肿瘤全外显子测序数据分析流程大放送

时间:2022-05-03
本文章向大家介绍肿瘤全外显子测序数据分析流程大放送,主要内容包括文章解读、选择部分数据如下:、从SRA数据库下载并转换为fastq测序数据文件、然后走WES的标准SNP-calling流程、然后走走somatic mutation calling流程、需要安装的软件、基本概念、基础应用、原理机制和需要注意的事项等,并结合实例形式分析了其使用技巧,希望通过本文能帮助到大家理解应用这部分内容。

这个一个肿瘤外显子项目的文章发表并且公布的公共数据,我这里给出全套分析流程代码。只需要你肯实践,就可以运行成功。

PS:有些后起之秀自己运营公众号或者博客喜欢批评我们这些老人,一味的堆砌代码不给解释,恶意揣测我们是因为不懂代码的原理。我表示很无语,我写了3000多篇教程,要求我一篇篇都重复提到基础知识,我真的做不到。比如下面的流程,包括软件的用法,软件安装,注释数据库的下载,我博客都说过好几次了,直播我的基因组系列也详细解读过,我告诉你去哪里学,你却不珍惜,不当回事,呵呵。

文章解读

paper;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4812767/ a whole-exome sequencing (WES) study was performed in 161 NPC cases and 895 controls of Southern Chinese descent

测序概述:

We sequenced the blood samples from 161 NPC cases, including 39 EAO cases, 63 FH+ cases from 52 independent families, and 59 sporadic cases by WES and achieved an average coverage of 49-fold on target (range of 32- to 76-fold)

后来还扩大了验证样本集:

An additional 2,160 NPC cases and 2,433 healthy controls from Chinese Hong Kong were further examined for the selected candidate variants.

数据全部上传了:

Whole-exome sequencing data for the early-age onset cases have been deposited in the Sequence Read Archive (SRA) database (accession ID SRA291701).

了解WES数据分析步骤: 2016-A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

选择部分数据如下:

随便选择了5个样本,其ID号及描述如下,

SRX445405    MALE    NPC15   SRR1139956  NPC15F  NO  SRS540548   NPC15F-TSRX445406    MALE    NPC15   SRR1139958  NPC15F  NO  SRS540549   NPC15F-NSRX445407    MALE    NPC29   SRR1139966  NPC29F  YES SRS540550   NPC29F-TSRX445408    MALE    NPC29   SRR1139973  NPC29F  YES SRS540551   NPC29F-NSRX445409    FEMALE  NPC10   SRR1139999  NPC10F  NO  SRS540552   NPC10F-TSRX445410    FEMALE  NPC10   SRR1140007  NPC10F  NO  SRS540553   NPC10F-NSRX445411    FEMALE  NPC34   SRR1140015  NPC34F  NO  SRS540554   NPC34F-TSRX445412    FEMALE  NPC34   SRR1140023  NPC34F  NO  SRS540555   NPC34F-NSRX445413    MALE    NPC37   SRR1140044  NPC37F  YES SRS540556   NPC37F-TSRX445414    MALE    NPC37   SRR1140045  NPC37F  YES SRS540557   NPC37F-N

从SRA数据库下载并转换为fastq测序数据文件

把上面的描述文本存为文件npc.sra.txt下载脚本如下:

cat npc.sra.txt | cut -f 4|while read iddo echo $idwget -c ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP035/SRP035573/$id/$id.sradone 

转换脚本如下:

cat npc.sra.txt | while read iddoarray=($id)echo  ${array[3]}.sra  ${array[7]} ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --gzip --split-3 -A   ${array[7]}  ${array[3]}.sra done 

得到的fastq文件如下:

3.5G Aug 26 09:48 NPC10F-N_1.fastq.gz3.6G Aug 26 09:48 NPC10F-N_2.fastq.gz3.2G Aug 26 09:48 NPC10F-T_1.fastq.gz3.3G Aug 26 09:48 NPC10F-T_2.fastq.gz2.7G Aug 26 09:47 NPC15F-N_1.fastq.gz2.8G Aug 26 09:47 NPC15F-N_2.fastq.gz2.8G Aug 26 09:47 NPC15F-T_1.fastq.gz2.9G Aug 26 09:47 NPC15F-T_2.fastq.gz2.8G Aug 26 09:47 NPC29F-N_1.fastq.gz2.9G Aug 26 09:47 NPC29F-N_2.fastq.gz2.4G Aug 26 09:47 NPC29F-T_1.fastq.gz2.5G Aug 26 09:47 NPC29F-T_2.fastq.gz2.0G Aug 26 09:47 NPC34F-N_1.fastq.gz2.0G Aug 26 09:47 NPC34F-N_2.fastq.gz2.2G Aug 26 09:48 NPC34F-T_1.fastq.gz2.3G Aug 26 09:48 NPC34F-T_2.fastq.gz2.1G Aug 26 09:47 NPC37F-N_1.fastq.gz2.1G Aug 26 09:47 NPC37F-N_2.fastq.gz2.2G Aug 26 09:46 NPC37F-T_1.fastq.gz2.2G Aug 26 09:46 NPC37F-T_2.fastq.gz

简单的走一下fastqc+multiqc 看看数据质量,一般都会很不错的,这个数据也不例外。

ls *.gz |xargs ~/biosoft/fastqc/FastQC/fastqc -o ./ -t 5 

但是值得注意的是本文章中的测序reads的编码格式是phred+64 而不是传统的33

然后走WES的标准SNP-calling流程

选用的是经典的GATK best practice的流程,代码如下:

/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* module load java/1.8.0_91GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fastaINDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jarPICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jarDBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gzSNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gzINDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gzKG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gzMills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcfKG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcfTMPDIR=/home/jianmingzeng/tmp/software## samtools and bwa are in the environment## samtools Version: 1.3.1 (using htslib 1.3.1)## bwa Version: 0.7.15-r1140#arr=($1)#fq1=${arr[0]}#fq2=${arr[1]}#sample=${arr[2]}fq1=$1fq2=$2sample=$3##################################################################### Step 1 : Alignment ######################################################################start=$(date +%s.%N)echo bwa `date`bwa mem -t 5 -M  -R "@RGtID:$sampletSM:$sampletLB:WEStPL:Illumina" $INDEX $fq1 $fq2 1>$sample.sam 2>/dev/null echo bwa `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for BWA : %.6f seconds" $durecho ##################################################################### Step 2: Sort and Index ##################################################################start=$(date +%s.%N)echo SortSam `date`java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD SortSam SORT_ORDER=coordinate INPUT=$sample.sam OUTPUT=$sample.bamsamtools index $sample.bamecho SortSam `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for SortSam : %.6f seconds" $durecho rm $sample.sam ##################################################################### Step 3: Basic Statistics ################################################################start=$(date +%s.%N)echo stats `date`samtools flagstat $sample.bam > ${sample}.alignment.flagstatsamtools stats  $sample.bam > ${sample}.alignment.statecho plot-bamstats -p ${sample}_QC  ${sample}.alignment.statecho stats `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for Basic Statistics : %.6f seconds" $durecho ############################################################ Step 4: multiple filtering for bam files ############################################################MarkDuplicates###start=$(date +%s.%N)echo MarkDuplicates `date`java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD MarkDuplicates INPUT=$sample.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metricsecho MarkDuplicates `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for MarkDuplicates : %.6f seconds" $durecho rm $sample.bam  ###FixMateInfo###start=$(date +%s.%N)echo FixMateInfo `date`java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD FixMateInformation INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinatesamtools index ${sample}_marked_fixed.bamecho FixMateInfo `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for FixMateInfo  : %.6f seconds" $durecho rm ${sample}_marked.bam ############################################################ Step 5: gatk process bam files ###################################################################### SplitNCigar ###start=$(date +%s.%N)echo SplitNCigar `date`java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SplitNCigarReads  -R $GENOME  -I ${sample}_marked_fixed.bam  -o ${sample}_marked_fixed_split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS #--fix_misencoded_quality_scores## --fix_misencoded_quality_scores only if phred 64 echo SplitNCigar `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for SplitNCigar : %.6f seconds" $durecho rm ${sample}_marked_fixed.bam # rm ${sample}.sam ${sample}.bam ${sample}_marked.bam ${sample}_marked_fixed.bam###RealignerTargetCreator###start=$(date +%s.%N)echo RealignerTargetCreator `date`java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T RealignerTargetCreator -I ${sample}_marked_fixed_split.bam -R $GENOME -o ${sample}_target.intervals -known $Mills_indels -known $KG_indels -nt 5echo RealignerTargetCreator `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for RealignerTargetCreator : %.6f seconds" $durecho ###IndelRealigner###start=$(date +%s.%N)echo IndelRealigner `date`java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T IndelRealigner -I ${sample}_marked_fixed_split.bam  -R $GENOME -targetIntervals ${sample}_target.intervals -o ${sample}_realigned.bam -known $Mills_indels -known $KG_indelsecho IndelRealigner `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for IndelRealigner : %.6f seconds" $durecho rm ${sample}_marked_fixed_split.bam###BaseRecalibrator###start=$(date +%s.%N)echo BaseRecalibrator `date`java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T BaseRecalibrator -I ${sample}_realigned.bam -R $GENOME -o ${sample}_temp.table -knownSites $DBSNPecho BaseRecalibrator `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for BaseRecalibrator : %.6f seconds" $durecho ###PrintReads###start=$(date +%s.%N)echo PrintReads `date`java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T PrintReads -R $GENOME -I ${sample}_realigned.bam -o ${sample}_recal.bam -BQSR ${sample}_temp.tablesamtools index ${sample}_recal.bamecho PrintReads `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for PrintReads : %.6f seconds" $durecho rm  ${sample}_realigned.bamchmod uga=r   ${sample}_recal.bam ################################################################### Step 6: gatk call snp/indel################################################################## start=$(date +%s.%N)echo HaplotypeCaller `date`java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T HaplotypeCaller  -R $GENOME -I ${sample}_recal.bam --dbsnp $DBSNP  -stand_emit_conf 10 -o  ${sample}_raw.snps.indels.vcfecho HaplotypeCaller `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for HaplotypeCaller : %.6f seconds" $durecho java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants  -R $GENOME  -selectType SNP -V ${sample}_raw.snps.indels.vcf -o ${sample}_raw_snps.vcfjava -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants  -R $GENOME -selectType INDEL  -V ${sample}_raw.snps.indels.vcf   -o ${sample}_raw_indels.vcf## :''## for SNPjava -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME  --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"  --filterName "my_snp_filter" -V ${sample}_raw_snps.vcf  -o ${sample}_filtered_snps.vcf   java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants -R $GENOME  --excludeFiltered -V ${sample}_filtered_snps.vcf  -o  ${sample}_filtered_PASS.snps.vcfjava -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantEval -R $GENOME  -eval ${sample}_filtered_PASS.snps.vcf -o  ${sample}_filtered_PASS.snps.vcf.eval## for  INDELjava -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME  --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0"  --filterName "my_indel_filter" -V ${sample}_raw_indels.vcf  -o ${sample}_filtered_indels.vcf   java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants -R $GENOME  --excludeFiltered -V ${sample}_filtered_indels.vcf  -o  ${sample}_filtered_PASS.indels.vcfjava -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantEval -R $GENOME  -eval ${sample}_filtered_PASS.indels.vcf -o  ${sample}_filtered_PASS.indels.vcf.eval
*/

比对成功后得到的bam文件如下;

7.7G Aug 26 19:22 NPC10F-N_marked_fixed.bam7.7G Aug 26 22:59 NPC10F-N_marked_fixed_split.bam7.7G Aug 26 23:57 NPC10F-N_realigned.bam 15G Aug 27 03:45 NPC10F-N_recal.bam 7.0G Aug 26 19:49 NPC10F-T_marked_fixed.bam7.0G Aug 26 22:55 NPC10F-T_marked_fixed_split.bam7.0G Aug 26 23:48 NPC10F-T_realigned.bam 13G Aug 27 03:12 NPC10F-T_recal.bam

Snp-calling结束后得到的vcf如下:

    82576 NPC10F-N_raw_indels.vcf   863243 NPC10F-N_raw_snps.vcf   945753 NPC10F-N_recal_raw.snps.indels.vcf    89604 NPC10F-T_raw_indels.vcf   819657 NPC10F-T_raw_snps.vcf   909190 NPC10F-T_recal_raw.snps.indels.vcf

这些vcf里面的变异位点还需要进行简单的过滤,或者只提取外显子区域的变异位点。(从WGS测序得到的VCF文件里面提取位于外显子区域的【直播】我的基因组84)

消耗时间如下;

# for NPC10F-N_1.fastq.gz NPC10F-N_2.fastq.gz NPC10F-Nbwa Sat Aug 26 16:04:44 CST 2017bwa Sat Aug 26 17:07:11 CST 2017SortSam Sat Aug 26 17:07:11 CST 2017SortSam Sat Aug 26 17:45:04 CST 2017stats Sat Aug 26 17:45:04 CST 2017plot-bamstats -p NPC10F-N_QC NPC10F-N.alignment.statstats Sat Aug 26 17:56:07 CST 2017MarkDuplicates Sat Aug 26 17:56:07 CST 2017MarkDuplicates Sat Aug 26 18:40:25 CST 2017FixMateInfo Sat Aug 26 18:40:25 CST 2017FixMateInfo Sat Aug 26 19:26:39 CST 2017SplitNCigar Sat Aug 26 22:20:48 CST 2017SplitNCigar Sat Aug 26 22:59:32 CST 2017RealignerTargetCreator Sat Aug 26 22:59:32 CST 2017RealignerTargetCreator Sat Aug 26 23:17:35 CST 2017IndelRealigner Sat Aug 26 23:17:35 CST 2017IndelRealigner Sat Aug 26 23:57:35 CST 2017BaseRecalibrator Sat Aug 26 23:57:35 CST 2017BaseRecalibrator Sun Aug 27 01:27:39 CST 2017PrintReads Sun Aug 27 01:27:39 CST 2017PrintReads Sun Aug 27 03:52:03 CST 2017#for NPC10F-T_1.fastq.gz NPC10F-T_2.fastq.gz NPC10F-Tbwa Sat Aug 26 16:54:14 CST 2017bwa Sat Aug 26 17:48:07 CST 2017SortSam Sat Aug 26 17:48:07 CST 2017SortSam Sat Aug 26 18:20:48 CST 2017stats Sat Aug 26 18:20:48 CST 2017plot-bamstats -p NPC10F-T_QC NPC10F-T.alignment.statstats Sat Aug 26 18:30:45 CST 2017MarkDuplicates Sat Aug 26 18:30:45 CST 2017MarkDuplicates Sat Aug 26 19:11:40 CST 2017FixMateInfo Sat Aug 26 19:11:40 CST 2017FixMateInfo Sat Aug 26 19:52:37 CST 2017SplitNCigar Sat Aug 26 22:20:54 CST 2017SplitNCigar Sat Aug 26 22:55:53 CST 2017RealignerTargetCreator Sat Aug 26 22:55:53 CST 2017RealignerTargetCreator Sat Aug 26 23:14:19 CST 2017IndelRealigner Sat Aug 26 23:14:19 CST 2017IndelRealigner Sat Aug 26 23:48:43 CST 2017BaseRecalibrator Sat Aug 26 23:48:43 CST 2017BaseRecalibrator Sun Aug 27 01:10:26 CST 2017PrintReads Sun Aug 27 01:10:26 CST 2017PrintReads Sun Aug 27 03:18:33 CST 2017

外显子的QC结果(这个是我自己写的脚本)是:

## for NPC10F-N32541594    2392028455  0.982188933530586   73.506800404430118711840    934414537   0.971564415370185   49.937073906147117075251    505674931   0.895198983425405   29.614494744469615656543    241509710   0.819070615186704   15.4254812189383## for NPC10F-T32348763    2946257280  0.97636879840624    91.077896239803718182204    1054428864  0.94406442121146    57.992356922186116075260    555403212   0.842772759844003   34.550185315820714181528    265599059   0.741905340358179   18.7285219900141

可以看到T的测序深度高于N,符合实验设计。T的外显子区域平均测序深度高达91X,是比较好的数据了,而且外显子旁边50bp范围区域平均测序深度还有57X,旁边100bp区域还有34X,也说明这款芯片的捕获效率比较好

然后走走somatic mutation calling流程

因为是配对数据,还可以走somatic mutation calling流程

reference=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fastaGENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fastaTMPDIR=/home/jianmingzeng/tmp/softwarenormal_bam=NPC10F-N_recal.bamtumor_bam=NPC10F-T_recal.bamsample=NPC10F######################################################################## Step : Run VarScan ##################################################################normal_pileup="samtools mpileup -q 1 -f $reference $normal_bam";tumor_pileup="samtools mpileup -q 1 -f $reference $tumor_bam";# Next, issue a system call that pipes input from these commands into VarScan :java -Djava.io.tmpdir=$TMPDIR   -Xmx40g  -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar somatic <($normal_pileup) <($tumor_pileup) $samplejava -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar processSomatic $sample.snp######################################################################## Step : Run Mutect2 ################################################################## java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK  -T MuTect2 -R $GENOME -I:tumor $tumor_bam  -I:normal $normal_bam --dbsnp  $DBSNP   -o ${sample}-mutect2.vcf######################################################################## Step : Run Muse######################################################################~/biosoft/muse/muse call -O $sample -f $reference $tumor_bam $normal_bam~/biosoft/muse/muse sump -I ${sample}.MuSE.txt -E –O ${sample}.vcf –D $DBSNP

其中varscan耗费两个半小时,结果如下:

893539210 positions in tumor891822458 positions shared in normal79827518 had sufficient coverage for comparison79718238 were called Reference26 were mixed SNP-indel calls and filtered102492 were called Germline3703 were called LOH2569 were called Somatic490 were called Unknown0 were called VariantReading input from NPC10F.snpOpening output files: NPC10F.snp.Somatic NPC10F.snp.Germline NPC10F.snp.LOH101352 VarScan calls processed2342 were Somatic (556 high confidence)95144 were Germline (4830 high confidence)3502 were LOH (3484 high confidence)

这3个软件找到的somatic mutation可以互相对比一下,差异主要是在哪里,都值得自己去探究,这样最终确定的肿瘤外显子测序数据分析流程就更有把握。

需要安装的软件

软件太多了,我就不一一列出具体代码了,还有很多需要下载的参考基因组,变异数据库也是以前直播基因组的时候已经反复提到过,也不赘述啦。

conda install -c bioconda bedtoolsconda install -c bioconda bwaconda install -c bioconda samtoolscd ~/biosoftmkdir sratoolkit &&  cd sratoolkitwget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gztar zxvf sratoolkit.current-centos_linux64.tar.gz~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastdump -h## https://sourceforge.net/projects/picard/## https://github.com/broadinstitute/picardcd ~/biosoftmkdir picardtools &&  cd picardtoolswget http://ncu.dl.sourceforge.net/project/picard/picard-tools/1.119/picard-tools-1.119.zipunzip picard-tools-1.119.zipmkdir 2.9.2 && cd 2.9.2 wget https://github.com/broadinstitute/picard/releases/download/2.9.2/picard.jarcd ~/biosoft## https://sourceforge.net/projects/varscan/files/## http://varscan.sourceforge.net/index.htmlmkdir VarScan  &&  cd VarScan  wget https://sourceforge.net/projects/varscan/files/VarScan.v2.3.9.jar cd ~/biosoftmkdir SnpEff &&  cd SnpEff##    http://snpeff.sourceforge.net/##    http://snpeff.sourceforge.net/SnpSift.html ##    http://snpeff.sourceforge.net/SnpEff_manual.htmlwget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip ## java -jar snpEff.jar download GRCh37.75## java -Xmx4G -jar snpEff.jar -i vcf -o vcf GRCh37.75 example.vcf > example_snpeff.vcfunzip snpEff_latest_core.zip