文章目录
ANNOVAR的程序模块
(ANNOVAR程序结构 │ annotate_variation.pl #主程序,功能包括下载数据库,三种不同的注释 │ coding_change.pl #可用来推断蛋白质序列 │ convert2annovar.pl #将多种格式转为.avinput的程序 │ retrieve_seq_from_fasta.pl #用于自行建立其他物种的转录本 │ table_annovar.pl #注释程序,可一次性完成三种类型的注释 │ variants_reduction.pl #可用来更灵活地定制过滤注释流程 │ ├─example #存放示例文件 │ └─humandb #人类注释数据库)
ANNOVAR的输入文件
ANNOVAR使用.avinput格式,如以上代码所示,该格式每列以tab分割,需要有以下几个信息:
染色体位置 起始位点 终止位点 参考基因组碱基 突变碱基 ......
chrM 302 302 - C chrM 302 . A AC 93.73 PASS AC=4;AF=0.500;AN=8;ClippingRankSum=0.000;DP=121;ExcessHet=3.0103;MLEAC=1;MLEAF=0.500;set=Intersection GT:AD:DP:GQ:PL 0/1:16,9:25:99:183,0,384 chrM 963 963 T - chrM 962 . CT C 258.73 PASS AC=3;AF=0.500;AN=6;ClippingRankSum=0.000;DP=75;ExcessHet=3.0103;MLEAC=1;MLEAF=0.500;set=variant2-variant3-variant4 GT:AD:DP:GQ:PL 0/1:11,11:22:99:296,0,247 chr1 GG - chr1 . TGG T 128 PASS AC=6;AF=1.00;AN=6;DP=31;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;set=variant-variant2-variant3 GT:AD:DP:GQ:PL 1/1:0,4:4:12:165,12,0 chr1 - C chr1 . G GC 98.25 PASS AC=4;AF=1.00;AN=4;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;set=variant2-variant3 GT:AD:DP:GQ:PL 1/1:0,3:3:9:135,9,0 chr1 - C chr1 . A AC 30.71 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=15.35;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:67,6,0 chr1 - T chr1 . C CT 22.73 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=11.36;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:59,6,0 chr1 - G chr1 . T TG 21.73 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=10.87;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:58,6,0 chr1 - C chr1 . A AC 30.71 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=15.35;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:67,6,0 chr1 - TTTA chr1 . C CTTTA 53.70 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.00;QD=26.85;SOR=2.303;set=variant2 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0 chr1 - T chr1 . C CT 62.74 PASS AC=2;AF=0.500;AN=4;ClippingRankSum=0.000;DP=15;ExcessHet=3.0103;MLEAC=1;MLEAF=0.500;MQ=37.00;MQRankSum=0.000;set=variant2-variant3 GT:AD:DP:GQ:PL 0/1:1,3:4:25:100,0,25
ANNOVAR输入文件的格式转换
ANNOVAR主要使用convert2annovar.pl程序进行转换,转换后文件是精简过的,主要包含前面提到的5列内容,如果要将原格式的文件的所有内容都包含在转换后的.avinput文件中,可以使用-includeinfo参数;如果需要分开每个sample输出单一的.avinput文件,可以使用-allsample参数,等等。
$ convert2annovar.pl -format vcf4 example/ex2.vcf > ex2.avinput # -format vcf4 指定格式为vcf
ANNOVAR还主要支持以下格式转换:
SAMtools pileup format Complete Genomics format GFF3-SOLiD calling format SOAPsnp calling format MAQ calling format CASAVA calling format
ANNOVAR注释功能
Table_annovar.pl(可一次完成三种类型的注释)
使用ANNOVAR最简单的方法就是使用table_annovar.pl进行注释,它的输入文件可以是多种格式包括VCF,输出文件已Tab分隔,每一列代表着一种注释。
注释命令示例:
$~/biosoft/ANNOVAR/annovar/table_annovar.pl 15_indel_pre.avinput.hg19.variant2.avinput ~/biosoft/ANNOVAR/annovar/humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2015aug_all,1000g2015aug_eur,exac03,avsnp147,dbnsfp30a -operation g,r,r,f,f,f,f,f,f -nastring . -csvout # -buildver hg19 表示使用hg19版本 # -out myanno 表示输出文件的前缀为myanno # -remove 表示删除注释过程中的临时文件 # -protocol 表示注释使用的数据库,用逗号隔开,且要注意顺序 # -operation 表示对应顺序的数据库的类型(g代表gene-based、r代表region-based、f代表filter-based),用逗号隔开,注意顺序 # -nastring . 表示用点号替代缺省的值 # -csvout 表示最后输出.csv文件
下载数据库
#下载1000g2015Aug数据库 $perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb/
Annotate_variation.pl
Annotate_variation.pl的注释方式分为三种:
Gene-based annotation Region-based annotation Filter-based annotation
annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/ annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/ annotate_variation.pl -filter -dbtype 1000g2014oct_all -buildver hg19 example/ex1.avinput humandb/ #三种命令示例,使用package自带数据进行注释,分别对应三种注释方式
Annotate_variation.pl 实例
Gene-based annotation
顾名思义,Gene-based annotation是根据SNPs以及CNVs的位置信息来确定是否会造成编码序列以及开放阅读框的改变从而影响氨基酸的改变,使用者可以自主选择RefSeq genes, 包括UCSC genes, ENSEMBL genes, GENCODE genes, AceView genes等来进行注释。
命令示例:
$ annotate_variation.pl -geneanno -dbtype refGene -out ex1 -build hg19 example/ex1.avinput humandb/ # -geneanno 表示使用基于基因的注释 # -dbtype refGene 表示使用"refGene"数据库 # -out ex1 表示输出文件以ex1为前缀
#ex1.variant_function [kaiwang@biocluster ~/]$ cat ex1.variant_function UTR5 ISG15(NM_005101:c.-33T>C) 1 T C comments: rs15842, a SNP in 5' UTR of ISG15 UTR3 ATAD3C(NM_00:c.*91G>T) 1 G T comments: rs, a SNP in 3' UTR of ATAD3C splicing NPHP4(NM_00:exon19:c.1279-2T>A,NM_00:exon18:c.1282-2T>A,NM_015102:exon22:c.2818-2T>A) 1 A T comments: rs, a splice site variant in NPHP4 intronic DDR2 1 C T comments: rs, a SNP in Illumina SNP arrays intronic DNASE2B 1 C T comments: rs or SNP_A-, a SNP in Affymetrix SNP arrays intergenic LOC(dist=11566),LOC(dist=) 1 TC - comments: rs, a 2-bp deletion intergenic UBIAD1(dist=55105),PTCHD2(dist=) 1 - AT comments: rs, a 2-bp insertion intergenic LOC(dist=),NONE(dist=NONE) 1 A ATAAA comments: rs, a block substitution exonic IL23R 1 G A comments: rs (R381Q), a SNP in IL23R associated with Crohn's disease exonic ATG16L1 2 A G comments: rs (T300A), a SNP in the ATG16L1 associated with Crohn's disease exonic NOD2 16 C T comments: rs (R702W), a non-synonymous SNP in NOD2 exonic NOD2 16 G C comments: rs (G908R), a non-synonymous SNP in NOD2 exonic NOD2 16 - C comments: rs (c.3016_3017insC), a frameshift SNP in NOD2 exonic GJB2 13 G - comments: rs (del35G), a frameshift mutation in GJB2, associated with hearing loss exonic CRYL1,GJB6 13 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss 第一个文件包括对于所有突变的注释,通过在文件最前面加入两列,以tab分割 第一列为变异所在基因位置的类型,如外显子,内含子,UTR5,UTR3,基因间等 第二列为对第一列的描述信息,详情见下
#ex1.exonic_variant_function [kaiwang@biocluster ~/]$ cat ex1.exonic_variant_function line9 nonsynonymous SNV IL23R:NM_:exon9:c.G1142A:p.R381Q, 1 G A comments: rs (R381Q), a SNP in IL23R associated with Crohn's disease line10 nonsynonymous SNV ATG16L1:NM_00:exon9:c.A550G:p.T184A,ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_00:exon9:c.A646G:p.T216A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_:exon5:c.A409G:p.T137A, 2 A G comments: rs (T300A), a SNP in the ATG16L1 associated with Crohn's disease line11 nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_00:exon3:c.C2023T:p.R675W, 16 C comments: rs (R702W), a non-synonymous SNP in NOD2 line12 nonsynonymous SNV NOD2:NM_022162:exon8:c.G2722C:p.G908R,NOD2:NM_00:exon7:c.G2641C:p.G881R, 16 G comments: rs (G908R), a non-synonymous SNP in NOD2 line13 frameshift insertion NOD2:NM_022162:exon11:c.3017dupC:p.A1006fs,NOD2:NM_00:exon10:c.2936dupC:p.A979fs, 16 comments: rs (c.3016_3017insC), a frameshift SNP in NOD2 line14 frameshift deletion GJB2:NM_004004:exon2:c.35delG:p.G12fs, 13 G - comments: rs (del35G), a frameshift mutation in GJB2, associated with hearing loss line15 frameshift deletion GJB6:NM_00:wholegene,GJB6:NM_00:wholegene,GJB6:NM_00:wholegene,CRYL1:NM_015974:wholegene,GJB6:NM_006783:wholegene, 13 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss 第二个输出文件以.exonic_variant_function结尾,只列出外显子(氨基酸会改变)的变异 第一列为第一个文件中该变异所在的行号; 第二列为该变异的功能性后果,如外现在改变导致的氨基酸变化,阅读框移码等,详情见下 第三列为基因名称,转录识别标志和相应的转录本的序列变化 第四列为原输入文件内容
Region-based annotation
#数据库下载 [kaiwang@biocluster ~/]$ annotate_variation.pl -build hg19 -downdb phastConsElements46way humandb/ NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/phastConsElements46way.txt.gz ... OK NOTICE: Uncompressing downloaded files NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory #使用下载数据库进行注释 [kaiwang@biocluster ~/]$ annotate_variation.pl -regionanno -build hg19 -out ex1 -dbtype phastConsElements46way example/ex1.avinput humandb/ NOTICE: Reading annotation database humandb/hg19_phastConsElements46way.txt ... Done with regions NOTICE: Finished region-based annotation on 12 genetic variants in ex1.hg19.avinput NOTICE: Output files were written to ex1.hg19_phastConsElements46way # -regionanno 表示使用基于区域的注释 # -dbtype phastConsElements46way 表示使用"phastConsElements46way"数据库,注意需要使用Region-based的数据库
#输出文件 [kaiwang@biocluster ~/]$ cat ex1.hg19_phastConsElements46way phastConsElements46way Score=387;Name=lod=50 1 G A comments: rs (R381Q), a SNP in IL23R associated with Crohn's disease phastConsElements46way Score=420;Name=lod=68 16 G C comments: rs (G908R), a non-synonymous SNP in NOD2 phastConsElements46way Score=385;Name=lod=49 16 - C comments: rs (c.3016_3017insC), a frameshift SNP in NOD2 phastConsElements46way Score=395;Name=lod=54 13 G - comments: rs (del35G), a frameshift mutation in GJB2, associated with hearing loss phastConsElements46way Score=545;Name=lod=218 13 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss 输出文件:输出的注释文件第1列为“phastConsElements46way”,对应注释的类型,这里的phastCons 46-way alignments属于保守的基因组区域的注释; 第二列包含评分和名称,评分来自UCSC,可以使用--score_threshold和--normscore_threshold来过滤评分低的变异,“Name=lod=x”名称表示该区域的名称; 剩余的部分为输入文件的内容。
Filter-based annotation
Filter-based annotation是用以确认已记录在特定数据库里的突变。例如想要知道突变是否为novel variation就需要知道该突变是否存在于dbSNP库里,它在1000 genome project里面等位基因频率怎样,以及计算一系列突变项目得分并加以过滤。它区别于region-based annotation就在于它针对突变碱基进行工作,而region-based annotation 针对染色体位置。举例来说就是region-based比对chr1:1000-1000而filter-based比对chr1:1000-1000上的A->G。
它拥有多种数据库,包括针对全基因组测序的突变频率,针对全外显子数据测序的突变频率,在孤立或者小类群人群中的突变频率,全基因组数据突变的功能预测,全外显子组突变的功能预测,剪切变异体的功能预测,疾病相关突变,突变确认等,如下:
下面给大家介绍常用的两种过滤注释
1000 Genomes Project annotations
[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 -out ex1 example/ex1.avinput humandb/ NOTICE: Variants matching filtering criteria are written to ex1.hg19_EUR.sites.2012_04_dropped, other variants are written to ex1.hg19_EUR.sites.2012_04_filtered NOTICE: Processing next batch with 15 unique variants in 15 input lines NOTICE: Database index loaded. Total number of bins is and the number of bins to be scanned is 12 NOTICE: Scanning filter database humandb/hg19_EUR.sites.2012_04.txt...Done #查看数据格式 [kaiwang@biocluster ~/]$ cat ex1.hg19_EUR.sites.2012_04_dropped 1000g2012apr_eur 0.04 1 G T comments: rs, a SNP in 3' UTR of ATAD3C 1000g2012apr_eur 0.87 1 C T comments: rs, a SNP in Illumina SNP arrays 1000g2012apr_eur 0.81 1 A T comments: rs, a splice site variant in NPHP4 1000g2012apr_eur 0.06 1 G A comments: rs (R381Q), a SNP in IL23R associated with Crohn's disease 1000g2012apr_eur 0.54 1 C T comments: rs or SNP_A-, a SNP in Affymetrix SNP arrays 1000g2012apr_eur 0.96 1 T C comments: rs15842, a SNP in 5' UTR of ISG15 1000g2012apr_eur 0.05 16 C T comments: rs (R702W), a non-synonymous SNP in NOD2 1000g2012apr_eur 0.01 16 G C comments: rs (G908R), a non-synonymous SNP in NOD2 1000g2012apr_eur 0.01 16 - C comments: rs (c.3016_3017insC), a frameshift SNP in NOD2 1000g2012apr_eur 0.53 2 A G comments: rs (T300A), a SNP in the ATG16L1 associated with Crohn's disease # -filter 使用基于过滤的注释 # -dbtype 1000g2012apr_eur 使用"1000g2012apr_eur"数据库
该注释使用2012年4月欧洲发布1000基因组计划数据库,输出文件会有两个,output_dropped file 和 output_filtered file
#dropped file [kaiwang@biocluster ~/]$ cat ex1.hg19_EUR.sites.2012_04_dropped 1000g2012apr_eur 0.04 1 G T comments: rs, a SNP in 3' UTR of ATAD3C 1000g2012apr_eur 0.87 1 C T comments: rs, a SNP in Illumina SNP arrays 1000g2012apr_eur 0.81 1 A T comments: rs, a splice site variant in NPHP4 1000g2012apr_eur 0.06 1 G A comments: rs (R381Q), a SNP in IL23R associated with Crohn's disease 1000g2012apr_eur 0.54 1 C T comments: rs or SNP_A-, a SNP in Affymetrix SNP arrays 1000g2012apr_eur 0.96 1 T C comments: rs15842, a SNP in 5' UTR of ISG15 1000g2012apr_eur 0.05 16 C T comments: rs (R702W), a non-synonymous SNP in NOD2 1000g2012apr_eur 0.01 16 G C comments: rs (G908R), a non-synonymous SNP in NOD2 1000g2012apr_eur 0.01 16 - C comments: rs (c.3016_3017insC), a frameshift SNP in NOD2 1000g2012apr_eur 0.53 2 A G comments: rs (T300A), a SNP in the ATG16L1 associated with Crohn's disease #*dropped文件 第一列如region-based注释的结果一样以数据库命名; 第二列为等位基因频率,我们可以用-maf 0.05参数来过滤掉低于0.05的变异; 第三列开始同样是输入文件的内容。 #需要注意的是,我们也可以使用-maf 0.05 -reverse过滤掉高于0.05的变异;但是过滤ALT等位基因的频率,我们更提倡使用-score_threshold参数。
dbSNP annotations
#下载dbsnpp138数据库 [kaiwang@biocluster ~/]$ annotate_variation.pl -downdb -buildver hg19 -webfrom annovar snp138 humandb NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_snp138.txt.gz ... OK NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_snp138.txt.idx.gz ... OK NOTICE: Uncompressing downloaded files NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory #使用dbsnp138注释 [kaiwang@biocluster ~/]$ annotate_variation.pl -filter -out ex1 -build hg19 -dbtype snp138 example/ex1.avinput humandb/ NOTICE: Variants matching filtering criteria are written to ex1.hg19_snp138_dropped, other variants are written to ex1.hg19_snp138_filtered NOTICE: Processing next batch with 15 unique variants in 15 input lines NOTICE: Database index loaded. Total number of bins is and the number of bins to be scanned is 12 NOTICE: Scanning filter database humandb/hg19_snp138.txt...Done #输入dropped file [kaiwang@biocluster ~/]$ cat ex1.hg19_snp138_dropped snp138 rs 1 - AT comments: rs, a 2-bp insertion snp138 rs 1 G T comments: rs, a SNP in 3' UTR of ATAD3C snp138 rs 1 C T comments: rs, a SNP in Illumina SNP arrays snp138 rs 1 A T comments: rs, a splice site variant in NPHP4 snp138 rs 1 G A comments: rs (R381Q), a SNP in IL23R associated with Crohn's disease snp138 rs 1 C T comments: rs or SNP_A-, a SNP in Affymetrix SNP arrays snp138 rs15842 1 T C comments: rs15842, a SNP in 5' UTR of ISG15 snp138 rs 13 G - comments: rs (del35G), a frameshift mutation in GJB2, associated with hearing loss snp138 rs 16 C T comments: rs (R702W), a non-synonymous SNP in NOD2 snp138 rs 16 G C comments: rs (G908R), a non-synonymous SNP in NOD2 snp138 rs 16 - C comments: rs (c.3016_3017insC), a frameshift SNP in NOD2 snp138 rs 2 A G comments: rs (T300A), a SNP in the ATG16L1 associated with Crohn's disease #*dropped文件 第一列如region-based注释的结果一样以数据库命名; 第二列为已经在数据库的突变的indentifier号; 第三列开始同样是输入文件的内容。
参考文章:
ANNOVAR的使用
https://www.bianchenghao.cn/p/95331e7a98cd
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:https://bianchenghao.cn/bian-cheng-ji-chu/85984.html