Ensembl Variant Effect Predictor (VEP) | 变异注释工具

https://asia.ensembl.org/info/docs/tools/vep/index.html

https://github.com/Ensembl/ensembl-vep

 

输入一些variant的名字,出来一些注释结果。

 

注释结果:

#Uploaded_variation	Location	Allele	Consequence	IMPACT	SYMBOL	Gene	Feature_type	Feature	BIOTYPE	EXON	INTRON	HGVSc	HGVSp	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	DISTANCE	STRAND	FLAGS	SYMBOL_SOURCE	HGNC_ID	MANE	TSL	APPRIS	SIFT	PolyPhen	AF	CLIN_SIG	SOMATIC	PHENO	PUBMED	MOTIF_NAME	MOTIF_POS	HIGH_INF_POS	MOTIF_SCORE_CHANGE	TRANSCRIPTION_FACTORS
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000631796.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3920	-1	-	HGNC	HGNC:33884	-	2	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000631994.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	4476	-1	-	HGNC	HGNC:33884	-	5	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000632089.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3920	-1	-	HGNC	HGNC:33884	-	3	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000632496.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3920	-1	-	HGNC	HGNC:33884	-	3	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	splice_region_variant,intron_variant,non_coding_transcript_variant	LOW	WASH5P	ENSG00000282458	Transcript	ENST00000632506.1	processed_transcript	-	2/2	-	-	-	-	-	-	-	rs1258750482	-	-1	-	HGNC	HGNC:33884	-	1	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000633703.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	1919	-1	-	HGNC	HGNC:33884	-	5	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000633719.1	retained_intron	-	-	-	-	-	-	-	-	-	rs1258750482	211	-1	-	HGNC	HGNC:33884	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000633742.1	transcribed_processed_pseudogene	-	-	-	-	-	-	-	-	-	rs1258750482	4418	-1	-	HGNC	HGNC:33884	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1258750482	19:61902-61902	A	downstream_gene_variant	MODIFIER	WASH5P	ENSG00000282458	Transcript	ENST00000634023.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1258750482	3149	-1	-	HGNC	HGNC:33884	-	5	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1156485833	19:107157-107157	C	upstream_gene_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000318050.4	protein_coding	-	-	-	-	-	-	-	-	-	rs1156485833	3486	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1156485833	19:107157-107157	C	splice_region_variant,5_prime_UTR_variant	LOW	OR4F17	ENSG00000176695	Transcript	ENST00000585993.3	protein_coding	1/3	-	-	-	54	-	-	-	-	rs1156485833	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1156485833	19:107157-107157	C	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000588632.2	transcribed_unprocessed_pseudogene	-	-	-	-	-	-	-	-	-	rs1156485833	1685	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1156485833	19:107157-107157	C	missense_variant,splice_region_variant	MODERATE	OR4F17	ENSG00000176695	Transcript	ENST00000618231.3	protein_coding	1/2	-	-	-	54	9	3	K/N	aaG/aaC	rs1156485833	-	1	-	HGNC	HGNC:15381	-	-	P1	deleterious_low_confidence(0.03)	benign(0.062)	-	-	-	-	-	-	-	-	-	-
rs1156485833	19:107157-107157	C	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641173.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1156485833	1080	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1156485833	19:107157-107157	C	splice_region_variant,non_coding_transcript_exon_variant	LOW	OR4F17	ENSG00000176695	Transcript	ENST00000641591.1	processed_transcript	1/4	-	-	-	54	-	-	-	-	rs1156485833	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs1156485833	19:107157-107157	C	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641984.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs1156485833	1080	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs867704559	19:110630-110630	TT	upstream_gene_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000318050.4	protein_coding	-	-	-	-	-	-	-	-	-	rs867704559	13	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs867704559	19:110630-110630	TT	5_prime_UTR_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000585993.3	protein_coding	3/3	-	-	-	143	-	-	-	-	rs867704559	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs867704559	19:110630-110630	TT	frameshift_variant	HIGH	OR4F17	ENSG00000176695	Transcript	ENST00000618231.3	protein_coding	2/2	-	-	-	60	15	5	T/TX	acT/acTT	rs867704559	-	1	-	HGNC	HGNC:15381	-	-	P1	-	-	-	-	-	-	-	-	-	-	-	-
rs867704559	19:110630-110630	TT	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641173.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs867704559	4553	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs867704559	19:110630-110630	TT	non_coding_transcript_exon_variant	MODIFIER	OR4F17	ENSG00000176695	Transcript	ENST00000641591.1	processed_transcript	3/4	-	-	-	143	-	-	-	-	rs867704559	-	1	-	HGNC	HGNC:15381	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
rs867704559	19:110630-110630	TT	downstream_gene_variant	MODIFIER	OR4G1P	ENSG00000267310	Transcript	ENST00000641984.1	processed_transcript	-	-	-	-	-	-	-	-	-	rs867704559	4553	1	-	HGNC	HGNC:8302	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-

 

问题:

为什么一个snp有这么多注释?因为注释是按Transcript进行的,同一个位点在不同的Transcript中的功能是不同的。另外,如果两个基因离得太近,那就有可能注释到两个基因里。【按优先级排序,去掉冗余的即可】

为什么注释有冗余,既有downstream又有non_coding?是的,肯定是有冗余的,注释有不同层面,可以很粗放,也可以很精细。

 


 

数量少可以用web server,https://asia.ensembl.org/Tools/VEP【几十万个以内都可以用,可以支持rs id,非常方便】

数量多就用local tool,https://github.com/Ensembl/ensembl-vep

 

安装perl依赖包

perl -MCPAN -e shell
install Archive::Zip
install DBI
cpan Module::Build

  

https://github.com/Ensembl/Bio-DB-HTS,这个模块不好装,fatal error: zlib.h: No such file or directory

 

装本地数据库:

cd $HOME/.vep
curl -O ftp://ftp.ensembl.org/pub/release-102/variation/indexed_vep_cache/homo_sapiens_vep_102_GRCh38.tar.gz
tar xzf homo_sapiens_vep_102_GRCh38.tar.gz

  

  

问题:#include <zlib.h> zlib.h: No such file or directory【非程序员背景,碰到编译问题真是头大】

解决方案:Compilation error - missing zlib.h

export PATH =$PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/
export LIBRARY_PATH=$LIBRARY_PATH:/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/
export C_INCLUDE_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/include/
export CPLUS_INCLUDE_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/include/
export PKG_CONFIG_PATH=/home/lizhixin/softwares/ensembl-vep/zlib-1.2.11/lib/pkgconfig

  

 

问题:MSG: ERROR: Cannot use ID format in offline mode【local模式无法使用rs id】

那就准备其他的格式测试一下,vcf肯定没问题。

  

最后安不上就用docker,https://hub.docker.com/r/ensemblorg/ensembl-vep#install

 


 

 

结果解读:

Consequences (all)

  • intron_variant: 44%
  • non_coding_transcript_variant: 16%
  • upstream_gene_variant: 12%
  • downstream_gene_variant: 11%
  • NMD_transcript_variant: 4%
  • regulatory_region_variant: 3%
  • intergenic_variant: 3%
  • non_coding_transcript_exon_variant: 2%
  • missense_variant: 1%
  • Others

Coding consequences

  • missense_variant: 73%
  • synonymous_variant: 26%
  • stop_gained: 1%
  • protein_altering_variant: 0%
  • frameshift_variant: 0%
  • stop_lost: 0%
  • coding_sequence_variant: 0%

 

注意:

  • non-coding包括intergenic + UTR + intron
  • exon包括CDS + UTR
  • upstream和downstream一般指基因上下游的2kbp
  • ncRNA exonic/splicing/intronic

 

优先级:

  • 1 splicing/ncRNA splicing
  • 2 missense
  • 3 coding region/ncRNA exonic
  • 4 5'UTR/3'UTR
  • 5 Upstream/Downstream
  • 6 regulatory_region_variant
  • 7 intronic/non_coding_transcript_variant:
  • 8 intergenic
  • 9 others

 

这里的功能注释也有ontology

Google search:vep regulatory_region_variant

Ensembl Variation - Calculated variant consequences 注释本身就是根据ensembl的transcript的功能来的

http://www.sequenceontology.org/miso/current_svn/term/SO:0001566

 

Critical association of ncRNA with introns

 

posted @ 2021-01-21 22:11  Life·Intelligence  阅读(1971)  评论(0编辑  收藏  举报
TOP