[使用整理-1] bamdst:统计覆盖度和测序深度

BAMDST

Bamdst是一个轻量级工具,用于统计bam文件中目标区域的测序深度、覆盖度。
必须使用排序后的bam文件,bed文件和输出目录必须优先指定。

1.下载安装

  git clone https://github.com/shiquan/bamdst.git
  cd bamdst/
  make

或者使用conda:

  conda install xdgene::bamdst

安装后会出现可执行程序

2.使用方式

首先准备一个排序后的bam文件,STAR可以直接输出排好序的bam文件,--outSAMtype BAM SortedByCoordinate,也可以自己用samtools进行排序:

  samtools sort {input.bam} --output-fmt BAM > {output.sorted.bam}

接下来准备一个bed文件,bed文件是一种以制表符为分隔的基因组注释文件,这里需要的格式很简单,只要必需的三列,染色体名、起始位点、终止位点(BED3),以该程序所给文件为例:

/example/MT-RNR1.bed

  chrM   649  	1603

这里bed的基础单位可以根据你的需求来选,chromosome、gene、exon等。请注意,间隔符为制表符,而不是连串空格

计算基因组全部数据,包含基因组所有染色体的位置。
可以用genome.fa.fai索引文件转化生成bed文件进行使用:

  samtools faidx genome.fa  ##建立索引
  cat genome.fa.fai |awk '{print $1"\t1\t"$2}' > sample.bed

gene和exon等可以通过gff/gtf文件进行提取,前提是第三列有这些特征

#gff3文件示例
Chr1	MSU_osa1r7	gene	2903	10817	.	+	.	ID=LOC_Os01g01010;Name=LOC_Os01g01010;...
Chr1	MSU_osa1r7	mRNA	2903	10817	.	+	.	ID=LOC_Os01g01010.1;...
Chr1	MSU_osa1r7	exon	2903	3268	.	+	.	ID=LOC_Os01g01010.1:exon_1;...

awk '$3=="exon"' file.gff3 |awk 'BEGIN{FS="\t|=|;";OFS="\t"}{print $1,$4,$5}' > exon.bed

两个文件都准备好后就可以使用了,找一个输出目录(这里我之前想给输出文件加个前缀来着,结果发现没输出(。_。)):

  Normal:
    bamdst -p <probe.bed> -o ./ in1.bam
  Pipeline mode:
    samtools view in1.bam -u | bamdst -p x.bed -o ./ -
(可以通过 bamdst --help 查看使用方法)

必须参数:

  -o, --outdir         output dir 
                       --输出目录
  -p, --bed            probe or target regions file, the region file will be merged before calculate depths 
                       --探测/探针(?)或目标区域文件,区域文件将在计算深度之前合并,也就是bed文件

可选参数:

参数 描述
-f, --flank [200] flank n bp of each region.
--设置侧翼区域的大小,如果要计算侧翼区的覆盖率,请设置此值,默认值为 200。
[以5'侧翼区为例,侧翼区是指与基因5’末端毗连的DNA区域,这些区域通常包含启动子、增强子或者其他蛋白质结合位点,这部分的区域不能被转录为RNA。侧翼区(flanking region)和侧翼区单核苷酸多态性(Flanking SNPs):https://www.cnblogs.com/chenwenyan/p/5802874.html ]
[侧翼序列:指存在于编码区第一个外显子和最末一个外显子的外侧是一段不被翻译的核苷酸序列。侧翼序列含有基因调控顺序,如5端含有的启动子、增强子,3端含有的终止子和多聚腺苷酸信号等,对基因表达起重要的调控作用。(之后总结) ]
-q [20] map quality cutoff value, greater or equal to the value will be count.
--比对质量截止值,大于或等于该值将被计数。
影响结果中 [Total] Map quality cutoff value // Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads.
--maxdepth [0] set the max depth to stat the cumu distribution.
--设置最大深度来统计累积分布。
对于一些项目来说,区域深度非常高,如果你不想在 Cumulation Distrbution 文件(depth_distribution.plot)中显示这些不正常的深度,可以设置 Cutoff 值来过滤它们。默认值为 0 (无过滤)。
--cutoffdepth [0] list the coverage of above depths.
--列出上述深度的覆盖度。
对于一些项目,人们关心指定深度的覆盖率,比如 10000x 等。bamdst 只计算 0x、4x、10x、30x、100x 的覆盖率,所以你可以设置这个值以在 coverage.report 文档中显示指定的 coverage。默认值为 0。
--isize [2000] stat the inferred insert size under this value.
--推断插入尺寸低于此值 对于错误的映射配对读长,推断的插入大小非常大。所以设置一个截止值。默认值为 2000。
--uncover [5] region will included in uncover file if below it.
--如果区域低于 Uncover 文档,则 Region 将包含在 Uncover 文档中。 为 Calculate the bad covered region (计算不良覆盖区域) 设置此截止值。默认值为 <5。
--bamout BAMFILE target reads will be exported to this bam file.
--目标reads将导出到此 BAM 文档。
-1 begin position of bed file is 1-based.
--bed的起始位置从1开始。
-h, --help print this help info.
--打印帮助信息。

还有一个参数已经被更新掉了,我用的版本是1.0.9
--use_rmdup (an invalid parament since v1.0.0 )
Use rmdup depth instead of cover depth to calculate the coverage of target regions and so on.
使用 rmdup depth 而不是 cover depth 来计算目标区域的覆盖率等等。


3.输出结果

主要输出7个文件

#--help中的介绍
* Five essential files would be created in the output dir. 
* region.tsv.gz and depth.tsv.gz are zipped by bgzip, so you can use tabix 
  index these files.

 - coverage.report     a report of the coverage information and reads 
                       information of whole target regions
 - cumu.plot           distribution data of depth values  #(这里应该是指文件:depth_distribution.plot(?可能作者更新没改这里?))
 - insert.plot         distribution data of inferred insert size 
 - chromosome.report   coverage information for each chromosome
 - region.tsv.gz       mean depth, median depth and coverage of each region
 - depth.tsv.gz        raw depth, rmdup depth, coverage depth of each position
 - uncover.bed         the bad covered or uncovered region in the probe file

* About depth.tsv.gz:
* There are five columns in this file, including chromosome, position, raw
* depth, rmdep depth, coverage depth
 - chromosome          the chromosome name
 - position            1-based position of each chromosome
 - raw depth           raw depth of position, not filter
 - rmdup depth         remove duplication, and only calculate the reads which
                       are primary mapped and mapQ >= cutoff_mapQ (default 20)
 - coverage depth      calculate the deletions (CIGAR level) into depths,
                       for coverage use.


输出文件内容详解:

chromosomes.report

从bam文件中获取的每条染色体的深度、覆盖度等信息

#Chromosome	   DATA(%)	 Avg depth	    Median	 Coverage%	  Cov 4x %	 Cov 10x %	 Cov 30x %	Cov 100x %
       Chr1	   13.38	   90.92	      4.0	   60.75	   51.37	   43.06	   31.89	   18.56

coverage.report

总结的统计信息结果,这个的文件的信息有很多,包含target区域和flank区域的所有覆盖度信息等。

## The file was created by bamdst
## Version : 1.0.9
## Files : /share/ofs1b/project/RNA_Seq/NLJKX20241028004_RnaSeq_T11_051/02_align/HF_N1_bam/HF_N1_Aligned.sort.out.bam 
                               [Total] Raw Reads (All reads)	95543390  #All reads in the bam file(s). bam文件中所有reads。
                                       [Total] QC Fail reads	0  #Reads number failed QC, this flag is marked by other software,like bwa. See flag in the bam structure. 读取失败的reads数,被其它软件标记的标志,如bwa,参照bam结构中的flag值。(指bam文件中的第二列:https://blog.csdn.net/LittleComputerRobot/article/details/139371891)
                                        [Total] Raw Data(Mb)	14248.07  #Total reads data in the bam file(s). 读取的bam文件的数据总量。
                                        [Total] Paired Reads	95543390  #Paired reads numbers. 配对reads数量。
                                        [Total] Mapped Reads	95543390  #Mapped reads numbers. 匹配上的reads数量。
                            [Total] Fraction of Mapped Reads	100.00%  #Ratio of mapped reads against raw reads. 比对上的reads和所有reads的比率。
                                     [Total] Mapped Data(Mb)	14248.07  #Mapped data in the bam file(s). bam文件中比对上的数据量。
                         [Total] Fraction of Mapped Data(Mb)	100.00%  #Ratio of mapped data against raw data. 比对上的数据和总数据比率。
                                     [Total] Properly paired	95543384  #Paired reads with properly insert size. See bam format protocol for details. 配对reads中有正确/恰当插入片段大小的数量,详细查看bam文件格式协议。(https://accio.github.io/bioinformatics/2020/03/10/filter-bam-by-insert-size.html)
                         [Total] Fraction of Properly paired	100.00%  #Ratio of properly paired reads against mapped reads. 正确配对的reads和比对上的reads的比率。
                                [Total] Read and mate paired	95543384  #Read (read1) and mate read (read2) paired. read1和read2中配对的reads。
                    [Total] Fraction of Read and mate paired	100.00%  #Ratio of read and mate paired against mapped reads.  reads和配对的reads与比对上的reads的比率。
                                          [Total] Singletons	6  #Read mapped but mate read unmapped, and vice versa. 一条map上但另一条没有,反之亦然。
                       [Total] Read and mate map to diff chr	0  #Read and mate read mapped to different chromosome, usually because mapping error and structure variants. read和配对的read比对到不同的基因组上,通常是因为比对错误或结构变异。
                                               [Total] Read1	47771698  #First reads in mate paired sequencing. read1数量。
                                               [Total] Read2	47771692  #Mate reads. read2数量。
                                        [Total] Read1(rmdup)	47771698  #First reads after remove duplications. 去重后的数量。
                                        [Total] Read2(rmdup)	47771692  #Mate reads after remove duplications. 去重后的数量。
                                [Total] forward strand reads	47771695  #Number of forward strand reads. 正链reads
                               [Total] backward strand reads	47771695  #Number of backward strand reads. 负链reads
                                 [Total] PCR duplicate reads	0  #PCR duplications. (https://zhuanlan.zhihu.com/p/660430787)
                     [Total] Fraction of PCR duplicate reads	0.00%  #Ratio of PCR duplications. PCR复制比率。
                            [Total] Map quality cutoff value	20  #Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads. map质量分数截止阈值,此值可以通过 -q 设置。默认为 20,因为像 GATK 这样的一些call突变的软件只考虑高质量的读取。
                       [Total] MapQuality above cutoff reads	93664629  #Number of reads with higher or equal quality score than cutoff value. 质量分数高于或等于截止值的reads数。
                 [Total] Fraction of MapQ reads in all reads	98.03%  #Ratio of reads with higher or equal Q score against raw reads. Q 分数较高或相等的reads与原始reads的比率。
              [Total] Fraction of MapQ reads in mapped reads	98.03%  #Ratio of reads with higher or equal Q score against mapped reads. Q 分数较高或相等的reads与映射reads的比率。
                                       [Target] Target Reads	91110291  #Number of reads covered target region (specified by bed file). 覆盖目标区域的reads数(由 bed 文档指定)。
              [Target] Fraction of Target Reads in all reads	95.36%  #Ratio of target reads against raw reads. 目标reads与原始reads的比率。
           [Target] Fraction of Target Reads in mapped reads	95.36%  #Ratio of target reads against mapped reads. 目标reads与map到的reads的比率。
                                    [Target] Target Data(Mb)	13435.50  #Total bases covered target region. If a read covered target region partly, only the covered bases will be counted. 覆盖目标区域的总碱基数。如果reads只覆盖了部分的目标区域,则仅计算覆盖到的碱基。
                              [Target] Target Data Rmdup(Mb)	13224.10  #Total bases covered target region after remove PCR duplications. 删除 PCR 复制后覆盖的目标区域的总碱基数。
                [Target] Fraction of Target Data in all data	94.30%  #Ratio of target bases against raw bases. 目标碱基与原始碱基的比率。
             [Target] Fraction of Target Data in mapped data	94.30%  #Ratio of target bases against mapped bases. 目标碱基与映射碱基的比率。
                                      [Target] Len of region	164886854  #The length of target regions. 目标区域的长度。
                                      [Target] Average depth	81.48  #Average depth of target regions. Calculated by "target bases / length of regions". 目标区域的平均深度。按“目标区域碱基数/区域长度”计算。
                               [Target] Average depth(rmdup)	80.20  #Average depth of target regions after remove PCR duplications. 去除 PCR 复制后目标区域的平均深度。
                                     [Target] Coverage (>0x)	51.61%  #Ratio of bases with depth greater than 0x in target regions, which also means the ratio of covered regions in target regions. 目标区域中深度大于 0x 的碱基比率,这也意味着目标区域中覆盖区域的比率。
                                    [Target] Coverage (>=4x)	43.23%  #Ratio of bases with depth greater than or equal to 4x in target regions. 目标区域中深度大于或等于 4 倍的碱基比率。
                                   [Target] Coverage (>=10x)	36.09%  #Ratio of bases with depth greater than or equal to 10x in target regions. 目标区域中深度大于或等于 10 倍的碱基比率。
                                   [Target] Coverage (>=30x)	26.76%  #Ratio of bases with depth greater than or equal to 30x in target regions. 目标区域中深度大于或等于 30 倍的碱基比率。
                                  [Target] Coverage (>=100x)	15.74%  #Ratio of bases with depth greater than or equal to 100x in target regions. 目标区域中深度大于或等于 100x 的碱基比率。
                                  [Target] Coverage (>=Nx)              #This is addtional line for user self-defined cutoff value, see --cutoffdepth 这是用户自定义截止值的附加行,参见 --cutoffdepth
                                [Target] Target Region Count	55104  #Number of target regions. In normal practise,it is the total number of exomes. 目标区域的数量。在正常实验中,它是外显子组的总数。
                                [Target] Region covered > 0x	25846  #The number of these regions with average depth greater than 0x. 平均深度大于 0x 的区域的数量。
                       [Target] Fraction Region covered > 0x	46.90%  #Ratio of these regions with average depth greater than 0x. 平均深度大于 0x 的这些区域的比率。
                      [Target] Fraction Region covered >= 4x	41.22%  #Ratio of these regions with average depth greater than or equal to 4x. 平均深度大于或等于 4x 的这些区域的比率。
                     [Target] Fraction Region covered >= 10x	36.67%  #Ratio of these regions with average depth greater than or equal to 10x. 平均深度大于或等于 10x 的这些区域的比率。
                     [Target] Fraction Region covered >= 30x	29.09%  #Ratio of these regions with average depth greater than or equal to 30x. 平均深度大于或等于 30x 的这些区域的比率。
                    [Target] Fraction Region covered >= 100x	15.76%  #Ratio of these regions with average depth greater than or equal to 100x. 平均深度大于或等于 100x 的这些区域的比率。
                                          [flank] flank size	200  #The flank size will be count. 200 bp in default. Oligos could also capture the nearby regions of target regions. 侧翼区将被计算在内。默认 200 个基点。寡核苷酸还可以捕获目标区域的附近区域。
           [flank] Len of region (not include target region)	186299040  #The length of flank regions (target regions will not be count). 侧翼区的长度 (目标区域将不计算在内)
                                       [flank] Average depth	72.94  #Average depth of flank regions. 侧翼区域的平均深度。
                                         [flank] flank Reads	91839874  #The total number of reads covered the flank regions. Note: some reads covered the edge of target regions, will be count in flank regions also. 覆盖侧翼区域的读取总数。注意:一些读取覆盖了目标区域的边缘,也将计入侧翼区域。
                [flank] Fraction of flank Reads in all reads	96.12%  #Ratio of reads covered in flank regions against raw reads. 侧翼区域覆盖的读取与原始读取的比率。
             [flank] Fraction of flank Reads in mapped reads	96.12%  #Ration of reads covered in flank regions against mapped reads. 侧翼区域覆盖的读取与映射读取的比率。
                                      [flank] flank Data(Mb)	13588.68  #Total bases in the flank regions. 侧翼区域的总碱基数。
                  [flank] Fraction of flank Data in all data	95.37%  #Ratio of total bases in the flank regions against raw data. 侧翼区域的总碱基数与原始数据的比率。
               [flank] Fraction of flank Data in mapped data	95.37%  #Ratio of total bases in the flank regions against mapped data. 侧翼区域的总碱基与映射数据的比率。
                                      [flank] Coverage (>0x)	48.01%  #Ratio of flank bases with depth greater than 0x. 深度大于 0x 的侧翼碱基的比率。
                                     [flank] Coverage (>=4x)	39.51%  #Ratio of flank bases with depth greater than or equal to 4x. 深度大于或等于 4 倍的侧翼碱基的比率。
                                    [flank] Coverage (>=10x)	32.68%  #Ratio of flank bases with depth greater than or equal to 10x. 深度大于或等于 10 倍的侧翼碱基的比率。
                                    [flank] Coverage (>=30x)	24.06%  #Ratio of flank bases with depth greater than or equal to 30x.深度大于或等于 30 倍的侧翼碱基的比率。
                                   [flank] Coverage (>=100x)	14.07%  #Ratio of flank bases with depth greater than or equal to 100x.(官网没这句,但我结果有,我自己加了一句)深度大于或等于 100 倍的侧翼碱基的比率。

map:映射,回帖至,匹配,意会一下(ˉ▽ˉ;)...

depth_distribution.plot

结合R可以做出目标区域的深度分布图(.plot结尾的几个文件应该都可以作图,不过我没做这里,后面试试吧

0	79772141	0.483856	85095379	0.516144
1	5844883	0.035452	79250496	0.480692
2	4686107	0.028423	74564389	0.452269
3	3289529	0.019953	71274860	0.432316
4	2778824	0.016855	68496036	0.415461

左边三列分别代表:覆盖深度,对应该深度的碱基数,碱基占比(以目标区域的碱基数为分母)。
右边两列是高深度向低深度累加的值,分别是碱基数及其占比,累加到X=1为止。

depth.tsv

对于探针文档 (in.bed) 中的每个位置,三种深度值,包括原始深度raw depth、删除重复读取的深度Rmdup depth和覆盖深度Cover depth。原始深度是从输入 bam 中检索的。Rmdup depth为去除了PCR重复的reads, 次优比对reads, 低比对质量的reads(mapQ < 20),该值与samtools depth的输出深度类似
其中,输出深度为 samtools 深度。覆盖深度是考虑删除区域的原始深度,因此此值应等于或大于原始深度。我们使用原始深度raw depth来统计 coverage.report 文件中的覆盖率信息。如果想用rmdup depth来计算覆盖率,可使用参数 “--use_rmdup”。

#Chr    Pos     Raw Depth       Rmdup depth     Cover depth
Chr1    2904    2       2       2
Chr1    2905    2       2       2
Chr1    2906    2       2       2

insertsize.plot

做出两个接头之间的插入片段大小的图,理论上是根据read1和read2在染色体上的坐标信息求得,这时read1和read2要求比对到一条染色体上。

0	0	0.000000	45594984	1.000000
1	0	0.000000	45594984	1.000000
2	0	0.000000	45594984	1.000000
3	0	0.000000	45594984	1.000000

region.tsv

可以用来评估捕获效率,结合基因自身的结构能探究其对捕获测序的影响
目标区域文件每一个区域的统计,其中Coverage以X>0来统计

#Chr    Start   Stop    Avg depth       Median  Coverage        Coverage(FIX)
Chr1    2903    10817   72.74   31.0    70.86   70.86
Chr1    11218   12435   21.75   21.0    100.00  100.00

uncover.bed

没有捕获的区域
--uncover的值默认是5,包含低覆盖或者是未覆盖的,通过"--uncover"规定“未覆盖”的阈值。

Chr1	2903	2914
Chr1	3751	4360
Chr1	4596	5458

第一次写这种,markdown甚至用的还不太熟练(所以啥都试了试),有些翻译可能有点词不达意,建议看看官网,如果有误还望指出,之后应该会再整理一些自己用过的软件或者包,当然可能已经烂大街了(ˉ▽ˉ;)...整理一下前人智慧自己看着玩儿吧。


整理参考,感谢:
https://www.jianshu.com/p/ac7b8e4d76e4
https://zhuanlan.zhihu.com/p/548250737
https://github.com/shiquan/bamdst

posted @ 2024-11-06 15:57  Janesy  阅读(1299)  评论(0)    收藏  举报