统计测序深度和覆盖度的工具
1.GATK DepthOfCoverage
cat test.sh
#!/usr/bin/bash HG38=/path/hg38/hg38.fa GATK=/path/biosoft/gatk3.7/GenomeAnalysisTK.jar java -jar -Xmx10g $GATK -T DepthOfCoverage \ -R $HG38 \ -o ./test \ -I /path2bam/03_bam_processing/03_base_recal/test.sorted_MarkDuplicates_BQSR.bam \ -L /path2bed/target.bed \ --omitDepthOutputAtEachBase --omitIntervalStatistics \ -ct 1 -ct 5 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100
结果生成四个文件
4096 Jan 13 21:39 ./ 4096 Jan 13 21:05 ../ 6417 Jan 13 21:39 test.sample_cumulative_coverage_counts GenekVIP 6412 Jan 13 21:39 test.sample_cumulative_coverage_proportions 9362 Jan 13 21:39 test.sample_statistics 291 Jan 13 21:39 test.sample_summary
less test.sample_sumary 结果不是很好
sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_1 %_bases_above_5 %_bases_above_10 %_bases_above_20 %_bases_above_3 test 2025198 17.50 1 1 1 5.1 2.7 2.6 2.5 2.5 2.4 2.3 Total 2025198 17.50 N/A N/A N/A
#
2.bamdst: bam文件深度统计
#
软件安装:
git clone https://github.com/shiquan/bamdst.git cd bamdst/ make
测试:
cat test.sh
#!/usr/bin/bash
/path2bamdst/biosoft/bamdst/bamdst -p /path2bed/target.bed -o ./ ../03_base_recal/test.sorted.markdup.realign.BQSR.bam echo "run status $?"
结果文件:
1630 Jan 13 16:58 chromosomes.report 4331 Jan 13 16:58 coverage.report 64188 Jan 13 16:58 depth_distribution.plot 838584 Jan 13 16:58 depth.tsv.gz 18119 Jan 13 16:58 insertsize.plot 9519 Jan 13 16:58 region.tsv.gz 0 Jan 13 16:58 uncover.bed
看一下内容:
$ less -S chromosomes.report#该文件中存储的是从bam文件中获取的染色体深度、覆盖度信息 Chromosome DATA(%) Avg depth Median Coverage% Cov 4x % Cov 10x % Cov 30x % chrM 100.00 0.26 0.0 5.66 2.83 0.00 0.00 $ less coverage.report#信息太多了,我目前觉得比较重要的有 [Total] Raw Reads (All reads) 15 [Total] Mapped Reads 15 [Total] Properly paired 0#Paired reads with properly insert size [Total] Fraction of Properly paired 0.00% [Total] Read and mate paired 0#read1 and read2 paired. [Total] Fraction of Read and mate paired 0.00% [Total] Map quality cutoff value 20 [Total] MapQuality above cutoff reads 10 [Total] Fraction of MapQ reads in all reads 66.67% [Total] Fraction of MapQ reads in mapped reads 66.67% [Target] Average depth 0.26 [Target] Average depth(rmdup) 0.06 [Target] Coverage (>0x) 5.66% [Target] Coverage (>=4x) 2.83% [Target] Coverage (>=10x) 0.00% [Target] Coverage (>=30x) 0.00% $ less depth_distribution.plot#结合R可以做出目标区域的深度分布图 0 900 0.943396 54 0.056604 1 0 0.000000 54 0.056604 2 0 0.000000 54 0.056604 3 27 0.028302 27 0.028302 4 4 0.004193 23 0.024109 #左边三列分别代表:覆盖深度,对应该深度的碱基数,碱基占比(以目标区域的碱基数为分母); #右边两列是高深度向低深度累加的值,分别是碱基数及其占比,累加到X=1为止。 $ less depth.tsv #Chr Pos Raw Depth Rmdup depth Cover depth chrM 650 8 6 8 chrM 651 8 6 8 chrM 652 8 6 8 chrM 653 9 6 9 chrM 654 9 6 9 #Raw Depth从输入bam文件中得到,没有任何限制; #Rmdup depth去除了PCR重复的reads, 次优比对reads, 低比对质量的reads(mapQ < 20), 该值与samtools depth的输出深度类似; #默认使用raw depth来统计coverage.report文件中的覆盖率信息。 如果要使用rmdup depth计算覆盖率,可使用参数"--use_rmdup"; #The coverage depth is the raw depth with consider of deletion region, so this value should be equal to or greated than the raw depth. $ less insertsize.plot#做出两个接头之间的插入片段大小的图,理论上是根据read1和read2在染色体上的坐标信息求得,这时read1和read2要求比对到一条染色体上。 $ less region.tsv #Chr Start Stop Avg depth Median Coverage Coverage(FIX) chrM 649 1603 0.26 0.0 5.66 5.66 #目标区域文件每一个区域的统计,其中Coverage以X>0来统计 $ less uncover.bed chrM 672 1603 #--uncover的值默认是5,从前面depth_distribution.plot文件中也能看出来,深度小于5的碱基数就是931; #包含低覆盖或者是未覆盖的,通过"--uncover"规定“未覆盖”的阈值。
REFERENCE 简书作者:hsy_hzauer 链接:https://www.jianshu.com/p/ac7b8e4d76e4