[TIPS]intersectBED/coverageBED -split option
I had posted a question on BioStar. http://www.biostars.org/post/show/53032/intersectbedcoveragebed-split-purify-exon/. I was wondering how to calculate RPKM for three situations:
- For given RefSeq gene, all reads overlapping with gene are included. (Q1)
- For given RefSeq gene, only reads overlapping exon are inclued.(Q3)
- For given Exon, all reads overlapping with exon are inclued.(Q2)
Questions are followed:
all.reads.bam file records mapped RNA-seq reads data, including:
- exon:exon junction
- exon body
- intron body
- exon:intron junction
Q1: When calculating RPKM for given RefSeq gene including all the position reads, will the following command just calculate exon:exon junction reads and at same time ignore all other reads?
coverageBED -abam all.reads.bam -b refseq.genes.BED12.bed -s -split >coverage.bed
I'm confused by the mannual (Page 62):
When dealing with RNA-seq reads, for example, one typically wants to only tabulate coverage for the portions of the reads that come from exons (and ignore the interstitial intron seqeunce), The -split command allows for such coverage to be performed.
If "-split" is set, the exon:exon read (for example, 30M3000N46M") exists in -abam bam file, and the 3000N will NOT be wrongly intersected when running intersectBED command. But what about coverageBED command? I do hope the 3000N will be not calculated which makes sense, and I also hope the intron body reads and other reads will be NOT ignored.
Q2: If one just want to calculate exon's RPKM, does it mean one should prepare -b file to record all the exon information, and run like
coverageBED -abam all.reads.bam -b all.exons.bed -s >coverage.bed
Q3: How to calculate RPKM for given genes whose reads overlap with exons? We all known BED12 format file record the exon information (start and length) for given RefSeq gene. Could BEDtools do the magic things? Note this calculation is different from Q1.
As I don't have time to wait for others' replies, dummy data is produced to help me understand how exactly intersectBED/coverageBED -split works?
"reads.bed" input file records all the reads information in BED12 format:
chr1 30 50 a 0 + 30 50 0 1 20, 30, chr1 110 130 b 0 + 110 130 0 1 20, 110, chr1 180 210 c 0 + 180 210 0 1 30, 180, chr1 110 230 d 0 + 110 230 0 2 20,10, 110,220, chr1 60 230 e 0 + 60 70 0 2 50,10, 60,220,
read_a: within exon1
read_b: within intron1
read_c: overlap with 5' end of exon2
read_d: spliced read, 5' half within intron1, and 3' half within exon2 ( in the real situation, both half reads will be within exon)
read_e: spliced read, 5' half overlap with 3' end of exon1, and 3' half within exon2
"ref.bed" input file records the given refseq gene with exon information in BED12 format:
chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400,
Now let's see what is going on with -split option?
1. No "-split" is set at either step.
head -1 reads.bed|intersectBed -a stdin -b ref.bed -s | coverageBed -a stdin -b ref.bed
then we get:
chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 1 20 500 0.0400000
read_a (number is 1) managed to be calculated, and the base coverage length is 20 nt as expected.
The same thing happens when we run :
head -2 reads.bed|intersectBed -a stdin -b ref.bed -s | coverageBed -a stdin -b ref.bed
chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 2 40 500 0.0800000
This means as long as the read is mapped within the gene, it will be calculated as expected.
2. intersectBED -split will ignore all the intron reads
Having run this command, we will get:
intersectBed -a reads.bed -b ref.bed -s
chr1 30 50 a 0 + 30 50 0 1 20, 30, chr1 110 130 b 0 + 110 130 0 1 20, 110, chr1 180 210 c 0 + 180 210 0 1 30, 180, chr1 110 230 d 0 + 110 230 0 2 20,10, 110,220, chr1 60 230 e 0 + 60 70 0 2 50,10, 60,220,
All the reads are included.
However, when we add -split:
intersectBed -a reads.bed -b ref.bed -s -split
chr1 30 50 a 0 + 30 50 0 1 20, 30, chr1 180 210 c 0 + 180 210 0 1 30, 180, chr1 110 230 d 0 + 110 230 0 2 20,10, 110,220, chr1 60 230 e 0 + 60 70 0 2 50,10, 60,220,
Only read_b which is mapped within intron is gone. Though read_d has 5' half located within intron, it still passes and gets calculated.
3. coverageBED -split will calculate sub-region for spliced read, and will NOT ignore intron reads.
How about the next command coverage?
Having run as followd, then we will get the gene reads number.
intersectBed -a reads.bed -b ref.bed -s |coverageBed -a stdin -b ref.bed -s
chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 5 190 500 0.3800000
How to get 190? Let's check it step by step.
head -3 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s
chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 3 70 500 0.1400000
Apparently, the read_a to read_c pass and get calculated. 30~50, 110~130, and 180~210 are used to calculate the base coverage.
Try add read_d into the list.
head -4 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s
chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 4 140 500 0.2800000
read_a to read_d pass. read_d has 2 blockes, one is 110~130 while the other is 220~230. Indeed, 110~230 will be included, but the repeat will not be calculated repeatedly. Thus, 30~50, 110~130, 130~180, 180~210, 210~-230, in sum, 140nt will be the anwser.
After adding -split.
awk '{if ($4 == "d") print $0}' reads.bed | intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split
chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 1 30 500 0.0600000
head -3 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 3 70 500 0.1400000
head -4 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 4 100 500 0.2000000
NOTE: Thus, coverageBED -split will calculate the repeated region 110~130 (within intron). Read_b is 110~130, and read_e 's 5'half is same.
This bug-like result probably is due to the 5'half of read_d will NOT be produced in the real RNA-seq experiments.
NOTE: intron repeat will calculated twice, but the exon repeat will NOT.
head -5 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split chr1 0 500 gene 0 + 0 500 0 3 100,100,100, 0,200,400, 5 150 500 0.3000000
The 220~230 region is covered by read_d and read_e twice. if this region is calculated twice, the answer will be 160, but it is 150 because 220~230 is calculated only once.