[TIPS]intersectBED/coverageBED -split option

I had posted a question on BioStar. http://www.biostars.org/post/show/53032/intersectbedcoveragebed-split-purify-exon/. I was wondering how to calculate RPKM for three situations: 

  1. For given RefSeq gene, all reads overlapping with gene are included. (Q1)
  2. For given RefSeq gene, only reads overlapping exon are inclued.(Q3)
  3. For given Exon, all reads overlapping with exon are inclued.(Q2)

 

Questions are followed: 

all.reads.bam file records mapped RNA-seq reads data, including:

  1. exon:exon junction
  2. exon body
  3. intron body
  4. exon:intron junction

Q1: When calculating RPKM for given RefSeq gene including all the position reads, will the following command just calculate exon:exon junction reads and at same time ignore all other reads?

coverageBED -abam all.reads.bam -b refseq.genes.BED12.bed -s -split >coverage.bed

I'm confused by the mannual (Page 62):

When dealing with RNA-seq reads, for example, one typically wants to only tabulate coverage for the portions of the reads that come from exons (and ignore the interstitial intron seqeunce), The -split command allows for such coverage to be performed.

If "-split" is set, the exon:exon read (for example, 30M3000N46M") exists in -abam bam file, and the 3000N will NOT be wrongly intersected when running intersectBED command. But what about coverageBED command? I do hope the 3000N will be not calculated which makes sense, and I also hope the intron body reads and other reads will be NOT ignored.

Q2: If one just want to calculate exon's RPKM, does it mean one should prepare -b file to record all the exon information, and run like 

coverageBED -abam all.reads.bam -b all.exons.bed -s >coverage.bed

Q3: How to calculate RPKM for given genes whose reads overlap with exons? We all known BED12 format file record the exon information (start and length) for given RefSeq gene. Could BEDtools do the magic things? Note this calculation is different from Q1.

 

As I don't have time to wait for others' replies, dummy data is produced to help me understand how exactly intersectBED/coverageBED -split works? 

"reads.bed" input file records all the reads information in BED12 format: 

chr1    30    50    a    0    +    30    50    0    1    20,    30,
chr1    110   130   b    0    +    110   130   0    1    20,    110,
chr1    180   210   c    0    +    180   210   0    1    30,    180,
chr1    110   230   d    0    +    110   230   0    2    20,10,    110,220,
chr1    60    230   e    0    +    60    70    0    2    50,10,    60,220,

read_a: within exon1

read_b: within intron1

read_c: overlap with 5' end of exon2

read_d: spliced read, 5' half within intron1, and 3' half within exon2 ( in the real situation, both half reads will be within exon)

read_e: spliced read, 5' half overlap with 3' end of exon1, and 3' half within exon2

 

"ref.bed" input file records the given refseq gene with exon information in BED12 format: 

chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,

Now let's see what is going on with -split option? 

1. No "-split" is set at either step.

head -1 reads.bed|intersectBed -a stdin -b ref.bed -s | coverageBed -a stdin -b ref.bed

then we get: 

chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    1   20    500    0.0400000

read_a (number is 1) managed to be calculated, and the base coverage length is 20 nt as expected. 

The same thing happens when we run :

head -2 reads.bed|intersectBed -a stdin -b ref.bed -s | coverageBed -a stdin -b ref.bed
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    2    40    500    0.0800000

This means as long as the read is mapped within the gene, it will be calculated as expected. 

2. intersectBED -split will ignore all the intron reads

Having run this command, we will get:

intersectBed -a reads.bed -b ref.bed -s

 

chr1    30    50    a    0    +    30    50    0    1    20,    30,
chr1    110    130    b    0    +    110    130    0    1    20,    110,
chr1    180    210    c    0    +    180    210    0    1    30,    180,
chr1    110    230    d    0    +    110    230    0    2    20,10,    110,220,
chr1    60    230    e    0    +    60    70    0    2    50,10,    60,220,

All the reads are included. 

However, when we add -split

intersectBed -a reads.bed -b ref.bed -s -split
chr1    30    50    a    0    +    30    50    0    1    20,    30,
chr1    180    210    c    0    +    180    210    0    1    30,    180,
chr1    110    230    d    0    +    110    230    0    2    20,10,    110,220,
chr1    60    230    e    0    +    60    70    0    2    50,10,    60,220,

Only read_b which is mapped within intron is gone. Though read_d has 5' half located within intron, it still passes and gets calculated. 

3. coverageBED -split will calculate sub-region for spliced read, and will NOT ignore intron reads. 

How about the next command coverage

Having run as followd, then we will get the gene reads number. 

intersectBed -a reads.bed -b ref.bed -s |coverageBed -a stdin -b ref.bed -s 
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    5    190    500    0.3800000

How to get 190? Let's check it step by step.

head -3 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s 
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    3    70    500    0.1400000

Apparently, the read_a to read_c pass and get calculated. 30~50, 110~130, and 180~210 are used to calculate the base coverage. 

Try add read_d into the list.

head -4 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    4    140    500    0.2800000

read_a to read_d pass. read_d has 2 blockes, one is 110~130 while the other is 220~230. Indeed, 110~230 will be included, but the repeat will not be calculated repeatedly. Thus, 30~50, 110~130, 130~180, 180~210, 210~-230, in sum, 140nt will be the anwser. 

 

After adding -split.

awk '{if ($4 == "d") print $0}' reads.bed | intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    1    30    500    0.0600000
head -3 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    3    70    500    0.1400000
head -4 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split 
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    4    100    500    0.2000000

NOTE: Thus, coverageBED -split will calculate the repeated region 110~130 (within intron). Read_b is 110~130, and read_e 's 5'half is same. 

This bug-like result probably is due to the 5'half of read_d will NOT be produced in the real RNA-seq experiments. 

 

NOTE: intron repeat will calculated twice, but the exon repeat will NOT

head -5 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split 
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    5    150    500    0.3000000

The 220~230 region is covered by read_d and read_e twice. if this region is calculated twice, the answer will be 160, but it is 150 because 220~230 is calculated only once. 

posted @ 2012-09-15 22:21  Puriney  阅读(888)  评论(0编辑  收藏  举报