Rle与GRanges

1 Rle（Run Length Encoding，行程编码）

1.1 Rle类和Rle对象

序列或基因最终要定位到染色体上。序列往往数量非常巨大，但染色体数量很少，如果每条序列的染色体定位都显式标注，将会产生大量的重复信息，更糟糕的是它们要占用大量的内存。BioC的IRanges包为这些数据提供了一种简便可行的信息压缩方式，即Rle。如果染色体1-3分别有3000,5000和2000个基因，基因的染色体注释可以用字符向量表示，也可以用Rle对象表示：

c(rep("ChrI", 3000), rep("ChrII", 5000), rep("ChrIII", 2000)) chr.rle <- Rle(chr.str)

两种方式的效果是完全一样的，但是Rle对象占用空间还不到字符向量的2%：

## [1] TRUE

## [1] 0.01616283

使用Rle并不总是可以“压缩”数据。如果信息没有重复或重复量很少，Rle会占用更多的内存：

sample(DNA_BASES, 10000, replace = TRUE) strx.rle <- Rle(strx) as.vector(object.size(strx.rle)/object.size(strx))

chr.rle

getClass(class(chr.rle))

Rle(values)
Rle(values, lengths)

values和lengths均为（原子）向量。第一种用法前面已经出现过了，我们看看第二种用法：

Rle(values = c("Chr1", "Chr2", "Chr3", "Chr1", "Chr3"), lengths = c(3, 2, 5, 4, 5)) chr.rle

as(chr.str, "Rle")

runLength(chr.rle)

runValue(chr.rle)

nrun(chr.rle)

start(chr.rle)

end(chr.rle)

width(chr.rle)

runLength(chr.rle) <- rep(3, nrun(chr.rle))
chr.rle

runValue(chr.rle)[3:4] <- c("III", "IV")
chr.rle

## 替换向量和被替换向量的长度必需相同，否则出错。下面两个语句都不正确：
runValue(chr.rle) <- c("ChrI", "ChrV")

runLength(chr.rle) <- 3

as.factor(chr.rle)

as.character(chr.rle)

set.seed(0)
rle1 <- Rle(sample(4, 6, replace = TRUE))
rle2 <- Rle(sample(5, 12, replace = TRUE))
rle3 <- Rle(sample(4, 8, replace = TRUE))
rle1 + rle2

rle1 + rle3

rle1 * rle2

sqrt(rle1)

range(rle1)

cumsum(rle1)

(rle1 <- Rle(sample(DNA_BASES, 10, replace = TRUE)))

(rle2 <- Rle(sample(DNA_BASES, 8, replace = TRUE)))

paste(rle1, rle2, sep = "")

数据结构和处理方法，但不直接面向序列处理；GenomicRanges包定义的GRanges和GRangesList类除了储存Ranges信息外还包含了序列的名称和DNA链等信息；而GenomicFeatures（包）则处理以数据库形式提供的GRanges信息，如基因、外显子、内含子、启动子、UTR等。<br>先看看BioC中Ranges最基本的类定义：

## Virtual Class "Ranges" [package "IRanges"]
## 
## Slots:
##                                                       
## Name:      elementType elementMetadata        metadata
## Class:       character DataTableORNULL            list
## 
## Extends: 
## Class "IntegerList", directly
## Class "RangesORmissing", directly
## Class "AtomicList", by class "IntegerList", distance 2
## Class "List", by class "IntegerList", distance 3
## Class "Vector", by class "IntegerList", distance 4
## Class "Annotated", by class "IntegerList", distance 5
## 
## Known Subclasses: 
## Class "IRanges", directly
## Class "Partitioning", directly
## Class "GappedRanges", directly
## Class "NCList", directly
## Class "IntervalTree", directly
## Class "NormalIRanges", by class "IRanges", distance 2
## Class "PartitioningByEnd", by class "Partitioning", distance 2
## Class "PartitioningByWidth", by class "Partitioning", distance 2
## Class "PartitioningMap", by class "Partitioning", distance 3

Ranges是虚拟类，实际应用中最常用的IRanges子类，它继承了Ranges的数据结构，另外多设置了3个slots（存储槽），分别用于存贮Ranges的起点、宽度和名称信息。由于Ranges由整数确定，所以称为IRanges（Integer Ranges，整数区间），但也有人理解成间隔区间（Interval Ranges）：

##       elementType   elementMetadata          metadata 
##       "character" "DataTableORNULL"            "list"

##             start             width             NAMES       elementType 
##         "integer"         "integer" "characterORNULL"       "character" 
##   elementMetadata          metadata 
## "DataTableORNULL"            "list"

GRanges是Ranges概念在序列处理上的具体应用，但它和IRanges没有继承关系：

##        seqnames          ranges          strand elementMetadata 
##           "Rle"       "IRanges"           "Rle"     "DataFrame" 
##         seqinfo        metadata 
##       "Seqinfo"          "list"

Ranges对于序列处理非常重要，除GenomicRanges外，Biostrings一些类的定义也应用了Ranges：

##           subject            ranges       elementType   elementMetadata 
##         "XString"         "IRanges"       "character" "DataTableORNULL" 
##          metadata 
##            "list"

2.2 对象构建和属性获取

IRanges对象可以使用对象构造函数IRanges产生，需提供起点（start）、终点（end）和宽度（width）三个参数中的任意两个：

IRanges(start = 1:10, width = 10:1) ir2 <- IRanges(start = 1:10, end = 11) ir3 <- IRanges(end = 11, width = 10:1) ir1

genes <- GRanges(seqnames = c("Chr1", "Chr3", "Chr3"), ranges = IRanges(start = c(1300,
    1050, 2000), end = c(2500, 1870, 3200)), strand = c("+", "+", "-"), seqlengths = c(Chr1 = 1e+05,
    Chr3 = 2e+05))
genes

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames       ranges strand
##          <Rle>    <IRanges>  <Rle>
##   [1]     Chr1 [1300, 2500]      +
##   [2]     Chr3 [1050, 1870]      +
##   [3]     Chr3 [2000, 3200]      -
##   -------
##   seqinfo: 2 sequences from an unspecified genome

IRanges和GRanges都是S4类，其属性获取有相应的函数：

##  [1]  1  2  3  4  5  6  7  8  9 10

##  [1] 10 10 10 10 10 10 10 10 10 10

##  [1] 10  9  8  7  6  5  4  3  2  1

## IRanges of length 3
##     start  end width
## [1]  1300 2500  1201
## [2]  1050 1870   821
## [3]  2000 3200  1201

## [1] 1300 1050 2000

Views对象也包含有IRanges属性：

function(dict, n) { paste(sample(dict, n, replace = T), collapse = "") } set.seed(0) dna <- DNAString(rndSeq(DNA_BASES, 1000)) vws <- as(maskMotif(dna, "TGA"), "Views") (ir <- ranges(vws))

plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black",
    add = FALSE, ybottom = NULL, ...) {
    require(scales)
    col <- alpha(col, 0.5)
    height <- 1
    sep <- 0.5
    if (is(xlim, "Ranges")) {
        xlim <- c(min(start(xlim)), max(end(xlim)) * 1.2)
    }
    if (!add) {
        bins <- disjointBins(IRanges(start(x), end(x) + 1))
        ybottom <- bins * (sep + height) - height
        par(mar = c(3, 0.5, 2.5, 0.5), mgp = c(1.5, 0.5, 0))
        plot.new()
        plot.window(xlim, c(0, max(bins) * (height + sep)))
    }
    rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col,
        ...)
    text((start(x) + end(x))/2, ybottom + height/2, 1:length(x), col = "white",
        xpd = TRUE)
    title(main)
    axis(1)
    invisible(ybottom)
}

shift函数对Ranges进行平移（下面图形中蓝色为原始Ranges，红色为变换后的Ranges，黑色/灰色则为参考Ranges，其他颜色为重叠区域）：

IRanges(c(3000, 2500), width = c(300, 850)) ir.trans <- shift(ir, 500) xlim <- c(0, max(end(ir, ir.trans)) * 1.3) ybottom <- plotRanges(ir, xlim = xlim, main = "shift", col = "blue") plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

flank函数获取Ranges的相邻区域，width参数为整数表示左侧，负数表示右侧：

flank(ir, width = 200) xlim <- c(0, max(end(ir, ir.trans)) * 1.3) ybottom <- plotRanges(ir, xlim = xlim, main = "flank", col = "blue") plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

reflect函数获取Ranges的镜面对称区域，bounds为用于设置镜面位置的Ranges对象：

IRanges(c(2000, 3000), width = 500) ir.trans <- reflect(ir, bounds = bounds) xlim <- c(0, max(end(ir, ir.trans, bounds)) * 1.3) ybottom <- plotRanges(ir, xlim = xlim, main = "reflect", col = "blue") plotRanges(bounds, add = TRUE, ybottom = ybottom, main = "") plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

promoters函数获取promoter区域，upstream和downstream分别设置上游和下游截取的序列长度：

promoters(ir, upstream = 1000, downstream = 100) xlim <- c(0, max(end(ir, ir.trans)) * 1.3) ybottom <- plotRanges(ir, xlim = xlim, main = "promoters", col = "blue") plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

resize函数改变Ranges的大小，width设置宽度，fix设置固定位置（start/end/center）：

resize(ir, width = c(100, 1300), fix = "start") xlim <- c(0, max(end(ir, ir.trans)) * 1.3) ybottom <- plotRanges(ir, xlim = xlim, main = "resize, fix=\"start\"", col = "blue") plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red") ir.trans <- resize(ir, width = c(100, 1300), fix = "center") xlim <- c(0, max(end(ir, ir.trans)) * 1.3) ybottom <- plotRanges(ir, xlim = xlim, main = "resize, fix=\"center\"", col = "blue") plotRanges(ir.trans, add = TRUE, ybottom = ybottom, main = "", col = "red")

其他函数的使用请自行尝试使用。

2.3.2 Ranges间转换（Inter-range transformations）

range函数用于获取Ranges所包括的整个区域（包括间隔区）；reduce将重叠区域合并；gaps用于获取间隔区域：

IRanges(c(200, 1000, 3000, 2500), width = c(600, 1000, 300, 850)) ir.trans <- range(ir) xlim <- c(0, max(end(ir, ir.trans)) * 1.3) ybottom <- plotRanges(ir, xlim = xlim, col = "blue") plotRanges(ir.trans, xlim = xlim, col = "red", main = "range") ir.trans <- reduce(ir) plotRanges(ir.trans, xlim = xlim, col = "red", main = "reduce") ir.trans <- gaps(ir) plotRanges(ir.trans, xlim = xlim, col = "red", main = "gaps")

2.3.3 Ranges对象间的集合运算

intersect求交集区域；setdiff求差异区域；union求并集：

IRanges(c(200, 1000, 3000, 2500), width = c(600, 1000, 300, 850)) ir2 <- IRanges(c(100, 1500, 2000, 3500), width = c(600, 800, 1000, 550)) xlim <- c(0, max(end(ir1, ir2)) * 1.3) ybottom <- plotRanges(reduce(ir1), xlim = xlim, col = "blue", main = "original") plotRanges(reduce(ir2), xlim = xlim, col = "blue", main = "", add = TRUE, ybottom = ybottom) plotRanges(intersect(ir1, ir2), xlim = xlim, col = "red") plotRanges(setdiff(ir1, ir2), xlim = xlim, col = "red") plotRanges(union(ir1, ir2), xlim = xlim, col = "red")

此外还有punion，pintersect，psetdiff和pgap函数，进行element-wise的运算。

3 SessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 8 (jessie)
## 
## locale:
##  [1] LC_CTYPE=zh_CN.utf8       LC_NUMERIC=C             
##  [3] LC_TIME=zh_CN.utf8        LC_COLLATE=zh_CN.utf8    
##  [5] LC_MONETARY=zh_CN.utf8    LC_MESSAGES=zh_CN.utf8   
##  [7] LC_PAPER=zh_CN.utf8       LC_NAME=C                
##  [9] LC_ADDRESS=C              LC_TELEPHONE=C           
## [11] LC_MEASUREMENT=zh_CN.utf8 LC_IDENTIFICATION=C      
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scales_0.3.0         GenomicRanges_1.18.4 GenomeInfoDb_1.2.5  
##  [4] Biostrings_2.34.1    XVector_0.6.0        IRanges_2.0.1       
##  [7] S4Vectors_0.4.0      BiocGenerics_0.12.1  zblog_0.1.0         
## [10] knitr_1.11          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0      plyr_1.8.3       formatR_1.2      magrittr_1.5    
##  [5] evaluate_0.7.2   highr_0.5        stringi_0.5-5    zlibbioc_1.12.0 
##  [9] tools_3.2.2      stringr_1.0.0    munsell_0.4.2    colorspace_1.2-6

参考资料

R语言与生物信息学博客：http://blog.csdn.net/u014801157/article/details/24372479

posted @ 2017-07-25 16:37 ywliao 阅读(1370) 评论(0) 收藏举报

刷新页面返回顶部

ywliao