Rle与GRanges
1 Rle(Run Length Encoding,行程编码)
1.1 Rle类和Rle对象
序列或基因最终要定位到染色体上。序列往往数量非常巨大,但染色体数量很少,如果每条序列的染色体定位都显式标注,将会产生大量的重复信息,更糟糕的是它们要占用大量的内存。BioC的IRanges包为这些数据提供了一种简便可行的信息压缩方式,即Rle。如果染色体1-3分别有3000,5000和2000个基因,基因的染色体注释可以用字符向量表示,也可以用Rle对象表示:
两种方式的效果是完全一样的,但是Rle对象占用空间还不到字符向量的2%:
## [1] TRUE
## [1] 0.01616283
使用Rle并不总是可以“压缩”数据。如果信息没有重复或重复量很少,Rle会占用更多的内存:
chr.rle
getClass(class(chr.rle))
Rle(values)
Rle(values, lengths)
values和lengths均为(原子)向量。第一种用法前面已经出现过了,我们看看第二种用法:
as(chr.str, "Rle")
runLength(chr.rle)
runValue(chr.rle)
nrun(chr.rle)
start(chr.rle)
end(chr.rle)
width(chr.rle)
runLength(chr.rle) <- rep(3, nrun(chr.rle))
chr.rle
runValue(chr.rle)[3:4] <- c("III", "IV")
chr.rle
## 替换向量和被替换向量的长度必需相同,否则出错。下面两个语句都不正确:
runValue(chr.rle) <- c("ChrI", "ChrV")
runLength(chr.rle) <- 3
as.factor(chr.rle)
as.character(chr.rle)
set.seed(0)
rle1 <- Rle(sample(4, 6, replace = TRUE))
rle2 <- Rle(sample(5, 12, replace = TRUE))
rle3 <- Rle(sample(4, 8, replace = TRUE))
rle1 + rle2
rle1 + rle3
rle1 * rle2
sqrt(rle1)
range(rle1)
cumsum(rle1)
(rle1 <- Rle(sample(DNA_BASES, 10, replace = TRUE)))
(rle2 <- Rle(sample(DNA_BASES, 8, replace = TRUE)))
paste(rle1, rle2, sep = "")
## Virtual Class "Ranges" [package "IRanges"] ## ## Slots: ## ## Name: elementType elementMetadata metadata ## Class: character DataTableORNULL list ## ## Extends: ## Class "IntegerList", directly ## Class "RangesORmissing", directly ## Class "AtomicList", by class "IntegerList", distance 2 ## Class "List", by class "IntegerList", distance 3 ## Class "Vector", by class "IntegerList", distance 4 ## Class "Annotated", by class "IntegerList", distance 5 ## ## Known Subclasses: ## Class "IRanges", directly ## Class "Partitioning", directly ## Class "GappedRanges", directly ## Class "NCList", directly ## Class "IntervalTree", directly ## Class "NormalIRanges", by class "IRanges", distance 2 ## Class "PartitioningByEnd", by class "Partitioning", distance 2 ## Class "PartitioningByWidth", by class "Partitioning", distance 2 ## Class "PartitioningMap", by class "Partitioning", distance 3
Ranges是虚拟类,实际应用中最常用的IRanges子类,它继承了Ranges的数据结构,另外多设置了3个slots(存储槽),分别用于存贮Ranges的起点、宽度和名称信息。由于Ranges由整数确定,所以称为IRanges(Integer Ranges,整数区间),但也有人理解成间隔区间(Interval Ranges):
## elementType elementMetadata metadata ## "character" "DataTableORNULL" "list"
## start width NAMES elementType ## "integer" "integer" "characterORNULL" "character" ## elementMetadata metadata ## "DataTableORNULL" "list"
GRanges是Ranges概念在序列处理上的具体应用,但它和IRanges没有继承关系:
## seqnames ranges strand elementMetadata ## "Rle" "IRanges" "Rle" "DataFrame" ## seqinfo metadata ## "Seqinfo" "list"
Ranges对于序列处理非常重要,除GenomicRanges外,Biostrings一些类的定义也应用了Ranges:
## subject ranges elementType elementMetadata ## "XString" "IRanges" "character" "DataTableORNULL" ## metadata ## "list"
2.2 对象构建和属性获取
IRanges对象可以使用对象构造函数IRanges产生,需提供起点(start)、终点(end)和宽度(width)三个参数中的任意两个:
genes <- GRanges(seqnames = c("Chr1", "Chr3", "Chr3"), ranges = IRanges(start = c(1300,
1050, 2000), end = c(2500, 1870, 3200)), strand = c("+", "+", "-"), seqlengths = c(Chr1 = 1e+05,
Chr3 = 2e+05))
genes
## GRanges object with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] Chr1 [1300, 2500] + ## [2] Chr3 [1050, 1870] + ## [3] Chr3 [2000, 3200] - ## ------- ## seqinfo: 2 sequences from an unspecified genome
IRanges和GRanges都是S4类,其属性获取有相应的函数:
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 10 10 10 10 10 10 10 10 10 10
## [1] 10 9 8 7 6 5 4 3 2 1
## IRanges of length 3 ## start end width ## [1] 1300 2500 1201 ## [2] 1050 1870 821 ## [3] 2000 3200 1201
## [1] 1300 1050 2000
Views对象也包含有IRanges属性:
plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black",
add = FALSE, ybottom = NULL, ...) {
require(scales)
col <- alpha(col, 0.5)
height <- 1
sep <- 0.5
if (is(xlim, "Ranges")) {
xlim <- c(min(start(xlim)), max(end(xlim)) * 1.2)
}
if (!add) {
bins <- disjointBins(IRanges(start(x), end(x) + 1))
ybottom <- bins * (sep + height) - height
par(mar = c(3, 0.5, 2.5, 0.5), mgp = c(1.5, 0.5, 0))
plot.new()
plot.window(xlim, c(0, max(bins) * (height + sep)))
}
rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col,
...)
text((start(x) + end(x))/2, ybottom + height/2, 1:length(x), col = "white",
xpd = TRUE)
title(main)
axis(1)
invisible(ybottom)
}
shift函数对Ranges进行平移(下面图形中蓝色为原始Ranges,红色为变换后的Ranges,黑色/灰色则为参考Ranges,其他颜色为重叠区域):
flank函数获取Ranges的相邻区域,width参数为整数表示左侧,负数表示右侧:
reflect函数获取Ranges的镜面对称区域,bounds为用于设置镜面位置的Ranges对象:
promoters函数获取promoter区域,upstream和downstream分别设置上游和下游截取的序列长度:
resize函数改变Ranges的大小,width设置宽度,fix设置固定位置(start/end/center):
其他函数的使用请自行尝试使用。
2.3.2 Ranges间转换(Inter-range transformations)
range函数用于获取Ranges所包括的整个区域(包括间隔区);reduce将重叠区域合并;gaps用于获取间隔区域:
2.3.3 Ranges对象间的集合运算
intersect求交集区域;setdiff求差异区域;union求并集:
此外还有punion,pintersect,psetdiff和pgap函数,进行element-wise的运算。
3 SessionInfo()
## R version 3.2.2 (2015-08-14) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Debian GNU/Linux 8 (jessie) ## ## locale: ## [1] LC_CTYPE=zh_CN.utf8 LC_NUMERIC=C ## [3] LC_TIME=zh_CN.utf8 LC_COLLATE=zh_CN.utf8 ## [5] LC_MONETARY=zh_CN.utf8 LC_MESSAGES=zh_CN.utf8 ## [7] LC_PAPER=zh_CN.utf8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=zh_CN.utf8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats4 parallel stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages: ## [1] scales_0.3.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.5 ## [4] Biostrings_2.34.1 XVector_0.6.0 IRanges_2.0.1 ## [7] S4Vectors_0.4.0 BiocGenerics_0.12.1 zblog_0.1.0 ## [10] knitr_1.11 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.0 plyr_1.8.3 formatR_1.2 magrittr_1.5 ## [5] evaluate_0.7.2 highr_0.5 stringi_0.5-5 zlibbioc_1.12.0 ## [9] tools_3.2.2 stringr_1.0.0 munsell_0.4.2 colorspace_1.2-6

浙公网安备 33010602011771号