突变数据清洗

mutation_data_tidy

这次偶然课题需要用到突变相关数据,在这做一下简单的总结,抛砖引玉。

癌症相关数据最近都是从FIREHOSE上下载的,第一次发现突变数据并不是按照矩阵形式存储的,还需要自己整理清洗数据(这就很麻烦嘛…),为了方便以后的使用,将整个流程做一下梳理。

我们先看一下原始下载的数据是这介个样子的:

其中MANIFEST.txt文件中包含的是所有有文件的信息,长介个样子:

## 6e92e6639f6c1f76588ae54777f1c289 TCGA-G2-A2EF-01.maf.txt
## 4acdd7c2af2f0bd89e4562759113b8ef TCGA-FD-A3NA-01.maf.txt
## b47e7e2572ff8155cfac01a4381aaa2a TCGA-DK-A1A5-01.maf.txt
## 60232e72368959618ddc019ad4e66b9a TCGA-BT-A20U-01.maf.txt
## f98a76eddbc68f9a1969551057a68bb0 TCGA-G2-A2EJ-01.maf.txt

以其中一个文件为例,里面的内容是介个样子的,有用的信息其中只有第2,9两列(好惨啊,其他的都没什么用):

接下来,我们就需要利用MANIFEST.txt遍历所有的样本文件,将所有的非同义突变的gene输出,下面是写的简单的脚本:

#! /usr/bin/env bash
cd PATH

if [ -d result ]  
    then echo "result exists"
else 
    mkdir result
fi

for file_name in `awk '{print $2}' MANIFEST.txt`
do
     sample_ID=${file_name%\.maf\.txt}
    `awk '{print $1"\t"$9}' $file_name | grep -v Silent | sed -e "1d" | sed -e "s/$/\t${sample_ID}/" | awk '{print $1"\t"$3}' >> result/result.txt`
done

最终,我们得到的结果文件是长介个样子的,之所以把数据存成这个样子是为了接下去结合{igraph}包整理成矩阵形式:

## 26155    TCGA-G2-A2EF
## 84069    TCGA-G2-A2EF
## 254173   TCGA-G2-A2EF
## 23013    TCGA-G2-A2EF
## 55672    TCGA-G2-A2EF

接下去就是我们熟悉的R脚本部分啦,程序挺简单的就不加以说明了,最终就得到了行是gene,列是样本的突变表达谱:

library(igraph)
library(dplyr)
setwd("PATH")
ensembl_entrez <- tbl_df(read.table("ensembl_entrez.txt", header = T, stringsAsFactors = F, sep = "\t"))
data <- tbl_df(read.table("mutation_data/result/result.txt", header = F, stringsAsFactors = F, sep = "\t"))
colnames(data) <- c("entrezgene", "sample_ID")
final_data <- data %>% left_join(ensembl_entrez, by = "entrezgene") %>% na.omit
need_data <- as.data.frame(final_data[, c(3, 2)])
mutation_net <- graph_from_edgelist(as.matrix(need_data, ncol = 2), directed = F)
mutation_adj <- as_adj(mutation_net)
sample_ID <- unique(unlist(need_data[,2]))
gene_ID <- unique(unlist(need_data[,1]))
final_adj <- mutation_adj[gene_ID, sample_ID]
write.table(matrix(c("gene", sample_ID), nrow = 1), "mutation_profile.txt", col.names = F, row.names = F, sep = "\t", quote = F, append = T)
write.table(as.matrix(final_adj, ncol = ncol(final_adj)), "mutation_profile.txt", col.names = F, row.names = T, sep = "\t", quote = F, append = T)

嗯,这样就结束了整个流程,不是很复杂,感谢Quan Fei同学,提供了新的思路:


library(dplyr)
setwd("PATH")
ensembl_entrez <- tbl_df(read.table("ensembl_entrez.txt", header = T, stringsAsFactors = F, sep = "\t"))
data <- tbl_df(read.table("mutation_data/result/result.txt", header = F, stringsAsFactors = F, sep = "\t"))
colnames(data) <- c("entrezgene", "sample_ID")
final_data <- data %>% left_join(ensembl_entrez, by = "entrezgene") %>% na.omit
final_adj <- table(unlist(final_data[,3]), unlist(final_data[,2]))
result <- t(apply(
  as.matrix(final_adj, ncol = ncol(final_adj)),
  1,
  function(x){
    x[x > 1] = 1
    return(x)
  }
))
rownames(result) <- rownames(final_adj)
colnames(result) <- colnames(final_adj)
write.table(result, "mutation_profile.txt", col.names = T, row.names = T, sep = "\t", quote = F)

posted @ 2018-03-10 18:43  PeRl`  阅读(756)  评论(0编辑  收藏  举报