lncRNA定量分析流程 | Smart-seq单细胞数据挖掘

一直没时间碰这部分的内容,一是不擅长,二是不想摊子铺得太大。

现在是快要毕业了,主要的分析数据也发了,这部分如果再不做,马上别人拿到数据就可以分析了。

还有就是要清理集群,内存不够了,主要的分析做完了就可以给数据存档了。

 

最近看了一篇NC的lncRNA的分析文章,非常的有新意,当然也是结合了疾病模型才发这么高的,值得借鉴。

2020 - In vivo functional analysis of non-conserved human lncRNAs associated with cardiometabolic traits

 

直接借鉴里面的分析pipeline

 

1. 构建index

这里用的是lncRNAKB这个数据库,省了不少事,既然NC都用了,那我们也用这个吧。http://psychiatry.som.jhmi.edu/lncrnakb/

直接下载fasta和gtf文件,然后构建索引。

# hisat2提供了两个python脚本将GTF文件转换成hisat2-build能使用的文件
extract_exons.py lncRNAKB_hg38_v7.gtf > genome.exon
extract_splice_sites.py lncRNAKB_hg38_v7.gtf > genome.ss

# Error: Reference sequence has more than 2^32-1 characters!  Please build a large index by passing the --large-index option to bowtie2-build
hisat2-build --large-index -p 10 lncRNAKB_hg38_v7.transcript.primary.assembly.fa --ss genome.ss --exon genome.exon hg38_lncRNA_index

ga搞了好久,到最后一个gene count都没有,回头才发现问题,基因组用错了,我下载的是转录本数据(神奇的是居然能跑完整个流程)。因为注释就是基于GRCh38的,直接下个完整版的就行。

基因组下载也有讲究,不要下载GCF_000001405.39_GRCh38.p13_genomic.fna.gz,这是contig或scaffold(primary genome assembly),下载打开一看你会找不到染色体的。

human基因组fasta和gtf首选genecode网站:https://www.gencodegenes.org/human/

其次:http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz【这个附加的染色体跟lncRNADB对应上了】

 

这一步可以不需要gtf文件,主需要genome的fasta就行,可以跟lncRNA没有任何关系。【10个核需要20分钟】

hisat2-build -p 10 ucsc.hg38.fa ucsc_hg38_index

  

参考:

 

2. fastq比对

执行细节

这里需要一点小的shell脚本知识,读取fastq的文件,读取里面的sample name、fq1、fq2文件,方便批量处理和应用拓展。

比对细节

最好用conda更新一下hisat2到最新稳定版

另一点就是fastq的质量的编码,是33还是64,我这里是33,用64会报错。 

 

3. 定量分析

featurecount通过subread安装 

conda install -y subread
#!/bin/bash

IFS=,

# parameters
out_dir="reads_count"
cpu=10
ref="/home/lizhixin/project/Data_center/references/lncRNA/ucsc_hg38_index"
gtf="/home/lizhixin/project/Data_center/references/lncRNA/lncRNAKB_hg38_v7.gtf"

# cat all.sample.csv | awk 'NR<2' | while read sample_name fq1 fq2
cat all.sample.csv | while read sample_name fq1 fq2
do
        echo $sample_name $fq1 $fq2 &&\
        hisat2 --phred33 --fast -x $ref -1 $fq1 -2 $fq2 2>$out_dir/${sample_name}.Map2GenomeStat.xls | samtools sort - -o $out_dir/${sample_name}.bam -@ $cpu &&\
        featureCounts -T $cpu -F GTF -M -t exon -g gene_id -a $gtf -o $out_dir/${sample_name}.counts $out_dir/${sample_name}.bam &&\
        cut -f1,7 $out_dir/${sample_name}.counts | grep -v '^#' > $out_dir/${sample_name}.gene.counts &&\
        rm $out_dir/${sample_name}.bam $out_dir/${sample_name}.counts
done

  

4. 检查未完成任务&合并文件成matrix

一次可能跑不完,或者有出错的需要继续跑,写个R脚本来生成需要重新跑的文件。

options(stringsAsFactors = F)
tmp.files <- list.files("reads_count", pattern = "gene.counts")
all.samples <- read.csv("all.sample.csv", header=F)
tmp.files <- stringr::str_replace_all(tmp.files, ".gene.counts", "")
remain.sample <- subset(all.samples, !V1 %in% tmp.files)
write.table(remain.sample, file = "remain.sample.csv", sep=",", quote=F, row.names=F, col.names=F)

  

把零散的文件合并成一个matrix文件。【之前使用的方法真是太幼稚了,居然还用Python写了很多行代码,其实三行shell命令就能实现,这么普遍的功能,别人肯定早就实现了,你google一下就有现成的代码!!!】

ls -1  reads_count/*.gene.counts | parallel 'cat {} | sed '1d' | cut -f2 {} > {/.}_clean.txt'
ls -1  reads_count/*.gene.counts | head -1 | xargs cut -f1 > genes.txt
paste genes.txt *_clean.txt > merged.lncRNA.count.matrix.txt

6000个文件一次merge不完

# paste: Too many open files
ls -1 *_clean.txt | split -l 1000 -d - lists
for list in lists*; do paste $(cat $list) > merge${list##lists}; done
paste merge0* > merge.txt
paste genes.txt merge.txt > merged.lncRNA.count.matrix.txt

  

5. 其他

QC文件的收集

cellName的匹配 

 

 


 

参考:Single-cell Long Non-coding RNA Landscape of T Cells in Human Cancer Immunity

 

Read mapping and transcript assembling
Clean reads from each T cell were mapped to the human reference genome (version hg38/GRCh38) using STAR aligner (v2.7.1) with the twopassMode set as Basic.
The bam files of T cells from each cell-type of each patient were merged using SAMtools merge.
StringTie (v2.0.3) was used to assemble transcripts based on genomic read alignments. Assembled transcripts of all cell-types across all patients were merged together using the Cuffmerge utility of the Cufflinks package.

Comparison with reference gene annotation
For reference gene annotation, lncRNA genes were collected from RefLnc [38], and other genes were collected from GENCODE v31 [59]. According to the “class code” information outputted by Cuffcompare, the merged assembly was classified into four categories by comparison with the reference gene annotation, including known coding genes, known lncRNA genes, potentially novel genes (class code is “i, x, u”), and others.

Identification of novel lncRNAs
Based on the potentially novel gene catalog derived from single-cell data, we developed a custom pipeline for the identification of reliable novel lncRNAs that included the following steps: (1) transcripts that were no shorter than 200 nt and had more than one exon were selected for downstream analysis (for intergenic transcripts, at least 1 kb away from known protein-coding genes); (2) coding potential calculator (CPC) [36] and coding noncoding index (CNCI) [37] software were used to evaluate the protein-coding potential of transcripts, and transcripts that were reported to lack coding potential by both CPC and CNCI were regarded as candidate noncoding transcripts; (3) The remaining transcripts that were assembled and had the same intron chain of at least two cell-types were retained as the final novel lncRNA catalog. The final lncRNA catalog was obtained by combining the reference lncRNA and novel lncRNA genes directly. The UCSC liftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver?hgsid=806106955_h2xhcK2iPRI7SiMkxkB41I2mwF9O) was used to identify the orthologous locations of human novel lncRNAs in the mouse genome and in primates such as chimpanzee and gorilla, with the parameters: Minimum ratio of bases that must remap = 0.1 and Min ratio of alignment blocks or exons that must map = 0.5.

Experimental validation of novel lncRNAs

Data normalization
We calculated the read counts and transcripts per million (TPM) values using pseudoalignment of scRNA-seq reads to both protein-coding and lncRNA transcriptomes, as implemented in Kallisto (v0.46.0) [60] with default parameters, and summarized expression levels from the transcript level to the gene level.

 

给毕业论文加点数据吧!

 

 


 

见文章:Benchmark of lncRNA Quantification for RNA-Seq of Cancer Samples

Overall, 10-16% of lncRNAs can be detected in the samples, with antisense and lincRNAs the two most abundant categories.

lncRNA的两大类要知道。

lncRNA的表达特点要知道,组织特异性、低表达性。

其实,如果没做链特异性的RNA-seq,你是无法评估antisense lncRNA的表达量的。

 

posted @ 2018-03-27 15:42  Life·Intelligence  阅读(1924)  评论(0编辑  收藏  举报
TOP