Deep-learning augmented RNA-seq analysis of transcript splicing | 用深度学习预测可变剪切

• 用DNA序列以及variant来预测可变剪切；GeneSplicer、MaxEntScan、dbscSNV、S-CAP、MMSplice、clinVar、spliceAI
• 用RNA来预测可变剪切；MISO、rMATS、DARTS

Deep-learning augmented RNA-seq analysis of transcript splicing

Unlike methods that use cis sequence features to predict exon splicing patterns in specific samples7–10，看看前人是如何根据cis sequence特征来预测exon的剪切模式的

MISO (Mixture of Isoforms) software documentation 目前只支持python2版本，用conda的话还需要从文档中copy一下miso_settings.txt文件。

rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features

a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing

During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.

During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.

This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.

To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.

cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions

DNN用到的训练数据具体是什么？

large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium

We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons

The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN

From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT(flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods). 这样就把每个exon给的特征给提取出来了，lable也有了，就可以用于训练了。

We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.

incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).

Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data

Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/


test_norep_data.txt文件是这样的：

ID	GeneID	geneSymbol	chr	strand	exonStart_0base	exonEnd	upstreamES	upstreamEE	downstreamES	downstreamEE	ID	IJC_SAMPLE_1	SJC_SAMPLE_1	IJC_SAMPLE_2	SJC_SAMPLE_2	IncFormLen	SkipFormLen
82439	ENSG00000169045.17_1	HNRNPH1	chr5	-	179046269	179046408	179045145	179045324	179047892	179048036	82439	15236	319	6774	834	180	90
21374	ENSG00000131876.16_3	SNRPA1	chr15	-	101826418	101826498	101825930	101826006	101827112	101827215	21374	4105	118	292	54	169	90
32815	ENSG00000141027.20_3	NCOR1	chr17	-	15990485	15990659	15989712	15989756	15995176	15995232	32815	624	564	549	1261	180	90
43143	ENSG00000133731.9_2	IMPA1	chr8	-	82597997	82598198	82593732	82593819	82598486	82598518	43143	155	332	22	341	180	90
111671	ENSG00000100320.22_3	RBFOX2	chr22	-	36232366	36232486	36205826	36206051	36236238	36236460	111671	93	193	35	534	180	90


1 ID      I1      S1      I2      S2      inc_len skp_len psi1    psi2    delta.mle       post_pr
2 1225    160     0       169     6       180     90      1       0.934   -0.0663 0.4367
3 15829   52      58      12      41      180     90      0.31    0.128   -0.1819 0.8867
4 20347   1084    930     371     615     180     90      0.368   0.232   -0.1365 1
5 21374   4105    118     292     54      169     90      0.949   0.742   -0.2065 1
6 24817   177     275     263     741     143     90      0.288   0.183   -0.1057 0.974
7 32815   624     564     549     1261    180     90      0.356   0.179   -0.1774 1
8 43143   155     332     22      341     180     90      0.189   0.031   -0.158  1
9 46548   1685    4040    216     1752    180     90      0.173   0.058   -0.1145 1


Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS

Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS

其中的第一个文件是Input feature file (*.h5) or Darts_BHT output (*.txt)

ID      I1      S1      I2      S2      inc_len skp_len mu.mle  delta.mle       post_pr
chr1:-:10002681:10002840:10002738:10002840:9996576:9996685      581     0       462     0       155     99      1       0       0
chr1:-:100176361:100176505:100176389:100176505:100174753:100174815      28      0       49      2       126     99      1       -0.0493827160493827     0.248
chr1:-:109556441:109556547:109556462:109556547:109553537:109554340      2       37      0       81      119     99      0.0430341230167355
-0.0430341230167355     0.188
chr1:-:11009680:11009871:11009758:11009871:11007699:11008901    11      2       49      4       176     99      0.755725190839695       0.117542135892979       0.329333333333333
chr1:-:11137386:11137500:11137421:11137500:11136898:11137005    80      750     64      738     133     99      0.0735580941766509      -0.0129207126090368     0


第二个文件是Kallisto expression files

thymus  adipose
RPS11   2678.83013      2531.887535
ERAL1   14.350975       13.709394
DDX27   18.2573 14.02368
DEK     32.463558       14.520312
PSMA6   102.332592      77.089475
TRIM56  4.519675        6.14762566667
TRIM71  0.082009        0.0153936666667
UPF2    7.150812        5.23628033333
FARS2   6.332831        7.291382
ALKBH8  3.056208        1.27043633333
ZNF579  5.13265 8.248575


结果文件，第一列是ID，第二列是真实的标签，第三列是预测的标签：

ID      Y_true  Y_pred
chr22:-:39136893:39137055:39137011:39137055:39136271:39136437   1.000000        0.318161
chr12:-:69326921:69326979:69326949:69326979:69326457:69326620   1.000000        0.073966
chr3:-:49053236:49053305:49053251:49053305:49052920:49053140    0.947333        0.295664
chr4:-:68358468:68358715:68358586:68358715:68357897:68357993    1.000000        0.304907
chr11:-:124972532:124972705:124972629:124972705:124972027:124972213     0.937333        0.365548
chr15:+:43695880:43696040:43695880:43695997:43696610:43696750   1.000000        0.450762


The Expanding Landscape of Alternative Splicing Variation in Human Populations.

Gene expression inference with deep learning | 基于深度学习的基因表达推测

uci-cbcl/D-GEX - github

LINCS L1000 data

posted @ 2019-08-29 21:39  Life·Intelligence  阅读(...)  评论(...编辑  收藏