RSeQC判断链特异性（strand-specific）

对于strand-specific的RNA-seq而言，我们必须得知道它是哪一种建库方式，才能进行后续的定量分析。

stringtie:

--rf    Assumes a stranded library fr-firststrand.
--fr    Assumes a stranded library fr-secondstrand.

kallisto:

--fr-stranded runs kallisto in strand specific mode, only fragments where the first read in the pair pseudoaligns to the forward strand of a transcript are processed. If a fragment pseudoaligns to multiple transcripts, only the transcripts that are consistent with the first read are kept.

--rf-stranded same as --fr-stranded but the first read maps to the reverse strand of a transcript.

现在比较常用的方式是fr-firststrand，也就是基于d-UTP的建库方式。但是为了更稳妥的判断，我们可以使用RSeQC中的工具来判断。RSeQC是2012年发表在Bioinformatics上的一个工具，包含多种功能：

1. 安装

# pip安装
pip3 install RSeQC

# 源代码安装
tar zxf RSeQC-VERSION.tar.gz

cd RSeQC-VERSION

#type 'python setup.py install --help' to see options
python setup.py install        #Note this requires root privilege
or
python setup.py install --root=/home/user/XXX/         #install RSeQC to user specificed location, does NOT require root privilege

#This is only an example. Change path according to your system configuration
export PYTHONPATH=/home/user/lib/python2.7/site-packages:$PYTHONPATH

#This is only an example. Change path according to your system configuration
export PATH=/home/user/bin:$PATH

2. infer_experiment.py

单端数据：

infer_experiment.py -r hg19.refseq.bed12 -i SingleEnd_StrandSpecific_36mer_Human_hg19.bam

#Output:
This is SingleEnd Data
Fraction of reads failed to determine: 0.0170
Fraction of reads explained by "++,--": 0.9669
Fraction of reads explained by "+-,-+": 0.0161

"++,--" 的比例远远超过另一种，这是strand-specifc的数据。++，--就是指的测出来的正链即实际的正链，负链就是实际的负链。

如上图这种，就是非链特异性的单端数据。

如果两种接近1：1，则是非链特异性，而假如两种比例悬殊，则是链特异性。

双端数据：

infer_experiment.py -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam

#Output::

This is PairEnd Data
Fraction of reads failed to determine: 0.0072
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9441
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0487

这种显然是链特异性，而且是fr-secondstrand。意思就是read1在+链，相对的gene也同样在+链上，而read2在+链，相对的gene在-链上。这种就是kallisto中的--fr-stranded和stringtie中的--fr。

现在这种特异性的library相对较少，而下面这种更为常见：

主要是“1+-，1-+，2++，2--”这种，也就是read1在+链，相对的gene其实是在-链（reverse）。这种就是“fr-firststrand”，也就是参数中的--rf。

同样两种在0.5附近的是non-specific：

infer_experiment.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam

#Output::

This is PairEnd Data
Fraction of reads failed to determine: 0.0172
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925

判断所需要的refseq文件可以在说明页面找到下载：

参考：