TCGA数据下载分析(1-1):RTCGA包gene获取表达值并可视化

时间:2022-06-10
本文章向大家介绍TCGA数据下载分析(1-1):RTCGA包gene获取表达值并可视化,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。
写在前面:
  • 方法:下载处理TCGA数据的R包很多,数据来源也不一样,这一部分开始对几个包分别进行使用,写出心得。
  • 结果最终想得到的是用其中两个包
  • 这部分,RTCGA包
  • 参考:作者文档这个以及生信技能树
---------------------------------------------------------------------------

1 安装并加载包

# Load the bioconductor installer. 
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA")
# Install the clinical and mRNA gene expression data packages
biocLite("RTCGA.clinical") ## 14Mb
biocLite('RTCGA.rnaseq') ##  (612.6 MB)
biocLite("RTCGA.mRNA") ##  (85.0 MB)
biocLite('RTCGA.mutations')  ## (103.8 MB)
library(RTCGA.clinical) 
library(RTCGA.mRNA)
library(RTCGA.rnaseq)
library(RTCGA.mutations)
> library(RTCGA)
Welcome to the RTCGA (version: 1.8.0).
#查看BRCA的内容
> checkTCGA('DataSets', 'LIHC')
   Size                                                                                                                                     Name
1   61K                                                                                   LIHC.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
2  932K                                                                                        LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz
3  1.6G LIHC.Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data.Level_3.2016012800.0.0.tar.gz
4  1.5M                  LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2016012800.0.0.tar.gz
5   22M               LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data.Level_3.2016012800.0.0.tar.gz
6  255K                LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz
7   78K   LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz.bak.20160128
8   83M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__exon_expression__data.Level_3.2016012800.0.0.tar.gz
9  8.7M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2016012800.0.0.tar.gz
10 8.0M                LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__splice_junction_expression__data.Level_3.2016012800.0.0.tar.gz
11 102M                            LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0.tar.gz
12  32M                 LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz
13 281M                         LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms__data.Level_3.2016012800.0.0.tar.gz
14  80M              LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data.Level_3.2016012800.0.0.tar.gz
15 905M                   LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data.Level_3.2016012800.0.0.tar.gz
16  70M               LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data.Level_3.2016012800.0.0.tar.gz
17 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg.Level_3.2016012800.0.0.tar.gz
18 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
19 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg.Level_3.2016012800.0.0.tar.gz
20 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
21 210M                                                                                LIHC.Methylation_Preprocess.Level_3.2016012800.0.0.tar.gz
22 2.0M                                                                               LIHC.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz
23 416M                                                                            LIHC.Mutation_Packager_Coverage.Level_3.2016012800.0.0.tar.gz
24  25M                                                                     LIHC.Mutation_Packager_Oncotated_Calls.Level_3.2016012800.0.0.tar.gz
25  47M                                                                 LIHC.Mutation_Packager_Oncotated_Raw_Calls.Level_3.2016012800.0.0.tar.gz
26 3.8M                                                                           LIHC.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0.tar.gz
27 765M                                                                        LIHC.Mutation_Packager_Raw_Coverage.Level_3.2016012800.0.0.tar.gz
28 585K                                                                                 LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz
29 275K                                                                    LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz.bak.20160128
30 312M                                                                                    LIHC.mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz
31 3.4M                                                                              LIHC.miRseq_Mature_Preprocess.Level_3.2016012800.0.0.tar.gz
32 2.8M                                                                                     LIHC.miRseq_Preprocess.Level_3.2016012800.0.0.tar.gz
关于这些数据
  • mRNA是芯片数据
  • ranseq是测序数据 具体参考这里这里
  • miRSeq is micro-RNA seq. Micro RNAs are a class of non protein coding RNAs that have regulatory functions (they typically bind to the 3`UTR of coding mRNAs and regulate that way)
  • mRNA, in this situation, refers to a cDNA microarray, where pre-designed probes on the microarray surface will bind to known target mRNAs, which may be coding or non-coding
  • mRNASeq is what most will loosely refer to as 'RNA-seq'. Most protocols will capture all classes of RNA species that have poly-adenylated (poly(A)) tails. RNAs that don't have these tails include ribsomal RNAs and many enhancer RNAs, but there are always exceptions to these rules.

推荐(建议参考)

miRSeq

  • illuminahiseq_mirnaseq-miR_gene_expression - normalised micro-RNAseq counts over each micro-RNA
  • illuminahiseq_mirnaseq-miR_isoform_expression - nomalised micro-RNAseq counts over each splice isoform of each micro-RNA mRNA
  • agilent4502a_07_3-unc_lowess_normalization_gene_level - LOWESS-normalised cDNA expression values over each gene (data from University of North Carolina) mRNASeq
  • illuminahiseq_rnaseq-gene_expression - normalised RNAseq counts over each gene
  • illuminahiseq_rnaseq-exon_expression - normalised RNAseq counts over each exon of each gene
> checkTCGA('Dates')
 [1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21"
[11] "2012-04-12" "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25"
[21] "2012-09-13" "2012-10-04" "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16"
[31] "2013-02-03" "2013-02-22" "2013-03-09" "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23"
[41] "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10" "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16"
[51] "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17" "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01"
[61] "2015-08-21" "2015-11-01" "2016-01-28"

2 下载某肿瘤中Datasets中的某类数据

可以用前述的checkTCGA查看类型,然后针对性下载,用法如下: downloadTCGA(cancerTypes, dataSet = "Merge_Clinical.Level_1", destDir, date = NULL, untarFile = TRUE, removeTar = TRUE, allDataSets = FALSE)

  • dataSet在checkTCGA中查看,可以看出数据类型名字很长,所以这里只要部分(连续)匹配即可
  • destDir在当前工作目录下创建一个新文件
setwd("E:/00TCGA/data/LIHC")
#downloading the miRNA DATA
dir.create('miRNA')
downloadTCGA(cancerTypes = 'LIHC',
             dataSet = 'miR_gene_expression',
             destDir = 'rnaseq',
             date = tail(checkTCGA('Dates'), 2)[1])

#downloading the ranseq data
dir.create('rnaseq')
downloadTCGA(cancerTypes = 'LIHC',
             dataSet = 'Level_3__gene_expression',
             destDir = 'rnaseq',
             date = tail(checkTCGA('Dates'), 2)[1])

3 获取某基因在任意癌症中的mRNA表达数据并可视化(ggplot2,ggpubrboxplotTCGA

3.0 获取mRNA表达数据

获取EZH2,PTEN,HGFAC,FYN,TIGD1等5个gene在BRCA,OV中的mRNA表达数据(LIHC没有mRNA)
expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                        extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
> expr
# A tibble: 1,305 x 7
   bcr_patient_barcode          dataset     EZH2   PTEN HGFAC    FYN  TIGD1
   <chr>                        <chr>      <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA -1.79   1.36  -2.31 -0.798 -0.86 
 2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA -1.57   0.428 -2.49 -0.532 -1.06 
 3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA -2.49   1.31  -2.65  0.007 -1.06 
 4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA -2.14   0.810 -2.58 -0.466 -0.714
 5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA  0.529  0.251 -2.95  0.960 -0.664
 6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA -1.44   1.31  -3.26 -0.622 -0.490
 7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -0.426 -0.237 -2.71  0.208  0.759
 8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -0.579 -1.24  -2.45  0.563  0.022
 9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA -1.16   1.21  -2.04 -0.616 -0.287
10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.894  0.288 -2.31 -0.314  1.33 
# ... with 1,295 more rows

3.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot

expr<-expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                        extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
expr1<-expr[,-1]
expr2<-gather(expr1, key ="mRNA", value="value", -dataset)

ggplot(expr2, aes(y=value,
             x=reorder(dataset, value, mean),
             fill= dataset))+
  geom_boxplot()+
  theme_RTCGA()+
  scale_fill_brewer(palette = "Set3")+
  facet_grid(mRNA~.)+
  theme(legend.position = "top")

boxplot如下

5genes in three kinds of cancers.jpeg

3.2 ggpubr绘制指定基因在不同肿瘤中的表达boxplot并进行统计学分析作图

这部分参考jimmy 查看样本量

table(expr$dataset)
> nb_samples
BRCA.mRNA LUSC.mRNA   OV.mRNA 
      590       154       561

bcr_patient_barcode这列改名,以便下一步可视化作图

expr$dataset <- gsub(pattern = ".mRNA", replacement = "",  expr$dataset)
expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154))
expr
> expr
# A tibble: 1,305 x 7
   bcr_patient_barcode dataset   EZH2   PTEN HGFAC    FYN  TIGD1
   <chr>               <chr>    <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1 BRCA1               BRCA    -1.79   1.36  -2.31 -0.798 -0.86 
 2 BRCA2               BRCA    -1.57   0.428 -2.49 -0.532 -1.06 
 3 BRCA3               BRCA    -2.49   1.31  -2.65  0.007 -1.06 
 4 BRCA4               BRCA    -2.14   0.810 -2.58 -0.466 -0.714
 5 BRCA5               BRCA     0.529  0.251 -2.95  0.960 -0.664
 6 BRCA6               BRCA    -1.44   1.31  -3.26 -0.622 -0.490
 7 BRCA7               BRCA    -0.426 -0.237 -2.71  0.208  0.759
 8 BRCA8               BRCA    -0.579 -1.24  -2.45  0.563  0.022
 9 BRCA9               BRCA    -1.16   1.21  -2.04 -0.616 -0.287
10 BRCA10              BRCA    -0.894  0.288 -2.31 -0.314  1.33

绘制EZH2表达值的boxplot图

library(ggpubr)
ggboxplot(expr, x = "dataset", y = "EZH2",
          title = "EZH2", ylab = "Expression",
          color = "dataset", palette = "jco")

ezh2.jpeg

4 获取某基因在任意癌症中的rnaseq表达数据并可视化(ggplot2,ggpubr和boxplotTCGA)

  • ggplot2和ggpubr的用法与第3部分几乎一致,而boxplotTCGA无法直接获取mRNA表达数据,只有rnaseq,另外还有mutation等数据。
  • 不同的是,不仅需要gene symbol还要entrez的ID,如MET|4233

4.0 获取rnaseq表达数据

  • 获取EZH2,PTEN,HGFAC,FYN等4个gene在BRCA,OV和LUSC中的rnaseq表达数据
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                      extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))
> expr
# A tibble: 2,071 x 6
   bcr_patient_barcode          dataset     `EZH2|2146` `PTEN|5728` `HGFAC|3083` `FYN|2534`
   <chr>                        <chr>             <dbl>       <dbl>        <dbl>      <dbl>
 1 TCGA-3C-AAAU-01A-11R-A41B-07 BRCA.rnaseq        487.       1724.        4.83       247. 
 2 TCGA-3C-AALI-01A-11R-A41B-07 BRCA.rnaseq        941.       1107.        9.79       523. 
 3 TCGA-3C-AALJ-01A-31R-A41B-07 BRCA.rnaseq        492.       1479.        7.25       814. 
 4 TCGA-3C-AALK-01A-11R-A41B-07 BRCA.rnaseq        334.       1877.        9.10       562. 
 5 TCGA-4H-AAAK-01A-12R-A41B-07 BRCA.rnaseq        297.       1740.        1.70       549. 
 6 TCGA-5L-AAT0-01A-12R-A41B-07 BRCA.rnaseq        238.       1597.        7.63       689. 
 7 TCGA-5L-AAT1-01A-12R-A41B-07 BRCA.rnaseq        258.       1374.        0.815      853. 
 8 TCGA-5T-A9QA-01A-11R-A41B-07 BRCA.rnaseq        253.       2181.        3.14        84.7
 9 TCGA-A1-A0SB-01A-11R-A144-07 BRCA.rnaseq        392.       2529.        0.901      740. 
10 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.rnaseq        191.       1876.        0          522. 
# ... with 2,061 more rows

4.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot(rnaseq)

#rnaseq ggplot2
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                      extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))

expr1<-expr[,-1]
expr2<-gather(expr1, key ="rnaseq", value="value", -dataset)

ggplot(expr2, aes(y=value,
                  x=reorder(dataset, value, mean),
                  fill= dataset))+
  geom_boxplot()+
  theme_RTCGA()+
  scale_fill_brewer(palette = "Set3")+
  facet_grid(rnaseq~.)+
  theme(legend.position = "top")

rnaseq ggplot2.jpeg

4.2 ggpubr

查看样本量

nb_samples <- table(expr$dataset)
nb_samples
> nb_samples
BRCA.rnaseq LUSC.rnaseq   OV.rnaseq 
       1212         552         307 
ggboxplot(expr, x = "dataset", y = "`PTEN|5728`",
          title = "ESR1|2099", ylab = "Expression",
          color = "dataset", palette = "jco")

PTEN_5728.jpeg

4.3 boxplotTCGA

具体参考RTCGA的文档

expressionsTCGA(LIHC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq,extra
                extract.cols = "MET|4233") %>%
  rename(cohort = dataset,
  MET = `MET|4233`) %>%
  #cancer samples
  filter(substr(bcr_patient_barcode, 14, 15) == "01") -> 
  ACC_BLCA_BRCA_OV.rnaseq
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "MET")

MET RNASEQ.jpeg

boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "log1p(MET)")

Rplot.jpeg

后记-----------------------------------------

这部分只是用RTCGA演示了如何下载数据;mRNA和rnaseq表达值的plot。