GEO 数据挖掘-数据获得
时间:2022-07-25
本文章向大家介绍GEO 数据挖掘-数据获得,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。
GEO 数据挖掘-数据获得
1. 概述
NCBI Gene Expression Omnibus(GEO)是各种高通量实验数据的公共存储库,这些数据包括测量mRNA、基因组DNA和蛋白质丰度的单通道和双通道微阵列实验,以及非阵列技术,如基因表达序列分析(SAGE)、质谱蛋白质组数据和高通量测序数据。相比较TCGA数据库,因为数据是用户上传,所以更新较快
需要知道四个单词 1. GEO Platform (GPL) 芯片平台
- GEO Sample (GSM) 样本ID号
- GEO Series (GSE) study的ID号
- GEO Dataset (GDS) 数据集的ID号
2. 配置包
# 设置镜像,用于下载
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
# 安装
# install.packages("BiocManager")
# BiocManager::install("GEOquery")
3. 下载数据
# 加载
library(GEOquery)
#使用getGEO函数获得基因信息
gds <- getGEO("GDS507")# 下载
# 同时支持从本地获得
# gds <- getGEO(filename=system.file("路径",package="GEOquery"))
# 下载gsm数据
gsm <- getGEO("GSM11805")
4. GEOquery数据结构
GEOquery数据结构实际上有两种形式。第一种,包括GDS、GPL和GSM,第二种是GSE是,由GSM和GPL对象组合而成的复合数据类型。
4.1 GDS, GSM, 和 GPL
# 通过meta查看gsm文件,显示样本的信息
head(Meta(gsm))
## $channel_count
## [1] "1"
##
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
# 查看前5行数值
Table(gsm)[1:5,]
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
# 查看列名
Columns(gsm)
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
对于gpl的格式和gsm是一样的,另外GDS的信息要多于上述两种
Columns(gds)[1:5,1:3]
## sample disease.state individual
## 1 GSM11815 RCC 035
## 2 GSM11832 RCC 023
## 3 GSM12069 RCC 001
## 4 GSM12083 RCC 005
## 5 GSM12101 RCC 011
4.2 GSE格式
GSE格式较为混乱,它可以表示任意平台的样本,gse有一个元数据部分,它没有DataTable。而是包含两个列表,可以使用GPLList和GSMList方法访问。
gse <- getGEO("GSE781",GSEMatrix=FALSE)
# 查看meta
head(Meta(gse))
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
##
## $contact_fax
## [1] "617-414-1646"
查看样本id
names(GSMList(gse))
## [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
## [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"
获取第一个gsm样本
GSMList(gse)[[1]]
## An object of class "GSM"
## channel_count
## [1] "1"
## contact_address
## [1] "715 Albany Street, E613B"
## contact_city
## [1] "Boston"
## contact_country
## [1] "USA"
## contact_department
## [1] "Genetics and Genomics"
## contact_email
## [1] "mlenburg@bu.edu"
## contact_fax
## [1] "617-414-1646"
## contact_institute
## [1] "Boston University School of Medicine"
## contact_name
## [1] "Marc,E.,Lenburg"
## contact_phone
## [1] "617-414-1375"
## contact_state
## [1] "MA"
## contact_web_link
## [1] "http://gg.bu.edu"
## contact_zip/postal_code
## [1] "02130"
## data_row_count
## [1] "22283"
## description
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"
## [3] "Keywords = renal"
## [4] "Keywords = RCC"
## [5] "Keywords = carcinoma"
## [6] "Keywords = cancer"
## [7] "Lot batch = 2004638"
## geo_accession
## [1] "GSM11805"
## last_update_date
## [1] "May 28 2005"
## molecule_ch1
## [1] "total RNA"
## organism_ch1
## [1] "Homo sapiens"
## platform_id
## [1] "GPL96"
## series_id
## [1] "GSE781"
## source_name_ch1
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status
## [1] "Public on Nov 25 2003"
## submission_date
## [1] "Oct 20 2003"
## supplementary_file
## [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM11nnn/GSM11805/suppl/GSM11805.CEL.gz"
## taxid_ch1
## [1] "9606"
## title
## [1] "N035 Normal Human Kidney U133A"
## type
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
## 22278 more rows ...
查看平台信息
names(GPLList(gse))
## [1] "GPL96" "GPL97"
5. 转换为bioconductor ExpressionSets 和 limma MALists
5.1 获得GSE Series Matrix数据
# 获得数据集,格式为gse
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
# 显示具体信息
show(gse2553)
## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
## varLabels: title geo_accession ... data_row_count (30 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 12600 (12600 total)
## fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 16230383
## Annotation: GPL1977
5.2 将GDS转换为ExpressionSet
eset <- GDS2eSet(gds,do.log2=TRUE)
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
## varLabels: sample disease.state individual description
## varMetadata: labelDescription
## featureData
## featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
## fvarLabels: ID Gene title ... GO:Component ID (21 total)
## fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 14641932
## Annotation:
此时eset就是一个表达矩阵
5.3 转换 GDS 为 MAList
Meta(gds)$platform
## [1] "GPL97"
# 选择
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
MA <- GDS2MA(gds,GPL=gpl)
class(MA)
## [1] "MAList"
## attr(,"package")
## [1] "limma"
5.4 GSE转换为 ExpressionSet
查看平台
gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)
## $GSM11805
## [1] "GPL96"
##
## $GSM11810
## [1] "GPL97"
##
## $GSM11814
## [1] "GPL96"
##
## $GSM11815
## [1] "GPL97"
##
## $GSM11823
## [1] "GPL96"
##
## $GSM11827
## [1] "GPL97"
通过fileter,筛选GPL96平台的数据
gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)
## [1] 17
查看测序数据
Table(gsmlist[[1]])[1:5,]
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
查看描述
Columns(gsmlist[[1]])[1:5,]
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## NA <NA>
## NA.1 <NA>
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
## NA <NA>
## NA.1 <NA>
# 探针
probesets <- Table(GPLList(gse)[[1]])$ID
# 提取每个样本的value
# wo ye kan bu dong 大致是通过lapply函数对每一个
# gsm进行匹配,然后返回value,最后cbind
data.matrix <- do.call('cbind',lapply(gsmlist,function(x)
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
# 转换为数字
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,1:5]
## GSM11805 GSM11814 GSM11823 GSM11830 GSM12067
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376
## [2,] 5.749534 7.908092 7.093814 7.514122 7.901470
## [3,] 7.066089 7.750205 7.244126 7.962896 7.337176
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568
## [5,] 6.195741 6.061776 6.565293 6.583459 6.877744
对data进一步处理
require(Biobase)
# 行列命名
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
# 转换为data.frame
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
## varLabels: samples
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
获取平台的信息
# gpl97 <- getGEO('GPL97')
# Meta(gpl97)$title
通过上述代码可以获得这个平台的所有信息,并通过id等查看,下载比较费时,不做展示
结束语
关于GEO数据挖掘的第一步,获得数据GEOquery的学习已经完成,没有学过关于测序的知识,这些信息获得之后还是懵逼的,2020-7-10更新 love&peace
- dedecms无法登录提示本页面禁止返回
- 前台开发从头说起:理解css盒模型
- 两个js冲突怎么解决?试试这四个方法
- dedecms如何去除后台登陆验证码
- DEDECMS自定义表单unix时间戳转换成常规时间方法及增加表单添加时间方法
- dedecms自定义表单发布成功后返回当前页面
- 前端构建工具 Gulp.js 上手实例
- dedecms数据库内容替换安全确认码不显示怎么解决
- 利用宏避免发送确认邮件时忘记添加附件
- dateDiff在Objective-C中的实现
- 禁用Firefox自带的元素查看工具
- 容易被误解的overflow:hidden
- dedecms调用全站相关文章怎么设置
- dedecms自定义表单提交成功后提示信息修改和跳转链接修改
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- (火狐)Selenium WebDriver测试 NotADirectoryError: [WinError 267] 目录名称无效。
- 浅析Android高斯模糊实现方案
- Android 自定义验证码输入框的实例代码(支持粘贴连续性)
- _countof和sizeof
- Flutter适配深色模式的方法(DarkMode)
- RecyclerView+SnapHelper实现无限循环筛选控件
- 详解Android 8.1.0 Service 中 弹出 Dialog的方法
- 短信收发类无错版JustinIO.cs
- Android快速实现无预览拍照功能
- RecyclerView+PagerSnapHelper实现抖音首页翻页的Viewpager效果
- android中使用react-native设置应用启动页过程详解
- 面试官问我单例模式真的安全吗?我懵逼了
- 详解Android使用CoordinatorLayout+AppBarLayout实现拉伸顶部图片功能
- Android自定义控制条效果
- Android使用MediaPlayer和TextureView实现视频无缝切换