【高能】用PyMC3进行贝叶斯统计分析(代码+实例)
时间:2022-04-25
本文章向大家介绍【高能】用PyMC3进行贝叶斯统计分析(代码+实例),主要内容包括例子1:抛硬币问题、说明、结果、模式、例子2:化学活性问题、数据、参数问题、说明、数据、MCMC Inference Button (TM)、结果、问题类型2:实验组之间的比较、例子1:药物IQ问题、说明、MCMC Inference Button (TM)、结果、例子2:手机消毒问题、数据、说明、MCMC Inference Button (TM)、结果、基本概念、基础应用、原理机制和需要注意的事项等,并结合实例形式分析了其使用技巧,希望通过本文能帮助到大家理解应用这部分内容。
问题类型1:参数估计
真实值是否等于X?
给出数据,对于参数,可能的值的概率分布是多少?
例子1:抛硬币问题
硬币扔了n次,正面朝上是h次。
参数问题
想知道 p 的可能性。给定 n 扔的次数和 h 正面朝上次数,p 的值很可能接近 0.5,比如说在 [0.48,0.52]?
说明
- 参数的先验信念:p∼Uniform(0,1)
- 似然函数:data∼Bernoulli(p)
import pymc3 as pmimport numpy.random as nprimport numpy as npimport matplotlib.pyplot as pltimport matplotlib as mplfrom collections import Counterimport seaborn as sns
sns.set_style('white')
sns.set_context('poster')
%load_ext autoreload
%autoreload 2%matplotlib inline
%config InlineBackend.figure_format = 'retina'import warnings
warnings.filterwarnings('ignore')from random import shuffle
total = 30n_heads = 11n_tails = total - n_heads
tosses = [1] * n_heads + [0] * n_tails
shuffle(tosses)
数据
def plot_coins():
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.bar(list(Counter(tosses).keys()), list(Counter(tosses).values()))
ax.set_xticks([0, 1])
ax.set_xticklabels(['tails', 'heads'])
ax.set_ylim(0, 20)
ax.set_yticks(np.arange(0, 21, 5)) return fig
fig = plot_coins()
plt.show()
# Context manager syntax. `coin_model` is **just** # a placeholderwith pm.Model() as coin_model:
# Distributions are PyMC3 objects.
# Specify prior using Uniform object.
p_prior = pm.Uniform('p', 0, 1)
# Specify likelihood using Bernoulli object.
like = pm.Bernoulli('likelihood', p=p_prior, observed=tosses) # "observed=data" is key
# for likelihood.
MCMC Inference Button (TM)
with coin_model: # don't worry about this:
step = pm.Metropolis() # focus on this, the Inference Button:
coin_trace = pm.sample(2000, step=step)
结果
pm.traceplot(coin_trace)
plt.show()
pm.plot_posterior(coin_trace[100:], color='#87ceeb', rope=[0.48, 0.52], point_estimate='mean', ref_val=0.5)
plt.show()
- 95% 的 HPD,包括 ROPE。
- 获取更多的数据!
模式
- 使用统计分布参数化问题
- 证明我们的模型结构
- 在PyMC3中编写模型,Inference ButtonTM
- 基于后验分布进行解释
- (可选) 新增信息,修改模型结构
例子2:化学活性问题
我有一个新开发的分子X; X在阻止流感方面的效果有多好?
实验
- 测试X的浓度范围,测量流感活动
- 计算 IC50:导致病毒复制率减半的X浓度。
数据
import numpy as np
chem_data = [(0.00080, 99),
(0.00800, 91),
(0.08000, 89),
(0.40000, 89),
(0.80000, 79),
(1.60000, 61),
(4.00000, 39),
(8.00000, 25),
(80.00000, 4)]import pandas as pd
chem_df = pd.DataFrame(chem_data)
chem_df.columns = ['concentration', 'activity']
chem_df['concentration_log'] = chem_df['concentration'].apply(lambda x:np.log10(x))# df.set_index('concentration', inplace=True)
参数问题
给出数据,化学品的IC50 值是多少, 以及其周围的不确定性?
说明
数据
def plot_chemical_data(log=True):
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1) if log:
ax.scatter(x=chem_df['concentration_log'], y=chem_df['activity'])
ax.set_xlabel('log10(concentration (mM))', fontsize=20) else:
ax.scatter(x=chem_df['concentration'], y=chem_df['activity'])
ax.set_xlabel('concentration (mM)', fontsize=20)
ax.set_xticklabels([int(i) for i in ax.get_xticks()], fontsize=18)
ax.set_yticklabels([int(i) for i in ax.get_yticks()], fontsize=18)
plt.hlines(y=50, xmin=min(ax.get_xlim()), xmax=max(ax.get_xlim()), linestyles='--',) return fig
fig = plot_chemical_data(log=True)
plt.show()
with pm.Model() as ic50_model:
beta = pm.HalfNormal('beta', sd=100**2)
ic50_log10 = pm.Flat('IC50_log10') # Flat prior
# MATH WITH DISTRIBUTION OBJECTS!
measurements = beta / (1 + np.exp(chem_df['concentration_log'].values - ic50_log10))
y_like = pm.Normal('y_like', mu=measurements, observed=chem_df['activity']) # Deterministic transformations.
ic50 = pm.Deterministic('IC50', np.power(10, ic50_log10))
MCMC Inference Button (TM)
with ic50_model:
step = pm.Metropolis()
ic50_trace = pm.sample(10000, step=step)
pm.traceplot(ic50_trace[2000:], varnames=['IC50_log10', 'IC50']) # live: sample from step 2000 onwards.plt.show()
结果
pm.plot_posterior(ic50_trace[4000:], varnames=['IC50'], color='#87ceeb', point_estimate='mean')
plt.show()
该化学物质的IC50在约 [2mM,2.4mM](95%HPD)。 这是一种不好的化学物质。
问题类型2:实验组之间的比较
实验组和对照组的不同
例子1:药物IQ问题
药物治疗是否影响 IQ Scores
drug = [ 99., 110., 107., 104., 省略]
placebo = [ 95., 105., 103., 99., 省略]def ECDF(data):
x = np.sort(data)
y = np.cumsum(x) / np.sum(x) return x, ydef plot_drug():
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
x_drug, y_drug = ECDF(drug)
ax.plot(x_drug, y_drug, label='drug, n={0}'.format(len(drug)))
x_placebo, y_placebo = ECDF(placebo)
ax.plot(x_placebo, y_placebo, label='placebo, n={0}'.format(len(placebo)))
ax.legend()
ax.set_xlabel('IQ Score')
ax.set_ylabel('Cumulative Frequency')
ax.hlines(0.5, ax.get_xlim()[0], ax.get_xlim()[1], linestyle='--') return fig
from scipy.stats import ttest_ind
ttest_ind(drug, placebo)
Ttest_indResult(statistic=2.2806701634329549, pvalue=0.025011500508647616)
实验
- 随机将参与者分配给两个实验组:
- +drug vs. -drug
- 测量每个参与者的 IQ Scores
说明
fig = plot_drug()
plt.show()
y_vals = np.concatenate([drug, placebo])
labels = ['drug'] * len(drug) + ['placebo'] * len(placebo)
data = pd.DataFrame([y_vals, labels]).T
data.columns = ['IQ', 'treatment']with pm.Model() as kruschke_model: # Focus on the use of Distribution Objects.
# Linking Distribution Objects together is done by
# passing objects into other objects' parameters.
mu_drug = pm.Normal('mu_drug', mu=0, sd=100**2)
mu_placebo = pm.Normal('mu_placebo', mu=0, sd=100**2)
sigma_drug = pm.HalfCauchy('sigma_drug', beta=100)
sigma_placebo = pm.HalfCauchy('sigma_placebo', beta=100)
nu = pm.Exponential('nu', lam=1/29) + 1
drug_like = pm.StudentT('drug', nu=nu, mu=mu_drug, sd=sigma_drug, observed=drug)
placebo_like = pm.StudentT('placebo', nu=nu, mu=mu_placebo, sd=sigma_placebo, observed=placebo)
diff_means = pm.Deterministic('diff_means', mu_drug - mu_placebo)
pooled_sd = pm.Deterministic('pooled_sd', np.sqrt(np.power(sigma_drug, 2) + np.power(sigma_placebo, 2) / 2))
effect_size = pm.Deterministic('effect_size', diff_means / pooled_sd)
MCMC Inference Button (TM)
with kruschke_model:
kruschke_trace = pm.sample(10000, step=pm.Metropolis())
结果
pm.traceplot(kruschke_trace[2000:], varnames=['mu_drug', 'mu_placebo'])
plt.show()
pm.plot_posterior(kruschke_trace[2000:], color='#87ceeb',varnames=['mu_drug', 'mu_placebo', 'diff_means'])
plt.show()
- Difference in mean IQ:[0.5, 4.6]
- 概率P值:0.02
def get_forestplot_line(ax, kind):
widths = {'median': 2.8, 'iqr': 2.0, 'hpd': 1.0} assert kind in widths.keys(), f('line kind must be one of {widths.keys()}')
lines = [] for child in ax.get_children(): if isinstance(child, mpl.lines.Line2D) and np.allclose(child.get_lw(), widths[kind]):
lines.append(child) return lines def adjust_forestplot_for_slides(ax):
for line in get_forestplot_line(ax, kind='median'):
line.set_markersize(10) for line in get_forestplot_line(ax, kind='iqr'):
line.set_linewidth(5) for line in get_forestplot_line(ax, kind='hpd'):
line.set_linewidth(3) return ax
pm.forestplot(kruschke_trace[2000:], varnames=['mu_drug', 'mu_placebo'])
ax = plt.gca()
ax = adjust_forestplot_for_slides(ax)
plt.show()
Forest plot:相同轴上后验分布的95%HPD(细线),IQR(较粗线)和中位数(点)。
def overlay_effect_size(ax):
height = ax.get_ylim()[1] * 0.5
ax.hlines(height, 0, 0.2, 'red', lw=5)
ax.hlines(height, 0.2, 0.8, 'blue', lw=5)
ax.hlines(height, 0.8, ax.get_xlim()[1], 'green', lw=5)
ax = pm.plot_posterior(kruschke_trace[2000:], varnames=['effect_size'],color='#87ceeb')[0]
overlay_effect_size(ax)
- Effect size (Cohen's d, none to small, medium, large) could be anywhere from essentially nothing to large (95% HPD [0.0, 0.77])。
- IQ改善0-4
- 该药很可能无关紧要。
- 没有生物学意义的证据。
例子2:手机消毒问题
两种常用的方法相比,我的“特别方法”能更好的消毒我的手机吗?
the experiment design
- 随机将手机分配到六组之一:4“特别”方法+ 2“对照”方法。
- count 形成的细菌菌落数,比较前后的计数。
renamed_treatments = dict()
renamed_treatments['FBM_2'] = 'FM1'renamed_treatments['bleachwipe'] = 'CTRL1'renamed_treatments['ethanol'] = 'CTRL2'renamed_treatments['kimwipe'] = 'FM2'renamed_treatments['phonesoap'] = 'FM3'renamed_treatments['quatricide'] = 'FM4'# Reload the data one more time.data = pd.read_csv('smartphone_sanitization_manuscript.csv', na_values=['#DIV/0!'])del data['perc_reduction colonies']# Exclude cellblaster datadata = data[data['treatment'] != 'CB30']
data = data[data['treatment'] != 'cellblaster']# Rename treatmentsdata['treatment'] = data['treatment'].apply(lambda x: renamed_treatments[x])# Sort the data according to the treatments.treatment_order = ['FM1', 'FM2', 'FM3', 'FM4', 'CTRL1', 'CTRL2']
data['treatment'] = data['treatment'].astype('category')
data['treatment'].cat.set_categories(treatment_order, inplace=True)
data = data.sort_values(['treatment']).reset_index(drop=True)# Encode the treatment index.data['treatment_idx'] = data['treatment'].apply(lambda x: treatment_order.index(x))
data['perc_change_colonies'] = (data['colonies_post'] - data['colonies_pre']) / data['colonies_pre']# # View the first 5 rows.# data.head(5)# # filter the data such that we have only PhoneSoap (PS-300) and Ethanol (ET)# data_filtered = data[(data['treatment'] == 'PS-300') | (data['treatment'] == 'QA')]# data_filtered = data_filtered[data_filtered['site'] == 'phone']# data_filtered.sample(10)
数据
def plot_colonies_data():
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(2,1,1)
sns.swarmplot(x='treatment', y='colonies_pre', data=data, ax=ax1)
ax1.set_title('pre-treatment')
ax1.set_xlabel('')
ax1.set_ylabel('colonies')
ax2 = fig.add_subplot(2,1,2)
sns.swarmplot(x='treatment', y='colonies_post', data=data, ax=ax2)
ax2.set_title('post-treatment')
ax2.set_ylabel('colonies')
ax2.set_ylim(ax1.get_ylim())
plt.tight_layout() return fig
fig = plot_colonies_data()
plt.show()
说明
计数是泊松分布。
with pm.Model() as poisson_estimation:
mu_pre = pm.DiscreteUniform('pre_mus', lower=0, upper=10000,shape=len(treatment_order))
pre_mus = mu_pre[data['treatment_idx'].values] # fancy indexing!!
pre_counts = pm.Poisson('pre_counts', mu=pre_mus,observed=data['colonies_pre'])
mu_post = pm.DiscreteUniform('post_mus', lower=0, upper=10000,shape=len(treatment_order))
post_mus = mu_post[data['treatment_idx'].values] # fancy indexing!!
post_counts = pm.Poisson('post_counts', mu=post_mus, observed=data['colonies_post'])
perc_change = pm.Deterministic('perc_change', 100 * (mu_pre - mu_post) / mu_pre)
MCMC Inference Button (TM)
with poisson_estimation:
poisson_trace = pm.sample(20000)
pm.traceplot(poisson_trace, varnames=['pre_mus', 'post_mus'])
plt.show()
结果
pm.forestplot(poisson_trace[10000:], varnames=['perc_change'], ylabels=treatment_order, xrange=[0, 110])
plt.xlabel('Percentage Reduction')
ax = plt.gca()
ax = adjust_forestplot_for_slides(ax)
- postMessage与postMessage跨域
- 【手把手教你做项目】自然语言处理:单词抽取/统计
- D-Link系列路由器漏洞挖掘入门
- 大家一直在谈的领域驱动设计(DDD),我们在互联网业务系统是这么实践的
- 在Atom中设置Python开发环境
- Kaggle赛题解析:逻辑回归预测模型实现
- Shield:支撑美团点评品类最丰富业务的移动端模块化框架开源了
- 点击块,让小块动起来 - 函数封装
- 玩玩文本挖掘-wordcloud、主题模型与文本分类
- Typecho 前台 getshell 漏洞分析
- 关于其他选择器以及选择器优先级详解
- 2016.05 第二周 群问题分享
- MyFlash——美团点评的开源MySQL闪回工具
- R语言关联规则可视化:扩展包arulesViz的介绍
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- Auto Remove Torrents:自动删种程序部署
- OpenVZ平台Alpine Linux一键安装脚本
- 使用holder.js生成美观的网页占位图
- Android studio 3.5.2安装图文教程详解
- Android面试必备的JVM虚拟机制详解,看完之后简历上多一个技能!
- 如何在PHP中JSON在线解析
- Linux下如何克隆磁盘/分区命令dd入门
- Android自定义跑马灯文字效果
- Android实现图片自动切换功能(实例代码详解)
- Android Studio 3.6 正式版终于发布了,快来围观
- android使用ViewPager实现图片自动切换
- Android Studio 3.6 调试 smali的全过程
- Android 10 适配攻略小结
- Android P实现静默安装的方法示例(官方Demo)
- Android studio实现滑动开关