用 Python 进行贝叶斯模型建模（2）_程序人生

用 Python 进行贝叶斯模型建模（2）

admin

2023-07-31 00:46:24

0次

本系列：

《第 0 节：导论》
《第 1 节：估计模型参数》
《第 2 节：模型检验》
《第 3 节：分层模型》
《第 4 节：贝叶斯回归》
待续

第2节：模型检验

在这一节中，我们将用到两种技术，旨在回答：

模型和估计的参数是否对潜在数据有很好的模拟？
给定两个独立的模型，哪个对潜在数据模拟的更好？

模型检验I：后验估计检验

一种检验模型拟合的方法是后验估计检验。这种方法很直观，回顾上节中，我们通过收集 200,000 个 μ 的后验分布样本来对泊松分布的参数 μ进行估计，每个样本都被认为是可信的参数值。

后验预测检验需要从预测模型中产生新的数据。具体来说就是，我们已经估计了 200,000 个可信的泊松分布参数值μ，这意味着我们可以根据这些值建立 200,000 个泊松分布，然后从这些分布中随机采样。用公式表示为：

理论上，如果模型对潜在数据拟合的很好，那么产生的数据应该和原始观测数据近似。PyMC 提供一种方便的方式来从拟合模型中采样。你可能已经注意到了上面模型应用中新的一行代码：

1	y_pred = pm.Poisson(\’y_pred\’, mu=mu)

这几乎和 y_est 一样，只不过没给观测数据赋值。PyMC 把它当做随机节点（和观测节点相反），当 MCMC 采样器运行时，它也从 y_est 中采集数据。

然后画出y_pred并和观测数据y_est比较。

12345678910111213141516171819

import jsonimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport pymc3 as pmimport scipyimport scipy.stats as statsimport statsmodels.api as smimport theano.tensor as tt from IPython.display import Image %matplotlib inline plt.style.use(\’bmh\’)colors = [\’#348ABD\’, \’#A60628\’, \’#7A68A6\’, \’#467821\’, \’#D55E00\’, \’#CC79A7\’, \’#56B4E9\’, \’#009E73\’, \’#F0E442\’, \’#0072B2\’] messages = pd.read_csv(\’data/hangout_chat_data.csv\’)

12	Applied interval–transform to mu and added transformed mu_interval to model. [————————–100%————————–] 200000 of 200000 complete in 63.5 sec

12345678910111213141516171819202122

x_lim = 60burnin = 50000 y_pred = trace[burnin:].get_values(\’y_pred\’)mu_mean = trace[burnin:].get_values(\’mu\’).mean() fig = plt.figure(figsize=(10,6))fig.add_subplot(211) _ = plt.hist(y_pred, range=[0, x_lim], bins=x_lim, histtype=\’stepfilled\’, color=colors[1]) _ = plt.xlim(1, x_lim)_ = plt.ylabel(\’Frequency\’)_ = plt.title(\’Posterior predictive distribution\’) fig.add_subplot(212) _ = plt.hist(messages[\’time_delay_seconds\’].values, range=[0, x_lim], bins=x_lim, histtype=\’stepfilled\’)_ = plt.xlabel(\’Response time in seconds\’)_ = plt.ylabel(\’Frequency\’)_ = plt.title(\’Distribution of observed data\’) plt.tight_layout()

选择正确的分布

我对上面的结果不是很满意。理想情况下，我希望后验预测分布和观测数据的分布一定程度上近似。直观上，如果我们正确估计了模型参数，那么我们应该可以从模型中采样得到类似的数据。结果很明显不是这样。

可能泊松分布不适合拟合这些数据。一种可选模型是负二项分布，特点和泊松分布很像，只是有两个参数（μ 和 α），使得方差和均值独立。回顾泊松分布只有一个参数 μ，既是均值，又是方差。

12345678910111213141516171819202122232425262728293031323334353637

fig = plt.figure(figsize=(10,5))fig.add_subplot(211)x_lim = 70mu = [15, 40]for i in np.arange(x_lim): plt.bar(i, stats.poisson.pmf(mu[0], i), color=colors[3]) plt.bar(i, stats.poisson.pmf(mu[1], i), color=colors[4]) _ = plt.xlim(1, x_lim)_ = plt.xlabel(\’Response time in seconds\’)_ = plt.ylabel(\’Probability mass\’)_ = plt.title(\’Poisson distribution\’)_ = plt.legend([\’$lambda$ = %s\’ % mu[0], \’$lambda$ = %s\’ % mu[1]]) # Scipy takes parameters n & p, not mu & alphadef get_n(mu, alpha): return 1. / alpha * mu def get_p(mu, alpha): return get_n(mu, alpha) / (get_n(mu, alpha) + mu) fig.add_subplot(212) a = [2, 4] for i in np.arange(x_lim): plt.bar(i, stats.nbinom.pmf(i, n=get_n(mu[0], a[0]), p=get_p(mu[0], a[0])), color=colors[3]) plt.bar(i, stats.nbinom.pmf(i, n=get_n(mu[1], a[1]), p=get_p(mu[1], a[1])), color=colors[4]) _ = plt.xlabel(\’Response time in seconds\’)_ = plt.ylabel(\’Probability mass\’)_ = plt.title(\’Negative Binomial distribution\’)_ = plt.legend([\’$mu = %s, / beta = %s$\’ % (mu[0], a[0]), \’$mu = %s, / beta = %s$\’ % (mu[1], a[1])]) plt.tight_layout()

使用之前相同的数据集，继续对负二项分布的参数进行估计。同样地，使用均匀分布来估计 μ 和 α。模型可以表示为：

代码：

1	Image(\’graphics/Neg Binomial Dag.png\’, width=400)

1234567

上一篇：用 Python 进行贝叶斯模型建模（3）

下一篇：用 Python 进行贝叶斯模型建模（1）

用 Python 进行贝叶斯模型建模（2）

第2节：模型检验

模型检验I：后验估计检验

选择正确的分布

相关内容

热门资讯