See PyMC3 on GitHub here, the docs here, and the release notes here. This post is available as a notebook here.
I think there are a few great usability features in this new release that will help a lot with building, checking, and thinking about models. To give an introduction, I am going to to a bad job of implementing the “eight schools” model, and show how these new features help debug the model. I am using this particular model as one that is complicated enough to be interesting, but not too complicated.
1. Checking model initialization
This implementation has two different mistakes in it that we will find. First, the method Model.check_test_point
is helpful to see if you have accidentally defined a model with 0 probability or with bad parameters:
J = 8
y = np.array([28., 8., -3., 7., -1., 1., 18., 12.])
sigma = np.array([15., 10., 16., 11., 9., 11., 10., 18.])
with pm.Model() as non_centered_eight:
mu = pm.Normal('mu', mu=0, sd=5)
tau = pm.HalfCauchy('tau', beta=5, shape=J)
theta_tilde = pm.Normal('theta_t', mu=0, sd=1, shape=J)
theta = pm.Deterministic('theta', mu + tau * theta_tilde)
obs = pm.Normal('obs', mu=theta, sd=-sigma, observed=y)
non_centered_eight.check_test_point()
mu -2.530000
tau_log__ -9.160000
theta_t -7.350000
obs -inf
Name: Log-probability of test_point, dtype: float64
Now that I see that obs
has -inf
log probability, I notice that I set the standard deviation to a negative number! quelle horreur! Let’s fix that and see if we can find other mistakes.
with pm.Model() as non_centered_eight:
mu = pm.Normal('mu', mu=0, sd=5)
tau = pm.HalfCauchy('tau', beta=5, shape=J)
theta_tilde = pm.Normal('theta_t', mu=0, sd=1, shape=J)
theta = pm.Deterministic('theta', mu + tau * theta_tilde)
obs = pm.Normal('obs', mu=theta, sd=sigma, observed=y)
non_centered_eight.check_test_point()
mu -2.53
tau_log__ -9.16
theta_t -7.35
obs -31.46
Name: Log-probability of test_point, dtype: float64
Everything looks ok now at the test point, at least!
2. Model Graphs
It takes an optional install (conda install -c conda-forge python-graphviz
), but you can visualize your models in plate notation. This can be useful for sharing your model, or just checking that you implemented the right one.
pm.model_to_graphviz(non_centered_eight)
Oops! I meant for both mu
and tau
to be shared priors among the eight groups, but left an extra shape=J
argument in there.
with pm.Model() as non_centered_eight:
mu = pm.Normal('mu', mu=0, sd=5)
tau = pm.HalfCauchy('tau', beta=5)
theta_tilde = pm.Normal('theta_t', mu=0, sd=1, shape=J)
theta = pm.Deterministic('theta', mu + tau * theta_tilde)
obs = pm.Normal('obs', mu=theta, sd=sigma, observed=y)
pm.model_to_graphviz(non_centered_eight)
3. Sampling from the prior
We can sample from the prior in the absence of data. This might seem like a small thing, but required a lot of refactoring along the way. Previously, this would be done by copy/pasting the model, deleting the observed
arguments, and using MCMC. Now it can be done in the same model context, and is vectorized, running thousands of times faster. I am excited to see what tools and visualizations can be built around this, but in the meantime we can see how the presence of data effects our prior beliefs for the hierarchical mean here.
with non_centered_eight:
prior = pm.sample_prior_predictive(5000)
posterior = pm.sample()
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [theta_t, tau, mu]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:02<00:00, 1897.82draws/s]
sns.distplot(prior['mu'], label='Prior', hist=False)
ax = sns.distplot(posterior['mu'], label='Posterior', hist=False)
ax.legend()
4. Nifty new progress bar
Check out also the progressbar above, which now shows you the total number of draws the sampler is doing, instead of just the first chain’s progress. The progressbar is actually the visible part of a change in multiprocessing. It removes (on OSX and Linux) the use of pickle
to pass models around for multiprocessing, so you could use lambda
in your models again if you really wanted.
5. Ordered transformation
There’s a new ordered
transform which is handy for sampling from, for example, 1-d mixture models. I’ll quickly generate a mixture model and use some of the tricks above to fit it.
# Generate data
N_SAMPLES = 100
μ_true = np.array([-2, 0, 2])
σ_true = np.ones_like(μ_true)
z_true = np.random.randint(len(μ_true), size=N_SAMPLES)
y = np.random.normal(μ_true[z_true], σ_true[z_true])
with pm.Model() as mixture:
μ = pm.Normal('μ', mu=0, sd=10, shape=3)
z = pm.Categorical('z', p=np.ones(3)/3, shape=len(y))
y_obs = pm.Normal('y_obs', mu=μ[z], sd=1., observed=y)
mixture.check_test_point()
μ -9.66
z -109.86
y_obs -292.47
Name: Log-probability of test_point, dtype: float64
pm.model_to_graphviz(mixture)
with mixture:
posterior = pm.sample()
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>NUTS: [μ]
>CategoricalGibbsMetropolis: [z]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:09<00:00, 405.95draws/s]
The acceptance probability does not match the target. It is 0.4428216460380929, but should be close to 0.8. Try to increase the number of tuning steps.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
pm.traceplot(posterior, varnames=['μ'], combined=True)
Notice the chains “jumping” between modes. This phenomena is called label switching. We can handle it with the ordered
transform.
import pymc3.distributions.transforms as tr
with pm.Model() as mixture:
μ = pm.Normal('μ', mu=0, sd=10, shape=3,
transform=tr.ordered,
testval=np.array([-1, 0, 1])) # the `ordered` transform needs an initialization
# that is also ordered! PRs welcome!
z = pm.Categorical('z', p=np.ones(3) / 3, shape=len(y))
y_obs = pm.Normal('y_obs', mu=μ[z], sd=1., observed=y)
posterior = pm.sample()
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>NUTS: [μ]
>CategoricalGibbsMetropolis: [z]
Sampling 4 chains: 100%|██████████| 4000/4000 [00:15<00:00, 251.06draws/s]
The estimated number of effective samples is smaller than 200 for some parameters.
pm.traceplot(posterior, varnames=['μ'], combined=True)
Look! No label switching! How cool!