Parallel-Tempering Ensemble MCMC

Added in version 1.2.0

When your posterior is multi-modal or otherwise hard to sample with a standard MCMC, a good option to try is parallel-tempered MCMC (PTMCMC). PTMCMC runs multiple MCMC’s at different temperatures, \(T\). Each MCMC samples from a modified posterior, given by

\[\pi_T(x) = \left[ l(x) \right]^{\frac{1}{T}} p(x)\]

As \(T \to \infty\), the posterior becomes the prior, which is hopefully easy to sample. If the likelihood is a Gaussian with standard deviation \(\sigma\), then the tempered likelihood is proportional to a Gaussian with standard deviation \(\sigma \sqrt{T}\).

Periodically during the run, the different temperatures swap members of their ensemble in a way that preserves detailed balance. The hot chains can more easily explore parameter space because the likelihood is flatter and broader, while the cold chains do a good job of exploring the peaks of the likelihood. This can dramatically improve convergence if your likelihood function has many well-separated modes.

How To Sample a Multi-Modal Gaussian

Suppose we want to sample from the posterior given by

\[\pi(\vec{x}) \propto \exp\left[ - \frac{1}{2} \left( \vec{x} - \vec{\mu}_1 \right)^T \Sigma^{-1}_1 \left( \vec{x} - \vec{\mu}_1 \right) \right] + \exp\left[ -\frac{1}{2} \left( \vec{x} - \vec{\mu}_2 \right)^T \Sigma^{-1}_2 \left( \vec{x} - \vec{\mu}_2 \right) \right]\]

If the modes \(\mu_{1,2}\) are well-separated with respect to the scale of \(\Sigma_{1,2}\), then this distribution will be hard to sample with the EnsembleSampler. Here is how we would sample from it using the PTSampler.

First, some preliminaries:

import numpy as np
from emcee import PTSampler

Define the means and standard deviations of our multi-modal likelihood:

# mu1 = [1, 1], mu2 = [-1, -1]
mu1 = np.ones(2)
mu2 = -np.ones(2)

# Width of 0.1 in each dimension
sigma1inv = np.diag([100.0, 100.0])
sigma2inv = np.diag([100.0, 100.0])

def logl(x):
    dx1 = x - mu1
    dx2 = x - mu2

    return np.logaddexp(-np.dot(dx1, np.dot(sigma1inv, dx1))/2.0,
                        -np.dot(dx2, np.dot(sigma2inv, dx2))/2.0)

# Use a flat prior
def logp(x):
    return 0.0

Now we can construct a sampler object that will drive the PTMCMC; arbitrarily, we choose to use 20 temperatures (the default is for each temperature to increase by a factor of \(\sqrt{2}\), so the highest temperature will be \(T = 1024\), resulting in an effective \(\sigma_T = 32 \sigma = 3.2\), which is about the separation of our modes). Let’s use 100 walkers in the ensemble at each temperature:

ntemps = 20
nwalkers = 100
ndim = 2

sampler=PTSampler(ntemps, nwalkers, ndim, logl, logp)

Making the sampling multi-threaded is as simple as adding the threads=Nthreads argument to PTSampler. We could have modified the temperature ladder using the betas optional argument (which should be an array of \(\beta \equiv 1/T\) values). The pool argument also allows to specify our own pool of worker threads if we wanted fine-grained control over the parallelism.

First, we run the sampler for 1000 burn-in iterations:

p0 = np.random.uniform(low=-1.0, high=1.0, size=(ntemps, nwalkers, ndim))
for p, lnprob, lnlike in sampler.sample(p0, iterations=1000):
    pass
sampler.reset()

Now we sample for 10000 iterations, recording every 10th sample:

for p, lnprob, lnlike in sampler.sample(p, lnprob0=lnprob,
                                           lnlike0=lnlike,
                                           iterations=10000, thin=10):
    pass

The resulting samples (1000 of them) are stored as the sampler.chain property:

assert sampler.chain.shape == (ntemps, nwalkers, 1000, ndim)

# Chain has shape (ntemps, nwalkers, nsteps, ndim)
# Zero temperature mean:
mu0 = np.mean(np.mean(sampler.chain[0,...], axis=0), axis=0)

# Longest autocorrelation length (over any temperature)
max_acl = np.max(sampler.acor)

# etc

Implementation Notes

For a description of the parallel-tempering algorithm, see, e.g. Earl & Deem (2010), Phys Chem Chem Phys, 7, 23, 3910. The algorithm has some tunable parameters:

Temperature Ladder
The choice of temperature for the chains will strongly influence the rate of convergence of the sampling. By default, the PTSampler class uses an exponential ladder, with each temperature increasing by a factor of \(\sqrt{2}\). The user can supply their own ladder using the beta optional argument in the constructor.
Rate of Temperature Swaps
The rate at which temperature swaps are proposed can, to a lesser extent, also influence the rate of convergence of the sampling. The goal is to make sure that good positions found by the high temperatures can propogate to the lower temperatures, but still ensure that the high-temperatures do not lose all memory of good locations. Here we choose to implement one temperature swap proposal per walker per rung on the temperature ladder after each ensemble update. This is not user-tunable, but seems to work well in practice.

The args optional argument is not available in the PTSampler constructor; use a custom class with a __call__ method if you need to pass arguments to the lnlike or lnprior functions and do not want to use a global variable.

The thermodynamic_integration_log_evidence uses thermodynamic integration (see, e.g., Goggans & Chi (2004), AIP Conf Proc, 707, 59) to estimate the evidence integral. Define the evidence as a function of inverse temperature:

\[Z(\beta) \equiv \int dx\, l^\beta(x) p(x)\]

We want to compute \(Z(1)\). \(Z\) satisfies the following differential equation

\[\frac{d \ln Z}{d\beta} = \frac{1}{Z(\beta)} \int dx\, \ln l(x) l^\beta(x) p(x) = \left \langle \ln l \right\rangle_\beta\]

where \(\left\langle \ldots \right\rangle_\beta\) is the average of a quantity over the posterior at temperature \(T = 1/\beta\). Integrating (note that \(Z(0) = 1\) because the prior is normalized), we have

\[\ln Z(1) = \int_0^1 d\beta \left \langle \ln l \right\rangle_\beta\]

This quantity can be estimated from a PTMCMC by computing the average \(\ln l\) within each chain and applying a quadrature formula to estimate the integral.