## Posts filed under ‘Monte Carlo Methods’

### Software for Flexible Bayesian Modeling – New release

I’ve released a new version of my Software for Flexible Bayesian Modeling and Markov Chain Sampling (FBM). This is the first public release since 2004, with the first release of the precursor software being in 1995. There was a version mostly completed in 2007 that never got released (due to my not getting around to checking that I’d fixed up everything). The new version has the changes from 2007 plus some more recent updates, including new features used for the tests in this paper.

FBM implements several general-purpose Markov chain sampling methods, such as Metropolis updates, Hamiltonian (Hybrid) Monte Carlo, and slice sampling. These methods can be applied to distributions defined by simple formulas, including posterior distributions for simple Bayesian models. Several more specialized modules have been written that implement posterior distributions for more complex models, including Bayesian neural networks, Gaussian process models, and mixture models (including Dirichlet process mixture models).

(more…)

### Two Hamiltonian Monte Carlo papers

Two papers involving Hamiltonian Monte Carlo (HMC) have recently appeared on arxiv.org — Jascha Sohl-Dickstein’s Hamiltonian Monte Carlo with reduced momentum flips, and Jascha Sohl-Dickstein and Benjamin Culpepper’s Hamiltonian annealed importance sampling for partition function estimation.

These papers both relate to the variant of HMC in which momentum is only partially refreshed after each trajectory, which allows random-walk behaviour to be suppressed even when trajectories are short (even just one leapfrog step). This variant is described in Section 5.3 of my HMC review. It seems that the method described in the first paper by Sohl-Dickstein could be applied in the context of the second paper by Sohl-Dickstein and Culpepper, but if so it seems they haven’t tried it yet (or haven’t yet written it up).

(more…)

### Non-random MCMC

In my post on MCMC simulation as a random permutation (paper available at arxiv.org here), I mentioned that this view of MCMC also has implications for the role of randomness in MCMC. This has also been discussed in a recent paper by Iain Murray and Lloyd Elliott on Driving Markov chain Monte Carlo with a dependent random stream.

For the simple case of Gibbs sampling for a continuous distribution, Murray and Elliott’s procedure is the same as mine, except that they do not have the updates of extra variables needed to produce a volume-preserving map. These extra variables are relevant for my importance sampling application, but not for what I’ll discuss here. The method is a simple modification of the usual Gibbs sampling procedure, assuming that sampling from conditional distributions is done by inverting their CDFs (a common method for many standard distributions). It turns out that after this modification, one can often eliminate the random aspect of the simulation and still get good results! (more…)

### MCMC simulation as a random permutation

I’ve just finished a new paper. Continuing my recent use of unwieldy titles, I call it “How to view an MCMC simulation as a permutation, with applications to parallel simulation and improved importance sampling”.

The paper may look a bit technical in places, but the basic idea is fairly simple. I show that, after extending the state space a bit, it’s possible to view an MCMC simulation (done for some number of iterations) as a randomly selected map from an initial state to a final state that is either a permutation, if the extended state space is finite, or more generally a one-to-one map that preserves volume.

Why is this interesting? I think it’s a useful mathematical fact — sort of the opposite of how one can “couple” MCMC simulations in a way that promotes coalescence of states. It may turn out to be applicable in many contexts. I present two of these in the paper. (more…)

### Evaluation of NUTS — more comments on the paper by Hoffman and Gelman

Here is my second post on the paper by Matthew Hoffman and Andrew Gelman on “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo”, available from arxiv.org. In my first post, I discussed how well the two main innovations in this “NUTS'” method — ending a trajectory when a “U-Turn” is encountered, and adaptively setting the stepsize — can be expected to work. In this post, I will discuss the empirical evaluations in the NUTS paper, and report on an evaluation of my own, made possible by the authors having kindly made available the NUTS software, concluding that the paper’s claims for NUTS are somewhat overstated. The issues I discuss are also of more general interest for other evaluations of HMC. (more…)

### No U-Turns for Hamiltonian Monte Carlo – comments on a paper by Hoffman and Gelman

Matthew Hoffman and Andrew Gelman recently posted a paper called “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” on arxiv.org. It has been discussed on Andrew’s blog.

It’s a good paper, which addresses two big barriers to wider use of Hamiltonian Monte Carlo — the difficulties of tuning the trajectory length and tuning the stepsize to use when simulating a trajectory. The name “No-U-Turn Sampler” (NUTS) comes from their way of addressing the problem of tuning the trajectory length — repeatedly double the length of the current trajectory, until (simplifying a bit) there is a part of the trajectory that makes a “U-Turn”, heading back towards its starting point. This doubling method is clever, and (as I discuss below) one aspect of it seems useful even apart from any attempt to adaptively set the trajectory length. They also introduce a method of adapting the stepsize during the burn-in period, so as to achieve some desired acceptance probability.

However, I don’t think these are completely satisfactory ways of setting trajectory lengths and stepsizes. As I’ll discuss below, these problems are more complicated than they might at first appear. (more…)

### GRIMS — General R Interface for Markov Sampling

I have released a (very) preliminary version of my new MCMC software in R, which I’m calling GRIMS, for General R Interface for Markov Sampling. You can get it here.

This software differs from other more-or-less general MCMC packages in several respects, all but one of which make it, I think, a much better tool for serious MCMC applications. Here are some highlights: (more…)

### Ensemble MCMC

I’m glad to have managed, before teaching starts again, to have finished a Technical Report (available here or at arxiv.org) with what may be my most unwieldy title ever:

MCMC Using Ensembles of States for Problems with Fast and Slow Variables such as Gaussian Process Regression

I wanted the title to mention all three of the nested ideas in the paper. Actually, I wasn’t able to fit in a fourth, most general, idea, of MCMC methods based on caching and mapping (see here). Here is the abstract:

I introduce a Markov chain Monte Carlo (MCMC) scheme in which sampling from a distribution with density π(x) is done using updates operating on an “ensemble” of states. The current state x is first stochastically mapped to an ensemble, x^{(1)},…,x^{(K)}. This ensemble is then updated using MCMC updates that leave invariant a suitable ensemble density, ρ(x^{(1)},…,x^{(K)}), defined in terms of π(x^{(i)}) for i=1,…,K. Finally a single state is stochastically selected from the ensemble after these updates. Such ensemble MCMC updates can be useful when characteristics of π and the ensemble permit π(x^{(i)}) for all i in {1,…,K} to be computed in less than K times the amount of computation time needed to compute π(x) for a single x. One common situation of this type is when changes to some “fast” variables allow for quick re-computation of the density, whereas changes to other “slow” variables do not. Gaussian process regression models are an example of this sort of problem, with an overall scaling factor for covariances and the noise variance being fast variables. I show that ensemble MCMC for Gaussian process regression models can indeed substantially improve sampling performance. Finally, I discuss other possible applications of ensemble MCMC, and its relationship to the “multiple-try Metropolis” method of Liu, Liang, and Wong and the “multiset sampler” of Leman, Chen, and Lavine.

I’ve also posted the programs used to produce the results. These haven’t been tested much beyond their use for the paper, but I hope to incorporate them into a general MCMC package in R (also including programs accompanying my review of Hamiltonian Monte Carlo). That’s my next project to do in whatever time I have available after teaching, administration, and a three-year-old daughter, along with more efforts to speed up R, so the that this MCMC package won’t be *too *slow.

### The Harmonic Mean of the Likelihood: Worst Monte Carlo Method Ever

Many Bayesian statisticians decide which of several models is most appropriate for a given dataset by computing the *marginal likelihood* of each model (also called the *integrated likelihood* or the *evidence*). The marginal likelihood is the probability that the model gives to the observed data, averaging over values of its parameters with respect to their prior distribution. If *x* is the entire dataset and *t* is the entire set of parameters, then the marginal likelihood is

Here, I use *P(x)* and so forth to represent a probability density or mass function, as appropriate. After P(x) has been computed for all models (which may well have different sorts of parameters), the model with largest marginal likelihood is chosen, or more elaborately, predictions might be made by averaging over all models, with weights proportional to their marginal likelihoods (perhaps multiplied by a prior factor for each model).

Use of the marginal likelihood to evaluate models is often rather dubious, because it is very sensitive to any flaws in the specifications of the priors for the parameters of each model. That may be a topic for a future post, but I’ll ignore that issue here, and talk instead about how people try to compute the marginal likelihood.

Computing *P(x)* is difficult, because it’s an integral of what is usually a highly variable function over what is typically a high dimensional parameter space. Of course, for some models the integral is analytically tractable (typically ones with conjugate priors), and if the parameter space is low-dimensional, brute-force numerical integration may work. For most interesting models, however, the only known way of accurately computing P(x) is by applying quite sophisticated Monte Carlo techniques, which have long been researched by physicists in order to solve the essentially identical problem of computing “free energies”. See, for example, Section 6.2 of my review of Probabilistic Inference Using Markov Chain Monte Carlo Methods, and my paper on Annealed Importance Sampling.

These sophisticated techniques aren’t too well-known, however. They also take substantial effort to implement and substantial amounts of computer time. So it’s maybe not surprising that many people have been tempted by what *seems *like a much easier method — compute the harmonic mean of the likelihood with respect to the posterior distribution, using the same Markov Chain Monte Carlo (MCMC) runs that they have usually already done to estimate parameters or make predictions based on each model.