Two Hamiltonian Monte Carlo papers
Two papers involving Hamiltonian Monte Carlo (HMC) have recently appeared on arxiv.org — Jascha Sohl-Dickstein’s Hamiltonian Monte Carlo with reduced momentum flips, and Jascha Sohl-Dickstein and Benjamin Culpepper’s Hamiltonian annealed importance sampling for partition function estimation.
These papers both relate to the variant of HMC in which momentum is only partially refreshed after each trajectory, which allows random-walk behaviour to be suppressed even when trajectories are short (even just one leapfrog step). This variant is described in Section 5.3 of my HMC review. It seems that the method described in the first paper by Sohl-Dickstein could be applied in the context of the second paper by Sohl-Dickstein and Culpepper, but if so it seems they haven’t tried it yet (or haven’t yet written it up).
The first paper presents a new modification of partial momentum refreshment, intended to improve its ability to suppress random-walk behaviour.
The original version of partial momentum refreshment (due to Horowitz) defines a non-reversible Markov chain update as the result of applying three reversible updates in sequence. The first update simulates a trajectory in position/momentum space by Hamiltonian dynamics and then uses the end-point, but with momentum negated, as the proposal for a Metropolis update. The second update simply negates the momentum. Finally, the third update replaces the momentum, p, with a new value computed as αp+(1-α2)1/2n, where n is a vector of standard normal random variates. Each of these updates leaves invariant the desired distribution for position and momentum (the one we’re interested in for position, and independent standard normals for p), so the combination of them also leaves this distribution invariant.
Setting α to zero gives standard HMC. When α is only slightly less than one, and the first step accepts the trajectory end-point (with momentum negated), the effect is that the trajectory simulated in the next iteration will continue in almost the same direction as the previous trajectory — the trajectory accepted in the first step has momentum negated, this is undone by the negation in the second step, and the third step leaves this momentum almost unchanged. So even if trajectories are short, movement will continue in the same direction, rather than in a random walk, until the limits of the distribution are reached, provided that the trajectories are all accepted.
When a trajectory is rejected, the direction of movement is reversed, since no negation occurs in the first step (rejection leaves the state unchanged), but the momentum is negated in the second step. This limits the efficiency of the method. To suppress random walks, the rejection rate needs to be small enough that these reversals usually occur only after the chain has moved a distance comparable to the width of the distribution. This requires that the stepsize be smaller than would be needed otherwise, which means that trajectories take longer to compute.
Sohl-Dickstein shows how this method can be modified to reduce the number of momentum flips, which allows random walks to be suppressed without reducing the stepsize as much as in the original version. In one example, he shows that compared with the standard version with the same stepsize, the autocovariances for functions of state are reduced, though not by a huge amount.
Unfortunately, if I understand the method correctly, the modified method will have to compute reversed trajectories that are not needed in the original version, which seems like it will about double the computation time per iteration. This probably more than offsets the benefit from reducing momentum flips. Note that reducing the stepsize by a factor of two will also only double the computation time (for a given trajectory length), and will also reduce the number of momentum flips. So I’m not convinced that this variant of partial momentum refreshment is actually useful.
However, one interesting aspect of the paper is that it’s the only realistic example that I can recall of a non-reversible Markov chain Monte Carlo update that isn’t constructed by applying a sequence of reversible updates (as in my description of standard partial momentum refreshment above). Maybe the technique (which I won’t describe here) could be used in other contexts.
The second paper is a fairly straightforward application of my annealed importance sampling method for computing normalizing constants, and handling multimodal distributions, using HMC as the Markov chain Monte Carlo method that the importance sampling employs. Annealed importance sampling applies MCMC updates that leave invariant a series of distributions, ending in the distribution of interest. Often, getting good results requires using many distributions that slowly approach the distribution of interest from some simple reference distribution.
The methodological innovation in the paper is to use HMC with partial momentum refreshment, with the momentum retained from one distribution in the sequence to the next. If the HMC updates do just one leapfrog step, and a new distribution is used for each update, partial momentum refreshment allows one to combine a large number of distributions with random walk suppression, which wouldn’t be possible if one did standard HMC.
They apply this to comparing the performance of some complex image models, in which difficult normalizing constants need to be evaluated in order to find the probability the model assigns to test data.
My method of Hamiltonian importance sampling is a more radical way of applying Hamiltonian dynamics to this problem, which I’m currently working on developing further.