No U-Turns for Hamiltonian Monte Carlo – comments on a paper by Hoffman and Gelman
Matthew Hoffman and Andrew Gelman recently posted a paper called “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” on arxiv.org. It has been discussed on Andrew’s blog.
It’s a good paper, which addresses two big barriers to wider use of Hamiltonian Monte Carlo — the difficulties of tuning the trajectory length and tuning the stepsize to use when simulating a trajectory. The name “No-U-Turn Sampler” (NUTS) comes from their way of addressing the problem of tuning the trajectory length — repeatedly double the length of the current trajectory, until (simplifying a bit) there is a part of the trajectory that makes a “U-Turn”, heading back towards its starting point. This doubling method is clever, and (as I discuss below) one aspect of it seems useful even apart from any attempt to adaptively set the trajectory length. They also introduce a method of adapting the stepsize during the burn-in period, so as to achieve some desired acceptance probability.
However, I don’t think these are completely satisfactory ways of setting trajectory lengths and stepsizes. As I’ll discuss below, these problems are more complicated than they might at first appear.
The biggest advantage of HMC over simple random-walk Metropolis or Gibbs sampling methods is that it can propose to move to a distant point, with a high probability of acceptance. This gets rid of the inefficiency of a random walk, in which taking n small steps in random directions is likely to move you only about √n steps away from where you started. HMC can propose the end-point of a trajectory found using n steps that move more-or-less consistently in one direction, ending up about n steps from the starting point. To get the full advantage of this, the trajectory has to be long enough (but not much longer) that in the least constrained direction the end-point of the trajectory is distant from the start point. This could require that a trajectory be much longer than the length at which it starts to move back towards the starting point, which (roughly speaking) is the point where the No-U-Turn method would stop.
Such premature U-Turns can occur when some directions are highly constrained, limiting how big the stepsize can be, while some other directions are much less constrained, and hence will take many steps to move across, and yet other directions are constrained to an intermediate degree. In directions with intermediate constraints, the trajectory can reverse direction long before the least constrained direction has been explored. This can produce a “U-Turn” when the trajectory is much shorter than is optimal.
Consider an example of sampling from a 30-dimensional multivariate normal distribution in which the 30 components are independent, with standard deviations of 110, 100, 1.1, 1.0, and 26 values equally spaced between 8 and 16. (Since the operation of HMC is invariant to rotation, behaviour on this distribution is the same as on any 30-dimensional multivariate normal for which the square roots of the eigenvalues of the covariance matrix have these values.) The smallest standard deviation of 1 limits the stepsize to less than 2 (at which point the dynamics becomes unstable). A stepsize of 1.5 gives reasonably good results. Exploring the less constrained directions where the standard deviations are 110 and 100 will then require hundreds of steps.
I simulated six trajectories from starting points randomly drawn from this distribution. Each trajectory was for 300 leapfrog steps with a stepsize of 1.5, a duration of 450 time units. Below is a plot (produced using this R script, which calls this HMC routine) of the distance from the start point for the points after each leapfrog step in these trajectories (trajectories that were rejected are plotted as dotted lines).
As one can see, the distance from the start point often does not increase monotonically for the time needed to reach a distant point. It’s hard from this plot to tell exactly what will happen when using the No U-Turn Sampler, since it looks for U-Turns in subtrees, which don’t always have the same start point as the entire trajectory. It’s clear, though, that it will be quite possible for NUTS to stop doubling prematurely, before reaching a trajectory length that will produce good movement in the least-constrained direction. However, we also see that occasionally the distance from the starting point does increase monotonically for a suitable time, so perhaps NUTS will at least occasionally use a suitable trajectory length.
I think there is scope for improving NUTS in this respect, by finding some better criterion for stopping the doubling process. One simple modification that might be desirable is to simply refuse to stop doubling until some minimum trajectory length is reached — a minimum length of eight might be about right. This will be less efficient for problems in which the optimal trajectory length is smaller than this, but such problems are easy anyway, and will still be easy even if done up to eight times less efficiently.
Once it has stopped doubling, NUTS has an interesting way of sampling a point from the final trajectory. Rather than just picking a point randomly from amongst those that are eligible, it first tries to move to the half of the trajectory that doesn’t contain the starting point, then if that is rejected, to move to the quarter of the trajectory that doesn’t contain the starting point but is in the same half as the starting point, and so forth, attempting to move away from the starting point by successively smaller amounts. This should produce greater movement on average than randomly selecting a point that might, just by chance, turn out to be close to the starting point.
This technique ought to also be applicable to my “windowed HMC” method (see Section 5.4 of my review), most obviously when the window size is half the trajectory length. It seems that it should be easy to try this out by just modifying the NUTS program to do a fixed number of doublings, ignoring whether the trajectories are doing U-Turns or not. The discussion in the NUTS paper mentions that some of the advantage seen for NUTS over standard HMC might be due to it having some characteristics of windowed HMC. Comparing adaptive NUTS with NUTS with a fixed number of doublings (essentially an improved version of windowed HMC) would separate these effects.
The other big issue when tuning HMC is how to choose a stepsize for the leapfrog updates that will be small enough to give a reasonably high probability of accepting trajectories, but not unnecessarily small, which would waste computation time. NUTS addresses this issue by adapting the stepsize during a burn-in period, so as to achieve a specified acceptance probability. For example, if one assumes that the distribution being sampled can be viewed as consisting approximately of many replications of some distribution, the optimal acceptance rate is 0.65 (see Section 4.4 of my review), so this might be the target of the adaptation.
I expect that this will work well for many problems. However, in some problems it may fail disastrously. This could happen if the stepsize that produces an average acceptance probability of 0.65 is much smaller in some regions of the distribution than in others. If the sampler is started in the region that allows a large stepsize, the adaptation may converge on that large stepsize. The sampler will then very likely never visit the other region, where the stepsize needs to be much smaller, and the answers obtained will be drastically wrong.
You might think that instead the sampler would at some iteration move to the region that needs a small stepsize, and then get stuck there, as all the proposals are rejected — a problem that is easily diagnosed. Unfortunately, that’s not what will happen. Since the sampler leaves the correct distribution invariant, if it would be stuck for a very long time in some region, it must also be very unlikely to enter that region. This is a potential problem for any global Metropolis algorithm, for which acceptance probabilities can be very low if a bad proposal distribution is chosen, especially in high dimensions, but is particularly severe for HMC, since a stepsize that is beyond the stability limit for the dynamics will produce exponential growth in the error, and an extremely small acceptance probability. See Section 4.2 of my review for further discussion.
One way of alleviating this problem is to choose the stepsize to use for a trajectory randomly, from some moderately broad distribution. One could still adaptively choose the mean of this distribution. Occasional choices of a small stepsize would allow the sampler to enter regions that require such a small stepsize, and either force the mean of the stepsize distribution down, or at least allow the problem to be diagnosed from the presence of some long runs of rejected trajectories.
I will discuss another reason to randomly choose the stepsize or number of leapfrog steps in a second post on NUTS (coming, I hope, in a few days), in which I will discuss the empirical evaluations done in the NUTS paper. This post will also shed some light on whether the possibility of a premature U-Turn that I discuss above is likely to be a problem in practice.