## No U-Turns for Hamiltonian Monte Carlo – comments on a paper by Hoffman and Gelman

*2012-01-21 at 12:38 am* *
9 comments *

Matthew Hoffman and Andrew Gelman recently posted a paper called “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” on arxiv.org. It has been discussed on Andrew’s blog.

It’s a good paper, which addresses two big barriers to wider use of Hamiltonian Monte Carlo — the difficulties of tuning the trajectory length and tuning the stepsize to use when simulating a trajectory. The name “No-U-Turn Sampler” (NUTS) comes from their way of addressing the problem of tuning the trajectory length — repeatedly double the length of the current trajectory, until (simplifying a bit) there is a part of the trajectory that makes a “U-Turn”, heading back towards its starting point. This doubling method is clever, and (as I discuss below) one aspect of it seems useful even apart from any attempt to adaptively set the trajectory length. They also introduce a method of adapting the stepsize during the burn-in period, so as to achieve some desired acceptance probability.

However, I don’t think these are completely satisfactory ways of setting trajectory lengths and stepsizes. As I’ll discuss below, these problems are more complicated than they might at first appear.

The biggest advantage of HMC over simple random-walk Metropolis or Gibbs sampling methods is that it can propose to move to a distant point, with a high probability of acceptance. This gets rid of the inefficiency of a random walk, in which taking *n* small steps in random directions is likely to move you only about √*n* steps away from where you started. HMC can propose the end-point of a trajectory found using* n* steps that move more-or-less consistently in one direction, ending up about *n* steps from the starting point. To get the full advantage of this, the trajectory has to be long enough (but not much longer) that *in the least constrained direction *the end-point of the trajectory is distant from the start point. This could require that a trajectory be much longer than the length at which it starts to move back towards the starting point, which (roughly speaking) is the point where the No-U-Turn method would stop.

Such premature U-Turns can occur when some directions are highly constrained, limiting how big the stepsize can be, while some other directions are much less constrained, and hence will take many steps to move across, and yet other directions are constrained to an intermediate degree. In directions with intermediate constraints, the trajectory can reverse direction long before the least constrained direction has been explored. This can produce a “U-Turn” when the trajectory is much shorter than is optimal.

Consider an example of sampling from a 30-dimensional multivariate normal distribution in which the 30 components are independent, with standard deviations of 110, 100, 1.1, 1.0, and 26 values equally spaced between 8 and 16. (Since the operation of HMC is invariant to rotation, behaviour on this distribution is the same as on any 30-dimensional multivariate normal for which the square roots of the eigenvalues of the covariance matrix have these values.) The smallest standard deviation of 1 limits the stepsize to less than 2 (at which point the dynamics becomes unstable). A stepsize of 1.5 gives reasonably good results. Exploring the less constrained directions where the standard deviations are 110 and 100 will then require hundreds of steps.

I simulated six trajectories from starting points randomly drawn from this distribution. Each trajectory was for 300 leapfrog steps with a stepsize of 1.5, a duration of 450 time units. Below is a plot (produced using this R script, which calls this HMC routine) of the distance from the start point for the points after each leapfrog step in these trajectories (trajectories that were rejected are plotted as dotted lines).

As one can see, the distance from the start point often does not increase monotonically for the time needed to reach a distant point. It’s hard from this plot to tell exactly what will happen when using the No U-Turn Sampler, since it looks for U-Turns in subtrees, which don’t always have the same start point as the entire trajectory. It’s clear, though, that it will be quite possible for NUTS to stop doubling prematurely, before reaching a trajectory length that will produce good movement in the least-constrained direction. However, we also see that occasionally the distance from the starting point does increase monotonically for a suitable time, so perhaps NUTS will at least occasionally use a suitable trajectory length.

I think there is scope for improving NUTS in this respect, by finding some better criterion for stopping the doubling process. One simple modification that might be desirable is to simply refuse to stop doubling until some minimum trajectory length is reached — a minimum length of eight might be about right. This will be less efficient for problems in which the optimal trajectory length is smaller than this, but such problems are easy anyway, and will still be easy even if done up to eight times less efficiently.

Once it has stopped doubling, NUTS has an interesting way of sampling a point from the final trajectory. Rather than just picking a point randomly from amongst those that are eligible, it first tries to move to the half of the trajectory that doesn’t contain the starting point, then if that is rejected, to move to the quarter of the trajectory that doesn’t contain the starting point but is in the same half as the starting point, and so forth, attempting to move away from the starting point by successively smaller amounts. This should produce greater movement on average than randomly selecting a point that might, just by chance, turn out to be close to the starting point.

This technique ought to also be applicable to my “windowed HMC” method (see Section 5.4 of my review), most obviously when the window size is half the trajectory length. It seems that it should be easy to try this out by just modifying the NUTS program to do a fixed number of doublings, ignoring whether the trajectories are doing U-Turns or not. The discussion in the NUTS paper mentions that some of the advantage seen for NUTS over standard HMC might be due to it having some characteristics of windowed HMC. Comparing adaptive NUTS with NUTS with a fixed number of doublings (essentially an improved version of windowed HMC) would separate these effects.

The other big issue when tuning HMC is how to choose a stepsize for the leapfrog updates that will be small enough to give a reasonably high probability of accepting trajectories, but not unnecessarily small, which would waste computation time. NUTS addresses this issue by adapting the stepsize during a burn-in period, so as to achieve a specified acceptance probability. For example, if one assumes that the distribution being sampled can be viewed as consisting approximately of many replications of some distribution, the optimal acceptance rate is 0.65 (see Section 4.4 of my review), so this might be the target of the adaptation.

I expect that this will work well for many problems. However, in some problems it may fail disastrously. This could happen if the stepsize that produces an average acceptance probability of 0.65 is much smaller in some regions of the distribution than in others. If the sampler is started in the region that allows a large stepsize, the adaptation may converge on that large stepsize. The sampler will then very likely never visit the other region, where the stepsize needs to be much smaller, and the answers obtained will be drastically wrong.

You might think that instead the sampler would at some iteration move to the region that needs a small stepsize, and then get stuck there, as all the proposals are rejected — a problem that is easily diagnosed. Unfortunately, that’s not what will happen. Since the sampler leaves the correct distribution invariant, if it would be stuck for a very long time in some region, it must also be very unlikely to enter that region. This is a potential problem for any global Metropolis algorithm, for which acceptance probabilities can be very low if a bad proposal distribution is chosen, especially in high dimensions, but is particularly severe for HMC, since a stepsize that is beyond the stability limit for the dynamics will produce exponential growth in the error, and an extremely small acceptance probability. See Section 4.2 of my review for further discussion.

One way of alleviating this problem is to choose the stepsize to use for a trajectory randomly, from some moderately broad distribution. One could still adaptively choose the mean of this distribution. Occasional choices of a small stepsize would allow the sampler to enter regions that require such a small stepsize, and either force the mean of the stepsize distribution down, or at least allow the problem to be diagnosed from the presence of some long runs of rejected trajectories.

I will discuss another reason to randomly choose the stepsize or number of leapfrog steps in a second post on NUTS (coming, I hope, in a few days), in which I will discuss the empirical evaluations done in the NUTS paper. This post will also shed some light on whether the possibility of a premature U-Turn that I discuss above is likely to be a problem in practice.

Entry filed under: Monte Carlo Methods, Statistics, Statistics - Computing. Tags: .

1.Matt Hoffman | 2012-01-22 at 11:15 amThanks for the detailed and thoughtful comments! I’m looking forward to part II.

I’ll respond quickly to a couple of points:

I actually think that the u-turn criterion sometimes being met early may sometimes be an asset, not a liability. If the trajectory makes a u-turn before it’s fully explored the least-constrained direction, that may be a sign that that particular trajectory isn’t exploring the least-constrained direction very efficiently. It may be desirable to cut those trajectories short, since they will require more leapfrog steps to travel a given distance along the least-constrained direction than a trajectory where the initial momentum is better aligned with the least-constrained direction. For example, when running NUTS on the 250-D multivariate normal in the NUTS paper (whose covariance matrix’s condition number is about 1300^2) we saw NUTS choosing a large range of trajectory lengths, many of which were no doubt too short to fully explore the least-constrained dimension, but nonetheless achieving substantially better computation-normalized performance than HMC. Fortunately, too-short trajectories are cheap relative to long-enough trajectories.

As to the problems associated with using a single global step size (even one that’s been optimized for the target distribution), I can only agree that this is a potentially serious issue for some (real-world) target distributions. Unfortunately, randomizing the step size used by NUTS is more expensive than for HMC, since the number of steps that NUTS takes is (usually) inversely proportional to the step size. So if we randomly choose a step size near 0, then that iteration of NUTS will take a long, long time. An alternative might be to interleave NUTS iterations with Langevin (or short HMC) iterations with a randomly chosen step size. This at least would give us some hope of getting into a region where our NUTS step size leads to an unacceptable rejection rate, which would allow the problem to be diagnosed and addressed.

We’re also actively looking at/for alternatives to the use of a single global step size (or a single step size distribution independent of position). One option is to incorporate a (hopefully not too costly) variant on Girolami & Calderhead’s methods based on Riemannian geometry, but there may be others. Preserving detailed balance is, of course, the great challenge.

2.John Salvatier | 2012-01-22 at 4:00 pmHello, I have a question: what are the advantages of focusing on no-u-turn samplers instead of 2nd derivative based samplers. It seems like 2nd derivative based samplers should be able to bypass a lot of the problems here by in effect making all of the dimensions unity scaled, but perhaps I’m missing something critical.

3.Radford Neal | 2012-01-22 at 4:39 pmWell, I think that methods based on using a matrix of second-derivatives will not be feasible in many applications, when finding a suitable matrix is hard, or when the dimensionality is large enough that you wouldn’t want to use it anyway.

4.John Salvatier | 2012-01-22 at 4:48 pmTrue. Though in lots of cases, you should be able to approximate the matrix by assuming the matrix is diagonal for some dimensions and then modifying the computation so that the inversion ignores those bits.

5.John Salvatier | 2012-01-22 at 4:49 pmI can easily see that being complicated, though.

6.Evaluation of NUTS — more comments on the paper by Hoffman and Gelman « Radford Neal's blog | 2012-01-27 at 4:31 pm[…] Adaptively Setting Path Lengths in Hamiltonian Monte Carlo”, available from arxiv.org. In my first post, I discussed how well the two main innovations in this “NUTS’” method — ending […]

7.Longhai Li | 2012-02-23 at 12:04 amI think using the optimal length of trajectory, which is often very long and dominates a whole MCMC simulation, may not be “optimal” either in Bayesian models using hyperparameters. As you mentioned in your thesis, spending much time in searching a distant point with hmc slows down the updating frequency of hyperparameters. High frequent updating of hypers is essential for the whole Markov chain to move between modes, because HMC cannot do it. So I think we don’t want to use overly long trajectory, especially we don’t the trajectory reverse. The method presented in the paper may choose a premature trajectory, but it can always guard against overly long trajectory.

8.» Interesting discussion of Hoffman & Gelman’s HMC paper Randomized Blocker | 2012-07-18 at 1:49 pm[…] don’t have too much to add to Radford Neal’s in-depth commentary on Hoffman and Gelman’s “No-U-Turn Sampler”. However, I would note that setting […]

9.KHOA TRAN | 2016-04-21 at 2:55 amI just discovered your blog and simply want to quickly shout out that I am a huge fan of the ingenuity behind the 2003 slice sampler.

I hope that you may be able to improvise something of that magnitude in the context of Hamiltonian dynamics.

Best regards,

Khoa.