## Non-random MCMC

In my post on MCMC simulation as a random permutation (paper available at arxiv.org here), I mentioned that this view of MCMC also has implications for the role of randomness in MCMC. This has also been discussed in a recent paper by Iain Murray and Lloyd Elliott on Driving Markov chain Monte Carlo with a dependent random stream.

For the simple case of Gibbs sampling for a continuous distribution, Murray and Elliott’s procedure is the same as mine, except that they do not have the updates of extra variables needed to produce a volume-preserving map. These extra variables are relevant for my importance sampling application, but not for what I’ll discuss here. The method is a simple modification of the usual Gibbs sampling procedure, assuming that sampling from conditional distributions is done by inverting their CDFs (a common method for many standard distributions). It turns out that after this modification, one can often eliminate the random aspect of the simulation and still get good results!

We can do standard Gibbs sampling using a sequence of values s0, s1, … that are independent random variables, uniformly distributed on [0,1). Suppose the state has components x(1), x(2), …, x(D). Each Gibbs sampling update will change one component of this state, by drawing a new value from its conditional distribution given the current values of the other components. If step i of the Gibbs sampling simulation should draw a new value for x(j), we can implement this by just setting the new value of x(j) to F-1(si), where F is the cumulative distribution function (CDF) for the conditional distribution of x(j), and F-1 is its inverse.

In the modified procedure, we introduce an additional state variable, u, which in the equilibrium distribution is independent of x and uniformly distributed on [0,1). If step i should be a Gibbs sampling update for x(j), we implement this by first setting u to u+si modulo 1 (ie, just the fractional part of the sum of u and si), then setting x(j) to F-1(u), and finally setting u to F(xjold), where xjold is the previous value of xj, before the update just performed.

If si is truly random, this has exactly the same effect as the standard Gibbs sampling procedure, since adding a uniform [0,1) random variable to anything modulo 1 will produce a uniform [0,1) random quantity. Then what’s the point? Well, in my paper, I consider parallel simulation of many chains, all using the same si, in which case the modified procedure prevents the chains from coalescing to the same state, which would eliminate any benefit from parallelism. Murray and Elliott consider the situation where the si are not truly random, and show that even quite drastic dependencies among the si have no effect on the results.  Indeed, they show good results when all si are the same (in a more complex context than simple Gibbs sampling). They also consider the possibility that non-random si might actually give better results than random si, but they present only a toy example of this.

I have also considered non-random si. I first looked at Gibbs sampling for the discrete Ising model, and found that fixing all si to some values works well, but that fixing them to some other values gives incorrect results. I then tried Gibbs sampling for a truncated bivariate normal distribution, with means of 0, standard deviations of 1, correlation 0.95, truncated to (-1,2.5)x(-1.5,2)). Here is part of Figure 8 in my paper: The top plot shows six chains (in different colours) with different random initializations of x and u, but that thereafter follow deterministic dynamics with all si set to 0. The bottom plot shows the same except that all si are set to 0.017.

Clearly, fixing all the si to 0 doesn’t give the correct result, either within one chain, or averaging over several chains. Each chain moves around an “orbit” that does not sample the whole distribution, and may include some values of very low probability.

Perhaps surprisingly, however, fixing all the si to a value slightly greater than 0 does give correct results. And in fact, the estimates for the means of the variables are actually more accurate than when the si are set randomly. Even better results are obtained when all si are fixed to 0.211, but the plots are not as easy to interpret. In the plot with si fixed to 0.017, one can see that the chain moves consistently upwards for many steps, and then moves rapidly down, before resuming its upward motion. This contrasts with the random-walk behaviour of standard Gibbs sampling, which exhibits frequent inefficient reversals of direction.

In its avoidance of a random walk, this method resembles overrelaxation methods, though the dynamics of overrelaxation are quite different. Some preliminary experiments indicate that Gibbs sampling with the si set to a small value may be better than overrelaxation.  In particular, it works well for a three-dimensional normal distribution with strong negative correlation, for which overrelaxation does very poorly.

It may not be too hard to analyse this method when it’s applied to a multivariate normal distribution. When all si are set to 0, each update can be seen as a linear map, if we replace u by Φ-1(u), where Φ is the standard normal CDF. The composition of these linear maps for the updates of all the components will be another linear map. This explains why the updates for a bivariate normal with all si set to 0 lead to x moving around an ellipse, which is the projection of a circle of rotation in three dimensions. You can see such ellipses in the plot above, when they are far from the truncation boundaries. (I don’t understand yet why the points remain on a one-dimensional orbit when the truncation has an effect, which makes the maps non-linear.) Of course, we then need to move from all si being 0 to them all being something else.

I think there’s a lot more to understand about such non-random methods, and that a better understanding may lead to substantial improvements in Markov chain Monte Carlo (or whatever it ought to be called if it’s not random!).

Entry filed under: Monte Carlo Methods, Statistics, Statistics - Computing.

• 1. Mark Johnson  |  2012-05-16 at 8:02 am

It’s very interesting that this deterministic modification of Gibbs sampling works at all, let alone so well, on this small example. But deterministic quadrature rules can be efficient in low dimensions. How well do these methods work in higher dimensions?

• 2. Radford Neal  |  2012-05-17 at 12:48 pm

That’s an interesting question. I’ve just tried a 10-dimensional truncated multivariate normal, with mean vector zero and a covariance matrix built as follows (in R):
``` X = rbind( c(1, 3, 3, 2, 1, 0, 1), c(2, 1, 3, 2, 1, 1, 1), c(2, 1, 2, 2, 1, 3, 1), c(1,-1,-2,-3,-1,-2,-1), c(0, 3, 0, 3, 1, 1, 0), c(2, 0,-2,-3, 0,-2,-1), c(2, 0,-2,-3, 2, 2,-1), c(1, 3, 2,-3, 2, 2,-1), c(0, 2, 0, 0, 5, 0, 0), c(0, 0, 3, 0, 0,-4, 0)) cov = X %*% t(X) + diag(10) ```
I truncated this distribution to minimum values of -10 for all coordinates and maximum values of 5, 6, …, 14.

I simulated 250 parallel Gibbs sampling chains, started from points drawn uniformly from [-1,1]10, in several ways. First, using standard Gibbs sampling, with the 250 chains being independent. Second using permutation Gibbs sampling, with the same random values for s0, s1, … for all chains (but different initial states). Finally, I tried fixing all the si to a single value, either 0.0073, 0.0187, 0.0518, or 0.1377. I simulated 400 burnin iterations for each chain, followed by 10000 iterations taken to be from equilibrium.

Using these 250×10000 states, I computed estimates of the expected values of each coordinate (which aren’t zero, due to truncation) and of the expected values of the squares of the coordinates, along with estimated standard errors for these estimates, based on the variation over the 250 chains. The results are plotted here. As you can see from the first and third plots, all the methods produce very consistent results. The standard errors of the methods differ, however. Standard Gibbs sampling and permuation Gibbs sampling with random si are very similar, but the standard errors for the coordinates are smaller by roughly a factor of two when si is fixed at 0.0073, corresponding to about a four times efficiency advantage over standard Gibbs sampling. Fixing the si at 0.0187 gives a smaller advantage, and the advantage is smaller still when the si are fixed at 0.0518. When the si are fixed at 0.1377, the results are very close to standard Gibbs sampling and permutation Gibbs sampling with random si.

The standard errors for the estimates of the expected values of the squares of the coordinates are all about the same, except that they are sometimes bigger when the si are fixed at 0.0073.

So it seems that fixing all the si to 0.0187 gives clearly better results than standard Gibbs sampling.

• 3. Nathan Bishop  |  2012-09-16 at 10:47 pm

Do you mean setting u to F^{-1}(x_j^{old}), where x is the previous value?

• 4. Radford Neal  |  2012-09-17 at 9:40 am

No. That wouldn’t make sense, since x needn’t be bounded by 0 and 1, and so may not be a valid argument of F inverse. Plus u is supposed to be in [0,1).

• 5. Nathan Bishop  |  2012-09-17 at 8:17 pm

Thank you for your answer. My background is physics so my questions might sound trivial, but what is the insight in putting u equal to F^{-1}(x_j^{old})? Can I use a linear combination of F^{-1}(x_j^{old}) and F^{-1}(x_j^{older})?

• 6. Anonymous  |  2013-11-14 at 9:51 am

Sorry for my 5c and it is not area of my expertise, but this reminded me of “low discrepancy numbers” technique used in finance for Monte Carlo methods. Are there any parallels with non-random MCMC?

• 7. Radford Neal  |  2013-11-14 at 1:06 pm

I’m not really familiar with what they do in finance, but this sounds like an instance of “Quasi Monte Carlo”, which people have indeed tried to apply to MCMC (see the Chen, Dick, and Owen reference in my paper). I think there must be some connections that could be explored.