## Non-random MCMC

*2012-05-03 at 11:37 pm* *
7 comments *

In my post on MCMC simulation as a random permutation (paper available at arxiv.org here), I mentioned that this view of MCMC also has implications for the role of randomness in MCMC. This has also been discussed in a recent paper by Iain Murray and Lloyd Elliott on Driving Markov chain Monte Carlo with a dependent random stream.

For the simple case of Gibbs sampling for a continuous distribution, Murray and Elliott’s procedure is the same as mine, except that they do not have the updates of extra variables needed to produce a volume-preserving map. These extra variables are relevant for my importance sampling application, but not for what I’ll discuss here. The method is a simple modification of the usual Gibbs sampling procedure, assuming that sampling from conditional distributions is done by inverting their CDFs (a common method for many standard distributions). It turns out that after this modification, one can often eliminate the random aspect of the simulation and still get good results!

We can do standard Gibbs sampling using a sequence of values s_{0}, s_{1}, … that are independent random variables, uniformly distributed on [0,1). Suppose the state has components x^{(1)}, x^{(2)}, …, x^{(D)}. Each Gibbs sampling update will change one component of this state, by drawing a new value from its conditional distribution given the current values of the other components. If step i of the Gibbs sampling simulation should draw a new value for x^{(j)}, we can implement this by just setting the new value of x^{(j)} to F^{-1}(s_{i}), where F is the cumulative distribution function (CDF) for the conditional distribution of x^{(j)}, and F^{-1} is its inverse.

In the modified procedure, we introduce an additional state variable, u, which in the equilibrium distribution is independent of x and uniformly distributed on [0,1). If step i should be a Gibbs sampling update for x^{(j)}, we implement this by first setting u to u+s_{i} modulo 1 (ie, just the fractional part of the sum of u and s_{i}), then setting x^{(j)} to F^{-1}(u), and finally setting u to F(x^{j}_{old}), where x^{j}_{old} is the previous value of x^{j}, before the update just performed.

If s_{i} is truly random, this has exactly the same effect as the standard Gibbs sampling procedure, since adding a uniform [0,1) random variable to anything modulo 1 will produce a uniform [0,1) random quantity. Then what’s the point? Well, in my paper, I consider parallel simulation of many chains, all using the same s_{i}, in which case the modified procedure prevents the chains from coalescing to the same state, which would eliminate any benefit from parallelism. Murray and Elliott consider the situation where the s_{i} are not truly random, and show that even quite drastic dependencies among the s_{i} have no effect on the results. Indeed, they show good results when all s_{i} are the same (in a more complex context than simple Gibbs sampling). They also consider the possibility that non-random s_{i} might actually give *better* results than random s_{i}, but they present only a toy example of this.

I have also considered non-random s_{i}. I first looked at Gibbs sampling for the discrete Ising model, and found that fixing all s_{i} to some values works well, but that fixing them to some other values gives incorrect results. I then tried Gibbs sampling for a truncated bivariate normal distribution, with means of 0, standard deviations of 1, correlation 0.95, truncated to (-1,2.5)x(-1.5,2)). Here is part of Figure 8 in my paper:

The top plot shows six chains (in different colours) with different random initializations of x and u, but that thereafter follow deterministic dynamics with all s_{i} set to 0. The bottom plot shows the same except that all s_{i} are set to 0.017.

Clearly, fixing all the s_{i} to 0 doesn’t give the correct result, either within one chain, or averaging over several chains. Each chain moves around an “orbit” that does not sample the whole distribution, and may include some values of very low probability.

Perhaps surprisingly, however, fixing all the s_{i} to a value slightly greater than 0 *does* give correct results. And in fact, the estimates for the means of the variables are actually more accurate than when the s_{i} are set randomly. Even better results are obtained when all s_{i} are fixed to 0.211, but the plots are not as easy to interpret. In the plot with s_{i} fixed to 0.017, one can see that the chain moves consistently upwards for many steps, and then moves rapidly down, before resuming its upward motion. This contrasts with the random-walk behaviour of standard Gibbs sampling, which exhibits frequent inefficient reversals of direction.

In its avoidance of a random walk, this method resembles overrelaxation methods, though the dynamics of overrelaxation are quite different. Some preliminary experiments indicate that Gibbs sampling with the s_{i} set to a small value may be better than overrelaxation. In particular, it works well for a three-dimensional normal distribution with strong negative correlation, for which overrelaxation does very poorly.

It may not be too hard to analyse this method when it’s applied to a multivariate normal distribution. When all s_{i} are set to 0, each update can be seen as a linear map, if we replace u by Φ^{-1}(u), where Φ is the standard normal CDF. The composition of these linear maps for the updates of all the components will be another linear map. This explains why the updates for a bivariate normal with all s_{i} set to 0 lead to x moving around an ellipse, which is the projection of a circle of rotation in three dimensions. You can see such ellipses in the plot above, when they are far from the truncation boundaries. (I don’t understand yet why the points remain on a one-dimensional orbit when the truncation has an effect, which makes the maps non-linear.) Of course, we then need to move from all s_{i} being 0 to them all being something else.

I think there’s a lot more to understand about such non-random methods, and that a better understanding may lead to substantial improvements in Markov chain Monte Carlo (or whatever it ought to be called if it’s not random!).

Entry filed under: Monte Carlo Methods, Statistics, Statistics - Computing. Tags: .

1.Mark Johnson | 2012-05-16 at 8:02 amIt’s very interesting that this deterministic modification of Gibbs sampling works at all, let alone so well, on this small example. But deterministic quadrature rules can be efficient in low dimensions. How well do these methods work in higher dimensions?

2.Radford Neal | 2012-05-17 at 12:48 pmThat’s an interesting question. I’ve just tried a 10-dimensional truncated multivariate normal, with mean vector zero and a covariance matrix built as follows (in R):

X = rbind( c(1, 3, 3, 2, 1, 0, 1),

c(2, 1, 3, 2, 1, 1, 1),

c(2, 1, 2, 2, 1, 3, 1),

c(1,-1,-2,-3,-1,-2,-1),

c(0, 3, 0, 3, 1, 1, 0),

c(2, 0,-2,-3, 0,-2,-1),

c(2, 0,-2,-3, 2, 2,-1),

c(1, 3, 2,-3, 2, 2,-1),

c(0, 2, 0, 0, 5, 0, 0),

c(0, 0, 3, 0, 0,-4, 0))

cov = X %*% t(X) + diag(10)

I truncated this distribution to minimum values of -10 for all coordinates and maximum values of 5, 6, …, 14.

I simulated 250 parallel Gibbs sampling chains, started from points drawn uniformly from [-1,1]

^{10}, in several ways. First, using standard Gibbs sampling, with the 250 chains being independent. Second using permutation Gibbs sampling, with the same random values for s_{0}, s_{1}, … for all chains (but different initial states). Finally, I tried fixing all the s_{i}to a single value, either 0.0073, 0.0187, 0.0518, or 0.1377. I simulated 400 burnin iterations for each chain, followed by 10000 iterations taken to be from equilibrium.Using these 250×10000 states, I computed estimates of the expected values of each coordinate (which aren’t zero, due to truncation) and of the expected values of the squares of the coordinates, along with estimated standard errors for these estimates, based on the variation over the 250 chains. The results are plotted here. As you can see from the first and third plots, all the methods produce very consistent results. The standard errors of the methods differ, however. Standard Gibbs sampling and permuation Gibbs sampling with random s

_{i}are very similar, but the standard errors for the coordinates are smaller by roughly a factor of two when s_{i}is fixed at 0.0073, corresponding to about a four times efficiency advantage over standard Gibbs sampling. Fixing the s_{i}at 0.0187 gives a smaller advantage, and the advantage is smaller still when the s_{i}are fixed at 0.0518. When the s_{i}are fixed at 0.1377, the results are very close to standard Gibbs sampling and permutation Gibbs sampling with random s_{i}.The standard errors for the estimates of the expected values of the squares of the coordinates are all about the same, except that they are sometimes bigger when the s

_{i}are fixed at 0.0073.So it seems that fixing all the s

_{i}to 0.0187 gives clearly better results than standard Gibbs sampling.3.Nathan Bishop | 2012-09-16 at 10:47 pmDo you mean setting u to F^{-1}(x_j^{old}), where x is the previous value?

4.Radford Neal | 2012-09-17 at 9:40 amNo. That wouldn’t make sense, since x needn’t be bounded by 0 and 1, and so may not be a valid argument of F inverse. Plus u is supposed to be in [0,1).

5.Nathan Bishop | 2012-09-17 at 8:17 pmThank you for your answer. My background is physics so my questions might sound trivial, but what is the insight in putting u equal to F^{-1}(x_j^{old})? Can I use a linear combination of F^{-1}(x_j^{old}) and F^{-1}(x_j^{older})?

6.Anonymous | 2013-11-14 at 9:51 amSorry for my 5c and it is not area of my expertise, but this reminded me of “low discrepancy numbers” technique used in finance for Monte Carlo methods. Are there any parallels with non-random MCMC?

7.Radford Neal | 2013-11-14 at 1:06 pmI’m not really familiar with what they do in finance, but this sounds like an instance of “Quasi Monte Carlo”, which people have indeed tried to apply to MCMC (see the Chen, Dick, and Owen reference in my paper). I think there must be some connections that could be explored.