In my post on MCMC simulation as a random permutation (paper available at arxiv.org here), I mentioned that this view of MCMC also has implications for the role of randomness in MCMC. This has also been discussed in a recent paper by Iain Murray and Lloyd Elliott on Driving Markov chain Monte Carlo with a dependent random stream.
For the simple case of Gibbs sampling for a continuous distribution, Murray and Elliott’s procedure is the same as mine, except that they do not have the updates of extra variables needed to produce a volume-preserving map. These extra variables are relevant for my importance sampling application, but not for what I’ll discuss here. The method is a simple modification of the usual Gibbs sampling procedure, assuming that sampling from conditional distributions is done by inverting their CDFs (a common method for many standard distributions). It turns out that after this modification, one can often eliminate the random aspect of the simulation and still get good results!
We can do standard Gibbs sampling using a sequence of values s0, s1, … that are independent random variables, uniformly distributed on [0,1). Suppose the state has components x(1), x(2), …, x(D). Each Gibbs sampling update will change one component of this state, by drawing a new value from its conditional distribution given the current values of the other components. If step i of the Gibbs sampling simulation should draw a new value for x(j), we can implement this by just setting the new value of x(j) to F-1(si), where F is the cumulative distribution function (CDF) for the conditional distribution of x(j), and F-1 is its inverse.
In the modified procedure, we introduce an additional state variable, u, which in the equilibrium distribution is independent of x and uniformly distributed on [0,1). If step i should be a Gibbs sampling update for x(j), we implement this by first setting u to u+si modulo 1 (ie, just the fractional part of the sum of u and si), then setting x(j) to F-1(u), and finally setting u to F(xjold), where xjold is the previous value of xj, before the update just performed.
If si is truly random, this has exactly the same effect as the standard Gibbs sampling procedure, since adding a uniform [0,1) random variable to anything modulo 1 will produce a uniform [0,1) random quantity. Then what’s the point? Well, in my paper, I consider parallel simulation of many chains, all using the same si, in which case the modified procedure prevents the chains from coalescing to the same state, which would eliminate any benefit from parallelism. Murray and Elliott consider the situation where the si are not truly random, and show that even quite drastic dependencies among the si have no effect on the results. Indeed, they show good results when all si are the same (in a more complex context than simple Gibbs sampling). They also consider the possibility that non-random si might actually give better results than random si, but they present only a toy example of this.
I have also considered non-random si. I first looked at Gibbs sampling for the discrete Ising model, and found that fixing all si to some values works well, but that fixing them to some other values gives incorrect results. I then tried Gibbs sampling for a truncated bivariate normal distribution, with means of 0, standard deviations of 1, correlation 0.95, truncated to (-1,2.5)x(-1.5,2)). Here is part of Figure 8 in my paper:
The top plot shows six chains (in different colours) with different random initializations of x and u, but that thereafter follow deterministic dynamics with all si set to 0. The bottom plot shows the same except that all si are set to 0.017.
Clearly, fixing all the si to 0 doesn’t give the correct result, either within one chain, or averaging over several chains. Each chain moves around an “orbit” that does not sample the whole distribution, and may include some values of very low probability.
Perhaps surprisingly, however, fixing all the si to a value slightly greater than 0 does give correct results. And in fact, the estimates for the means of the variables are actually more accurate than when the si are set randomly. Even better results are obtained when all si are fixed to 0.211, but the plots are not as easy to interpret. In the plot with si fixed to 0.017, one can see that the chain moves consistently upwards for many steps, and then moves rapidly down, before resuming its upward motion. This contrasts with the random-walk behaviour of standard Gibbs sampling, which exhibits frequent inefficient reversals of direction.
In its avoidance of a random walk, this method resembles overrelaxation methods, though the dynamics of overrelaxation are quite different. Some preliminary experiments indicate that Gibbs sampling with the si set to a small value may be better than overrelaxation. In particular, it works well for a three-dimensional normal distribution with strong negative correlation, for which overrelaxation does very poorly.
It may not be too hard to analyse this method when it’s applied to a multivariate normal distribution. When all si are set to 0, each update can be seen as a linear map, if we replace u by Φ-1(u), where Φ is the standard normal CDF. The composition of these linear maps for the updates of all the components will be another linear map. This explains why the updates for a bivariate normal with all si set to 0 lead to x moving around an ellipse, which is the projection of a circle of rotation in three dimensions. You can see such ellipses in the plot above, when they are far from the truncation boundaries. (I don’t understand yet why the points remain on a one-dimensional orbit when the truncation has an effect, which makes the maps non-linear.) Of course, we then need to move from all si being 0 to them all being something else.
I think there’s a lot more to understand about such non-random methods, and that a better understanding may lead to substantial improvements in Markov chain Monte Carlo (or whatever it ought to be called if it’s not random!).