<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Radford Neal&#039;s blog</title>
	<atom:link href="http://radfordneal.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://radfordneal.wordpress.com</link>
	<description></description>
	<lastBuildDate>Sun, 29 Jan 2012 02:21:20 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='radfordneal.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Radford Neal&#039;s blog</title>
		<link>http://radfordneal.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://radfordneal.wordpress.com/osd.xml" title="Radford Neal&#039;s blog" />
	<atom:link rel='hub' href='http://radfordneal.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Evaluation of NUTS — more comments on the paper by Hoffman and Gelman</title>
		<link>http://radfordneal.wordpress.com/2012/01/27/evaluation-of-nuts-more-comments-on-the-paper-by-hoffman-and-gelman/</link>
		<comments>http://radfordneal.wordpress.com/2012/01/27/evaluation-of-nuts-more-comments-on-the-paper-by-hoffman-and-gelman/#comments</comments>
		<pubDate>Fri, 27 Jan 2012 21:31:35 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Monte Carlo Methods]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics - Computing]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=1001</guid>
		<description><![CDATA[Here is my second post on the paper by Matthew Hoffman and Andrew Gelman on &#8220;The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo&#8221;, available from arxiv.org. In my first post, I discussed how well the two main innovations in this &#8220;NUTS&#8217;&#8221; method — ending a trajectory when a &#8220;U-Turn&#8221; is encountered, and adaptively [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=1001&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here is my second post on the paper by Matthew Hoffman and Andrew Gelman on &#8220;The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo&#8221;, available <a href="http://arxiv.org/abs/1111.4246">from arxiv.org</a>. In my <a title="No U-Turns for Hamiltonian Monte Carlo – comments on a paper by Hoffman and Gelman" href="http://radfordneal.wordpress.com/2012/01/21/no-u-turns-for-hamiltonian-monte-carlo-comments-on-a-paper-by-hoffman-and-gelman/">first post</a>, I discussed how well the two main innovations in this &#8220;NUTS&#8217;&#8221; method — ending a trajectory when a &#8220;U-Turn&#8221; is encountered, and adaptively setting the stepsize — can be expected to work. In this post, I will discuss the empirical evaluations in the NUTS paper, and report on an evaluation of my own, made possible by the authors having kindly made available the <a href="http://www.cs.princeton.edu/~mdhoffma">NUTS software</a>, concluding that the paper&#8217;s claims for NUTS are somewhat overstated. The issues I discuss are also of more general interest for other evaluations of HMC.<span id="more-1001"></span></p>
<p>The scripts used for the empirical evaluations in the NUTS paper are not included with the NUTS software, so in some respects I&#8217;m not certain what was done, but the description in the paper is fairly complete.</p>
<p>The foundation for any empirical assessment of MCMC methods is some way of estimating how good the MCMC estimates are, which could be the autocorrelation time, or the variance of the estimate, or the effective sample size (ESS, the equivalent number of independent points). The NUTS paper uses autocorrelation time, from which they compute the effective sample size.  They estimate autocorrelations using means and variances found from a very long run, not from the run being evaluated, which avoids one possible source of drastically wrong results in such evaluations. (For the zero-mean multivariate normal example I discuss below, they presumably plugged in the true mean of zero and the true variances, rather than use a long run.) As their measure of overall performance for a sampler, they used its worst performance (in terms of autocorrelation time or ESS) over all variables (looking at both mean and variance estimates for these variables). Focusing on the worst estimate seems reasonable in a way, but looking at only this one number could be misleading if the estimate that is worst is different for different samplers.</p>
<p>One aspect of their evaluation (as described in the paper) is incorrect. In the Appendix, they say that they estimate autocorrelation time by summing autocorrelation estimates at all lags up to the first lag at which the autocorrelation estimate is less than 0.05. Since the true autocorrelation can sometimes be negative, this is not valid, in general — it would be for some simple random-walk Metropolis methods, for which autocorrelations are all positive, but not for HMC, for which negative autocorrelations are quite possible. However, for reversible methods (which includes HMC), the sums of consecutive pairs of autocorrelations are always positive. This provides the basis for a similar approach that is valid. See <a href="http://arxiv.org/abs/1011.0175">here</a> for a review and assessment of this and other approaches to estimating autocorrelation time.</p>
<p>Assuming that this is an error in the actual code (not just the paper), it could affect the empirical results quite a bit, but I&#8217;ll have to ignore this possibility in the remaining discussion.</p>
<p>The first distribution that Hoffman and Gelman test NUTS on is a 250-dimensional multivariate normal distribution with mean zero and a covariance matrix randomly drawn from an inverse Wishart distribution. They claim that NUTS performs better on this distribution than HMC (by a factor of about three) even when HMC is optimally tuned. This seems strange, since NUTS is just an automated way of tuning HMC. As I&#8217;ll show below, this claim is indeed likely to be due to deficiencies in the experimental methodology.</p>
<p>Due to rotational invariance, performance of HMC on a multivariate normal is determined by the set of eigenvalues of the covariance matrix, with the inherent difficulty of the problem being proportional to the square root of the ratio of the largest eigenvalue to the smallest eigenvalue. This is the ratio of the standard deviation in the least confined direction to the standard deviation in the most confined direction, and should be roughly the number of leapfrog steps in a trajectory of optimal length, using the optimal stepsize. (Performance by Hoffman and Gelman&#8217;s metric isn&#8217;t quite rotationally symmetric, however, since they look only at autocorrelation times for the 250 variables, not all linear combinations of them.) The evaluation of HMC and NUTS on this distribution would be much more informative if the eigenvalues of the covariance matrix were reported, and if the number of leapfrog steps in HMC trajectories (not just the product of stepsize and number of steps) were reported.</p>
<p>I&#8217;ll return below to the multivariate normal example, but first I&#8217;ll mention the other distributions on which HMC and NUTS are evaluated in the paper. Two of these other distributions are posterior distributions for logistic regression models. The results on these distributions are more what one might expect — NUTS performs about as well as HMC with optimal tuning. One slightly strange phenomenon is visible in the results presented in Figure 6, however. For both logistic regression problems, the performance of HMC drops drastically as the trajectory length is increased from the optimal value to a value larger by a factor of 1.5. I would expect a more gradual decline in efficiency.  It would be nice to know why this occurs.</p>
<p>The last distribution they evaluate NUTS and HMC on is the posterior distribution for  a stochastic volatility model.  Here again, they claim that NUTS performs better than HMC, even when the latter is optimally tuned. I&#8217;m not sure what is going on here, but one thing I notice in the lower left plot in Figure 6 is that the performance of NUTS seems to be bimodal — some of the runs are assessed as being three times better than the best HMC runs, but others are assessed as being worse. All I can think of to explain this bimodality is that the adaptive stepsize selection might be reaching either of two stable values. The advantage of some NUTS runs over the best HMC runs might be explained by the same factors that lead to this result for the multivariate normal.</p>
<p>Since the actual multivariate normal distribution used in the paper isn&#8217;t reported, I&#8217;ll look at performance of NUTS and HMC on another multivariate normal, the same 30-dimensional multivariate normal distribution that I used in my <a title="No U-Turns for Hamiltonian Monte Carlo – comments on a paper by Hoffman and Gelman" href="http://radfordneal.wordpress.com/2012/01/21/no-u-turns-for-hamiltonian-monte-carlo-comments-on-a-paper-by-hoffman-and-gelman/">previous post</a>, in which the 30 components are independent, with standard deviations of 110, 100, 1.1, 1.0, and 26 values equally spaced between 8 and 16. This may in any case be a better demonstration example than one in which the standard deviations are chosen randomly.</p>
<p>I used the NUTS software for these experiments, with automatic stepsize adaptation for both HMC and NUTS (the adaptation always converged to a reasonable value of around 1.63 for HMC and around 1.73 for NUTS). I did 10,000 burn-in iterations, followed by 250,000 sampling iterations, which I divided into 400 batches of size 625. I estimated the mean of each variable and the second moment of each variable (equal to the variance, since the true mean is zero) for each of the 400 batches, and looked at the variance of these batch estimates as a measure of the performance of the sampler, not accounting for computation time yet. I then multiplied the variances of these batch estimates by the number of gradient evaluations for the whole run to get figures of merit that account for computation time. I used <a href="http://radfordneal.files.wordpress.com/2012/01/demomvn.doc">this Matlab script</a> for these experiments (and then moved the results to R for plotting).</p>
<p>Here are the results for estimating the means of each of the 30 variables (numbered 1 to 30 by decreasing standard deviation), for HMC with trajectory lengths of 100, 170, and 200 leapfrog steps (and hence trajectory lengths of about 163, 277, and 326 time units since the stepsize is about 1.63):</p>
<p style="text-align:center;"><img class="aligncenter  wp-image-964" title="demo-plot1a" src="http://radfordneal.files.wordpress.com/2012/01/demo-plot1a.gif?w=435" alt="" width="435" /></p>
<p>The left plot gives the variances of the estimates, not accounting for computation time; the right plot adjusts for computation time (gradient evaluations). Smaller is better in both plots. Note that the vertical scale is logarithmic, with horizontal lines at each power of ten.</p>
<p>These results look rather peculiar! Notice how the variances of the mean estimates for variables 3 to 28, with standard deviations declining from 16 to 8, don&#8217;t decline as one might expect, but rather go up and down, with peaks at positions that change with the trajectory length.</p>
<p>This behaviour is due to a phenomenon discussed in Section 3.2 of <a href="http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html">my review of Hamiltonian Monte Carlo</a>. Hamiltonian dynamics for this simple distribution without dependencies between variables will be periodic for each variable. If the period for a variable happens to match (or be near) the trajectory length, the end-point of a trajectory will be close to the start point, with the result that sampling for that variable is very poor. (A related problem  happens when the trajectory length is half the period.) Note that this problem can arise for any multivariate normal distribution, with the poorly-explored directions being the eigenvectors of the covariance matrix, not necessarily the individual variables.</p>
<p>The cure for this is to randomly vary the trajectory length (randomly varying the stepsize, or the number of leapfrog steps, or both) over some moderate range. I modified the hmc_da function in the NUTS software to do this, changing the computation of the number of leapfrog steps from</p>
<p style="padding-left:30px;"><tt>L = max(1, round(lambda / epsilon));</tt></p>
<p>to</p>
<p style="padding-left:30px;"><tt>L = max(1, round((0.9 + rand/5) * lambda / epsilon));</tt></p>
<p>which has the effect of multiplying the number of leapfrog steps by a random factor uniformly distributed between 0.9 and 1.1.</p>
<p>Here are the results after this modification:</p>
<p style="text-align:center;"><img class="aligncenter  wp-image-964" title="demo-plot1b" src="http://radfordneal.files.wordpress.com/2012/01/demo-plot1b.gif?w=435" alt="" width="435" /></p>
<p style="text-align:left;">The variances of the mean estimates still go up and down a bit (maybe a wider random range would be better), but much less than before.</p>
<p style="text-align:left;">So we see that one can get a misleadingly-bad impression of the performance of HMC on a multivariate normal distribution from experiments in which the trajectory length is not varied randomly . On less simple distributions, the exact periodicities that happen with a multivariate normal may not be present, but there can be approximate periodicities, which might still happen to nearly match a fixed trajectory length, with a consequent bad effect on performance. This is why randomly varying the trajectory length is a recommended part of standard HMC methodology. (However, this is unnecessary with &#8220;windowed&#8221; HMC, as long as the window size is a non-negligible fraction of the trajectory length.)</p>
<p style="text-align:left;">So, how does NUTS perform on this distribution? Here are the results, together, for comparison, with the variances of mean estimates from an equal-size sample of points drawn independently (omitted for the computational cost plot, where it would not be meaningful):</p>
<p style="text-align:center;"><img class="aligncenter  wp-image-964" title="demo-plot1c" src="http://radfordneal.files.wordpress.com/2012/01/demo-plot1c.gif?w=435" alt="" width="435" /></p>
<p style="text-align:left;">For the first two variables, with the largest standard deviations, the variances of the NUTS estimates of the mean are about 30 times bigger than the variances when the sampled points are independent.  Compared to HMC with random trajectory length with median 200, these NUTS estimates have a variance that is about 25 times bigger. After adjusting for computational cost, the NUTS estimates of the means of these two variables are 7 times less efficient than the HMC estimates.</p>
<p style="text-align:left;">NUTS looks much better for the other variables, with smaller standard deviations. Indeed, the NUTS estimates are often better than those obtained with independent sampling! This is possible because NUTS (and HMC) can produce negative autocorrelations. However, for negative correlations to show up with these variables, NUTS must often be using trajectories that are much shorter than are needed for efficient exploration of the large-standard-deviation variables, which of course is consistent with it estimating the means of those variables inefficiently.</p>
<p style="text-align:left;">Here are plots showing the performance of HMC with random trajectory length and of NUTS when estimating the variances (rather than the means) of the 30 variables:</p>
<p style="text-align:center;"><img class="aligncenter  wp-image-964" title="demo-plot2b" src="http://radfordneal.files.wordpress.com/2012/01/demo-plot2b.gif?w=435" alt="" width="435" /></p>
<p style="text-align:center;"><img class="aligncenter  wp-image-964" title="demo-plot2c" src="http://radfordneal.files.wordpress.com/2012/01/demo-plot2c.gif?w=435" alt="" width="435" /></p>
<p style="text-align:left;">HMC is about a factor of two more efficient than NUTS at estimating the variances of the two variables with the largest standard deviations, but is less efficient at estimating the variances of the other variables.</p>
<p style="text-align:left;">One might think that these results show that there&#8217;s a tradeoff — HMC better for variables with large standard deviation, NUTS better for variables with small standard deviation. But I think this isn&#8217;t really so. First, it is often the absolute magnitude of the variance of estimates that is important, and in that respect HMC is clearly better. (Consider, for instance, estimating regression coefficients for covariates that have been standardized.) When that&#8217;s not so — when you need precise estimates of the means of variables that have small standard deviations — it&#8217;s easy to alternate HMC updates designed to sample well for the variables with large standard deviations with other, much cheaper, updates designed to sample well for the variables with small standard deviations (eg, univariate slice sampling updates, or just HMC with a smaller trajectory length). Using the windowed HMC method would also help, since rejected transitions in windowed HMC will usually move a small distance, helping with the sampling for variables with small standard deviations. (Note: as mentioned before, due to the rotational invariance of HMC, all this discussion applies also when the &#8220;variables&#8221; are actually various directions in the space, that aren&#8217;t necessarily the coordinate axes being used.)</p>
<p style="text-align:left;">One conclusion I draw from this evaluation is that the better performance of NUTS over HMC (optimally tuned) shown in the evaluations in the NUTS paper is likely an artifact of not randomizing the HMC trajectory length, and of evaluating performance by the worst effective sample size for any variable (not accounting for some variables being easy to handle by other methods). Another conclusion is that the premature U-Turn problem that I discussed in <a title="No U-Turns for Hamiltonian Monte Carlo – comments on a paper by Hoffman and Gelman" href="http://radfordneal.wordpress.com/2012/01/21/no-u-turns-for-hamiltonian-monte-carlo-comments-on-a-paper-by-hoffman-and-gelman/">my first post on NUTS</a> can be a problem in practice, since otherwise NUTS would estimate the means and variances of the large-standard-deviation variables in this example better. As I discussed there, I think there is scope for improving the way that NUTS decides when to stop a trajectory, and thereby make it a more generally-useful sampling method.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/1001/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/1001/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/1001/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/1001/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/1001/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/1001/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/1001/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/1001/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/1001/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/1001/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/1001/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/1001/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/1001/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/1001/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=1001&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2012/01/27/evaluation-of-nuts-more-comments-on-the-paper-by-hoffman-and-gelman/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>

		<media:content url="http://radfordneal.files.wordpress.com/2012/01/demo-plot1a.gif" medium="image">
			<media:title type="html">demo-plot1a</media:title>
		</media:content>

		<media:content url="http://radfordneal.files.wordpress.com/2012/01/demo-plot1b.gif" medium="image">
			<media:title type="html">demo-plot1b</media:title>
		</media:content>

		<media:content url="http://radfordneal.files.wordpress.com/2012/01/demo-plot1c.gif" medium="image">
			<media:title type="html">demo-plot1c</media:title>
		</media:content>

		<media:content url="http://radfordneal.files.wordpress.com/2012/01/demo-plot2b.gif" medium="image">
			<media:title type="html">demo-plot2b</media:title>
		</media:content>

		<media:content url="http://radfordneal.files.wordpress.com/2012/01/demo-plot2c.gif" medium="image">
			<media:title type="html">demo-plot2c</media:title>
		</media:content>
	</item>
		<item>
		<title>No U-Turns for Hamiltonian Monte Carlo &#8211; comments on a paper by Hoffman and Gelman</title>
		<link>http://radfordneal.wordpress.com/2012/01/21/no-u-turns-for-hamiltonian-monte-carlo-comments-on-a-paper-by-hoffman-and-gelman/</link>
		<comments>http://radfordneal.wordpress.com/2012/01/21/no-u-turns-for-hamiltonian-monte-carlo-comments-on-a-paper-by-hoffman-and-gelman/#comments</comments>
		<pubDate>Sat, 21 Jan 2012 04:38:16 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Monte Carlo Methods]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics - Computing]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=898</guid>
		<description><![CDATA[Matthew Hoffman and Andrew Gelman recently posted a paper called &#8220;The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo&#8221; on arxiv.org. It has been discussed on Andrew&#8217;s blog. It&#8217;s a good paper, which addresses two big barriers to wider use of Hamiltonian Monte Carlo — the difficulties of tuning the trajectory length and tuning [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=898&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Matthew Hoffman and Andrew Gelman recently posted a paper called &#8220;The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo&#8221; <a href="http://arxiv.org/abs/1111.4246">on arxiv.org</a>. It has been discussed <a href="http://andrewgelman.com/2011/11/stan-uses-nuts/">on Andrew&#8217;s blog</a>.</p>
<p>It&#8217;s a good paper, which addresses two big barriers to wider use of <a href="http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html">Hamiltonian Monte Carlo</a> — the difficulties of tuning the trajectory length and tuning the stepsize to use when simulating a trajectory. The name &#8220;No-U-Turn Sampler&#8221; (NUTS) comes from their way of addressing the problem of tuning the trajectory length — repeatedly double the length of the current trajectory, until (simplifying a bit) there is a part of the trajectory that makes a &#8220;U-Turn&#8221;, heading back towards its starting point. This doubling method is clever, and (as I discuss below) one aspect of it seems useful even apart from any attempt to adaptively set the trajectory length. They also introduce a method of adapting the stepsize during the burn-in period, so as to achieve some desired acceptance probability.</p>
<p>However, I don&#8217;t think these are completely satisfactory ways of setting trajectory lengths and stepsizes. As I&#8217;ll discuss below, these problems are more complicated than they might at first appear.<span id="more-898"></span></p>
<p>The biggest advantage of HMC over simple random-walk Metropolis or Gibbs sampling methods is that it can propose to move to a distant point, with a high probability of acceptance. This gets rid of the inefficiency of a random walk, in which taking <em>n</em> small steps in random directions is likely to move you only about √<em>n</em> steps away from where you started.  HMC can propose the end-point of a trajectory found using<em> n</em> steps that move more-or-less consistently in o<em></em>ne direction, ending up about <em>n</em> steps from the starting point.  To get the full advantage of this, the trajectory has to be long enough (but not much longer) that <em>in the least constrained direction </em>the end-point of the trajectory is distant from the start point. This could require that a trajectory be much longer than the length at which it starts to move back towards the starting point, which (roughly speaking) is the point where the No-U-Turn method would stop.</p>
<p>Such premature U-Turns can occur when some directions are highly constrained, limiting how big the stepsize can be, while some other directions are much less constrained, and hence will take many steps to move across, and yet other directions are constrained to an intermediate degree. In  directions with intermediate constraints, the trajectory can reverse direction long before the least constrained direction has been explored. This can produce a &#8220;U-Turn&#8221; when the trajectory is much shorter than is optimal.</p>
<p>Consider an example of sampling from a 30-dimensional multivariate normal distribution in which the 30 components are independent, with standard deviations of 110, 100, 1.1, 1.0, and 26 values equally spaced between 8 and 16. (Since the operation of HMC is invariant to rotation, behaviour on this distribution is the same as on any 30-dimensional multivariate normal for which the square roots of the eigenvalues of the covariance matrix have these values.) The smallest standard deviation of 1 limits the stepsize to less than 2 (at which point the dynamics becomes unstable). A stepsize of 1.5 gives reasonably good results. Exploring the less constrained directions where the standard deviations are 110 and 100 will then require hundreds of steps.</p>
<p>I simulated six trajectories from starting points randomly drawn from this distribution.  Each trajectory was for 300 leapfrog steps with a stepsize of 1.5, a duration of 450 time units.  Below is a plot (produced using <a href="http://radfordneal.files.wordpress.com/2012/01/dist-tst.doc">this R script</a>, which calls <a href="http://radfordneal.files.wordpress.com/2012/01/hmc-dist4.doc">this HMC routine</a>) of the distance from the start point for the points after each leapfrog step in these trajectories (trajectories that were rejected are plotted as dotted lines).</p>
<p style="text-align:center;"><img class="aligncenter  wp-image-964" title="dist-tst-out-w" src="http://radfordneal.files.wordpress.com/2012/01/dist-tst-out-w.gif?w=435&#038;h=397" alt="" width="435" height="397" /></p>
<p>As one can see, the distance from the start point often does not increase monotonically for the time needed to reach a distant point.  It&#8217;s hard from this plot to tell exactly what will happen when using the  No U-Turn Sampler, since it looks for U-Turns in subtrees, which don&#8217;t always have the same start point as the entire trajectory. It&#8217;s clear, though, that it will be quite possible for NUTS to stop doubling prematurely, before reaching a trajectory length that will produce good movement in the least-constrained direction. However, we also see that occasionally the distance from the starting point does increase monotonically for a suitable time, so perhaps NUTS will at least occasionally use a suitable trajectory length.</p>
<p>I think there is scope for improving NUTS in this respect, by finding some better criterion for stopping the doubling process. One simple modification that might be desirable is to simply refuse to stop doubling until some minimum trajectory length is reached — a minimum length of eight might be about right. This will be less efficient for problems in which the optimal trajectory length is smaller than this, but such problems are easy anyway, and will still be easy even if done up to eight times less efficiently.</p>
<p>Once it has stopped doubling, NUTS has an interesting way of sampling a point from the final trajectory.  Rather than just picking a point randomly from amongst those that are eligible, it first tries to move to the half of the trajectory that doesn&#8217;t contain the starting point, then if that is rejected, to move to the quarter of the trajectory that doesn&#8217;t contain the starting point but is in the same half as the starting point, and so forth, attempting to move away from the starting point by successively smaller amounts. This should produce greater movement on average than randomly selecting a point that might, just by chance, turn out to be close to the starting point.</p>
<p>This technique ought to also be applicable to my &#8220;windowed HMC&#8221; method (see Section 5.4 of my <a href="http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html">review</a>), most obviously when the window size is half the trajectory length. It seems that it should be easy to try this out by just modifying the NUTS program to do a fixed number of doublings, ignoring whether the trajectories are doing U-Turns or not. The discussion in the NUTS paper mentions that some of the advantage seen for NUTS over standard HMC might be due to it having some characteristics of windowed HMC. Comparing adaptive NUTS with NUTS with a fixed number of doublings (essentially an improved version of windowed HMC) would separate these effects.</p>
<p>The other big issue when tuning HMC is how to choose a stepsize for the leapfrog updates that will be small enough to give a reasonably high probability of accepting trajectories, but not unnecessarily small, which would waste computation time. NUTS addresses this issue by adapting the stepsize during a burn-in period, so as to achieve a specified acceptance probability. For example, if one assumes that the distribution being sampled can be viewed as consisting approximately of many replications of some distribution, the optimal acceptance rate is 0.65 (see Section 4.4 of my <a href="http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html">review</a>), so this might be the target of the adaptation.</p>
<p>I expect that this will work well for many problems. However, in some problems it may fail disastrously. This could happen if the stepsize that produces an average acceptance probability of 0.65 is much smaller in some regions of the distribution than in others. If the sampler is started in the region that allows a large stepsize, the adaptation may converge on that large stepsize. The sampler will then very likely never visit the other region, where the stepsize needs to be much smaller, and the answers obtained will be drastically wrong.</p>
<p>You might think that instead the sampler would at some iteration move to the region that needs a small stepsize, and then get stuck there, as all the proposals are rejected — a problem that is easily diagnosed. Unfortunately, that&#8217;s not what will happen. Since the sampler leaves the correct distribution invariant, if it would be stuck for a very long time in some region, it must also be very unlikely to enter that region. This is a potential problem for any global Metropolis algorithm, for which acceptance probabilities can be very low if a bad proposal distribution is chosen, especially in high dimensions, but is particularly severe for HMC, since a stepsize that is beyond the stability limit for the dynamics will produce exponential growth in the error, and an extremely small acceptance probability. See Section 4.2 of my <a href="http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html">review</a> for further discussion.</p>
<p>One way of alleviating this problem is to choose the stepsize to use for a trajectory randomly, from some moderately broad distribution. One could still adaptively choose the mean of this distribution. Occasional choices of a small stepsize would allow the sampler to enter regions that require such a small stepsize, and either force the mean of the stepsize distribution down, or at least allow the problem to be diagnosed from the presence of some long runs of rejected trajectories.</p>
<p>I will discuss another reason to randomly choose the stepsize or number of leapfrog steps in a second post on NUTS (coming, I hope, in a few days), in which I will discuss the empirical evaluations done in the NUTS paper. This post will also shed some light on whether the possibility of a premature U-Turn that I discuss above is likely to be a problem in practice.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/898/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/898/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/898/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/898/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/898/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/898/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/898/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/898/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/898/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/898/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/898/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/898/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/898/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/898/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=898&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2012/01/21/no-u-turns-for-hamiltonian-monte-carlo-comments-on-a-paper-by-hoffman-and-gelman/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>

		<media:content url="http://radfordneal.files.wordpress.com/2012/01/dist-tst-out-w.gif" medium="image">
			<media:title type="html">dist-tst-out-w</media:title>
		</media:content>
	</item>
		<item>
		<title>GRIMS — General R Interface for Markov Sampling</title>
		<link>http://radfordneal.wordpress.com/2011/06/26/grims-%e2%80%94-general-r-interface-for-markov-sampling/</link>
		<comments>http://radfordneal.wordpress.com/2011/06/26/grims-%e2%80%94-general-r-interface-for-markov-sampling/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 04:58:44 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Statistics - Computing]]></category>
		<category><![CDATA[R Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Computing]]></category>
		<category><![CDATA[Monte Carlo Methods]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=861</guid>
		<description><![CDATA[I have released a (very) preliminary version of my new MCMC software in R, which I&#8217;m calling GRIMS, for General R Interface for Markov Sampling. You can get it here. This software differs from other more-or-less general MCMC packages in several respects, all but one of which make it, I think, a much better tool [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=861&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have released a (very) preliminary version of my new MCMC software in R, which I&#8217;m calling GRIMS, for General R Interface for Markov Sampling. You can get it <a href="http://www.cs.utoronto.ca/~radford/GRIMS.html">here</a>.</p>
<p>This software differs from other more-or-less general MCMC packages in several respects, all but one of which make it, I think, a much better tool for serious MCMC applications. Here are some highlights:<span id="more-861"></span></p>
<ul>
<li>The distribution to be sampled is defined by an R function.  This allows for full flexibility in defining models. For simple models, it is a bit clumsier than a BUGS specification, but one could address this by &#8220;compiling&#8221; a BUGS-like specification into such an R function (though I have no plans to write such a compiler myself).</li>
</ul>
<ul>
<li>The Markov chain updates are also defined by R functions, which can easily be written by the user. Of course, some updates, such as random-walk Metropolis, are pre-defined.</li>
</ul>
<ul>
<li>The interface between the R functions defining distributions and the R functions defining Markov chain updates converts between their different preferred representations of the state.  A general-purpose update function (eg, Metropolis) is most naturally written to operate on a state that is a simple vector. But a function defining the posterior distribution for a Bayesian model is most naturally written to work with a state that is a list of vectors, with named elements for various types of parameters, hyperparameters, etc.</li>
</ul>
<ul>
<li>The top-level function for running an MCMC simulation allows one to combine several different types of updates, which may each operate on only part of the state. This is often necessary for efficient MCMC sampling for a complex model.</li>
</ul>
<ul>
<li>GRIMS supports incremental computation, in which the log probability density for a state is quickly recomputed after only a subset of the state variables change. This is a crucial capability for some sampling methods, such as Metropolis updates with proposals that change a single variable in the state. This support depends, of course, on the function that computes the log density being written to cache suitable intermediate quantities, and make use of them when possible.</li>
</ul>
<ul>
<li>Updates that use the gradient of the log density are supported, provided the function defining the distribution is able to compute the gradient.</li>
</ul>
<ul>
<li>The state can be supplemented with auxiliary &#8220;momentum&#8221; variables, allowing full use of <a href="http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html">sampling methods based on Hamiltonian dynamics</a>.</li>
</ul>
<ul>
<li>GRIMS has flexible facilities for specifying tuning parameters of update methods (eg, proposal standard deviations), and for collecting information (eg, acceptance rates) about how the update methods are operating.</li>
</ul>
<p>There are other additional aspects of the design (or extensions to it that I envision) that are intended to support both complex applications and complex sampling methods, including methods like <a href="http://www.cs.utoronto.ca/~radford/ttrans.abstract.html">tempered transtions</a> and <a href="http://www.cs.utoronto.ca/~radford/ensmcmc.abstract.html">ensemble MCMC</a> in which a whole series of Markov chain updates are nested within a complex outer update (though nothing that elaborate is implemented yet). It&#8217;s not completely general, however. For example, with it&#8217;s present design, GRIMS can&#8217;t easily handle states of varying dimensionality.</p>
<p>So what&#8217;s the one undesirable respect in which GRIMS differs from other MCMC packages? Speed. Though I haven&#8217;t quantified this yet, GRIMS is likely to be rather slow, due to the overhead of implementing the nice things listed above in R. This is one of my motivations for <a title="New patches to speed up R 2.13.0" href="http://radfordneal.wordpress.com/2011/06/09/new-patches-to-speed-up-r-2-13-0/">speeding up R</a>, though ironically, GRIMS has ended up making heavy use of some R operations, such as subscripting lists with strings using &#8220;[[...]]&#8221;, that I haven&#8217;t (yet) looked at speeding up.</p>
<p>We&#8217;ll have to see how fast or slow it ends up. In any case, speed isn&#8217;t essential to all uses of GRIMS. Since much of my research is on new MCMC methods, I want a better environment for quickly trying out new MCMC methods on a variety of applications. I also want these new MCMC methods, as well as new Bayesian models that use MCMC, to have implementations that are easily accessible to statisticians, for which an R function is a lot better than a C program.</p>
<p>If any readers want to try out this rather unpolished <a href="http://www.cs.utoronto.ca/~radford/GRIMS.html">preliminary version of GRIMS</a> (or just read the documentation), I&#8217;d be interested in your comments.</p>
<p>UPDATE: I&#8217;ve put up a new version that fixes some bugs, adds a Gibbs sampling / overrelaxation update for normals, and adds some tests. There may still be plenty of bugs remaining.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/861/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/861/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/861/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/861/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/861/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/861/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/861/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/861/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/861/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/861/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/861/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/861/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/861/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/861/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=861&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/06/26/grims-%e2%80%94-general-r-interface-for-markov-sampling/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>
	</item>
		<item>
		<title>Innumeracy at the Globe and Mail</title>
		<link>http://radfordneal.wordpress.com/2011/06/24/innumeracy-at-the-globe-and-mail/</link>
		<comments>http://radfordneal.wordpress.com/2011/06/24/innumeracy-at-the-globe-and-mail/#comments</comments>
		<pubDate>Fri, 24 Jun 2011 04:24:14 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Society]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics - Nontechnical]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=856</guid>
		<description><![CDATA[In the June 23 print edition of the Globe and Mail (billed as &#8220;Canada&#8217;s National Newspaper&#8221;), there&#8217;s an article on data centres (&#8220;Hewers of wood, storers of data&#8221;), in which, on page B4, one can read the following: Greenpeace recently released a report that said if the Internet were a country, it would be the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=856&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In the June 23 print edition of the Globe and Mail (billed as &#8220;Canada&#8217;s National Newspaper&#8221;), there&#8217;s an article on data centres (&#8220;Hewers of wood, storers of data&#8221;), in which, on page B4, one can read the following:</p>
<p style="padding-left:30px;">Greenpeace recently released a report that said if the Internet were a country, it would be the fifth-largest consumer of energy, largely because of the massive data centres that run unseen in the background. The group estimated that the centres will use 1.9 billion kilowatt hours of electricity by 2020 — more than the amount currently used by Canada, France, Germany and Brazil combined. (The average US home uses 8,000 kilowatt hours a year.)</p>
<p>An exercise for the reader: How many logical fallacies, arithmetic errors, or contradictions of common knowledge can you find in this passage?</p>
<p>I haven&#8217;t tried to determine whether these fallacies originate in the (unidentified) Greenpeace report, or are original to the Globe and Mail.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/856/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/856/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/856/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/856/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/856/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/856/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/856/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/856/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/856/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/856/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/856/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/856/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/856/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/856/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=856&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/06/24/innumeracy-at-the-globe-and-mail/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>
	</item>
		<item>
		<title>Two textbooks on probability using R</title>
		<link>http://radfordneal.wordpress.com/2011/06/18/two-textbooks-on-probability-using-r/</link>
		<comments>http://radfordneal.wordpress.com/2011/06/18/two-textbooks-on-probability-using-r/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 03:34:03 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[R Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics - Computing]]></category>
		<category><![CDATA[Statistics - Nontechnical]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=828</guid>
		<description><![CDATA[This fall, I&#8217;ll be teaching a second-year course on Probability with Computer Applications, which is required for Computer Science majors.  I&#8217;ve taught this before, but that was five years ago, so I&#8217;ve been looking to see what new textbooks would be suitable.  The course aims not just to use computer science applications as examples, but [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=828&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This fall, I&#8217;ll be teaching a second-year course on Probability with Computer Applications, which is required for Computer Science majors.  I&#8217;ve taught this before, but that was five years ago, so I&#8217;ve been looking to see what new textbooks would be suitable.  The course aims not just to use computer science applications as examples, but also to reinforce concepts of probability with programs, and to show how simulation can be used to solve problems that aren&#8217;t easily solved analytically. I&#8217;ve used R for the programming part, and plan to again, so I was naturally interested in two recent textbooks that seemed to have similar aims:</p>
<p style="padding-left:30px;"><em>Introduction to Probability with R</em>, Kenneth Baclawski, Chapman &amp; Hall / CRC.</p>
<p style="padding-left:30px;"><em>Probability with R: An Introduction with Computer Science Applications</em>, Jane M. Horgan, Wiley.</p>
<p>I&#8217;ve now had a look at both of these textbooks.  Unfortunately, they are both seriously flawed.  Even more unfortunately, although some of the flaws in these books are particularly striking, I&#8217;ve seen similar, if usually less serious, problems in many other textbooks.<span id="more-828"></span></p>
<p><em>Introduction to Probability with R</em> seemed from the title and blurb to be quite promising.  I began looking at it by just flipping to a few random pages to see what it read like.  That wasn&#8217;t supposed to be a serious evaluation yet, but in only a couple minutes, I&#8217;d found two passages that pretty much eliminated it from further consideration.  Here is the introduction to parametric families of distributions on pages 56 and 57:</p>
<p style="padding-left:30px;">&#8230; The distributions within a family are distinguished from one another by &#8220;parameters&#8221;&#8230; Because of the dependence on parameters, a family of distributions is also called a <em>random</em> or <em>stochastic function</em>&#8230; Be careful not to think of a random function as a &#8220;randomly chosen function&#8221; any more than a random variable is a &#8220;randomly chosen variable.&#8221;</p>
<p>Now, I&#8217;ve <em>never</em> before seen a parametric family of distributions called a &#8220;random function&#8221; or &#8220;stochastic function&#8221;.  And I&#8217;ve quite frequently seen &#8220;random function&#8221; used to mean exactly a &#8220;randomly chosen function&#8221;.  Where the author got his terminology, I&#8217;ve no idea.  For good measure, this same passage refers to the <em>p</em> parameter of a binomial distribution as the &#8220;bias&#8221;, and has a totally pointless illustration of a bunch of gears grinding a value for <em>n</em> and a &#8220;bias&#8221; into a distribution.</p>
<p>So, maybe he uses non-standard notation, but could the content be good?  Here&#8217;s what&#8217;s on page 131:</p>
<p style="padding-left:30px;"><strong>Main Rule of Statistics</strong>.  In any statistical measurement we may assume that the individual measurements are distributed according to the normal distribution, <em>N</em>(<em>m</em>,σ<sup>2</sup>).</p>
<p style="padding-left:30px;">To use this rule, we first find the mean <em>m</em> and variance σ<sup>2</sup> from information given in our problem or by using the sample mean&#8230; and/or sample variance&#8230; defined below. We then compute using either the pnorm or the qnorm function.</p>
<p style="padding-left:30px;">As stated, the main rule says only that our results will be &#8220;reasonable&#8221; if we assume that the measurements are normally distributed. We can actually assert more. In the absence of better information, we <em>must</em> assume that a measurement is normally distributed. In other words, if several models are possible, we must use the normal model unless there is a significant reason for rejecting it.</p>
<p style="padding-left:30px;">(The author continues his avoidance of standard terminology in the passage above by denoting the sample mean of <em>x<sub>1</sub>, &#8230;, x<sub>n</sub></em> by <em>m</em> with a bar over it and the sample variance by σ<sup>2</sup> with a bar over it, which I haven&#8217;t tried to reproduce.)</p>
<p>Sometimes, an occasional incorrect passage in a textbook does no harm, if corrected, and can even make for an interesting example in lecture, but when a textbook emphatically states completely erroneous nonsense it would be a disservice to students to force them to buy it.  The passage above reads like a parody of a statistics textbook — too many of which say how important it is to think carefully about what model is appropriate for your problem, but then proceed to use a normal distribution in all the examples without any discussion.  Still, lip-service to good practice is better than nothing, and much better than insistent advocacy of bad practice.</p>
<p><em>Probability with R: An Introduction with Computer Science Applications </em>seemed from its title and blurb to be even closer to what I need for my course.  For this book, two minutes of flipping through it did not provide sufficient grounds for rejection, so I started looking at it more closely, including reading it systematically from the beginning.  I found lots of careless errors (some corrected in the on-line errata), and lots to quibble about, along with nothing that was particularly impressive, but it wasn&#8217;t until page 83 that I encountered something seriously wrong:</p>
<p style="padding-left:30px;">Returning to the birthday problem &#8230;, instead of using permutations and counting, we could view it as a series of <em>k</em> events and apply the multiplication law of probability.</p>
<p style="padding-left:30px;"><em>B<sub>i</sub></em> is the birthday of the <em>i</em>th student.</p>
<p style="padding-left:30px;"><em>E</em> is the event that all students have different birthdays.</p>
<p style="padding-left:30px;">For example with two students, the probability of different birthdays is that the second student has a birthday different from that of the first,</p>
<p style="padding-left:60px;"><img src='http://s0.wp.com/latex.php?latex=P%28E%29%5C+%3D%5C+P%28B_2%7C%5Coverline+B_1%29%5C+%3D%5C+364%2F365&amp;bg=ffffff&amp;fg=414141&amp;s=-1' alt='P(E)&#92; =&#92; P(B_2|&#92;overline B_1)&#92; =&#92; 364/365' title='P(E)&#92; =&#92; P(B_2|&#92;overline B_1)&#92; =&#92; 364/365' class='latex' /></p>
<p style="padding-left:30px;">that is, the second student can have a birthday on any of the days of the year except the birthday of the first student.</p>
<p style="padding-left:30px;">With three students, the probability that the third is different from the previous two is</p>
<p style="padding-left:60px;"><img src='http://s0.wp.com/latex.php?latex=P%28B_3%7C%5Coverline+B_1+%5Ccap+%5Coverline+B_2%29%5C+%3D%5C+363%2F365&amp;bg=ffffff&amp;fg=414141&amp;s=-1' alt='P(B_3|&#92;overline B_1 &#92;cap &#92;overline B_2)&#92; =&#92; 363/365' title='P(B_3|&#92;overline B_1 &#92;cap &#92;overline B_2)&#92; =&#92; 363/365' class='latex' /></p>
<p style="padding-left:30px;">that is, the third student can have a birthday on any of the days of the year, except the two of the previous two students.</p>
<p>The example continues like this, with equations that on the right side have numerical values that are correct (in terms of the subsequent explanation in words), while on the left side are probability statements that are complete nonsense — since of course <em>B<sub>i</sub></em>, &#8220;the birthday of the <em>i</em>th student&#8221; is <em>not</em> an event. Nor can I see any alternative definition of <em>B<sub>i</sub></em> that would lead to these probability statements making sense.</p>
<p>Maybe &#8220;anyone&#8221; could make a mistake like this, but maybe not  — I do wonder whether the author actually understands elementary probability theory.  I lost all confidence in her ability to apply it in practice on reading the following on page 192, in the section on &#8221;Machine learning and the binomial distribution&#8221;:</p>
<p style="padding-left:30px;">Suppose there are three classifiers used to classify a new example. The probability that any of these classifiers correctly classifies the new case is 0.7, and therefore 0.3 of making an error. If a majority decision is made, what is the probability that the new case will be correctly classified?</p>
<p style="padding-left:30px;">Let <em>X</em> be the number of correct classifications made by the three classifiers. For a majority vote we need <em>X</em>≥2. Because the classifiers are independent, <em>X</em> follows a binomial distribution with parameters <em>n</em>=3 and <em>p</em>=0.7&#8230;</p>
<p style="padding-left:30px;">We have improved the probability of a correct classification from 0.7 with one classifier to 0.784 with three&#8230;</p>
<p style="padding-left:30px;">Obviously, by increasing the number of classifiers, we can improve classification accuracy further&#8230;</p>
<p style="padding-left:30px;">With 21 classifiers let us calculate the probability that a majority decision will be in error for various values of <em>p</em>, the probability that any one classifier will be in error&#8230;</p>
<p style="padding-left:30px;">Thus the key to successful ensemble methods is to construct individual classifiers with error rates below 0.5.</p>
<p>No, I haven&#8217;t omitted any phrase like &#8220;assuming the classifiers make independent errors&#8221;.  The author just says &#8220;because the classifiers are independent&#8221; as if this was totally obvious.  Of course, in any real ensemble of classifiers, the mistakes they make are <em>not</em> independent, and it is <em>not</em> enough to just produce lots of classifiers with error rates below 0.5.</p>
<p>Many, many textbooks give examples in which they say things like &#8220;assume that whether one patient dies after surgery is independent of whether another patient dies&#8221; without considering the many reasons why this might not be so.  But at least they do say they are making an assumption, and at least it is possible to imagine situations in which the assumption is approximately correct.  There is no realistic machine learning situation in which multiple classifiers in an ensemble will make errors independently, even approximately. The example in the book is totally misleading.</p>
<p>As well as being seriously flawed, neither of these books makes particularly good use of R to explain probability concepts or to demonstrate the use of simulation. The fragments of R code they contain are very short, basically using it as a calculator and plot program.</p>
<p>So, that&#8217;s it for these books.  If any readers know of good books on probability that either have good computer science applications, or use R for simulations and to clarify probability concepts, or preferably both, please let me know!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/828/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/828/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/828/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/828/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/828/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/828/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/828/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/828/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/828/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/828/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/828/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/828/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/828/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/828/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=828&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/06/18/two-textbooks-on-probability-using-r/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>
	</item>
		<item>
		<title>New patches to speed up R 2.13.0</title>
		<link>http://radfordneal.wordpress.com/2011/06/09/new-patches-to-speed-up-r-2-13-0/</link>
		<comments>http://radfordneal.wordpress.com/2011/06/09/new-patches-to-speed-up-r-2-13-0/#comments</comments>
		<pubDate>Thu, 09 Jun 2011 19:03:41 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[R Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics - Computing]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=799</guid>
		<description><![CDATA[I have now released a new collection of 30 patches to speed up R version 2.13.0. You can get them here Assessing how much these patches speed up R is difficult. First of all, the speedup varies tremendously with the type of program. It also varies quite a bit with the machine and compiler used [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=799&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have now released a new collection of 30 patches to speed up R version 2.13.0. You can <a href="http://www.cs.utoronto.ca/~radford/R-mods-2.13.0.html">get them here</a></p>
<p>Assessing how much these patches speed up R is difficult. First of all, the speedup varies tremendously with the type of program. It also varies quite a bit with the machine and compiler used to run R. Finally, it varies in apparently random ways — changing code in one part of the R interpreter can change the speed of operations that never use the modified code by plus or minus 5% or more.  This is presumably due to the change altering the exact addresses of other code segments, with consequent effects on alignment of memory fetches or on cache behaviour.</p>
<p>Nevertheless, <a href="http://radfordneal.files.wordpress.com/2011/06/mod-speedup.pdf">here is a comparison</a> of R 2.13.0 without modification and with all my patches applied, with and without compilation of R functions. The tests were done with an Intel X5680 processor running at 3.33GHz in 64-bit mode using gcc 4.4.4 under Red Hat Linux with default R configuration parameters. The tests use <a href="http://www.cs.utoronto.ca/%7Eradford/R-speed.html">my suite of speed tests for R</a>.</p>
<p>Here are some highlights:<span id="more-799"></span></p>
<ul>
<li>For programs dominated by general interpretive overhead, such as the first test listed (&#8220;em&#8221;), there is a speedup by a factor of up to about 1.5.  If these programs are compiled with the R bytecode compiler, the speedup is less, but not negligible (up to a factor of about 1.1).</li>
</ul>
<ul>
<li>Vector and matrix subscripting is substantially faster, for both interpreted and compiled programs.  The speedup is mostly when extracting subsets (for instance, r &lt;- v[100:200]), with only a small improvement when replacing a subset (for instance, v[100:200]&lt;-r). Here are some examples:
<ul>
<li>Operations such as A[100:200,300:400] are sped up by a factor of about 3.4.</li>
<li>Operations such as A[100:200,10] are sped up by a factor of about 1.8.</li>
<li>Operations such as A[,10:11], with A having 300 rows, are sped up by a factor of about 4.3.</li>
<li>Accessing a single element of a logical or complex vector is sped up by a factor of about 1.7.</li>
<li>Accessing a fairly big range of elements in a vector (eg, a[1:100]) is sped up by a factor of about 1.6.</li>
<li>Forming a vector with one element deleted (for example, a[-100], with a being of length 1000) is sped up by a factor of about 5.5.</li>
</ul>
</li>
</ul>
<ul>
<li>The interpreter now sometimes avoids creating a sequence of integers when it is just used as a subscript or in a for statement. This is partly responsible for the speedups detailed above for subscripting operations such as a[1:100].  A for statement such as for (i in 1:1000000) will now not actually create a vector of length 1000000, but just behave as if it had.</li>
</ul>
<ul>
<li>Vector dot products (done with %*%) are sped up by a factor of 10 for long vectors.  Matrix times vector and vector times matrix products with the vector being of length 1000 are sped up factors ranging from 3.2 to 5.5.  General matrix-matrix products are sped up substantially as long as at least one dimension of the result matrix is not large.  For discussion of the issues involved here, see <a title="Slowing down matrix multiplication in R" href="http://radfordneal.wordpress.com/2011/05/21/slowing-down-matrix-multiplication-in-r/">my previous post</a>.</li>
</ul>
<ul>
<li>Accessing an element of a list by name with the $ operator is sped up by a factor of about 1.5.</li>
</ul>
<ul>
<li>Argument matching when a function is called has been sped up. The speedup for this operation alone isn&#8217;t too clear from the tests, but is probably a factor of 1.3 or more. Somewhat as a by-product of this patch, some primitive operations such as rep have also been sped up (they had used an inefficient kludge for their argument matching). For instance, rep(1,length=100) is sped up by a factor of 1.5.</li>
</ul>
<p>I&#8217;ll now switch to working on other projects for a while, but I plan to get back to R work in August.  In the meantime, I will be posting a bit more on what some of these patches are doing.  I&#8217;m also interested, of course, in hearing of any problems that people may have installing the patches, in any bugs that they find, and in reports of how much effect the patches have on the speed of real programs.</p>
<p>Note to commenters:  Remember that &#8220;&lt;&#8221; must be entered as &#8220;&amp;lt;&#8221;, and &#8220;&amp;&#8221; must be entered as &#8220;&amp;amp;&#8221;.  Unfortunately, there&#8217;s nothing I can do to fix things if you post a comment with a &#8220;&lt;&#8221; that gets interpreted as indicating the start of an HTML tag!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/799/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/799/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/799/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/799/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/799/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/799/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/799/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/799/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/799/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/799/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/799/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/799/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/799/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/799/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=799&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/06/09/new-patches-to-speed-up-r-2-13-0/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>
	</item>
		<item>
		<title>Under the slide</title>
		<link>http://radfordneal.wordpress.com/2011/06/09/under-the-slide/</link>
		<comments>http://radfordneal.wordpress.com/2011/06/09/under-the-slide/#comments</comments>
		<pubDate>Thu, 09 Jun 2011 19:03:24 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Photography]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=783</guid>
		<description><![CDATA[Click on image for larger version. Toronto, May 2011. Pentax ME Super with SMC Pentax-M 50mm 1:1.7 lens, Black&#8217;s (Fuji?) ISO 200 film, Nikon Coolscan V.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=783&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p style="text-align:center;"><a href="http://radfordneal.files.wordpress.com/2011/06/01-eleanor-small.jpg"><img class="size-full wp-image-325 aligncenter" src="http://radfordneal.files.wordpress.com/2011/06/01-eleanor-tiny.jpg?w=455" alt="" /></a><span id="more-783"></span></p>
<p style="text-align:center;">Click on image for larger version.</p>
<p style="text-align:left;">Toronto, May 2011. Pentax ME Super with SMC Pentax-M 50mm 1:1.7 lens, Black&#8217;s (Fuji?) ISO 200 film, Nikon Coolscan V.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/783/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/783/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/783/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/783/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/783/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/783/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/783/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=783&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/06/09/under-the-slide/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>

		<media:content url="http://radfordneal.files.wordpress.com/2011/06/01-eleanor-tiny.jpg" medium="image" />
	</item>
		<item>
		<title>Slowing down matrix multiplication in R</title>
		<link>http://radfordneal.wordpress.com/2011/05/21/slowing-down-matrix-multiplication-in-r/</link>
		<comments>http://radfordneal.wordpress.com/2011/05/21/slowing-down-matrix-multiplication-in-r/#comments</comments>
		<pubDate>Sun, 22 May 2011 02:52:11 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[R Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics - Computing]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=739</guid>
		<description><![CDATA[After I realized that some aspects of R&#8217;s implementation are rather inefficient, one of the first things I looked at was matrix multiplication.  There I found a huge performance penalty for many matrix multiplies, a penalty which remains in the current version, 2.13.0.  As discussed below, eliminating this penalty speeds up long vector dot products [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=739&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>After I realized that <a href="http://radfordneal.wordpress.com/2010/08/15/two-surpising-things-about-r/">some aspects of R&#8217;s implementation are rather inefficient</a>, one of the first things I looked at was matrix multiplication.  There I found a huge performance penalty for many matrix multiplies, a penalty which remains in the current version, 2.13.0.  As discussed below, eliminating this penalty speeds up long vector dot products by a factor of 9.5 (on my new machine), and other operations where the result matrix has at least one small dimension are sped up by factors that are somewhat smaller, but still substantial. There&#8217;s a long story behind why R&#8217;s matrix multiplies are so slow&#8230;<span id="more-739"></span></p>
<p>The main source of this speed penalty is an insistence that the result of a matrix multiply should follow R&#8217;s rules for handling infinity, NaN (not-a-number), and NA.  These rules correspond to what happens with ordinary arithmetic operations on modern computers, which follow a standard for floating-point arithmetic in which, for example, 0/0 is NaN.  You might therefore think that nothing special is needed to arrange for matrix multiplies to produce NaNs as required.  However, R does matrix multiplications using the BLAS library, which comes in many versions, some of which may try to speed things up by avoiding &#8220;unnecessary&#8221; operations such as multiplication by zero — assuming that that will always result in zero.  However, zero times NaN or infinity is supposed to be NaN, not zero.</p>
<p>To ensure that all the right NaNs appear in the result matrix, R currently looks at every element of the operands of a matrix multiply to see if any of them are NaN or NA.  If so, it does the matrix multiplication with a simple set of nested loops in C, which will propagate the NaNs correctly.  Only after verifying that no NaN or NA appears in the operands does R call the BLAS routine for a matrix multiplication.</p>
<p>A comment in the code indicates that whoever wrote it thought that the overhead of this check would be negligible (when no NaN or NA is present), but this is not always true.  If the result matrix has N rows and M columns, and the other dimension of the operands is K, the check will take (N+M)×K operations, whereas the multiplication itself will take N×M×K operations.  The NaN check is actually slower than a multiply, so the overhead of the check will be negligible only if N+M is several times smaller than N×M.  When either N or M is small (say, less than 10), the overhead is substantial.  The worst case is when N and M are both one, which corresponds to a vector dot product operation.  The NaN check may dominate even more if the actual multiply is done in parallel using several processors.</p>
<p>In my <a href="http://radfordneal.wordpress.com/2010/09/03/fourteen-patches-to-speed-up-r/">first set of speed patches for R</a>, I tried to address this problem by estimating when the check would take longer than the matrix multiplication itself, and if so doing the multiplication with simple loops (avoiding the need for an explicit check). This sped up some operations by large factors, while ensuring that all the right NaNs were produced.  The R Core people didn&#8217;t adopt this patch, however, so R&#8217;s matrix multiplications are still as slow as ever.</p>
<p>The  latest release of R does have one change, however.  The BLAS routines included with the R release have been changed so that they don&#8217;t try to save time by avoiding multiplies by zero.  This has the effect of fixing a bug in 2.11.1 in which, for example, c(1/0,1) %*% c(0,1) produced 1 rather than NaN.  The costly check for NaNs had not actually been a full fix!</p>
<p>This makes the situation in version 2.13.0 rather ridiculous.  Many matrix multiplies are slowed down substantially by a check that is unnecessary if the BLAS routines supplied with R are used.  These checks might make a difference only if some other set of BLAS routines are linked with R. Generally, users will go to the trouble of doing that only if they are looking to improve the speed of their matrix multiplies.  But even the fasted BLAS routine can do nothing to reduce the overhead of the NaN check that is done before it is called!  The final irony is that (as far as I can tell) there will still be some &#8220;incorrect&#8221; results, when a BLAS optimizes away a multiply of infinity by zero.  (It looks like this could be fixed, however, by simply changing the &#8220;isnan&#8221; checks to &#8220;!isfinite&#8221; checks.)</p>
<p>Although my previous patch sped up many operations considerably (eg, vector dot products sped up by something like a factor of six), I now think that the real solution is to just not be so concerned about NaNs. From a statistical viewpoint, does anyone really expect that missing data indicated by NA will propagate through matrix multiplies?  If so, are they also expecting NAs and NaNs to propagate correctly through eigenvalue computations?  At some point, it becomes ridiculous to expect this.  I think that point comes before matrix multiplies.  From a more general viewpoint, the fact that BLAS routines don&#8217;t guarantee correct propagation of NaNs is an indication that the community thinks that is less important than speed.  Now that the BLAS supplied with R does propagate NaNs correctly, anyone who is concerned about that can simply use the supplied BLAS. Someone who instead installs an optimized BLAS presumably does so because they want their multiplies to go fast.</p>
<p>I&#8217;ve therefore created a modified version of R that omits the NaN check. Also, rather than always call the BLAS general matrix multiply routine, DGEMM, I call the routines for vector dot products (DDOT) and matrix-vector products (DGEMV) when the dimensions of the matrices are suitable.  Calling these more specialized routines sometimes gives faster results with the BLAS supplied with R.  The modified version of the R matrix product routine is <a href="http://radfordneal.files.wordpress.com/2011/05/matprod-mod-11-05-21.doc">here</a>; the original from 2.13.0 is <a href="http://radfordneal.files.wordpress.com/2011/05/matprod-2-13-0.doc">here</a>.</p>
<p>Below, are the factors by which this modified version speeds up matrix multiplies, on my new workstation with an Intel X5680 processor, running at 3.33GHz:</p>
<table border="1" align="center">
<tbody>
<tr>
<th>Type</th>
<th>dimensions</th>
<th>speedup</th>
</tr>
<tr>
<td>vector dot product</td>
<td>1&#215;1000 times 1000&#215;1</td>
<td>6.5 times faster</td>
</tr>
<tr>
<td>vector dot product</td>
<td>1&#215;500000 times 500000&#215;1</td>
<td>9.5 times faster</td>
</tr>
<tr>
<td>matrix-vector</td>
<td>5&#215;1000 times 1000&#215;1</td>
<td>3.6 times faster</td>
</tr>
<tr>
<td>vector-matrix</td>
<td>1&#215;1000 times 1000&#215;50</td>
<td>5.4 times faster</td>
</tr>
<tr>
<td>matrix-matrix</td>
<td>10&#215;100 times 100&#215;10</td>
<td>1.6 times faster</td>
</tr>
<tr>
<td>matrix-matrix</td>
<td>500&#215;100 times 100&#215;500</td>
<td>no difference</td>
</tr>
</tbody>
</table>
<p>The speed ups seem to me to be quite worthwhile. This is with the BLAS supplied with R. I expect that the speed ups with an optimized BLAS would be even larger.</p>
<p>Note again that when using the BLAS supplied with R, there is no problem with NaNs not being propagated. There <em>might</em> be a problem with some other BLAS. However, it is difficult to see why an optimized BLAS would check whether an element of a vector in a vector dot product is zero, since it will be used in only one multiply.  Checking whether an element of a matrix in a matrix-vector product is zero seems similarly counterproductive.  Any problem is therefore probably confined to vector elements in a matrix-vector product or matrix elements in a matrix-matrix product that isn&#8217;t a matrix-vector or vector dot product. So if one were to insist on doing a NaN check, one might be able to look only at those elements.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/739/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=739&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/05/21/slowing-down-matrix-multiplication-in-r/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>
	</item>
		<item>
		<title>Speed tests for R — and a look at the compiler</title>
		<link>http://radfordneal.wordpress.com/2011/05/13/speed-tests-for-r-%e2%80%94-and-a-look-at-the-compiler/</link>
		<comments>http://radfordneal.wordpress.com/2011/05/13/speed-tests-for-r-%e2%80%94-and-a-look-at-the-compiler/#comments</comments>
		<pubDate>Fri, 13 May 2011 13:16:40 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[R Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics - Computing]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=719</guid>
		<description><![CDATA[I&#8217;ve gotten back to work on speeding up R, starting with improving my suite of speed tests.  Among other new features, this suite allows one to easily try out the &#8220;byte-code&#8221; compiler that is now a standard part of the latest release of R, version 2.13.0. You can get the suite here. I&#8217;ve been running [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=719&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve gotten back to work on speeding up R, starting with improving my suite of speed tests.  Among other new features, this suite allows one to easily try out the &#8220;byte-code&#8221; compiler that is now a standard part of the latest release of R, version 2.13.0. You can get the suite <a href="http://www.cs.utoronto.ca/~radford/R-speed.html">here</a>.</p>
<p>I&#8217;ve been running these tests on my new workstation, which has a six-core Intel X5680 processor, running at 3.33GHz.  Unfortunately, it&#8217;s clear that thing runs somewhat slower when you use all the cores at once, so for consistency one needs to do the speed tests using just one core.  (Or one needs some more elaborate, and unclear, protocol for testing the speed of R in a muticore environment.)  I haven&#8217;t figured out how to get Red Hat Linux to compile 32-bit applications yet, so all the tests are in a 64-bit environment.</p>
<p>I&#8217;ve started with comparing the speed of R-2.13.0 with and without functions being compiled, and with comparing R-2.13.0 (without the compiler) to R-2.11.1, which was the last release before some of my speed improvements were incorporated.  A plot of the results is <a href="http://radfordneal.files.wordpress.com/2011/05/speed-test-example.pdf">here</a>.<span id="more-719"></span></p>
<p>Looking first at the effect of the compiler in R-2.13.0, one can see that for programs that do simple operations in loops, the compiler can speed things up by up to a factor of five, though the speed-up is often less than a factor of two, and in one strange case (a very simple for loop) the compiler slows things considerably.  As one would expect, there is no speed-up for programs dominated by large operations such as matrix multiplies.  There is also little speed-up when operations like matching arguments dominate.  There&#8217;s a modest speed-up for the vector arithmetic tests, which may be related to storage allocation.</p>
<p>Looking at R-2.13.0 versus R-2.11.0, one can see modest speed-ups for programs doing simple operations, which I believe is due to my improvements to &#8220;for&#8221; and to construction of argument lists.  There are also major improvements to some operations like &#8220;transpose&#8221;, which are also all due to modifications I introduced, with the exception of the improvement for matrix multiplies, which I believe is due to recent changes to the BLAS, which eliminate some special checks for zero, probably motivated by concern for proper NA/NaN propagation.  (My proposed modifications to matrix multiplies can produce a much larger improvement, but were not incorporated.)</p>
<p>Many of my other speed improvements have also not been incorporated into the released version of R.  I&#8217;m currently updating them for R-2.13.0, and adding some new speed improvements.  I hope to release them soon.</p>
<p>I expect that the speed-ups from these improvements will often be comparable to that obtained from using the compiler.  Indeed, in some cases they will be the <em>same</em> improvements — the compiler includes some optimizations that can just as easily (or more easily) be done in the interpreter.  For instance, the interpreter currently allocates new space for TRUE or FALSE for the result of every comparison or logical operation.  I came up with a simple modification to just allocate TRUE, FALSE, and logical NA once, and then re-use them as needed.  I then noticed that the compiler does something similar.</p>
<p>Other speed-ups will be different, however.  It will be interesting to see the combined effect of using both my speed improvements and the compiler.</p>
<p>UPDATE: I&#8217;ve released a new version of these speed tests, which fixes some glitches, adds some new tests, and improves the appearance of the plots.  You can get the new version (and new plots comparing 2.13.0 with and without compilation and 2.11.1 versus 2.13.0)  <a href="http://www.cs.utoronto.ca/%7Eradford/R-speed.html">here</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/719/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/719/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/719/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/719/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/719/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/719/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/719/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/719/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/719/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/719/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/719/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/719/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/719/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/719/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=719&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/05/13/speed-tests-for-r-%e2%80%94-and-a-look-at-the-compiler/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>
	</item>
		<item>
		<title>Ensemble MCMC</title>
		<link>http://radfordneal.wordpress.com/2011/01/01/ensemble-mcmc/</link>
		<comments>http://radfordneal.wordpress.com/2011/01/01/ensemble-mcmc/#comments</comments>
		<pubDate>Sun, 02 Jan 2011 03:26:08 +0000</pubDate>
		<dc:creator>Radford Neal</dc:creator>
				<category><![CDATA[Statistics - Computing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Monte Carlo Methods]]></category>

		<guid isPermaLink="false">http://radfordneal.wordpress.com/?p=692</guid>
		<description><![CDATA[I&#8217;m glad to have managed, before teaching starts again, to have finished a Technical Report (available here or at arxiv.org) with what may be my most unwieldy title ever: MCMC Using Ensembles of States for Problems with Fast and Slow Variables such as Gaussian Process Regression I wanted the title to mention all three of [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=692&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m glad to have managed, before teaching starts again, to have finished a Technical Report (available <a href="http://www.cs.utoronto.ca/~radford/ensmcmc.abstract.html">here</a> or at <a href="http://arxiv.org/abs/1101.0387">arxiv.org</A>) with what may be my most unwieldy title ever:</p>
<p style="padding-left:30px;">MCMC Using Ensembles of States for Problems with Fast and Slow Variables such as Gaussian Process Regression</p>
<p>I wanted the title to mention all three of the nested ideas in the paper.  Actually, I wasn&#8217;t able to fit in a fourth, most general, idea, of MCMC methods based on caching and mapping (see <a href="http://www.cs.utoronto.ca/~radford/ftp/cache-map.pdf">here</a>).   Here is the abstract:</p>
<p style="padding-left:30px;">I introduce a Markov chain Monte Carlo (MCMC) scheme in which sampling from a distribution with density π(x) is done using updates operating on an &#8220;ensemble&#8221; of states. The current state x is first stochastically mapped to an ensemble, x<sup>(1)</sup>,&#8230;,x<sup>(K)</sup>.  This ensemble is then updated using MCMC updates that leave invariant a suitable ensemble density, ρ(x<sup>(1)</sup>,&#8230;,x<sup>(K)</sup>), defined in terms of π(x<sup>(i)</sup>) for i=1,&#8230;,K.  Finally a single state is stochastically selected from the ensemble after these updates.  Such ensemble MCMC updates can be useful when characteristics of π and the ensemble permit π(x<sup>(i)</sup>) for all i in {1,&#8230;,K} to be computed in less than K times the amount of computation time needed to compute π(x) for a single x.  One common situation of this type is when changes to some &#8220;fast&#8221; variables allow for quick re-computation of the density, whereas changes to other &#8220;slow&#8221; variables do not. Gaussian process regression models are an example of this sort of problem, with an overall scaling factor for covariances and the noise variance being fast variables.  I show that ensemble MCMC for Gaussian process regression models can indeed substantially improve sampling performance.  Finally, I discuss other possible applications of ensemble MCMC, and its relationship to the &#8220;multiple-try Metropolis&#8221; method of Liu, Liang, and Wong and the &#8220;multiset sampler&#8221; of Leman, Chen, and Lavine.</p>
<p>I&#8217;ve also posted the <a href="http://www.cs.utoronto.ca/~radford/ensmcmc.software.html">programs used to produce the results</a>.  These haven&#8217;t been tested much beyond their use for the paper, but I hope to incorporate them into a general MCMC package in R (also including programs accompanying my <a href="http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html">review of Hamiltonian Monte Carlo</a>).   That&#8217;s my next project to do in whatever time I have available after teaching, administration, and a three-year-old daughter, along with more <a href="http://radfordneal.wordpress.com/2010/09/03/fourteen-patches-to-speed-up-r/">efforts to speed up R</a>, so the that this MCMC package won&#8217;t be <em>too </em>slow.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/radfordneal.wordpress.com/692/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/radfordneal.wordpress.com/692/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/radfordneal.wordpress.com/692/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/radfordneal.wordpress.com/692/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/radfordneal.wordpress.com/692/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/radfordneal.wordpress.com/692/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/radfordneal.wordpress.com/692/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/radfordneal.wordpress.com/692/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/radfordneal.wordpress.com/692/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/radfordneal.wordpress.com/692/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/radfordneal.wordpress.com/692/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/radfordneal.wordpress.com/692/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/radfordneal.wordpress.com/692/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/radfordneal.wordpress.com/692/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=radfordneal.wordpress.com&amp;blog=4390751&amp;post=692&amp;subd=radfordneal&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://radfordneal.wordpress.com/2011/01/01/ensemble-mcmc/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">radfordneal</media:title>
		</media:content>
	</item>
	</channel>
</rss>
