Speed tests for R — and a look at the compiler

2011-05-13 at 9:16 am 9 comments

I’ve gotten back to work on speeding up R, starting with improving my suite of speed tests. Among other new features, this suite allows one to easily try out the “byte-code” compiler that is now a standard part of the latest release of R, version 2.13.0. You can get the suite here.

I’ve been running these tests on my new workstation, which has a six-core Intel X5680 processor, running at 3.33GHz. Unfortunately, it’s clear that thing runs somewhat slower when you use all the cores at once, so for consistency one needs to do the speed tests using just one core. (Or one needs some more elaborate, and unclear, protocol for testing the speed of R in a muticore environment.) I haven’t figured out how to get Red Hat Linux to compile 32-bit applications yet, so all the tests are in a 64-bit environment.

I’ve started with comparing the speed of R-2.13.0 with and without functions being compiled, and with comparing R-2.13.0 (without the compiler) to R-2.11.1, which was the last release before some of my speed improvements were incorporated. A plot of the results is here.

Looking first at the effect of the compiler in R-2.13.0, one can see that for programs that do simple operations in loops, the compiler can speed things up by up to a factor of five, though the speed-up is often less than a factor of two, and in one strange case (a very simple for loop) the compiler slows things considerably. As one would expect, there is no speed-up for programs dominated by large operations such as matrix multiplies. There is also little speed-up when operations like matching arguments dominate. There’s a modest speed-up for the vector arithmetic tests, which may be related to storage allocation.

Looking at R-2.13.0 versus R-2.11.0, one can see modest speed-ups for programs doing simple operations, which I believe is due to my improvements to “for” and to construction of argument lists. There are also major improvements to some operations like “transpose”, which are also all due to modifications I introduced, with the exception of the improvement for matrix multiplies, which I believe is due to recent changes to the BLAS, which eliminate some special checks for zero, probably motivated by concern for proper NA/NaN propagation. (My proposed modifications to matrix multiplies can produce a much larger improvement, but were not incorporated.)

Many of my other speed improvements have also not been incorporated into the released version of R. I’m currently updating them for R-2.13.0, and adding some new speed improvements. I hope to release them soon.

I expect that the speed-ups from these improvements will often be comparable to that obtained from using the compiler. Indeed, in some cases they will be the same improvements — the compiler includes some optimizations that can just as easily (or more easily) be done in the interpreter. For instance, the interpreter currently allocates new space for TRUE or FALSE for the result of every comparison or logical operation. I came up with a simple modification to just allocate TRUE, FALSE, and logical NA once, and then re-use them as needed. I then noticed that the compiler does something similar.

Other speed-ups will be different, however. It will be interesting to see the combined effect of using both my speed improvements and the compiler.

UPDATE: I’ve released a new version of these speed tests, which fixes some glitches, adds some new tests, and improves the appearance of the plots. You can get the new version (and new plots comparing 2.13.0 with and without compilation and 2.11.1 versus 2.13.0) here.

Entry filed under: Computing, R Programming, Statistics, Statistics - Computing.

Ensemble MCMC Slowing down matrix multiplication in R

9 Comments Add your own

1. Tal Galili | 2011-05-13 at 10:32 am

Hello Neal,
I don’t know if you hear this a lot or not, but I (and am sure many of the other R users who are running simulations) think the work you are doing is wonderful.

So thank you very much!

Best,
Tal
Reply
- 2. Radford Neal | 2011-05-13 at 10:53 am
  
  Thanks! But note that my first name is “Radford” :-)
  Reply
  - 3. Tal Galili | 2011-05-13 at 11:23 am
    
    Note taken :)
    Reply
4. Luke Tierney | 2011-05-13 at 1:18 pm

Just a quick note on TRUE/FALSE and the slow “for” loop. In principle the compiler should not need to allocate literal constants–just using ones in the constant pool, or pre-allocated values as for TRUE and FALSE, should work as long as they have NAMED set at 2. The problem: there are currently a number of CRAN packages that play too fast and loose with .C(…,DUP=FALSE) and the like. These fail if fresh values are not allocated. The same would happen if the interpreter is modified to avoid these allocations. Having spent hours in the distant past debugging FORTRAN code where a a call to a function had managed to change the value of the constant ‘2’ to ‘3’ I am not eager to re-introduce that possibility in R. So for now the byte code engine allocates in (most of) these cases. It is even more defensive than the interpreter in this area, hence the difference in the simple “for” loop. I am experimenting with an alternative that would eliminate these an many other allocations when scalar values are involved. Preliminary results are fairly promising but it will be a while before this makes it into the R distribution.
Reply
- 5. Radford Neal | 2011-05-13 at 1:30 pm
  
  I can see the problem here, but probably most of these packages are broken in any case — operating incorrectly when they encounter various other quantities with NAMED set to 2. One possible approach woud be to do a (very quick) check for integrity of TRUE and FALSE after a C function returns, fixing them and reporting an error if they are found to have been altered. That might identify problems rapidly.
  Reply
6. Joe Cheng | 2011-05-13 at 9:36 pm

Re: the second paragraph, you probably want to disable Turbo Boost on your Xeon for benchmarking. You should be able to do this in your BIOS.
Reply
- 7. Radford Neal | 2011-05-13 at 9:49 pm
  
  Thanks! I hadn’t been aware of this issue. Possibly this is the main reason for longer times when running concurrently on multiple cores. I had assumed that this was due to memory access contention, but it seems that with Turbo Boost the clock speed could actually be lowered by running concurrently.
  Reply
- 8. Ryan | 2011-05-17 at 1:50 pm
  
  Do you have more information on why the Xeon’s behave this way?
  Reply
9. News about speeding R up « Xi'an's Og | 2011-05-23 at 6:15 pm

[…] Ihaka’s comments, “simply start over and build something better”). I just spotted two new entries by Radford on his blog that are bound to rekindle the debate about the speed of R. The latest one […]
Reply

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Radford Neal's blog