Posts filed under ‘R Programming’
This fall, I’ll be teaching a second-year course on Probability with Computer Applications, which is required for Computer Science majors. I’ve taught this before, but that was five years ago, so I’ve been looking to see what new textbooks would be suitable. The course aims not just to use computer science applications as examples, but also to reinforce concepts of probability with programs, and to show how simulation can be used to solve problems that aren’t easily solved analytically. I’ve used R for the programming part, and plan to again, so I was naturally interested in two recent textbooks that seemed to have similar aims:
Introduction to Probability with R, Kenneth Baclawski, Chapman & Hall / CRC.
Probability with R: An Introduction with Computer Science Applications, Jane M. Horgan, Wiley.
I’ve now had a look at both of these textbooks. Unfortunately, they are both seriously flawed. Even more unfortunately, although some of the flaws in these books are particularly striking, I’ve seen similar, if usually less serious, problems in many other textbooks. (more…)
I have now released a new collection of 30 patches to speed up R version 2.13.0. You can get them here
Assessing how much these patches speed up R is difficult. First of all, the speedup varies tremendously with the type of program. It also varies quite a bit with the machine and compiler used to run R. Finally, it varies in apparently random ways — changing code in one part of the R interpreter can change the speed of operations that never use the modified code by plus or minus 5% or more. This is presumably due to the change altering the exact addresses of other code segments, with consequent effects on alignment of memory fetches or on cache behaviour.
Nevertheless, here is a comparison of R 2.13.0 without modification and with all my patches applied, with and without compilation of R functions. The tests were done with an Intel X5680 processor running at 3.33GHz in 64-bit mode using gcc 4.4.4 under Red Hat Linux with default R configuration parameters. The tests use my suite of speed tests for R.
Here are some highlights: (more…)
After I realized that some aspects of R’s implementation are rather inefficient, one of the first things I looked at was matrix multiplication. There I found a huge performance penalty for many matrix multiplies, a penalty which remains in the current version, 2.13.0. As discussed below, eliminating this penalty speeds up long vector dot products by a factor of 9.5 (on my new machine), and other operations where the result matrix has at least one small dimension are sped up by factors that are somewhat smaller, but still substantial. There’s a long story behind why R’s matrix multiplies are so slow… (more…)
I’ve gotten back to work on speeding up R, starting with improving my suite of speed tests. Among other new features, this suite allows one to easily try out the “byte-code” compiler that is now a standard part of the latest release of R, version 2.13.0. You can get the suite here.
I’ve been running these tests on my new workstation, which has a six-core Intel X5680 processor, running at 3.33GHz. Unfortunately, it’s clear that thing runs somewhat slower when you use all the cores at once, so for consistency one needs to do the speed tests using just one core. (Or one needs some more elaborate, and unclear, protocol for testing the speed of R in a muticore environment.) I haven’t figured out how to get Red Hat Linux to compile 32-bit applications yet, so all the tests are in a 64-bit environment.
I’ve started with comparing the speed of R-2.13.0 with and without functions being compiled, and with comparing R-2.13.0 (without the compiler) to R-2.11.1, which was the last release before some of my speed improvements were incorporated. A plot of the results is here. (more…)
Following my discovery of two surprising inefficiencies in R, I’ve been inspired to spend much of the last two weeks looking for ways to speed it up. I’ve had quite a bit of success, both at finding ways to speed up particular functions, and at finding ways to reduce general interpretive overhead.
You can get my fourteen patches to the R source code here. I’d be interested in hearing how much it speeds up typical applications, on various machines. Of course, you need to be comfortable with installing R from source code to use these patches. For meaningful speed comparisons, you also need to be sure to compile the modified and unmodified versions of R with the same compiler, same options, etc.
There look to be some more places in the R source code where speed improvements are possible, but for now, I had better switch to preparing for the coming teaching term…
UPDATE: I discovered a bug in the vec-subset patch. The version you can get from here now has this fixed. I also split the vec-subset patch into patch-vec-subset and patch-subscript, since these two parts are really independent. So there are now fifteen patches.
As I noted here, enclosing sub-expressions in parentheses is slower in R than enclosing them in curly brackets. I now know why, and I’ve modified R to reduce (but not eliminate) the slowness of parentheses. The modification speeds up many other operations in R as well, for an average speedup of something like 5% for programs that aren’t dominated by large built-in operations like matrix multiplies. (more…)
I see that it’s been over a year since my last post! I have a backlog of blog post ideas, but something else always seems to have higher priority. Today, though, I have some interesting (and useful) things to say about R, which I discovered in the last few days, and which shouldn’t take long to blog about. Of course, some other people may already be quite familiar with these things. Or maybe not…
First up, a useful feature of R that I hadn’t realized existed, which comes with a surprising gain in efficiency. Second, something surprisingly slow about R’s implementation of a very common operation. (more…)
Unlike the two design flaws I posted about before (here, here, and also here), where one could at least see a reason for the design decision, even if it was unwise, this design flaw is just incomprehensible. For no reason at all that I can see, R allows one to use zero as a subscript without triggering an error. (Remember that in R, indexes for vectors and matrices start at one, not zero.)
This is of course a terrible decision, because it makes debugging harder, and makes it more likely that bugs will exist that have never been noticed. (more…)
I’ve previously posted about two design flaws in R. The first post was about how R produces reversed sequences from a:b when a>b, with bad consequences in “for” statements (and elsewhere). The second post was about how R by default drops dimensions in expressions like M[i:j,] when i:j is a sequence only one long (ie, when i equals j).
In both posts, I suggested ways of extending R to try to solve these problems. I now think there is a better way, however, which solves both problems with one simple extension to R. This extension would also make R programs run faster and use less memory. (more…)
In a comment on my first post on design flaws in the R language, Longhai remarked that he has encountered problems as a result of R’s default behaviour of dropping a dimension of a matrix when you select only one row/column from that dimension. This was indeed the design flaw that I was going to get to next! I think it also points to what is perhaps a deeper design flaw. (more…)