New version of pqR, now with task merging
I’ve now released pqR-2013-12-29, a new version of my speedier implementation of R. There’s a new website, pqR-project.org, as well, and a new logo, seen here.
The big improvement in this version is that vector operations are sped up using task merging.
With task merging, several arithmetic operations on a vector may be merged into a single operation, reducing the time spent on memory stores and fetches of intermediate results. I was inspired to add task merging to pqR by Renjin and Riposte (see my post here and the subsequent discussion).
The main way pqR merges tasks is via the same deferred evaluation mechanism that it uses to perform tasks in helper threads, when multiple processor cores are available. Here’s an example, in which we’ll first assume that only one processor core is used (no helper threads):
f <- function (x,a,b) a*x+b v <- seq(1,2,length=10000) w <- f(v,2,3)^2 print(w)
When f(v,2,3) is called, a task is first scheduled to compute 2*v, but this computation is not started. Next, a task is scheduled to add 3 to the result of computing 2*v, at which point pqR recognizes that this task can be merged with the first task, into a task to compute 2*v+3. This computation is also not started yet, with the value returned from f(v,2,3) being a vector whose computation is pending. A task is then scheduled to square this vector, which is merged with the previously merged task into a task to compute (2*v+3)^2. This computation is also deferred; instead, a pending value is assigned to w. It’s only when w needs to be printed that the vector w is actually computed.
The final merged computation of (2*v+3)^2 is done with a single loop, which fetches each element, v[i], and stores (2*v[i]+3)^2 in w[i]. In contrast, if the three operations hadn’t been merged, three loops would have been done, with two stores and two fetches of intermediate results for each element of the vector. In my tests, the merged operation is 1.5 to 2.0 times faster than the combination of the three unmerged operations (depending on the processor, compiler, etc.).
This example is more complicated if pqR is using helper threads. The first task to compute 2*v will now probably be started in a helper thread before the task to add 3 to its result is scheduled, so no merge will be done. If there is only one helper thread, the next task to do the squaring operation will likely be merged with the “add 3″ task, but if a second helper is available, the “add 3″ task may also have already started. So it’s possible that the availability of helper threads could actually slow computations, by preventing merges. Helper threads will of course be a help when computations on vector elements dominate the memory store and fetch operations, and for this reason pqR does not try to merge expensive operations when helper threads are available (eg, sin(exp(v)) is not merged when there is at least one helper thread, whereas it is merged when there are none).
I plan to improve how task merging works with helper threads in a future release of pqR. It should be possible to merge tasks even when the first has already started, by telling the first task to stop at the point it has reached, doing the second task’s operation on those elements where the first has already been done, and then doing the merged operations on the remaining elements.
I also plan to extend the operations for which task merging can be done. At present, pqR can only merge operations with a single vector operand and a vector result, such as 3*v or exp(v), but it should also be possible to merge some operations with two vector operands, such as u+v with u and v both vectors. Task merging in pqR will remain less ambitious than the similar features in Renjin and Riposte however, since pqR uses pre-compiled code snippets for all possible combinations of merged operations (currently 2744 of them, with merges limited to three operations), whereas Renjin and Riposte compile code on-the-fly.
Previously, pqR had already had some features resembling task merging, implemented using the “variant result” mechanism (described here). The new version extends use of this mechanism to merge transpose and matrix multiply operations — now t(A)%*%B does not actually compute the transpose of A, but instead calls a routine that directly computes the transpose of A times B, which is actually faster than computing A times B, because of more favourable memory ordering. Those in the know can also achieve this result using the crossprod function, but in pqR it’s now automatic.
The new version also has some bug fixes and other improvements. You can install it on a Linux or Unix system from the source available at pqR-project.org. Installation from source on Max OS X is also possible, but is recommended only for those experienced in doing this. (You’ll need to install both Apple’s Xcode and a Fortran compiler, plus you’ll neeed a newer version of gcc than Xcode has if you want to use helper threads. The Mac GUI does not yet work if helper threads are enabled.) Some people have managed to install pqR on Microsoft Windows, but this is still experimental. One of my next objectives is to produce pre-compiled versions of pqR for common systems, but for now you have to install from source.
I recently gave a talk on pqR at the University of Guelph, the slides of which you can see here. (As usual, I had to hurry through the last few in the actual talk.)