Slowing down matrix multiplication in R
After I realized that some aspects of R’s implementation are rather inefficient, one of the first things I looked at was matrix multiplication. There I found a huge performance penalty for many matrix multiplies, a penalty which remains in the current version, 2.13.0. As discussed below, eliminating this penalty speeds up long vector dot products by a factor of 9.5 (on my new machine), and other operations where the result matrix has at least one small dimension are sped up by factors that are somewhat smaller, but still substantial. There’s a long story behind why R’s matrix multiplies are so slow…
The main source of this speed penalty is an insistence that the result of a matrix multiply should follow R’s rules for handling infinity, NaN (not-a-number), and NA. These rules correspond to what happens with ordinary arithmetic operations on modern computers, which follow a standard for floating-point arithmetic in which, for example, 0/0 is NaN. You might therefore think that nothing special is needed to arrange for matrix multiplies to produce NaNs as required. However, R does matrix multiplications using the BLAS library, which comes in many versions, some of which may try to speed things up by avoiding “unnecessary” operations such as multiplication by zero — assuming that that will always result in zero. However, zero times NaN or infinity is supposed to be NaN, not zero.
To ensure that all the right NaNs appear in the result matrix, R currently looks at every element of the operands of a matrix multiply to see if any of them are NaN or NA. If so, it does the matrix multiplication with a simple set of nested loops in C, which will propagate the NaNs correctly. Only after verifying that no NaN or NA appears in the operands does R call the BLAS routine for a matrix multiplication.
A comment in the code indicates that whoever wrote it thought that the overhead of this check would be negligible (when no NaN or NA is present), but this is not always true. If the result matrix has N rows and M columns, and the other dimension of the operands is K, the check will take (N+M)×K operations, whereas the multiplication itself will take N×M×K operations. The NaN check is actually slower than a multiply, so the overhead of the check will be negligible only if N+M is several times smaller than N×M. When either N or M is small (say, less than 10), the overhead is substantial. The worst case is when N and M are both one, which corresponds to a vector dot product operation. The NaN check may dominate even more if the actual multiply is done in parallel using several processors.
In my first set of speed patches for R, I tried to address this problem by estimating when the check would take longer than the matrix multiplication itself, and if so doing the multiplication with simple loops (avoiding the need for an explicit check). This sped up some operations by large factors, while ensuring that all the right NaNs were produced. The R Core people didn’t adopt this patch, however, so R’s matrix multiplications are still as slow as ever.
The latest release of R does have one change, however. The BLAS routines included with the R release have been changed so that they don’t try to save time by avoiding multiplies by zero. This has the effect of fixing a bug in 2.11.1 in which, for example, c(1/0,1) %*% c(0,1) produced 1 rather than NaN. The costly check for NaNs had not actually been a full fix!
This makes the situation in version 2.13.0 rather ridiculous. Many matrix multiplies are slowed down substantially by a check that is unnecessary if the BLAS routines supplied with R are used. These checks might make a difference only if some other set of BLAS routines are linked with R. Generally, users will go to the trouble of doing that only if they are looking to improve the speed of their matrix multiplies. But even the fasted BLAS routine can do nothing to reduce the overhead of the NaN check that is done before it is called! The final irony is that (as far as I can tell) there will still be some “incorrect” results, when a BLAS optimizes away a multiply of infinity by zero. (It looks like this could be fixed, however, by simply changing the “isnan” checks to “!isfinite” checks.)
Although my previous patch sped up many operations considerably (eg, vector dot products sped up by something like a factor of six), I now think that the real solution is to just not be so concerned about NaNs. From a statistical viewpoint, does anyone really expect that missing data indicated by NA will propagate through matrix multiplies? If so, are they also expecting NAs and NaNs to propagate correctly through eigenvalue computations? At some point, it becomes ridiculous to expect this. I think that point comes before matrix multiplies. From a more general viewpoint, the fact that BLAS routines don’t guarantee correct propagation of NaNs is an indication that the community thinks that is less important than speed. Now that the BLAS supplied with R does propagate NaNs correctly, anyone who is concerned about that can simply use the supplied BLAS. Someone who instead installs an optimized BLAS presumably does so because they want their multiplies to go fast.
I’ve therefore created a modified version of R that omits the NaN check. Also, rather than always call the BLAS general matrix multiply routine, DGEMM, I call the routines for vector dot products (DDOT) and matrix-vector products (DGEMV) when the dimensions of the matrices are suitable. Calling these more specialized routines sometimes gives faster results with the BLAS supplied with R. The modified version of the R matrix product routine is here; the original from 2.13.0 is here.
Below, are the factors by which this modified version speeds up matrix multiplies, on my new workstation with an Intel X5680 processor, running at 3.33GHz:
|vector dot product||1×1000 times 1000×1||6.5 times faster|
|vector dot product||1×500000 times 500000×1||9.5 times faster|
|matrix-vector||5×1000 times 1000×1||3.6 times faster|
|vector-matrix||1×1000 times 1000×50||5.4 times faster|
|matrix-matrix||10×100 times 100×10||1.6 times faster|
|matrix-matrix||500×100 times 100×500||no difference|
The speed ups seem to me to be quite worthwhile. This is with the BLAS supplied with R. I expect that the speed ups with an optimized BLAS would be even larger.
Note again that when using the BLAS supplied with R, there is no problem with NaNs not being propagated. There might be a problem with some other BLAS. However, it is difficult to see why an optimized BLAS would check whether an element of a vector in a vector dot product is zero, since it will be used in only one multiply. Checking whether an element of a matrix in a matrix-vector product is zero seems similarly counterproductive. Any problem is therefore probably confined to vector elements in a matrix-vector product or matrix elements in a matrix-matrix product that isn’t a matrix-vector or vector dot product. So if one were to insist on doing a NaN check, one might be able to look only at those elements.