Slowing down matrix multiplication in R

2011-05-21 at 10:52 pm 12 comments

After I realized that some aspects of R’s implementation are rather inefficient, one of the first things I looked at was matrix multiplication. There I found a huge performance penalty for many matrix multiplies, a penalty which remains in the current version, 2.13.0. As discussed below, eliminating this penalty speeds up long vector dot products by a factor of 9.5 (on my new machine), and other operations where the result matrix has at least one small dimension are sped up by factors that are somewhat smaller, but still substantial. There’s a long story behind why R’s matrix multiplies are so slow…

The main source of this speed penalty is an insistence that the result of a matrix multiply should follow R’s rules for handling infinity, NaN (not-a-number), and NA. These rules correspond to what happens with ordinary arithmetic operations on modern computers, which follow a standard for floating-point arithmetic in which, for example, 0/0 is NaN. You might therefore think that nothing special is needed to arrange for matrix multiplies to produce NaNs as required. However, R does matrix multiplications using the BLAS library, which comes in many versions, some of which may try to speed things up by avoiding “unnecessary” operations such as multiplication by zero — assuming that that will always result in zero. However, zero times NaN or infinity is supposed to be NaN, not zero.

To ensure that all the right NaNs appear in the result matrix, R currently looks at every element of the operands of a matrix multiply to see if any of them are NaN or NA. If so, it does the matrix multiplication with a simple set of nested loops in C, which will propagate the NaNs correctly. Only after verifying that no NaN or NA appears in the operands does R call the BLAS routine for a matrix multiplication.

A comment in the code indicates that whoever wrote it thought that the overhead of this check would be negligible (when no NaN or NA is present), but this is not always true. If the result matrix has N rows and M columns, and the other dimension of the operands is K, the check will take (N+M)×K operations, whereas the multiplication itself will take N×M×K operations. The NaN check is actually slower than a multiply, so the overhead of the check will be negligible only if N+M is several times smaller than N×M. When either N or M is small (say, less than 10), the overhead is substantial. The worst case is when N and M are both one, which corresponds to a vector dot product operation. The NaN check may dominate even more if the actual multiply is done in parallel using several processors.

In my first set of speed patches for R, I tried to address this problem by estimating when the check would take longer than the matrix multiplication itself, and if so doing the multiplication with simple loops (avoiding the need for an explicit check). This sped up some operations by large factors, while ensuring that all the right NaNs were produced. The R Core people didn’t adopt this patch, however, so R’s matrix multiplications are still as slow as ever.

The latest release of R does have one change, however. The BLAS routines included with the R release have been changed so that they don’t try to save time by avoiding multiplies by zero. This has the effect of fixing a bug in 2.11.1 in which, for example, c(1/0,1) %*% c(0,1) produced 1 rather than NaN. The costly check for NaNs had not actually been a full fix!

This makes the situation in version 2.13.0 rather ridiculous. Many matrix multiplies are slowed down substantially by a check that is unnecessary if the BLAS routines supplied with R are used. These checks might make a difference only if some other set of BLAS routines are linked with R. Generally, users will go to the trouble of doing that only if they are looking to improve the speed of their matrix multiplies. But even the fasted BLAS routine can do nothing to reduce the overhead of the NaN check that is done before it is called! The final irony is that (as far as I can tell) there will still be some “incorrect” results, when a BLAS optimizes away a multiply of infinity by zero. (It looks like this could be fixed, however, by simply changing the “isnan” checks to “!isfinite” checks.)

Although my previous patch sped up many operations considerably (eg, vector dot products sped up by something like a factor of six), I now think that the real solution is to just not be so concerned about NaNs. From a statistical viewpoint, does anyone really expect that missing data indicated by NA will propagate through matrix multiplies? If so, are they also expecting NAs and NaNs to propagate correctly through eigenvalue computations? At some point, it becomes ridiculous to expect this. I think that point comes before matrix multiplies. From a more general viewpoint, the fact that BLAS routines don’t guarantee correct propagation of NaNs is an indication that the community thinks that is less important than speed. Now that the BLAS supplied with R does propagate NaNs correctly, anyone who is concerned about that can simply use the supplied BLAS. Someone who instead installs an optimized BLAS presumably does so because they want their multiplies to go fast.

I’ve therefore created a modified version of R that omits the NaN check. Also, rather than always call the BLAS general matrix multiply routine, DGEMM, I call the routines for vector dot products (DDOT) and matrix-vector products (DGEMV) when the dimensions of the matrices are suitable. Calling these more specialized routines sometimes gives faster results with the BLAS supplied with R. The modified version of the R matrix product routine is here; the original from 2.13.0 is here.

Below, are the factors by which this modified version speeds up matrix multiplies, on my new workstation with an Intel X5680 processor, running at 3.33GHz:

Type	dimensions	speedup
vector dot product	1×1000 times 1000×1	6.5 times faster
vector dot product	1×500000 times 500000×1	9.5 times faster
matrix-vector	5×1000 times 1000×1	3.6 times faster
vector-matrix	1×1000 times 1000×50	5.4 times faster
matrix-matrix	10×100 times 100×10	1.6 times faster
matrix-matrix	500×100 times 100×500	no difference

The speed ups seem to me to be quite worthwhile. This is with the BLAS supplied with R. I expect that the speed ups with an optimized BLAS would be even larger.

Note again that when using the BLAS supplied with R, there is no problem with NaNs not being propagated. There might be a problem with some other BLAS. However, it is difficult to see why an optimized BLAS would check whether an element of a vector in a vector dot product is zero, since it will be used in only one multiply. Checking whether an element of a matrix in a matrix-vector product is zero seems similarly counterproductive. Any problem is therefore probably confined to vector elements in a matrix-vector product or matrix elements in a matrix-matrix product that isn’t a matrix-vector or vector dot product. So if one were to insist on doing a NaN check, one might be able to look only at those elements.

Entry filed under: Computing, R Programming, Statistics, Statistics - Computing.

Speed tests for R — and a look at the compiler Under the slide

12 Comments Add your own

1. Ken | 2011-05-21 at 11:46 pm

This could be easily solved with a check after compilation that R produces the required result. It seems preferable that NA be propagated as they are missing data, so do mean something.
Reply
- 2. Radford Neal | 2011-05-22 at 12:27 am
  
  I’m not sure what you are suggesting here. For any particular BLAS, one could probably determine fairly easily whether or not it propagates NaNs, but what does one do with that information?
  Reply
  - 3. Ken | 2011-05-26 at 12:05 am
    
    If they don’t propagate then fail, and allow an option to overide this. Another alternative would be to only perform the time consuming checks if it fails.
    Reply
4. Patrick McCann | 2011-05-22 at 11:53 am

This is really fascinating; I hope the core team takes notice.
Reply
5. Quant | 2011-05-23 at 4:33 am

thanks a lot, the results look promising, it sounds a silly question, but how can I use your improved function instead of the one embedded by R? thank you.
Reply
- 6. Radford Neal | 2011-05-23 at 10:08 am
  
  I’ll be releasing a new set of speed patches soon. If you’re really eager, of course, and you have installed R from source, you can replace the matprod function with the revised one in the post yourself. It’s in src/main/array.c.
  Reply
7. jcborras | 2011-05-23 at 5:02 am

Algebraically speaking matrices are made of numbers non NAs, and we tolerate NaNs and Infs because IEEE754 clearly states what to do with them.
IMHO and BLAS idiosyncrasies aside, checking for the existence of NAs in a matrix product operation is a misplacement of responsibilities.
And I’d love to hear an opposite argument.
Reply
- 8. Radford Neal | 2011-05-23 at 10:12 am
  
  The issue isn’t just with NAs, but also NaNs and Infs, so this isn’t exclusively an R issue, or one with “missing data” indicators in general. Some BLAS packages don’t, for example, guarantee that 0 times Inf is NaN when it occurs as part of a matrix multiply, because the BLAS has realized that a large number of multiplies will have zero as one operand, and has decided to optimize them all away, on the assumption that the result of multiplying by zero is always zero.
  Reply
9. Ben Bolker | 2011-05-23 at 9:28 am

I wonder if this fix could be incorporated as an option (dangerous.matrix.multiplication=TRUE), if R-core thinks that it is unacceptable not to check in general … ?
Reply
- 10. Radford Neal | 2011-05-23 at 10:13 am
  
  That would certainly be possible. It may also be possible to detect when the supplied BLAS is being used, and not do the checks in that case (they would be unnecessary), regardless of the setting of any such option.
  Reply
11. richierocks | 2011-05-23 at 1:18 pm

This sounds very interesting.

I’m curious as to what sort of analyses have matrix multiplication as a bottleneck. Some things like reading and writing big text files are noticeably slow (though this may be related to an ageing hard drive). Matrix multiplication tends to be hidden away in subfunctions, so it is often unclear when it is being done.
Reply
12. Announcing pqR: A faster version of R | Radford Neal's blog | 2013-06-22 at 2:32 pm

[…] example, vector-matrix multiplies are over ten times faster in pqR than in R-2.15.0 or R-3.0.1 (see here for the main reason why, though pqR solves the problem differently than suggested […]
Reply

Radford Neal's blog