Deferred evaluation in Renjin, Riposte, and pqR

2013-07-24 at 11:03 pm 17 comments

The previously sleepy world of R implementation is waking up. Shortly after I announced pqR, my “pretty quick” implementation of R, the Renjin implementation was announced at UserR! 2013. Work also proceeds on Riposte, with release planned for a year from now. These three implementations differ greatly in some respects, but interestingly they all try to use multiple processor cores, and they all use some form of deferred evaluation.

Deferred evaluation isn’t the same as “lazy evaluation” (which is how R handles function arguments). Deferred evaluation is purely an implementation technique, invisible to the user, apart from its effect on performance. The idea is to sometimes not do an operation immediately, but instead wait, hoping that later events will allow the operation to be done faster, perhaps because a processor core becomes available for doing it in another thread, or perhaps because it turns out that it can be combined with a later operation, and both done at once.

Below, I’ll sketch how deferred evaluation is implemented and used in these three new R implementations, and also comment a bit on their other characteristics. I’ll then consider whether these implementations might be able to borrow ideas from each other to further expand the usefulness of deferred evaluaton.

My pqR implementation (discussed also in my blog post here and others it links to), is the most conservative of these projects. It is based on R-2.15.0, and so retains compatibility with the huge number of R packages that work with that version of R (apart from any bugs that I may have introduced, and excepting any packages that use internal characteristics of the R interpreter that they weren’t supposed to rely on). Many of the performance improvements in pqR result simply from detailed rewrites of C code — indeed, the project began with my realization of just how much scope there must be for such improvements. But there are other more profound changes in pqR. Two relating to deferred evaluation are the use of helper threads to perform numerical computations concurrently on multicore systems, and to a lesser extent the use of variant returns to optimize some operations.

Renjin (see also this talk) is written in Java, but tries to retain compatibility by reusing many parts of R-2.14.2 (eg, by translating C components to Java). One of its goals is to allow R to use huge objects that don’t fit in memory. Perhaps related to this, its objects are immutable, which is apparently tolerable because it can create “views” of objects that are the same except for some simple changes. These views also play a role in how Renjin can defer computations until a tree of operations has been built, which can then be optimized and perhaps done in parallel.

Riposte (see also this paper), appears to be mostly written in C++, and to currently target only Intel processors. Of the three projects, Riposte’s implementation seems to be the most distinct from the current R Core implementation, which may perhaps lead to compatibility problems. However, this may also allow it to employ more sophisticated techniques. Its use of deferred evaluation seems generally similar to Renjin’s, but I expect that many details differ. (The available documentation on Renjin isn’t sufficient to really tell.)

So, when do these implementations defer evaluations? Consider the following R code, and assume that the variables a and b are numeric vectors of the same length:

    f <- function (a,b) { u <- a*b+1; exp(u); }
    print(f(a,b)/2)

When evaluating this code, pqR, Renjin, and Riposte may all not do the multiply, not do the add, not do the exponentiation, and not do the division, until the final value is actually needed by print.

In pqR, this deferral will happen when helper threads are enabled (provided the vectors a and b are sufficiently long). Tasks for performing the multiply, the add, the exponentiation, and the division will be scheduled as these operations are encounted when interpreting the code. Helper threads may immediately start performing these tasks, or they may not, if all the helper threads are busy. When the master thread needs the actual data to print, it will wait until all these tasks are finished, doing them itself if necessary. However, it’s possible that no wait will be needed — while the master thread was handling interpretive tasks such as returning from the function f, the helper threads may have finished all the computations, allowing printing to start immediately.

Renjin and Riposte also do the equivalent of scheduling tasks for the multiply, add, exponentiation, and division, though for Renjin this is seen as constructing a directed graph of operations, and for Riposte it is seen as creating a “vector trace”. They both will also ensure that these computations are actually performed when the result is needed for printing.

However, Renjin and Riposte differ from pqR in that, as far as I can tell, neither will start any of the vector computations above until the call of print, even if there are idle processor cores. Because of this, it seems that only pqR can overlap interpretive operations in the master thread with numeric computations done in other threads. The advantage of not immedately starting these computations is that when the result is finally needed, the whole tree of computations is available, allowing optimizations to be performed. In particular, operations can be combined, so that possibly long intermediate vectors do not need to be allocated. In the above example, a single loop (perhaps split amongst processor cores) can read sucessive elements, i, of a and b, compute exp(a[i]*b[i]+1)/2, and store the result (or even just print it without ever storing it).

This merging of operations can be beneficial even if only one processor core is available. Since the general apparatus for deferred evaluation is already implemented in pqR, adding a capability for merging operations might not be too much work, though I suspect the result would be a bit of a “hack” compared to what Renjin or Riposte do. Actually, pqR already does a very limited amount of merging via its variant result mechanism. For example, the evaluation of sum(exp(x)) will not store exp(x) in memory, but just sum its elements as they are computed.

In the other direction, I wonder whether pqR’s immediate use of available processor cores could be incorporated into Renjin or Riposte. For very long vectors, starting work immediately may not help much, but I think many R programs work on vectors of moderate size (thousands of elements), for which the time it takes for the interpreter to reach the point where the result is needed may be comparable to the time required for the vector computation. Starting the computation immediately will then give a substantial speedup.

Finally, Renjin, Riposte, and pqR aren’t the only new implementations of R in the works. There are at least two more — FastR, which uses Java, and CXXR, which uses C++. As far as I can tell, neither of these projects involve deferred evaluation, but they may have other interesting ideas.

Entry filed under: Computing, R Programming, Statistics, Statistics - Computing. Tags: pqR.

Fixing R’s NAMED problems in pqR Pipes at U of T

17 Comments Add your own

1. Wayne | 2013-07-25 at 12:41 pm

Exciting! And props for your work on pqR. To some extent R’s lost a bit of it’s Open Source mojo (at the core), so it’s good to see some new school action there.
Reply
2. Justin Talbot | 2013-07-25 at 1:26 pm

Author of Riposte here. Great progress on pqR. And I’m happy to see the renjin project publicly announced; I hadn’t checked in on them in awhile.

Your summary of Riposte’s deferred evaluation strategy seems right on, though I think you undersell the main point—for long vectors, R’s execution is completely memory bound. It spends almost all of its time reading and writing vector intermediates to memory. During this time, the CPU is essentially doing nothing.

Parallelizing this work is not going to be particularly useful; you’ll just end up with 4 cores waiting on memory instead of just 1. (This depends on hardware. On some server architectures you can get more memory bandwidth by using more cores, but this isn’t likely to be true on the commodity hardware most analysts use.)

The key to making long vector code run fast is eliminating as many memory accesses as possible, which means delaying execution as long as possible so that you can do fusion and so that temporary variables can go out of scope and be proven dead. That’s why Riposte delays vector instructions as long as it can. Starting execution earlier, even on another thread, would actually it slow down most of the time.

When might parallel execution of deferred tasks be useful? The most obvious is matrix operations. Well-written matrix operations are compute bound. If you can run two or more matrix operations simultaneously, that would be a substantial win. Of course, this requires finding independent matrix operations. I’m not sure how common this is. My guess is that most matrix code consists of sequences of dependent operations.

The other possibility, as you mention, is shorter vectors which are not memory bound. However, working against this is the overhead of launching a task on another thread. I’m not sure how the constants work out, but you might end up with a relatively narrow range of vector lengths for which parallelization is a clear win.

This is all handwaving, so I’d be very interested to hear where you are seeing big wins.
Reply
3. t. davis | 2013-07-25 at 1:57 pm

I’ve been watching the pqR discussion with interest. My great hope is that much of what you are doing becomes a part of “standard” R. In my opinion, the use of threads in R (especially on Windows) isn’t transparent enough to the average user. Perhaps a good “stretch” goal for the R world would be to have pqR-like performance improvements folded into R 3.x, for some small value of x.

I suspect that Revolution Analytics is looking at similar improvements, but I’d much rather see this sort of thing in standard, open source R.
Reply
4. Tim Keitt | 2013-07-25 at 3:45 pm

You all need to look what they are doing with Julia over at julialang.org. Leveraging the LLVM system and type annotation are really good ideas I’ve been suggesting to folks for years.
Reply
5. Alex Bertram | 2013-07-25 at 4:19 pm

Renjin developer here – great to see this conversation about new R interpreters take off!

Justin – very cool to see Riposte’s roadmap, your talk at Purdue last year really helped focus our approach on Renjin. I think we’ve been able to go pretty far with deferred vector operations. I talked about an example of the dcor() function from the energy package at UseR!, where we can make it almost to the very end before doing any work. At that point, you can look at the graph, parallelize independent reduction operations like mean, sum or rowMeans, and do what I think Justin calls “vector fusion” by merging something like sum(dist(x) – sqrt(rowMeans(t(dist(x))) into a single loop JITed out to JVM byte code, which then gets quickly jitted down to machine code.

What’s interesting is that once you get to the graph of the calculation, it’s a very simple set of operations, so it’s straightforward to write additional targets for the vector fusion: I’ve experimented with using OpenCL as a target (https://github.com/bedatadriven/renjin/tree/feature-opencl) and in the future I can see a set of pluggable that target SQL or even Hadoop depending on where the data is located.

Finally, one last plug for Renjin – of all the reimplementations, we’ve been able to make the most progress in running existing CRAN packages — see http://packages.renjin.org. We’re still have a bit of way to go before we can approach the level of compatibility that a fork like pqR has but it’s exciting to see so many packages building and tests passing.
Reply
- 6. Steven S. | 2013-07-26 at 5:47 am
  
  Hello Alex,
  Thanks to this blog post I found out yesterday about Renjin. I took a look at the site & the presentation and I must say I am super excited about this project! Having a fast, lazy, parallelizing R on the JVM is awesome! The possibility of tightlly integrating with all of the server-side java frameworks is a BIG deal. This also gives R a new object model (the traditional single dispatch model from Java) which might be something that a lot of traditional software devs coming to R might like (although I actually prefer the multidispatch OO from R which blends nicely with the functional paradigm).
  
  Your comment about possibly targeting SQL or Hadoop (Hive or equivalents?) from the calculation tree: sounds a lot like LINQ/F# type providers (minus the static type checking) and this is for me THE killer feature of C#/F#, so YES PLEASE!!!
  Reply
  - 7. alex | 2013-07-30 at 7:38 pm
    
    Yes! That’s exactly it! I think LINQ is a very good model of where you can go with a deeply functional language like R.
    
    Regarding the javaObject$method() syntax, yes, I’m a little nervous to add a *fourth* object system to R, but I thought trying to shoehorn JVM method invocation in the S3/S4 dispatch system might be require too many extra R functions sitting around – one for each uniquely named method in a java library? For some libraries this could be thousands…
    
    Please try out the latest release candidate and let me know how you’re able to integrate it into your server environment!
    Reply
    - 8. Steven S. | 2013-07-31 at 3:27 am
      
      Hi Alex,
      don’t worry about the java object model being exposed in R. I think this is fine in the context of java integration. The same goes for e.g. F#: the object model also feels alien in F# but you know it’s just there for .NET (mosty C#) integration and you shouldn’t use it as your main paradigm.
      
      Keep up the good work. I am eagerly looking forward to a version that might include something LINQ-like, not just for querying external data source but also in-memory data structures (although data.table is already pretty sweet). BTW: I don’t know if you have pondered this, but this mechanism could be used for more than querying internal/external data, it could be also used for integration with external languages libraries (e.g. Matlab,Mathematica,Python,…). This is something that is very active now in F# type providers: see e.g. http://richardminerich.com/2013/07/the-promise-of-f-language-type-providers
      Reply
9. Radford Neal | 2013-07-25 at 10:07 pm

Thanks for your comments, Justin and Alex. One thing I would be interested to hear is, how do you two think the approaches of Renjin and Riposte differ?

Regarding memory bandwidth when pqR is doing unmerged operations on long vectors, empirically it seems to not be the constraint on how much is gained from helper threads, since performance doesn’t change much with vector size, past the point where overhead dominates. This is a bit puzzling, actually, but here are some results…

First, a couple little functions and a “doit” function to try them out, first without any helper threads and then with helpers (five helpers, for six threads total, on an Intel X5680 with six cores and a 12 Megabyte cache).

f1 = function (n, m) {
r = numeric(10)
a = seq(0,1,length=n)
b = seq(1,2,length=n)
for (i in 1:m) {
x = a^2 * (b^2 + i/m)
r[i%%10+1] = r[i%%10+1] + sum(x)
}
r
}

f2 = function (n, m) {
r = numeric(10)
a = seq(0,1,length=n)
b = seq(1,2,length=n)
for (i in 1:m) {
x = exp (tanh (a^2 * (b^2 + i/m)))
r[i%%10+1] = r[i%%10+1] + sum(x)
}
r
}

doit = function (f,n) {
m = round(25000000/n)
cat(“NO HELPERS:\n”)
options(helpers_disable=TRUE)
print(system.time(f(n,m)))
print(system.time(f(n,m)))
cat(“WITH HELPERS:\n”)
options(helpers_disable=FALSE)
print(system.time(f(n,m)))
print(system.time(f(n,m)))
}

Note that “sum” will wait for its argument to be computed (though I intend to change that), so that’s the point where the master thread will wait for everything to be finished. The second elapsed time printed will be more reliable (first may be influenced by memory allocation).

Here are the results with f1, which does only simple arithmetic.

> doit(f1,2500)
NO HELPERS:
user system elapsed
0.172 0.024 0.194
user system elapsed
0.160 0.028 0.189
WITH HELPERS:
user system elapsed
0.892 0.024 0.153
user system elapsed
0.912 0.012 0.157
> doit(f1,2500000)
NO HELPERS:
user system elapsed
0.340 0.048 0.388
user system elapsed
0.264 0.060 0.327
WITH HELPERS:
user system elapsed
0.560 0.048 0.247
user system elapsed
0.544 0.036 0.221

For vectors of length 2500, using helper threads gives a small speedup of about a factor of 1.2. For vectors of length 2.5 million, the speedup is a factor of 1.48. Intermediate lengths give intermediate speedups.

For f2, which also has exp and tanh in the computation, the results are as follows:

> doit(f2,2500)
NO HELPERS:
user system elapsed
1.492 0.024 1.519
user system elapsed
1.512 0.004 1.519
WITH HELPERS:
user system elapsed
5.536 0.012 0.932
user system elapsed
5.584 0.008 0.939
> doit(f2,2500000)
NO HELPERS:
user system elapsed
1.693 0.044 1.741
user system elapsed
1.640 0.028 1.674
WITH HELPERS:
user system elapsed
2.996 0.032 0.966
user system elapsed
2.984 0.052 0.973

This time, the speedup is a factor of 1.62 for vectors of length 2500 and a factor of 1.72 for vectors of length 2.5 million.

Given that pqR will be using all six cores for f2, a speedup by a factor of 1.72 isn’t too impressive, but it certainly doesn’t seem that memory bandwidth limits have eliminated any gains. The speedup from using helpers gets bigger if one adds extra stuff for the interpreter to do after the assignment to x and before sum(x).

In trying to understand these results, remember that pqR does pipelining, so the task doing the “exp” will start as soon as the “tanh” task starts, processing its output as it is produced. Because of this, even for long vectors that don’t all fit in cache, a read by the “exp” task may find that the value produced by “tanh” is still in cache. Analysing what happens seems complicated…
Reply
10. Justin Talbot | 2013-07-28 at 1:32 am

I can’t really speak to the difference between Riposte’s and renjin’s implementations. At a high level, I’d assume that they’re probably quite similar. The vector trace vs. directed graph distinction is really just terminology. The bigger differences are probably in the low level details—how liveness of temporaries are tracked, how the deferred evaluation in handled by the interpreter, etc.. I don’t know any of the details about renjin’s implementation. Maybe Alex would be able to make a better comparison?

I’ll have to think some more about your example. Your processor is a Xeon, which is a server chip. You will get more bandwidth as you use more cores, but it will be less than linear. I’d be interested to see the tests run on a core i5 or i7 and see if the results are different.

I’d like to know more about how pqR does pipelining. How do the threads coordinate?
Reply
- 11. Radford Neal | 2013-07-28 at 6:57 pm
  
  For the pipelining itself, the thread producing data writes some data to the variable, does an OpenMP “flush” (which generates an mfence instruction on Intel processors) and then writes the amount of data generated so far to a field the other threads can see. The thread reading the data does some slightly trickier things to find out how much data is currently available (it has to account for the possibility that the producer has finished and terminated). This is pretty low overhead, though the producer will typically say it has generated more data only after producing another fair-size chunk, to reduce the overhead of this further.
  
  More generally, coordination of threads is through a queue of tasks needing to be done, which includes dependencies of one task on output of another task (which may or may not be pipelinable). Some locking is done when accessing this queue, but it’s also pretty low overhead. I tried to keep the overhead low by limiting generality – for instance, tasks have exactly one output and two inputs (though these can be null), which eliminates overhead from loops over variable numbers of arguments.
  
  Within the bulk of the interpreter, coordination is through two bits in the header of every object saying whether the object is “being computed” by some task and whether it is “in use” as an input to some task. These bits are set and cleared only by the master thread (when it schedules tasks, or notices that tasks have been completed), so they can be accessed by the interpreter (in the master thread) without any synchronization overhead. The “in use” flag has to be looked at in may places by code written without knowledge of helper threads (legacy code, in particular). This is accomplished by redefining the NAMED macro to wait until “in use” isn’t set before returning the NAMED value, which has the effect of preventing an object that is in use by a helper thread from being modified (which will be done only after checking that NAMED is less than two). The “being computed” flag is consulted only by code that is aware of the possibility that an object might not be computed yet. Code that isn’t aware of this possibility will never see an object in the process of being computed, because “eval”, variable lookup, etc. will wait for the object to be computed before returning it unless specifically requested not to wait by the caller. So the needed coordination within the interpreter adds very little overhead.
  Reply
- 12. Radford Neal | 2013-07-28 at 8:54 pm
  
  Regarding memory bandwidth… I tried the f1 and f2 examples above on a 2GHz Intel Core Duo processor (in an old MacBook, with 2GBytes of main memory and 2Mbytes of cache). For the f1 example with only simple arithmetic, there is no consistent gain using one helper thread (two threads total, matching the two cores) – depending on vector size, it ranges from slightly slower to slightly faster. There is a gain from using a helper thread in the f2 example, however:
  
  > doit(f2,2500)
  NO HELPERS:
  user system elapsed
  2.970 0.015 2.985
  user system elapsed
  2.986 0.004 2.991
  WITH HELPERS:
  user system elapsed
  4.009 0.089 2.165
  user system elapsed
  4.027 0.085 2.181
  > doit(f2,2500000)
  NO HELPERS:
  user system elapsed
  3.370 0.502 3.872
  user system elapsed
  3.162 0.489 3.651
  WITH HELPERS:
  user system elapsed
  3.905 0.710 2.486
  user system elapsed
  4.006 0.659 2.506
  
  I think at least part of the explanation is that exp and tanh are slow enough that memory bandwidth doesn’t dominate.
  Reply
- 13. alex | 2013-07-30 at 7:44 pm
  
  Hi Justin,
  My understanding of Riposte is still quite limited, but I think the biggest difference is that Riposte still anchors an R vector to an in-memory array, is that right?
  
  So when I try with riposte:
  > sum(1:1e9)
  [1] 5e+17
  > x<-1:1e9; sum(x)
  Segmentation fault (core dumped)
  
  So the expression 1:1e9 is fused into the sum loop, but only within in a single expression? Or does assignment into the environment trigger calculation?
  
  With Renjin, the 1:1e9 evaluates to a sequence object, which can be passed around freely.
  
  I've tried to go through an example in depth here:
  http://www.renjin.org/blog/2013-07-30-deep-dive-vector-pipeliner.html
  Reply
  - 14. Justin Talbot | 2013-07-30 at 9:41 pm
    
    No, Riposte can represent vectors (including vectors in environments)
    in deferred form (called Futures in the code).
    
    The behavior you are seeing is touching on a difficult issue with
    deferred evaluation. In cases where a deferred value can be used more
    than once, deferring too long can actually greatly increase runtime.
    
    Imagine your code looked like the following:
```
> x <- super.expensive.fn(1:1e9); sum(x)
1
> mean(x)
2
```
    If x is deferred as long as possible, as you suggest,
    super.expensive.fn will have to be executed twice, once when
    evaluating the sum and again when evaluating the mean. Worse, since
    the cost of super.expensive.fn could be arbitrarily large, this could
    result in bad slow down compared to eager evaluation.
    
    A simple fix, which the Riposte master branch uses, is to materialize
    any output that it cannot prove will not be used again. When Riposte
    sees:
```
> x <- 1:1e9; sum(x)
```
    it doesn’t know if x might be used again, so it materializes x at the
    same time it computes the sum. This avoids the potential for expensive
    recompilation at the cost of potentially unnecessary materialization.
    
    In cases where Riposte can prove that a vector is dead after the
    deferred computation is run, it will avoid materializing it. For
    example, try:
```
f <- function() { x <- 1:1e9; sum(x) }; f()
```
    A better solution, which I’ve been experimenting with in a separate
    branch, would be to estimate the cost of reevaluating a deferred
    expression vs. the cost of materializing it, then picking the cheaper
    one. In your example 1:1e9 is cheap to evaluate, so we would leave it
    deferred. But super.expensive.fn(1:1e9) is expensive, so we would
    materialize it on the first use and then load it from memory on
    subsequent uses.
    
    And, of course, Riposte needs to do a better job of figuring out that
    it’s running out of memory.
    Reply
15. isomorphismes | 2013-07-28 at 4:31 am

Reblogged this on isomorphismes.
Reply
16. Fearsome Engines Part 2: Innovations and new features | 4D Pie Charts | 2013-10-13 at 10:11 am

[…] always result in the best performance – sometimes it can be faster to wait. This “deferred execution” becomes very important when trying to parallelise code, as we’ll see in a […]
Reply
17. DSC 2014. Day 1 | JAGS News | 2014-07-07 at 12:15 pm

[…] optimizations. Some of these optimizations are described in more detail in his blog. In particular, deferred evaluation and task merging are also used in alternative R implementations Riposte and Renjin (see below). […]
Reply

Radford Neal's blog