Subset replacement in pqR: Now faster and better

2014-10-13 at 11:01 pm 7 comments

The latest version of pqR that I just released uses a new way of implementing subset replacement operations — such as a[i]<-1 or L$M[1:100,i]<-v. The new approach is much faster, and eliminates some strange behaviour of the previous approach.

This change affects only interpreted code. The bytecode compiler (available since R-2.13.0) introduced a different mechanism, which is also faster than the previous approach used by the interpreter (though it still has some of the strange behaviour). This faster mechanism was one of the main reasons for byte-compiled code to be faster than interpreted code (although it would have been possible to use the new mechanism in the interpreter as well). With pqR’s new implementation of subset replacement, this advantage of byte-compiled over interpreted code is much reduced.

In addition to being faster, pqR’s new approach is also more coherent than the previous approach (still current in the interpreter for R Core releases to at least R-3.1.1), which despite its gross inefficiency and confused semantics has remained essentially unchanged for 18 years. Unfortunately, the new approach in pqR is not as coherent as it might be, because past confusion has resulted in some packages doing “wrong” things, which have to be accommodated, as least in the short term.

Replacement functions. To understand pqR’s new approach, and the problems with the old approach (some not currently fixable), you first need to know how R’s subset replacement operations are defined. The central concept is that every function for extracting part of an object is accompanied by a corresponding function for replacing that part, whose name has “<-” appended. So, for example, the “dimnames” function is accompanied by “dimnames<-“, the “$” operator is accompanied by “$<-“, and “[” is accompanied by “[<-“.

Those three pairs of functions are primitive, but users can define their own pairs of subset and replacement functions. For example, the pair of functions below access or replace those elements of a vector that have odd indexes:

odd_elements <- function (x)
{  x[seq(1,by=2,length=(length(x)+1)%/%2)]  }
`odd_elements<-` <- function (x,value)
{  x[seq(1,by=2,length=(length(x)+1)%/%2)] <- value;  x  }

In general, such functions may take additional arguments that specify which part of the variable should be accessed or modified.

Simple replacements. To see how these replacement functions are used, let’s start with a simple replacement of part of a variable:

x[3:5] <- 13:15

According to the current R Language Definition at r-project.org, the effect of this statement is the same as that of

`*tmp*` <- x
x <- `[<-`(`*tmp*`, 3:5, value=13:15)
rm(`*tmp*`)

This specification is actually incomplete, since it fails to specify the value of the expression x[3:5] <- 13:15 (which might, uncommonly, be used someplace such as the argument of a function call), but it is close to a literal description of what the interpreter in R Core implementations does — this simple assignment to part of a vector really does cause a variable called “*tmp*” to be created in the current environment, to then be modified, and to finally be removed, with all the overhead this implies. You can confirm that this is what’s happening (for example, in R-3.1.1) by typing the following:

`*tmp*`<-9; a<-c(1,2); a[1]<-3; print(`*tmp*`)

You’ll get an error from print, since a[1]<-3 will have removed *tmp*.

In pqR, x[3:5] <- 13:15 is now instead implemented as something close to the following:

x <- `[<-`(x, 3:5, value=13:15)

This has the same effect as the code in the language definition, except that it has much less overhead, and lacks the undesired side effect of deleting any previously existing *tmp* variable. Subset replacement with a user-defined function is done the same way — for example, odd_elements(x)<-0 is translated to

x <- `odd_elements<-`(x, value=0)

Note that although their use in implementing assignments to subsets is the principal purpose of replacement functions, nothing stops them from being called directly. And it can occasionally be useful to write things like the following:

z <- W %*% `odd_elements<-`(x+y, value=1)

Avoiding duplication. If the `[<-` primitive were implemented in the most obvious way, the call `[<-`(x,i,v) would start by making a duplicate copy of x, then replace the elements of this copy that i indexes by v, and finally return this modified copy as its value. But this would be intolerably inefficient when x is a vector of 1000000 elements, that isn’t shared with any other variable, and i indexes just one of these elements.

The right way to solve this is to not duplicate the first argument of [<- if either it is a value that is not stored anywhere (eg, the result of some arithmetic operation), or it is the value of a variable that is not also stored elsewhere and the call of [<- is part of an assignment operation. This would not be hard to do in pqR, using its “variant result” mechanism (see here) to pass to the replacement operator the information on whether it has been called from an assignment operator.

That’s not what is currently done, however. Instead, the primitive replacement operators such as “[<-” duplicate their first argument only if it is stored in two or more variables (or list elements), regardless of the context in which it is called. This violates the usual pass-by-value semantics of R function calls. For example, the call

y <- `[<-` (x, 1, 0)

ought to set y to the value stored in x with the first element changed to zero, while leaving x unchanged. But it (sometimes) does change x, as you can confirm with the following test:

x <- c(10,20,30); y <- `[<-`(x, 1, 0); print(x)

Unfortunately, some code now relies on this behaviour, although this is a very bad idea, both for general reasons, and also because in the following slightly different code, “[<-doesn’t change x:

w <- x <- c(10,20,30); y <- `[<-`(x, 1, 0); print(x)

Worse, the “@<-” and “slot<-” operators for changing the value of a slot in an S4 object have been written to never duplicate their first argument, even if it is shared amongst many variables. To keep this from causing total chaos, the general code for assignment to subsets has to duplicate the value stored in the target variable if it is shared with another variable (even though this is necessary only for “@<-" and “slot<-“), which sometimes results in an extra duplication being done. Unfortunately, this behaviour of “@<-” and “slot<-” is also relied on by some code.

For the moment, pqR accommodates all this bad behaviour, though it would be nice to move to a coherent semantics sometime.

Complex replacements. Assignment operations with more complex replacements are trickier. The R Language Definition defines an assignment such as

L[[2]][3] <- 1

as being equivalent to

`*tmp*` <- L
L<-`[[<-`(`*tmp*`,2,value=`[<-`(`*tmp*`[[2]],3,value=1))
rm(`*tmp*`)

That is, the [[ operator is used to extract the second element (a vector) of L (which has been put in *tmp*), then [<- is used to create a new version of this vector with its second element changed to 1, and finally the [[<- operator is used to put this modified vector back as the second element of L.

The interpreter in R Core implementations (and pqR before the latest release) implement this definition quite literally, actually creating a *tmp* variable, and evaluating index expressions as implied above. This results in strange behaviour. The following code produces the error “cannot change value of locked binding for `*tmp*`”, though it should surely be legal:

L <- list(c(4,7),"x"); b <- c(2,3); L[[ b[1]<-1 ]] [1] <- 9

The following code calls the function f twice, though a programmer writing it would surely expect it to be called only once:

f <- function () { cat("Hi!\n"); 1 }
L <- list(c(4,7),"x"); L[[ f() ]] [1] <- 9

This prints “Hi!” twice, in R-3.1.1 and earlier R Core releases (for both interpreted and byte-compiled code).

How pqR implements complex replacements. These strange behaviours are eliminated in the new pqR implementation, which is also much faster.

In pqR, an assignment that does a complex replacement starts by evaluating the expression on the right side, and then calls in succession all the subset extraction functions that appear on the left side, except for the outermost one. For example, names(L[[f()]])[i]<-g() will first evaluate g(), and then evaluate the extraction functions from the inside out, effectively doing something like

tmp1 <- L[[f()]]
tmp2 < names(tmp1)

However, `tmp1` and `tmp2` are not actual R variables — the interpreter just stores the values extracted internally.

So far, this is similar to what the R Core interpreter does, but there are two crucial differences.

First, when evaluating an extraction function, pqR uses its “variant result” mechanism to ask the extraction function whether the value it returns is an unshared subset of the variable it was extracted from, which can safely be modified, and for which modifications will automatically be reflected in changes to that part of the larger variable.

For example, after L <- list("x", c(1,2)), the expression L[[2]] returns an unshared subset of L. However if either M <- L or M < L[[2]]; were then executed, L[[2]] would no longer be an unshared subset, since it would be shared with the value of M. And after v <- 1:100, the expression v[20:30] does not return an unshared subset, because it will return a copy of part of v, not that part of v itself (unlike list elements, parts of numeric vectors are not objects in themselves).

Knowing when the result of an extraction is an unshared subset is crucial to efficiently updating it. When the result of an extraction is not an unshared subset, and it is referenced elsewhere, pqR duplicates it (at the top level) before doing further extractions and replcements.

The second difference from R Core implementations concerns the index arguments of the extraction functions, which are later also arguments of the corresponding replacement functions. When pqR evaluates a call of an extraction function, such as L[[f()]], it creates what (in the terminology of R internals) are called “promises” for index arguments, such as f() in this example. These promises contain the expression to be evaluated, plus an initially empty field for the value of the expression. When (if ever) the extraction function actually references the index value, the expression is evaluated, and this field is filled in. Later references to the index value do not evaluate the expression again, but just use the value stored in this field of the promise. Crucially, in pqR, these promises are kept for later use when the corresponding replacement function is called, usually with their value fields already filled in.

Avoiding re-evaluation of index arguments saves time, and also eliminates the double occurrence of side effects of evaluation, such as “Hi!” being printed twice in the example above when f() is evaluated twice (once for extracting L[[f()]] and once when replacing that element of L by a call of `[[<-` with f() as the index argument).

Once all the extraction functions have been called, the outermost replacement function is called to store the right hand side of the assignment statement into the result from the last extraction function. The next replacement function is then called to store this modified value into the result of the previous extraction function, and so forth, until the last replacement function call produces the new value for the variable being assigned into.

This is again generally similar to R Core implementations. However, pqR is able to skip some of these replacement calls, when it knows that the result of an extraction function is part of the larger variable. In that case, when that part is modified, nothing has to be done to propagate the modification to the larger variable. For example, to perform the replacement operation below:

L <- list(a=c(1,2),b="x"); L$a[1] <- 9

pqR will first extract L$a, and find that this vector is an unshared subset of L. It will then call `[<-` to replace the first element of this vector by 9, at which point it is done — pqR realizes that there is no need to call the `$<-` replacement function, and also that there is no need to store the final result in L, since it is the same as the object already in L. However, if the assignment L$a[1] <- 1+2i is now done, the replacement of the first element of L$a by the complex number 1+2i will produce a new vector of complex type, and pqR will realize that `$<-` needs to be called to store this new vector in L$a.

R Core implementations try to infer whether an extracted value is an unshared subset from how many references there are to it (see the discussion here), which sort of works, but fails when extraction is done with a vector index, as below:

L <- list(1,list(2,list(3,c(4,5,6))))
K <- L
L[[c(2,2,2,3)]] <- 9

The vector index c(2,2,2,3) refers to the 3rd element of the 2nd element of the 2nd element of the 2nd element of L, which is the number 6. When replacing this by 9, the vector c(4,5,6) needs to be duplicated, because the entire object is shared by K and L. However, the reference count for c(4,5,6) should be only one, since it is referenced from only a single list (albeit one that ultimately is itself shared). To get around this problem, recent R Core releases increment reference counts as a result of simply extracting a value with a vector index, which will leave reference counts that are bigger than they should be, and may therefore cause unnecessary copying to be done later. (Earlier R Core releases just give the wrong answer.) In the new pqR implementation, extraction leaves the reference counts unchanged, but if asked, `[[` will say that the result returned in not an unshared subset, which will lead to the appropriate duplications before replacement functions are called.

User-defined replacement functions. As illustrated by the odd_elements and odd_elements<- functions above, users can write their own replacement functions, which can be used just like the built-in ones. Unfortunately, in both R Core and pqR implementations, there is presently no way to avoid duplication of the value in a variable that is updated with a user-defined function, even when the value is not shared with other variables. For example, odd_elements(a) <- 0 will always duplicate the value in a before setting its odd elements to 0. Furthermore, the modified value stored in a after the replacement will always be marked as shared with the variable x within odd_elements<- (even though x is inaccessible), forcing a copy when it is next updated.

The new version of pqR does avoid some unnecessary duplications that are done in R Core implementations, but the basic problems remain. One fundamental question is what should happen if a user-defined replacement function generates an error after partially changing the value being updated. At present, the variable being updated will be left unchanged after the error exit. But any scheme that doesn’t duplicate the variable being updated will have the possibility of leaving a partial update that was cut short by an error. Successfully resolving such issues would allow for much more efficient use of user-defined replacement functions.

Advertisements

Entry filed under: Computing, R Programming, Statistics, Statistics - Computing.

New version of pqR with faster variable lookup, faster subset replacement, and more Faculty Position in Statistics at the University of Toronto (Mississauga)

7 Comments Add your own

  • 1. Janko Thyson  |  2014-10-14 at 6:48 am

    Haven’t had time to really look into pqR yet, but I find it quite a fascinating project! Keep up the great work and thanks for sharing details on the very internals of R, very interesting!!

    Reply
  • 2. Chris Paciorek  |  2014-10-14 at 7:48 pm

    So I’m confused about what R does when it uses `[ x .Internal(inspect(x))
    @28ae6f0 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 0.88136,-0.322485,-0.308981,-1.41978,-1.25584
    > x[2:3] .Internal(inspect(x))
    @28ae6f0 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 0.88136,1.95407,0.223208,-1.41978,-1.25584

    b) Furthermore, the standard syntax uses very little time compared to running your example code:

    > x = rnorm(1e7)
    > system.time(x[10000:10001] system.time({`*tmp*` <- x; x <- `[<-`(`*tmp*`, 10000:10001, value=rnorm(2)); rm(`*tmp*`)})
    user system elapsed
    0.020 0.028 0.047

    Any thoughts?

    Reply
    • 3. Radford Neal  |  2014-10-14 at 8:33 pm

      I think your code got garbled by wordpress. It takes what you type as HTML, so you have to type “&lt;” instead of “<“. Not doing that tends to make some of what you type disappear.

      Guessing at what you might have meant for (b) at least, the code I show involving *tmp* is an R-level picture of what the R Core interpeter more-or-less does, but the interpreter does do it faster than it would be done at the R level. In particular, at the R level, *tmp* will be treated as an ordinary variable that has to appear different from x, whereas in the interpreter it isn’t, so there’s not as much copying done.

      Reply
      • 4. Chris Paciorek  |  2014-10-14 at 8:52 pm

        yes, sorry about the garbling. Let me try again. Two empirical results that make this confusing. First when I replace into elements of a vector, the memory address for that vector appears unchanged when querying it via .Internal(inspect()).

        Second the amount of time taken compared to when I force an actual copy of an entire long vector is miniscule. So in terms of time taken, it’s either not making a copy or is somehow implemented so effectively that the time is essentially none…

        Here’s the address not changing:

        x = rnorm(5)

        .Internal(inspect(x))

        # @28ae698 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 0.394722,0.212128,1.58906,0.0491079,-0.817408

        x[2:3] = rnorm(2)

        .Internal(inspect(x))

        # @28ae698 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 0.394722,0.794627,0.791023,0.0491079,-0.817408

        and here’s the stark example of how little time the replacement takes, compared to when I force a copy of the entire vector to be made, where the actual copy is delayed because of R’s copy-on-change semantics:

        x = rnorm(1e8)

        system.time({x[1] = 3})

        # user system elapsed
        # 0 0 0

        system.time({y = x})

        # user system elapsed
        # 0 0 0

        system.time({y[1] = 3})

        # user system elapsed
        # 0.160 0.172 0.332

      • 5. Radford Neal  |  2014-10-14 at 9:21 pm

        Yes, for simple replacements like x[2:3] = rnorm(2), no copy is made of all of x. That’s what the “Avoiding duplication” section of my post is about (and for this example, both pqR and R Core implementations avoid duplication).

        For a vector of 100 million elements, replacing one element with no copying required will indeed be much, much faster than when a copy is required. However, it’s still quite possible for a sequence of updates of single elements with no copy done to take a long, long time. You’ll see that a loop doing updates one at a time to all 100 million elements of x will be slower than a copy (by a lot). Whether such updates of single elements (or small parts) are an important portion of the total computation time for your program will, of course, depend on your program…

  • 6. Longhai  |  2014-11-20 at 11:02 am

    I am waiting for a mac dmg file for pqR. I don’t think I will be able to compile the source on mac.

    Reply
    • 7. Radford Neal  |  2014-11-20 at 11:59 am

      Compiling a version of pqR that doesn’t support helper threads isn’t too bad. You need to get Apple’s Xcode from the app store, then get a Fortran compiler (which does come as a dmg or pkg, see the Mac instructions in the pqR wiki for where to get it), then get X Windows if you don’t have it (it used to come with OS X, but now you get it from XQuartz, also as a dmg or pkg), and finally you follow the configuration instructions at the pqR wiki.

      A version of pqR on a Mac that supports helper threads does take more work, since you have to get a recent version of gcc (4.7 or later) from macports.org or brew.sh, which can potentially be tedious and/or problematic.

      I do hope to distribute pqR in more packaged form sometime. But further improvements and testing it out on Windows are higher priority at the moment.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

October 2014
M T W T F S S
« Jun   Dec »
 12345
6789101112
13141516171819
20212223242526
2728293031  

Most Recent Posts