Fixing R’s NAMED problems in pqR

2013-07-02 at 9:44 pm 3 comments

In R, objects of most types are supposed to be treated as “values”, that do not change when other objects change. For instance, after doing the following:

  a <- c(1,2,3)
  b <- a
  a[2] <- 0

b[2] is supposed to have the value 2, not 0. Similarly, a vector passed as an argument to a function is not normally changed by the function. For example, with b as above, calling f(b), will not change b even if the definition of f is f <- function (x) x[2] <- 0.

This semantics would be easy to implement by simply copying an object whenever it is assigned, or evaluated as the argument to a function. Unfortunately, this would be unacceptably slow. Think, for example, of passing a 10000 by 10000 matrix as an argument to a little function that just accesses a few elements of the matrix and returns a value computed from them. The copying would take far longer than the computation within the function, and the extra 800 Megabytes of memory required might also be a problem.

So R doesn’t copy all the time. Instead, it maintains a count, called NAMED, of how many “names” refer to an object, and copies only when an object that needs to be modified is also referred to by another name. Unfortunately, however, this scheme works rather poorly. Many unnecessary copies are still made, while many bugs have arisen in which copies aren’t made when necessary. I’ll talk about this more below, and discuss how pqR has made a start at solving these problems.

One problem with the NAMED scheme in R (all recent versions up to at least R-3.0.1) is that the count can only be 0, 1, or 2, with 2 really meaning “2 or more”. Furthermore, the count never decreases.

Extending the example above illustrates how this causes unnecessary copying:

  a <- c(1,2,3)
  b <- a
  a[2] <- 0
  b[2] <- 4

Here, no copy is made when b <- a is executed. Instead, both a and b refer to the same object, which has NAMED set to 2. When executing a[2] <- 0, the R interpreter realizes that a copy needs to be made before changing a[2]. After this, a will refer to this new copy, which has NAMED set to 1, while b refers to the original object. Unfortunately, the original object still has NAMED set to 2, so when b[2] <- 4 is executed, another copy is made, unnecessarily. (This copying is fairly trivial for a vector of length three, but of course the same thing happens for vectors or matrices of any size.)

For similar reasons, simply passing something as an argument to a function will cause a copy to be made if it is later modified. For example, when doing the following:

  f <- function (X) sum(X^2)
  Y <- matrix(1.1,100,100)
  print(f(Y))
  Y[1,1] <- 2

R will make a copy of Y before changing Y[1,1], even though there are no other references to Y at this time.

So the NAMED scheme fails to eliminate many unnecessary copies. Despite this, it has also been a source of many bugs in which copies aren’t made when they should have been — for recent ones, see some of the bugs listed as fixed in the pqR NEWS file.

I think that one underlying cause of these bugs is that how NAMED should be used is not documented. Despite the terminology, reflected in my explanation above, NAMED should count not just the number of “names” referring to an object, but also the number of references to it as a member of a list (and maybe also as an attribute?). Except that that’s not exactly how things are actually done — objects with NAMED set to 0 are tolerated as list elements, on the understanding that this really means that NAMED is 1 if the list itself has a non-zero value for NAMED. Or something like that.

Addressing these performance and reliability issues is an ongoing task in pqR. Documenting a clear set of conventions is part of the plan, to be followed by reviewing code to fix places that don’t follow these conventions. The problem of unnecessary copying is addressed in pqR by extending NAMED to be a true count of references to an object, with this count now called NAMEDCNT. (For compatibility, the existing NAMED and SET_NAMED functions still exist, implemented in terms of NAMEDCNT.)

NAMEDCNT has a range of 0 to 7, and is intended to be as close as possible to an actual count of names and other references to an object, which will be decremented when references cease to be. There will still be some situations where NAMEDCNT isn’t decremented when it should be, so the count will remain an upper bound rather than an exact count. (This will certainly be the case if the count reaches the maximum of 7, and hence can no longer be decremented, since how much bigger than 7 it is will be unknown.) But keeping a more accurate count than previous versions of R is possible in many situations.

The benfits of keeping a better count are illustrated by the example I give above with the statement b[2] <- 4. In pqR, this statement does not cause b to be copied — when the preceding statement a[2] <- 0 causes a newly copied object to be assigned to a, pqR realizes that NAMEDCNT for the value previously bound to a can be decremented. Since this is the same object that is bound to b, its NAMEDCNT becomes 1, allowing b[2] <- 4 to be done without an unnecessary second copying operation.

Unfortunately, in the current (2013-06-28) version of pqR, the function call f(Y) in the latter example above still leads to a copy of Y being made unnecessarily when Y[1,1] is changed. So there is more work to be done, both with regard to performance and with regard to ensuring correctness. However, the current version of pqR does avoid the bug in R-3.0.1 involving NAMED that was reported today. (Though pqR does give the wrong answer for a more elaborate version of this bug.)

Entry filed under: Computing, R Programming, Statistics, Statistics - Computing. Tags: pqR.

How pqR makes programs faster by not doing things Deferred evaluation in Renjin, Riposte, and pqR

3 Comments Add your own

1. Corey | 2013-07-03 at 10:42 am

Wow. That is terrifying, and also explains why my nested list approach of a few years ago mysteriously bugged out.

Thanks for this. I sure hope the R Core dev team gets with the program — they need you.
Reply
2. Karl | 2013-07-04 at 5:29 am

Thanks, very instructive !
Just to be sure to understand: why capping the NAMEDCNT at 7 ?
For memory savings ? Isn’t that negligible ?
Reply
- 3. Radford Neal | 2013-07-04 at 11:09 am
  
  The maximum of 7 comes from three bits being used for NAMEDCNT, versus two bits for the old NAMED. This is part of the header for every object. There are no spare bits at the moment, and indeed I had to get rid of the tracemem facility to make everything fit as is. Putting in one more bit would require either getting rid of some other facility (such as helper threads), or expanding the header. Given the alignment requirements for pointers, expanding the header would use up 8 bytes for a 64-bit configuration or 4 bytes for a 32-bit configuration, which is a non-negligible amount compared to the size of a vector of one real value, which is 48 bytes for a 64-bit configuration and 32 bytes for a 32-bit configuration. This affects run time, not just memory usage, since these bytes have to be read from memory, and occupy space in the cache.
  
  The maximum of 7 ought to be enough for most uses. I think it’s fairly hard to imagine scenarios where an object acquires 7 or more references and then loses all but one of these references, so that it could be modified without copying if only NAMEDCNT had been big enough to keep track of the references.
  Reply

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Radford Neal's blog