The result is a new paper of mine on Fast exact summation using small and large superaccumulators (also available from arxiv.org).

A superaccumulator is a giant fixed-point number that can exactly represent any floating-point value (and then some, to allow for bigger numbers arising from doing sums). This concept has been used before to implement exact summation methods. But if done in software in the most obvious way, it would be pretty slow. In my paper, I introduce two new variations on this method. The “small” superaccumulator method uses a superaccumulator composed of 64-bit “chunks” that overlap by 32 bits, allowing carry propagation to be done infrequently. The “large” superaccumulator method has a separate chunk for every possible combination of the sign bit and exponent bits in a floating-point number (4096 chunks in all). It has higher overhead for initialization than the small superaccumulator method, but takes less time per term added, so it turns out to be faster when summing more than about 1000 terms.

Here is a graph of performance on a Dell Precision T7500 workstation, with a 64-bit Intel Xeon X5680 processor:

The horizontal axis is the number of terms summed, the vertical axis the time per term in nanoseconds, both on logarithmic scales. The time is obtained by repeating the same summation many times, so the terms summed will be in cache memory if it is large enough (vertical lines give sizes of the three cache levels).

The red lines are for the new small (solid) and large (dashed) superaccumulator methods. The blue lines are for the iFastSum (solid) and OnlineExact (dashed) methods of Zhu and Hayes (2010), which appear to be the fastest previous exact summation methods. The black lines are for the obvious (inexact) simple summation method (solid) and a simple out-of-order summation method, that adds terms with even and odd indexes separately, then adds together these two partial sums. Out-of-order summation provides more potential for instruction-level parallelism, but may not produce the same result as simple ordered summation, illustrating the reproducibility problems with trying to speed up non-exact summation.

One can see that my new superaccumulator methods are about twice as fast as the best previous methods, except for sums of less than 100 terms. For large sums (10000 or more terms), the large superaccumulator method is about 1.5 times slower than the obvious simple summation method, and about three times slower than out-of-order summation.

These results are all for serial implementations. One advantage of exact summation is that it can easily be parallelized without affecting the result, since the exact sum is the same for any summation order. I haven’t tried a parallel implementation yet, but it should be straightforward. For large summations, it should be possible to perform exact summation at the maximum rate possible given limited memory bandwidth, using only a few processor cores.

For small sums (eg, 10 terms), the exact methods are about ten times slower than simple summation. I think it should be possible to reduce this inefficiency, using a method specialized to such small sums.

However, even without such an improvement, the new superaccumulator methods should be good enough for replacing R’s “mean” function with one that computes the exact sample mean, since for small vectors the overhead of calling “mean” will be greater than the overhead of exactly summing the vector. Summing all the data points exactly, then rounding to 64-bit floating point, and then dividing by the number of data points wouldn’t actually produce the exactly-rounded mean (due to the effect of two rounding operations). However, it should be straightforward to combine the final division with the rounding, to produce the exactly-correct rounding of the sample mean. This should also be faster than the current inexact two-pass method.

Modifying “sum” to have an “exact=TRUE” option also seems like a good idea. I plan to implement these modifications to both “sum” and “mean” in a future version of pqR, though perhaps not the next version, which may be devoted to other improvements.

]]>

It seems that lots of actual data vectors could be stored more compactly than at present. Many integer vectors consist solely of elements that would fit in one or two bytes. Logical vectors could be stored using two bits per element (allowing TRUE, FALSE, and NA), which would use only one-sixteenth as much memory as at present. It’s likely that many operations would also be faster on such compact vectors, so there’s not even necessarily a time-space tradeoff.

For integer and logical types, the possible compact representations, and how to work with them, are fairly obvious. The challenge is how to start using such compact representations while retaining compatibility with existing R code, including functions written in C, Fortran, or whatever. Of course, one could use the S3 or S4 class facilities to define new classes for data stored compactly, with suitable redefinitions of standard operators such as `+’, but this would have substantial overhead, and would in any case not completely duplicate the behaviour of non-compact numeric, integer, or logical vectors. Below, I discuss how to implement compact representations in a way that is completely invisible to R programs. I hope to try this out in my pqR implementation of R sometime, though other improvements to pqR have higher priority at the moment.

How to compactly represent floating-point data (of R’s `numeric’ type) is not so obvious. If the use of a compact representation is to have no effect on the results, one cannot just use single-precision floating point. I describe a different approach in a new paper on Representing numeric data in 32 bits while preserving 64-bit precision (also on arxiv). I’ll present the idea of this paper next, before returning to the question of how one might put compact representations of any sort into an R interpreter, invisibly to R programs.

Statistical applications written in R typically represent numbers read from data files using 64-bit double-precision floating-point numbers (unless all numbers are integers). However, the information content of such data is often small enough that each data item could be represented in 32 bits. For example, if every item in the data file contains six or fewer digits, with the decimal point in one of the seven places before or after one of these digits, there are less than 7 million possible numbers (14 million if a number might be negative), which is much less than the approximately 4 billion possible values of a 32-bit element.

However, representing such data with 32-bit single-precision floating-point values won’t really work. Single-precision floating-point will be able to distinguish all numbers that have no more than six digits, but if these numbers are used in standard arithmetic operations, the results will in general *not* be the same as if they had been represented with 64-bit double precision. The problem is that numbers, such as 0.1, that have non-terminating binary representations will be rounded to much less precise values in single precision than in double precision.

Exactly the same results as using double precision could be obtained by using a decimal floating point representation. For example, a compact number could consist of a 28-bit signed integer, *m*, and a 4-bit exponent, *e*, which represent the number *m*×10^{–e}. To decode such a number, we would extract *m* and *e* with bit operations, use *e* to look up 10^{e} from a table, and finally divide *m* by 10^{e}. Unfortunately, the final division operation is comparatively slow on most current processors, so compressing data with this method would lead to substantially slower operations on long vectors (typically about six times slower). It’s much faster to instead multiply *m* by 10^{-e}, but this will not give accurate results, since 10^{-e} cannot be exactly represented in binary notation.

In my paper, I propose a faster way of representing 64-bit floating-point values in 32 bits, while getting exactly the same answers. The idea is simply to store only the upper 32 bits of the 64-bit number, consisting of the sign, the exponent, and the upper 20 bits of the mantissa (21 bits of precision, counting the implicit 1 bit). To use such a compact number, we need to fill in the lower 32 bits of the mantissa, which is done by looking these bits up in a table, indexing with some bits from the retained part of the mantissa and perhaps from the exponent.

Of course, this only works for some subsets of possible 64-bit floating-point values, in which there aren’t two numbers with the same upper 32 bits. Perhaps surprisingly, there are a number of interesting subsets with this property. For example, the set of all six-digit decimal numbers with the decimal point before or after any of the digits can be represented, and decoded using a table indexed by 19 bits from the mantissa and the exponent. Some smaller subsets can be decoded with smaller tables. More details are in the paper, including timing results for operations on vectors of such compactly-represented values, which show that it’s faster than representing data by decimal floating point, and sometimes faster than using the original 64-bit floating point values.

An interesting feature of this scheme is that the compact representation of a 64-bit value is the same regardless of what subset is being represented (and hence what decoding table will be used). So when compressing a stream of data, the data can be encoded before it is known what decoding scheme will be used. (Of course, it may turn out that no decoding scheme will work, and hence the non-compact form of the data will need to be used.) In contrast, when trying to compress an integer vector by storing it in one or two bytes, it may initially seem that a one-byte representation of the data will be possible, but if an integer not representable in one byte is later encountered, the previous values will need to be expanded to two bytes.

I’d like to be able to use such compact representations for R vectors invisibly — without changing any R programs, or external routines called from R that use the R API. This requires that a compactly-represented vector sometimes be converted automatically to its non-compact form, for example, when passed to an external routine that knows nothing about compact representations, or when it is operated on by some part of the R interpreter that has not been re-written to handle compact vectors. Compactly-represented vectors will also need to be expanded to their non-compact form when an element of the vector is replaced by a value that is not in the set that can be compactly represented.

It should be possible to accommodate code that doesn’t know about compact representations using the same variant result mechanism in pqR that is used to implement parallel computation in helper threads. With this mechanism, code in C that calls the internal “eval” function to evaluate an R expression can specify that the caller is prepared to handle a “variant” of the normal evaluation result, which in this application would be a result that is a compactly-stored vector. By default, such variant results will not be returned, so code that is unaware of compact vectors will still work. Of course, compact representations will be useful only if modifications to handle compact representations have been made to many parts of the interpreter, so that vectors can often remain in their compact form.

When we do need to expand a compact vector into it’s non-compact form, how should we do it? Should we keep the compact form around too, and use it if we no longer need the expanded form? That seems bad, since far from reducing memory usage, we’d then be increasing it by 50%. And even if we discard the compact form after expanding it, we still use 50% more memory temporarily, while doing the expansion, which may cause serious problems if the vector is really huge.

We can avoid these issues by expanding the vector in place, having originally allocated enough memory for the non-compact representation, with the compact form taking up only the first part of this space allocation. Now, this may seem crazy, since the whole point of using a compact representation is to avoid having to allocate the amount of memory needed for the non-compact representation! Modern operating systems can be clever, though. At least on Linux, Solaris, and Mac OS X, if you allocate a large block of memory (with C’s malloc function), real physical memory will be assigned to addresses in this memory block only when they are actually used. So if you use only the first half of the block, only that much physical memory will be allocated — except that allocation is actually done in units of “pages”, typically around 4 KBytes. So as long as physical memory (rather than virtual address space) is what you’re short of, and the vector is several times larger than the page size, allocating enough memory to hold the vector’s non-compact form should still save on physical memory if in fact only the compact representation is used.

Expanding compact vectors in place also avoids problems with garbage collections being triggered at unexpected times, and with the address of a vector changing when existing code may assume it will stay the same. Indeed, it’s not clear that these problems could be solved any other way.

However, one unfortunate consequence of allocating space to allow expansion in place is that compact representations will not help with programs that create a huge number of short vectors, because the allocation of physical memory in units of pages limits the usefulness of compact representations to vectors of at least a few thousand elements. It’s difficult to assess how often compact representations will provide a substantial benefit in real applications until they have been implemented, which as I mentioned above, will have to wait until after several other planned features have been implemented in pqR.

]]>

Click on image for larger version.

Toronto, March 2015. Fujica G690 with 100mm 1:3.5 lens, Kodak Portra 400 film (120), Nikon Coolscan 9000.

]]>

Faculty at the suburban Mississauga campus teach undergraduate courses there, but also spend much time at the Department of Statistical Sciences on the downtown campus, teaching graduate courses, supervising graduate students, attending research seminars, etc. The University of Toronto has a diverse group of both young and experienced faculty working in statistics, both in the downtown and suburban statistics departments, and in related research groups such as machine learning and biostatistics.

The deadline to apply is December 15, 2014. You can see the ad here.

]]>

This change affects only interpreted code. The bytecode compiler (available since R-2.13.0) introduced a different mechanism, which is also faster than the previous approach used by the interpreter (though it still has some of the strange behaviour). This faster mechanism was one of the main reasons for byte-compiled code to be faster than interpreted code (although it would have been possible to use the new mechanism in the interpreter as well). With pqR’s new implementation of subset replacement, this advantage of byte-compiled over interpreted code is much reduced.

In addition to being faster, pqR’s new approach is also more coherent than the previous approach (still current in the interpreter for R Core releases to at least R-3.1.1), which despite its gross inefficiency and confused semantics has remained essentially unchanged for 18 years. Unfortunately, the new approach in pqR is not as coherent as it might be, because past confusion has resulted in some packages doing “wrong” things, which have to be accommodated, as least in the short term.

**Replacement functions.** To understand pqR’s new approach, and the problems with the old approach (some not currently fixable), you first need to know how R’s subset replacement operations are defined. The central concept is that every function for extracting part of an object is accompanied by a corresponding function for replacing that part, whose name has “`<-`” appended. So, for example, the “`dimnames`” function is accompanied by “`dimnames<-`“, the “`$`” operator is accompanied by “`$<-`“, and “`[`” is accompanied by “`[<-`“.

Those three pairs of functions are primitive, but users can define their own pairs of subset and replacement functions. For example, the pair of functions below access or replace those elements of a vector that have odd indexes:

`odd_elements <- function (x)`

{ x[seq(1,by=2,length=(length(x)+1)%/%2)] }

`odd_elements<-` <- function (x,value)

{ x[seq(1,by=2,length=(length(x)+1)%/%2)] <- value; x }

In general, such functions may take additional arguments that specify which part of the variable should be accessed or modified.

**Simple replacements.** To see how these replacement functions are used, let’s start with a simple replacement of part of a variable:

`x[3:5] <- 13:15`

According to the current R Language Definition at r-project.org, the effect of this statement is the same as that of

``*tmp*` <- x`

x <- `[<-`(`*tmp*`, 3:5, value=13:15)

rm(`*tmp*`)

This specification is actually incomplete, since it fails to specify the value of the expression `x[3:5] <- 13:15`

(which might, uncommonly, be used someplace such as the argument of a function call), but it is close to a literal description of what the interpreter in R Core implementations does — this simple assignment to part of a vector really does cause a variable called “`*tmp*`” to be created in the current environment, to then be modified, and to finally be removed, with all the overhead this implies. You can confirm that this is what’s happening (for example, in R-3.1.1) by typing the following:

``*tmp*`<-9; a<-c(1,2); a[1]<-3; print(`*tmp*`)`

You’ll get an error from `print`, since `a[1]<-3` will have removed `*tmp*`.

In pqR, `x[3:5] <- 13:15` is now instead implemented as something close to the following:

`x <- `[<-`(x, 3:5, value=13:15)`

This has the same effect as the code in the language definition, except that it has much less overhead, and lacks the undesired side effect of deleting any previously existing `*tmp*` variable. Subset replacement with a user-defined function is done the same way — for example, `odd_elements(x)<-0` is translated to

`x <- `odd_elements<-`(x, value=0)`

Note that although their use in implementing assignments to subsets is the principal purpose of replacement functions, nothing stops them from being called directly. And it can occasionally be useful to write things like the following:

`z <- W %*% `odd_elements<-`(x+y, value=1)`

**Avoiding duplication.** If the ``[<-`` primitive were implemented in the most obvious way, the call ``[<-`(x,i,v)` would start by making a duplicate copy of `x`, then replace the elements of this copy that `i` indexes by `v`, and finally return this modified copy as its value. But this would be intolerably inefficient when `x` is a vector of 1000000 elements, that isn’t shared with any other variable, and `i` indexes just one of these elements.

The right way to solve this is to **not** duplicate the first argument of `[<-` if *either* it is a value that is not stored anywhere (eg, the result of some arithmetic operation), *or* it is the value of a variable that is not also stored elsewhere and the call of `[<-` is part of an assignment operation. This would not be hard to do in pqR, using its “variant result” mechanism (see here) to pass to the replacement operator the information on whether it has been called from an assignment operator.

That’s not what is currently done, however. Instead, the primitive replacement operators such as “`[<-`” duplicate their first argument only if it is stored in two or more variables (or list elements), regardless of the context in which it is called. This violates the usual pass-by-value semantics of R function calls. For example, the call

`y <- `[<-` (x, 1, 0)`

ought to set `y` to the value stored in `x` with the first element changed to zero, *while leaving x unchanged*. But it (sometimes) does change

`x <- c(10,20,30); y <- `[<-`(x, 1, 0); print(x)`

Unfortunately, some code now relies on this behaviour, although this is a very bad idea, both for general reasons, and also because in the following slightly different code, “`[<-`” *doesn’t* change `x`:

`w <- x <- c(10,20,30); y <- `[<-`(x, 1, 0); print(x)`

Worse, the “`@<-`” and “`slot<-`” operators for changing the value of a slot in an S4 object have been written to *never* duplicate their first argument, even if it is shared amongst many variables. To keep this from causing total chaos, the general code for assignment to subsets has to duplicate the value stored in the target variable if it is shared with another variable (even though this is necessary only for “`@<-"` and “`slot<-`“), which sometimes results in an extra duplication being done. Unfortunately, this behaviour of “`@<-`” and “`slot<-`” is also relied on by some code.

For the moment, pqR accommodates all this bad behaviour, though it would be nice to move to a coherent semantics sometime.

**Complex replacements.** Assignment operations with more complex replacements are trickier. The R Language Definition defines an assignment such as

`L[[2]][3] <- 1`

as being equivalent to

``*tmp*` <- L`

L<-`[[<-`(`*tmp*`,2,value=`[<-`(`*tmp*`[[2]],3,value=1))

rm(`*tmp*`)

That is, the `[[` operator is used to extract the second element (a vector) of `L` (which has been put in `*tmp*`), then `[<-` is used to create a new version of this vector with its second element changed to 1, and finally the `[[<-` operator is used to put this modified vector back as the second element of `L`.

The interpreter in R Core implementations (and pqR before the latest release) implement this definition quite literally, actually creating a `*tmp*` variable, and evaluating index expressions as implied above. This results in strange behaviour. The following code produces the error “cannot change value of locked binding for `*tmp*`”, though it should surely be legal:

`L <- list(c(4,7),"x"); b <- c(2,3); L[[ b[1]<-1 ]] [1] <- 9`

The following code calls the function `f` twice, though a programmer writing it would surely expect it to be called only once:

`f <- function () { cat("Hi!\n"); 1 }`

L <- list(c(4,7),"x"); L[[ f() ]] [1] <- 9

This prints “`Hi!`” twice, in R-3.1.1 and earlier R Core releases (for both interpreted and byte-compiled code).

**How pqR implements complex replacements.** These strange behaviours are eliminated in the new pqR implementation, which is also much faster.

In pqR, an assignment that does a complex replacement starts by evaluating the expression on the right side, and then calls in succession all the subset extraction functions that appear on the left side, except for the outermost one. For example, `names(L[[f()]])[i]<-g()` will first evaluate `g()`, and then evaluate the extraction functions from the inside out, effectively doing something like

`tmp1 <- L[[f()]]`

tmp2 < names(tmp1)

However, ``tmp1`` and ``tmp2`` are not actual R variables — the interpreter just stores the values extracted internally.

So far, this is similar to what the R Core interpreter does, but there are two crucial differences.

First, when evaluating an extraction function, pqR uses its “variant result” mechanism to ask the extraction function whether the value it returns is an unshared subset of the variable it was extracted from, which can safely be modified, and for which modifications will automatically be reflected in changes to that part of the larger variable.

For example, after `L <- list("x", c(1,2))`, the expression `L[[2]]` returns an unshared subset of `L`. However if either `M <- L` or `M < L[[2]]`; were then executed, `L[[2]]` would no longer be an *unshared* subset, since it would be shared with the value of `M`. And after `v <- 1:100`, the expression `v[20:30]` does not return an unshared subset, because it will return a *copy* of part of `v`, not that part of `v` itself (unlike list elements, parts of numeric vectors are not objects in themselves).

Knowing when the result of an extraction is an unshared subset is crucial to efficiently updating it. When the result of an extraction is not an unshared subset, and it is referenced elsewhere, pqR duplicates it (at the top level) before doing further extractions and replcements.

The second difference from R Core implementations concerns the index arguments of the extraction functions, which are later also arguments of the corresponding replacement functions. When pqR evaluates a call of an extraction function, such as `L[[f()]]`, it creates what (in the terminology of R internals) are called “promises” for index arguments, such as `f()` in this example. These promises contain the expression to be evaluated, plus an initially empty field for the value of the expression. When (if ever) the extraction function actually references the index value, the expression is evaluated, and this field is filled in. Later references to the index value do not evaluate the expression again, but just use the value stored in this field of the promise. Crucially, in pqR, these promises are kept for later use when the corresponding replacement function is called, usually with their value fields already filled in.

Avoiding re-evaluation of index arguments saves time, and also eliminates the double occurrence of side effects of evaluation, such as “`Hi!`” being printed twice in the example above when `f()` is evaluated twice (once for extracting `L[[f()]]` and once when replacing that element of `L` by a call of ``[[<-`` with `f()` as the index argument).

Once all the extraction functions have been called, the outermost replacement function is called to store the right hand side of the assignment statement into the result from the last extraction function. The next replacement function is then called to store this modified value into the result of the previous extraction function, and so forth, until the last replacement function call produces the new value for the variable being assigned into.

This is again generally similar to R Core implementations. However, pqR is able to skip some of these replacement calls, when it knows that the result of an extraction function is part of the larger variable. In that case, when that part is modified, nothing has to be done to propagate the modification to the larger variable. For example, to perform the replacement operation below:

`L <- list(a=c(1,2),b="x"); L$a[1] <- 9`

pqR will first extract `L$a`, and find that this vector is an unshared subset of `L`. It will then call ``[<-`` to replace the first element of this vector by 9, at which point it is done — pqR realizes that there is no need to call the ``$<-`` replacement function, and also that there is no need to store the final result in `L`, since it is the same as the object already in `L`. However, if the assignment `L$a[1] <- 1+2i` is now done, the replacement of the first element of `L$a` by the complex number `1+2i` will produce a new vector of complex type, and pqR will realize that ``$<-`` needs to be called to store this new vector in `L$a`.

R Core implementations try to infer whether an extracted value is an unshared subset from how many references there are to it (see the discussion here), which sort of works, but fails when extraction is done with a vector index, as below:

`L <- list(1,list(2,list(3,c(4,5,6))))`

K <- L

L[[c(2,2,2,3)]] <- 9

The vector index `c(2,2,2,3)` refers to the 3rd element of the 2nd element of the 2nd element of the 2nd element of `L`, which is the number 6. When replacing this by 9, the vector `c(4,5,6)` needs to be duplicated, because the entire object is shared by `K` and `L`. However, the reference count for `c(4,5,6)` should be only one, since it is referenced from only a single list (albeit one that ultimately is itself shared). To get around this problem, recent R Core releases increment reference counts as a result of simply extracting a value with a vector index, which will leave reference counts that are bigger than they should be, and may therefore cause unnecessary copying to be done later. (Earlier R Core releases just give the wrong answer.) In the new pqR implementation, extraction leaves the reference counts unchanged, but if asked, ``[[`` will say that the result returned in *not* an unshared subset, which will lead to the appropriate duplications before replacement functions are called.

**User-defined replacement functions.** As illustrated by the `odd_elements` and `odd_elements<-` functions above, users can write their own replacement functions, which can be used just like the built-in ones. Unfortunately, in both R Core and pqR implementations, there is presently no way to avoid duplication of the value in a variable that is updated with a user-defined function, even when the value is not shared with other variables. For example, `odd_elements(a) <- 0` will always duplicate the value in `a` before setting its odd elements to 0. Furthermore, the modified value stored in `a` after the replacement will always be marked as shared with the variable `x` within `odd_elements<-` (even though `x` is inaccessible), forcing a copy when it is next updated.

The new version of pqR does avoid some unnecessary duplications that are done in R Core implementations, but the basic problems remain. One fundamental question is what should happen if a user-defined replacement function generates an error after partially changing the value being updated. At present, the variable being updated will be left unchanged after the error exit. But any scheme that doesn’t duplicate the variable being updated will have the possibility of leaving a partial update that was cut short by an error. Successfully resolving such issues would allow for much more efficient use of user-defined replacement functions.

]]>

Details are in pqR NEWS. Here I’ll highlight some of the more interesting improvements.

**Faster variable lookup.** In both pqR and R Core implementations, variables local to a function are stored in a linked list, which is searched sequentially when looking for a variable (though this may sometimes be avoided in byte-compiled functions). So the more variables you have in your function, the slower it is to access or modify one of them. The new version of pqR often avoids this search by saving for each symbol the result from the last time that symbol was looked up in some local environment, and re-using this if the same environment is searched for that symbol again.

**Re-using memory when updating variables.** When variables are updated with statements like `i <- i+1` or `v <- exp(v)` we would prefer that the variable be updated by modifying its stored value, without allocating a new object (provided this value isn’t shared with some other variable). This is now done in pqR for binary and unary arithmetic operators and for mathematical functions of one argument. Eliminating such unnecessary storage allocation is important both for scalar operands (eg, counters in while loops) and when the operands are vectors (possibly quite large).

Updating in place also produces more possibilities for task merging — for example, the two operations `v <- 2*v; v <- v+1` will now be merged into a single loop over the elements of `v` that replaces each element by twice the element plus one.

**Faster and better subset replacement operations.** The interpreter’s handling of subset replacement operations such as `a[i] <- 1`, `L$x <- y`, `L$x[i] <- 0`, and `diag(L$M)[i] <- 1` has been completely revised, substantially improving speed, and also fixing some long-standing problems with the previous scheme. I will discuss this important change in more detail in a later post.

**Shared, read-only constants.** The result of evaluating an expression may now sometimes be a shared constant, stored (on most platforms) in read-only memory. In addition to improving speed and reducing memory usage, this change will sometimes have the effect that buggy code in packages (or the interpreter itself) that fails to check whether an object is shared before modifying it will now result in a memory access fault, rather than silently producing an incorrect answer.

**Faster and better-defined external function calls.** The overhead of calling external functions with .C or .Fortran has been substantially reduced. Some improvements in .C and .Fortran were made in R-2.15.1; pqR now has these optimizations as well as others.

Furthermore, pqR now documents (in `help(.C)`) what expressions are guaranteed to return unshared objects that may safely be modified when the `DUP=FALSE` option is used to .C or .Fortran, and makes clear that `DUP=FALSE` should be used only to improve performance, not as a way of surreptitiously returning information to the caller without the caller referring to the list returned by .C or .Fortran. I will be writing more on the use of `DUP=FALSE` in a future post.

Under some circumstances, routines called via .C or .Fortran can now be done by a helper thread in parallel with other operations. This is done only if an argument of `HELPER=TRUE` is passed to .C or .Fortran, which should be done only when the routine performs a pure numerical computation without side effects.

The speed of .Call and .External has been improved slightly. More importantly, however, within a routine called by .Call or .External, LENGTH, TYPEOF, REAL, INTEGER, LOGICAL, RAW, COMPLEX, CAR, CDR, CADR, etc. are now macros or inline functions, avoiding possibly substantial procedure call overhead.

**And more…** Numerous other performance improvements are described in the NEWS file, which also describes other changes that improve compatibility with recent R Core releases, add a few new features, fix bugs, etc. Several changes have been made to make it easier to use fast BLAS routines for matrix multiplication and other matrix operations, which will be the topic of another post. I will also be posting soon about how the speed of pqR-2014-09-30 compares with earlier versions of pqR and with past and current R Core releases.

]]>

Click on image for larger version.

Pentax ME-Super with SMC Pentax-M 40mm 1:2.8 lens, Fuji Reala 100 film, Nikon Coolscan V.

]]>

The result is that pqR now works with a large collection of 3438 packages.

This collection was created starting with the complete set of CRAN packages as of 2012-06-25. Packages were then eliminated if they were not suitable for Linux, or if their licence was unacceptable, or if they failed their checks with R-2.15.0 (on which pqR is based) as well as with pqR (though sometimes a more recent version of the package that works was found), or if they required Linux software that wasn’t easily installed. Also, some packages were added, and some upgraded to more recent versions, in order to produce a collection of packages that all pass their checks if the entire set of packages is first installed from this repository. Finally, a few packages from bioconductor.org were added.

See the pqR wiki for more information on how to access this repository, and for more information on packages that do and do not work with pqR .

The process of creating this curated repository revealed a number of bugs in pqR, which have now been fixed. I now know of no packages that fail because of what appears to be a pqR bug. This repository of packages that all work together should be a valuable resource for testing future versions of pqR. Some features have been added to pqR to facilitate this use, such as initializing the random number seed (if not set by set.seed) from the R_SEED environment variable (if it is set), rather than initializing the seed from the time and process id, which leads to random failures of checks for some packages.

Some of the bugs fixed in this release were introduced by changes made in pqR, others are present in the R Core version pqR is based on and in the latest R Core release, R-3.1.0. Some of the documentation improvements provide information that is missing or incorrect in R-3.1.0. See the pqR NEWS for details.

With this release, I think the reliability of pqR is comparable to that of recent R Core releases. There is still work to be done to improve compatibility by incorporating features introduced in R Core releases after 2.15.0. However, the main focus of the next release will be further performance improvements.

]]>

The inaccuracy of microbenchmark has two main sources — first, it does not correctly allocate the time for garbage collection to the expression that is responsible for it, and second, its summarizes the results by the *median* time for many repetitions, when the *mean* is what is needed. The median and mean can differ drastically, because just a few of the repetitions will include time for a garbage collection. These flaws can result in comparisons being reversed, with the expression that is actually faster looking slower in the output of microbenchmark.

Here’s an example of this, using R-3.0.2 on an Ubuntu Linux 12.04 system with an Intel X5680 processor. First some setup, defining a vector `v` to be operated on, a vector of indexes, `m`, for an initial part of `v`, and variables used to repeat expressions `k` times:

> library(microbenchmark) > set.seed(1) > v <- seq(0,1,length=100000) > m <- 1:42000 > k <- 1000 > n <- 1:k

We can now compare two ways of computing the same thing using microbenchmark:

> (m1a <- microbenchmark (u <- v[m]^2+v[m], u <- (v^2+v)[m], + times=k)) Unit: microseconds expr min lq median uq max neval u <- v[m]^2 + v[m] 549.607 570.2220 816.0170 902.6045 11381.40 1000 u <- (v^2 + v)[m] 472.859 682.3785 749.4225 971.9475 11715.08 1000

Looking at the median time, it seems that `u <- (v^2+v)[m]` is 816.0/749.4 = 1.089 times faster than `u <- v[m]^2+v[m]`.

Alternatively, we could use system.time to see how long a “for” loop evaluating one of these expressions `k` times takes (`k` was set to 1000 and `n` was set to `1:k` above):

> system.time(for (i in n) u <- v[m]^2+v[m]) user system elapsed 0.808 0.060 0.869 > system.time(for (i in n) u <- (v^2+v)[m]) user system elapsed 0.912 0.092 1.006

This gives the opposite result! From this output, it seems that `u <- v[m]^2+v[m]` is 1.006/0.869 = 1.158 times faster than `u <- (v^2+v)[m]`. Multiplying these factors, we see that the comparisons using microbenchmark and system.time differ by a factor of 1.089*1.158 = 1.26.

Maybe one could argue that a 26% error is often not that bad — I did fiddle the lengths of `v` and `m` for this example so that the error would actually make a difference in which expression seems faster — but the main reason people use microbenchmark is apparently that they think it is more accurate than system.time. So which gives the right answer here?

To find out, the first thing to do is to repeat the same commands again. As seen in the full output of this example, the repetition produces much the same results. (This isn’t always the case, though — appreciably different results in repetitions can occur when, for example, R decides to change the level of memory usage that triggers a garbage collection.)

We can also try the `order="block"` control option for microbenchmark. This tells it to do all 1000 repetitions of the first expression, and then all 1000 repetitions of the second expression, rather than randomly shuffling them, as is the default. The results aren’t much different:

> (m2a <- microbenchmark (u <- v[m]^2+v[m], u <- (v^2+v)[m], + times=k, control=list(order="block"))) Unit: microseconds expr min lq median uq max neval u <- v[m]^2 + v[m] 548.558 566.785 816.6940 830.7535 11706.04 1000 u <- (v^2 + v)[m] 471.602 543.331 743.3735 836.7810 12162.34 1000

However, we can see what’s going on if we plot the individual times that microbenchmark records for the 2000 repetitions (1000 of each expression). Here are the plots for the second repetition with both random and block order:

The red dots are times for `u <- v[m]^2+v[m]`, the green dots for `u <- (v^2+v)[m]`. Note the log scale for time.

Some of the repetitions take about 20 times longer than others. The long ones clearly must be for evaluations in which a memory allocation request triggers a full garbage collection. We can now see that the results using the median from microbenchmark are going to be wrong.

First, notice that with block ordering, 1000 evaluations of the first expression result in 14 full garbage collections, whereas 1000 evaluations of the second expression result in 24 full garbage collections. With random order, however, there are 20 full garbage collections during evaluations of the first expression, and 25 during evaluations of the second expression. Random ordering has obscured the fact that the second expression is responsible for almost twice as much garbage collection time as the first, presumably because it allocates larger vectors for intermediate results such as `v^2` (versus `v[m]^2`).

But why, then, does block ordering give almost the same result as random ordering? Because we are looking at microbenchmark’s output of the *median* time to evaluate each expression. The median will be almost totally insensitive to the time taken by the small number (about 2%) of evaluations that trigger full garbage collections. So an expression whose evaluation requires more garbage collections will not be penalized for the time they take, even assuming that this time is correctly assigned to it (which happens only with block ordering).

This explanation suggests that using microbenchmark with block ordering, and then looking at the *mean* evaluation time for each expression would give a more accurate result, agreeing with the result found with system.time. And indeed it does in this example — in the two repetitions, system.time gives ratios of times for the two expressions of 1.158 and 1.136, while microbenchmark with block ordering gives ratios of mean times for the two expressions of 1.131 and 1.137. (These ratios of mean times can be computed from the data saved by microbenchmark, as seen in the R program I used, though the package provides no convenience function for doing so.)

So microbenchmark’s problem with this example can be fixed by using block ordering and taking the mean time for repetitions. But there’s no reason to think the time it then produces is any more accurate than that from system.time. It’s probably less accurate. The overhead of the “for” loop used with system.time is very small; in particular, since a change I suggested was incorporated in R-2.12.0, there is no memory allocation from the “for” loop iteration itself. The overhead of measuring the time for every repetition in microbenchmark is greater, and though microbenchmark attempts to subtract this overhead, one might wonder how well this works. Executing the time measurement code between each repetition will also disturb the memory cache and other aspects of processor state, which may affect the time to evaluate the expression.

More fundamentally, nanosecond timing precision is just not useful for this task, since to correctly account for garbage collection time, the number of repetitions must be large enough for many garbage collections to have occurred, which will lead to run times that are at least a substantial fraction of a second. Millisecond timing precision is good enough.

Furthermore, there’s no point in measuring the relative speeds of different expressions to more than about 5% accuracy, since recompiling the R interpreter after any slight change (even in code that is never executed) can change relative timings of different expressions by this much — presumably because changes in exact memory addresses affect the behaviour of memory caching, branch prediction, or other arcane processor tricks.

The data gathered by microbenchmark could be useful. For instance, it would be interesting to investigate the variation in evaluation time seen in the plots above beyond the obvious full garbage collections. Some is presumably due to less-than-full garbage collections; there could be other effects as well. But the microbenchmark package isn’t useful for the purpose for which it was designed.

]]>

I’ve now released pqR-2013-12-29, a new version of my speedier implementation of R. There’s a new website, pqR-project.org, as well, and a new logo, seen here.

The big improvement in this version is that vector operations are sped up using *task merging*.

With task merging, several arithmetic operations on a vector may be merged into a single operation, reducing the time spent on memory stores and fetches of intermediate results. I was inspired to add task merging to pqR by Renjin and Riposte (see my post here and the subsequent discussion).

The main way pqR merges tasks is via the same deferred evaluation mechanism that it uses to perform tasks in helper threads, when multiple processor cores are available. Here’s an example, in which we’ll first assume that only one processor core is used (no helper threads):

f <- function (x,a,b) a*x+b v <- seq(1,2,length=10000) w <- f(v,2,3)^2 print(w[10000])

When `f(v,2,3)` is called, a task is first scheduled to compute `2*v`, but this computation is not started. Next, a task is scheduled to add 3 to the result of computing `2*v`, at which point pqR recognizes that this task can be merged with the first task, into a task to compute `2*v+3`. This computation is also not started yet, with the value returned from `f(v,2,3)` being a vector whose computation is pending. A task is then scheduled to square this vector, which is merged with the previously merged task into a task to compute `(2*v+3)^2`. This computation is also deferred; instead, a pending value is assigned to `w`. It’s only when `w[10000]` needs to be printed that the vector `w` is actually computed.

The final merged computation of `(2*v+3)^2` is done with a single loop, which fetches each element, `v[i]`, and stores `(2*v[i]+3)^2` in `w[i]`. In contrast, if the three operations hadn’t been merged, three loops would have been done, with two stores and two fetches of intermediate results for each element of the vector. In my tests, the merged operation is 1.5 to 2.0 times faster than the combination of the three unmerged operations (depending on the processor, compiler, etc.).

This example is more complicated if pqR is using helper threads. The first task to compute `2*v` will now probably be started in a helper thread before the task to add 3 to its result is scheduled, so no merge will be done. If there is only one helper thread, the next task to do the squaring operation will likely be merged with the “add 3″ task, but if a second helper is available, the “add 3″ task may also have already started. So it’s possible that the availability of helper threads could actually slow computations, by preventing merges. Helper threads will of course be a help when computations on vector elements dominate the memory store and fetch operations, and for this reason pqR does not try to merge expensive operations when helper threads are available (eg, `sin(exp(v))` is not merged when there is at least one helper thread, whereas it is merged when there are none).

I plan to improve how task merging works with helper threads in a future release of pqR. It should be possible to merge tasks even when the first has already started, by telling the first task to stop at the point it has reached, doing the second task’s operation on those elements where the first has already been done, and then doing the merged operations on the remaining elements.

I also plan to extend the operations for which task merging can be done. At present, pqR can only merge operations with a single vector operand and a vector result, such as `3*v` or `exp(v)`, but it should also be possible to merge some operations with two vector operands, such as `u+v` with `u` and `v` both vectors. Task merging in pqR will remain less ambitious than the similar features in Renjin and Riposte however, since pqR uses pre-compiled code snippets for all possible combinations of merged operations (currently 2744 of them, with merges limited to three operations), whereas Renjin and Riposte compile code on-the-fly.

Previously, pqR had already had some features resembling task merging, implemented using the “variant result” mechanism (described here). The new version extends use of this mechanism to merge transpose and matrix multiply operations — now `t(A)%*%B` does not actually compute the transpose of A, but instead calls a routine that directly computes the transpose of A times B, which is actually faster than computing A times B, because of more favourable memory ordering. Those in the know can also achieve this result using the crossprod function, but in pqR it’s now automatic.

The new version also has some bug fixes and other improvements. You can install it on a Linux or Unix system from the source available at pqR-project.org. Installation from source on Max OS X is also possible, but is recommended only for those experienced in doing this. (You’ll need to install both Apple’s Xcode and a Fortran compiler, plus you’ll neeed a newer version of gcc than Xcode has if you want to use helper threads. The Mac GUI does not yet work if helper threads are enabled.) Some people have managed to install pqR on Microsoft Windows, but this is still experimental. One of my next objectives is to produce pre-compiled versions of pqR for common systems, but for now you have to install from source.

I recently gave a talk on pqR at the University of Guelph, the slides of which you can see here. (As usual, I had to hurry through the last few in the actual talk.)

]]>