How large vectors in R might be stored compactly

Vectors in R can currently have elements of two sizes — 8-byte double-precision floating-point elements for `numeric’ vectors, or 4-byte elements for `integer’ or `logical’ vectors.  You can also have vectors whose elements are 1-byte `raw’ values, but these raw vectors don’t support negative numbers, or NA values, so they aren’t suitable for general use.

It seems that lots of actual data vectors could be stored more compactly than at present.  Many integer vectors consist solely of elements that would fit in one or two bytes. Logical vectors could be stored using two bits per element (allowing TRUE, FALSE, and NA), which would use only one-sixteenth as much memory as at present. It’s likely that many operations would also be faster on such compact vectors, so there’s not even necessarily a time-space tradeoff.

For integer and logical types, the possible compact representations, and how to work with them, are fairly obvious. The challenge is how to start using such compact representations while retaining compatibility with existing R code, including functions written in C, Fortran, or whatever. Of course, one could use the S3 or S4 class facilities to define new classes for data stored compactly, with suitable redefinitions of standard operators such as `+’, but this would have substantial overhead, and would in any case not completely duplicate the behaviour of non-compact numeric, integer, or logical vectors. Below, I discuss how to implement compact representations in a way that is completely invisible to R programs. I hope to try this out in my pqR implementation of R sometime, though other improvements to pqR have higher priority at the moment.

How to compactly represent floating-point data (of R’s `numeric’ type) is not so obvious. If the use of a compact representation is to have no effect on the results, one cannot just use single-precision floating point. I describe a different approach in a new paper on Representing numeric data in 32 bits while preserving 64-bit precision (also on arxiv). I’ll present the idea of this paper next, before returning to the question of how one might put compact representations of any sort into an R interpreter, invisibly to R programs.

Statistical applications written in R typically represent numbers read from data files using 64-bit double-precision floating-point numbers (unless all numbers are integers).  However, the information content of such data is often small enough that each data item could be represented in 32 bits. For example, if every item in the data file contains six or fewer digits, with the decimal point in one of the seven places before or after one of these digits, there are less than 7 million possible numbers (14 million if a number might be negative), which is much less than the approximately 4 billion possible values of a 32-bit element.

However, representing such data with 32-bit single-precision floating-point values won’t really work. Single-precision floating-point will be able to distinguish all numbers that have no more than six digits, but if these numbers are used in standard arithmetic operations, the results will in general not be the same as if they had been represented with 64-bit double precision. The problem is that numbers, such as 0.1, that have non-terminating binary representations will be rounded to much less precise values in single precision than in double precision.

Exactly the same results as using double precision could be obtained by using a decimal floating point representation.  For example, a compact number could consist of a 28-bit signed integer, m, and a 4-bit exponent, e, which represent the number m×10e. To decode such a number, we would extract m and e with bit operations, use e to look up 10e from a table, and finally divide m by 10e. Unfortunately, the final division operation is comparatively slow on most current processors, so compressing data with this method would lead to substantially slower operations on long vectors (typically about six times slower). It’s much faster to instead multiply m by 10-e, but this will not give accurate results, since 10-e cannot be exactly represented in binary notation.

In my paper, I propose a faster way of representing 64-bit floating-point values in 32 bits, while getting exactly the same answers. The idea is simply to store only the upper 32 bits of the 64-bit number, consisting of the sign, the exponent, and the upper 20 bits of the mantissa (21 bits of precision, counting the implicit 1 bit). To use such a compact number, we need to fill in the lower 32 bits of the mantissa, which is done by looking these bits up in a table, indexing with some bits from the retained part of the mantissa and perhaps from the exponent.

Of course, this only works for some subsets of possible 64-bit floating-point values, in which there aren’t two numbers with the same upper 32 bits. Perhaps surprisingly, there are a number of interesting subsets with this property. For example, the set of all six-digit decimal numbers with the decimal point before or after any of the digits can be represented, and decoded using a table indexed by 19 bits from the mantissa and the exponent. Some smaller subsets can be decoded with smaller tables. More details are in the paper, including timing results for operations on vectors of such compactly-represented values, which show that it’s faster than representing data by decimal floating point, and sometimes faster than using the original 64-bit floating point values.

An interesting feature of this scheme is that the compact representation of a 64-bit value is the same regardless of what subset is being represented (and hence what decoding table will be used). So when compressing a stream of data, the data can be encoded before it is known what decoding scheme will be used. (Of course, it may turn out that no decoding scheme will work, and hence the non-compact form of the data will need to be used.) In contrast, when trying to compress an integer vector by storing it in one or two bytes, it may initially seem that a one-byte representation of the data will be possible, but if an integer not representable in one byte is later encountered, the previous values will need to be expanded to two bytes.

I’d like to be able to use such compact representations for R vectors invisibly — without changing any R programs, or external routines called from R that use the R API. This requires that a compactly-represented vector sometimes be converted automatically to its non-compact form, for example, when passed to an external routine that knows nothing about compact representations, or when it is operated on by some part of the R interpreter that has not been re-written to handle compact vectors. Compactly-represented vectors will also need to be expanded to their non-compact form when an element of the vector is replaced by a value that is not in the set that can be compactly represented.

It should be possible to accommodate code that doesn’t know about compact representations using the same variant result mechanism in pqR that is used to implement parallel computation in helper threads. With this mechanism, code in C that calls the internal “eval” function to evaluate an R expression can specify that the caller is prepared to handle a “variant” of the normal evaluation result, which in this application would be a result that is a compactly-stored vector. By default, such variant results will not be returned, so code that is unaware of compact vectors will still work. Of course, compact representations will be useful only if modifications to handle compact representations have been made to many parts of the interpreter, so that vectors can often remain in their compact form.

When we do need to expand a compact vector into it’s non-compact form, how should we do it?  Should we keep the compact form around too, and use it if we no longer need the expanded form?  That seems bad, since far from reducing memory usage, we’d then be increasing it by 50%. And even if we discard the compact form after expanding it, we still use 50% more memory temporarily, while doing the expansion, which may cause serious problems if the vector is really huge.

We can avoid these issues by expanding the vector in place, having originally allocated enough memory for the non-compact representation, with the compact form taking up only the first part of this space allocation. Now, this may seem crazy, since the whole point of using a compact representation is to avoid having to allocate the amount of memory needed for the non-compact representation! Modern operating systems can be clever, though. At least on Linux, Solaris, and Mac OS X, if you allocate a large block of memory (with C’s malloc function), real physical memory will be assigned to addresses in this memory block only when they are actually used.  So if you use only the first half of the block, only that much physical memory will be allocated — except that allocation is actually done in units of “pages”, typically around 4 KBytes. So as long as physical memory (rather than virtual address space) is what you’re short of, and the vector is several times larger than the page size, allocating enough memory to hold the vector’s non-compact form should still save on physical memory if in fact only the compact representation is used.

Expanding compact vectors in place also avoids problems with garbage collections being triggered at unexpected times, and with the address of a vector changing when existing code may assume it will stay the same. Indeed, it’s not clear that these problems could be solved any other way.

However, one unfortunate consequence of allocating space to allow expansion in place is that compact representations will not help with programs that create a huge number of short vectors, because the allocation of physical memory in units of pages limits the usefulness of compact representations to vectors of at least a few thousand elements. It’s difficult to assess how often compact representations will provide a substantial benefit in real applications until they have been implemented, which as I mentioned above, will have to wait until after several other planned features have been implemented in pqR.

Entry filed under: Computing, R Programming, Statistics - Computing.

• 1. Norm Matloff  |  2015-05-01 at 12:50 am

I’m surprised you didn’t mention Huffman coding, run-length coding and so on for integers. These could result in huge space savings, though of course even more headaches in trying to make it transparent to the programmer.

But is transparency that important? It might be best for now to leave it at the package level.

• 2. Radford Neal  |  2015-05-01 at 9:32 am

There would indeed be situations where such methods would allow for much more compact storage. However, aside from the issue of transparency, there is the problem of accessing and updating elements randomly — schemes like Huffman coding have a sequential nature that would make that difficult. Random access may not be necessary for some applications, of course.

I think one should probably think of compression for such applications as being separate from the sort of automatic compression that I have in mind here. I expect that one would want to have the R programmer explicitly request compression, and probably make visible the decompression as well. There would probably be no need to integrate the scheme into the interpreter, so a package implementation might be fine.

Schemes like Huffman coding also produce a variable-size representation, which might cause complications, though the approach I describe of over-allocating space and relying on physical memory not being assigned until needed might be applicable in this context too.

• 3. Charles  |  2015-05-03 at 8:48 pm

Lots of interesting work being done in speeding up R at the moment. Besides the separate reimplementation efforts (riposte, renjin, fastR) I’ve also seen one or two others that are trying to improve base R along some dimension or another.

Have you seen Jeffrey Horner’s Array-Hash R? See eg http://r.789695.n4.nabble.com/R-with-Array-Hashes-td4704256.html, code available at https://github.com/jeffreyhorner/R-Array-Hash.

Seems to offer big wins for environment lookup which could be quite complementary to the sort of improvements you’ve been looking at…

• 4. Radford Neal  |  2015-05-03 at 9:18 pm

Yes, it looks interesting, but I haven’t had time to look at it in detail yet.

• 5. Ewan  |  2015-05-13 at 4:49 pm

You should also check out blosc, a compression library for use with in-memory data: http://www.blosc.org/

• 6. Radford Neal  |  2015-05-13 at 7:01 pm

That looks interesting in a context where you would want all, or at least a substantial part, of an object to be decompressed at once, and would also update it only all at once or by replacing large parts. I don’t know: perhaps that is the most important context, or perhaps not. It’s different from the context I have in mind, however, where the assumption is that individual elements may be randomly accessed, and perhaps also individual elements are randomly updated.