R Design Flaws #1 and #2: A Solution to Both?

2008-08-25 at 11:42 pm 15 comments

I’ve previously posted about two design flaws in R. The first post was about how R produces reversed sequences from a:b when a>b, with bad consequences in “for” statements (and elsewhere). The second post was about how R by default drops dimensions in expressions like M[i:j,] when i:j is a sequence only one long (ie, when i equals j).

In both posts, I suggested ways of extending R to try to solve these problems. I now think there is a better way, however, which solves both problems with one simple extension to R. This extension would also make R programs run faster and use less memory.

Recall the problems, and the solutions I proposed previously… To stop sequences from reversing, we need a new operator to use rather than 1:n, which suffers from the reversal problem when n is zero (which can’t be changed for compatibility reasons). I suggested 1:>:n. To stop dimenions from being dropped, I suggested using semicolons rather than commas to separate subscripts. My new suggestion is that in the most common case where the indexing vector is a sequence, we could use the same new operator introduced to solve the reversing problem — ie, we write M[1:>:n,] to get the first n rows, and define it so that the result isn’t converted to a vector when n is one.

Now, this may seem like it won’t work. If 1:>:n returns a vector, then when n is one, it’s a vector of length one, which has to (by default) lead to the dimension being dropped if ordinary subscripting by scalars is to work (since scalars in R are really vectors of length one).

The solution is for 1:>:n to not return a vector, but instead return a new data type that just records its two operands. This new data type — perhaps it could be called an indexing pair — would have to be recognized by “for” statements and by the subscripting operator. A dimension indexed by such an indexing pair would never be dropped. One could add an operator :<: as well, for iterating or subscripting in descending sequence, though this is much less common. If this descending operator is omitted, I think .. (two periods) would be a better name than :>: for the ascending indexing pair operator (but it unfortunately has no obviously mnemonic descending counterpart).

One disadvantage of this solution is that it doesn’t address the dropped dimension problem in the general case where the vector index may not be a sequence. But indexing by an ascending sequence is by far the most common case, so having to write drop=FALSE for the others may be OK. (However, perhaps knowledge of the drop=FALSE option would be less common if it is needed less often.) Similary, this solution doesn’t address the problem of reversing sequences outside the context of for loops and subscripting, but those are the most common uses.

One extra advantage of introducing indexing pairs is that they will take up a trivial amount of storage. In contrast, if you use 1:1000000 to iterate a million times in a for loop, R will allocate 4 Megabyes of storage to hold this sequence. Producing this sequence also takes time, of course. It would be possible for R to avoid this cost when using 1:1000000 in a for loop, by treating this combination specially, but it doesn’t (in version 2.4.1 at least):

   > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234834  6.3     467875 12.5   407500 10.9
   Vcells 104319  0.8     786432  6.0   690698  5.3
   > for (i in 1:1000000) if (i==1000000) print(gc())
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234855  6.3     467875 12.5   467875 12.5
   Vcells 604321  4.7     905753  7.0   720903  5.5
   > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234904  6.3     467875 12.5   467875 12.5
   Vcells 104328  0.8     786432  6.0   720903  5.5

Notice how the memory usage goes up by 3.9 Megabytes during the for loop.

Another extra advantage of introducing :>: (or ..) is that the precedence of this operator could be made lower than that of any other operator (with no loss, since an indexing pair wouldn’t be a valid operand for any other operator). Expressions like 1:>:n-1 would then do what the programmer meant (unlike 1:n-1).

So, does anyone see any flaws in this solution?

Entry filed under: R Programming, Statistics, Statistics - Computing.

Answers to Applied PhD Comprehensive Question #1 Down Syndrome and Decision Theory

15 Comments Add your own

1. Tom Moertel | 2008-08-26 at 2:06 pm

Your suggestion of using an “indexing pair” is one that has been adopted in other languages. In Ruby, for example, the syntax x..y creates a Range object:

http://www.ruby-doc.org/core-1.8.7/classes/Range.html

Ranges have proved to be handy for a number of uses beyond iteration (as a perusal of the documentation above might suggest). Adding something similar to R might likewise provide a number of benefits, only one being a way to work around the design flaws you have described.
Reply
2. Andrew Gelman | 2008-08-26 at 8:18 pm

Radford

I can’t follow the details, but I’m glad you’re thinking hard about this. I bet there are big gains out there that you, with an outsider’s eye, will be able to spot and fix.
Reply
3. Hans | 2008-08-27 at 12:25 pm

If your idea were implemented, suddenly matrix subscripts drop extra dimensions *except* when using your new operator. This kind of inconsistent behavior causes more bugs than it solves in the long run. It also makes life harder for generations of new R programmers who each will spend two hours banging their heads against a wall before running into that one sentence in the R tutorial: “oh, by the way, if you use this one operator then matrix subscripting doesn’t drop extra dimensions by default.”

For example, let’s say that you write a function that takes a vector of numbers, uses that to subscript a matrix and then processes the results. Now, if someone passes in an indexed pair, the function will fail with some obscure error message. Worse, it could not fail but produce some incorrect result. Or it could fail half the time… you get the idea.

I recommend introducing new matrix subscript notation that does not drop extra dimensions. Unfortunately, R already has the [[ ]] double brackets, which would be my first choice. So I do not know what notation would be good for “matrix subscripting that does not drop dimensions” but I recommend that such a notation is found and used. Maybe {} rather than [].

The your indexed pair can then work consistently with both of these matrix subscript notations.
Reply
4. Radford Neal | 2008-08-27 at 2:00 pm

Hans,

I don’t see that there would be problems of the sort you talk about. The documentation would present M[1..n,] (or M[1:>:n,] if that’s the syntax) as the normal way to get a matrix consisting of the first n rows of M. It might mention that M[1:n,] looks similar, but doesn’t work when n is zero or one. I don’t see how this produces anything but improvement. (Of course, there ideally wouldn’t be these two similar-looking things, but we don’t have the option of redesigning R from scratch without worrying about backward compatibility).

You mention a function that takes a vector of numbers as an argument and uses it as a subscript, and worry about someone calling it with an indexing pair like 1..n instead of a vector. But I don’t see the problem. The indexing pair should work just like 1:n, except that it does the right thing when n is zero or one. The only exception would be if the function actually wants the inconsistent behaviour of the dimension being dropped when n is one, but not otherwise. This is not usually the case (and hence the function will be using drop=FALSE, if it doesn’t have a bug). And if a function does want this behaviour, it would even now be good practice to program it explicity, or at least explicityly specify drop=TRUE (which might be made to still have effect with an indexing pair – ie, the default might depend on the subscript type, analogously to lots of other defaults in R).

My previous post did suggest a new subscript notation, replacing commas with semicolons to separate subscripts (obviously the issue arises only when there is more than one subscript). I don’t understand your last sentence about indexing pairs working with this – I’d see them as alternatives, not things to use together, unless one really thinks indexing by vectors that aren’t sequences is common enough to justify another extension.
Reply
5. Cyril | 2008-08-28 at 11:46 am

One comment about this proposal is that the current syntax, despite its flaws, provides a unification of the sequence and list notations, which is very convenient in some situations.

This is useful for extracting non-sequential rows/columns as mentioned above, but also in loops, eg one can use

for(l in c(“low”,”med”,”high”)) …

Using indexing pairs would require (if I understand correctly) two different implementation of for, which would have to be kept coherent so that, presumable, for (i in 1..n) would behave exactly like for (i in 1:n) when n>=1, apart from the memory usage.

Also, would you envision indexing pairs to default as sequences in cases like plot(log(1..20)), or (1..100)/100 ?
Reply
6. Radford Neal | 2008-08-28 at 12:42 pm

Certainly one would need to retain backward compatibility, so for loops would still work with “in” referring to vectors or lists, as well as indexing pairs. If we weren’t worried about backward compatibility, we could just redefine 1:n, rather than introducing the 1..n notation. And of course iterating over vectors or lists that aren’t sequences can be useful. The problem doesn’t arise because for loops can do that, but rather because 1:n doesn’t work as desired.

I don’t see any need to propagate indexing pairs into other contexts. The problem of reversing sequences doesn’t really arise with plot(log(1..n)), for instance, since when n is zero this doesn’t make sense anyway (unlike for loops, where zero iterations usually does make sense). Also, I’d envision i..j only being allowed when i and j are integers. I would envision indexing pairs as not being allowable operands for arithmetic and other operators, other than there being some sort of way of extracting the low and high bounds from an indexing pair.
Reply
7. Alex | 2008-08-29 at 2:58 pm

This concept reminds me of the xrange type in Python (Python library ref page), which can be quite useful for this type of application.

In my experience, the reversible sequences can be a bit of a nuisance, but dropped dimensions (and the general issue of numeric vectors and single column matrices being treated differently) tend to come up far more frequently. It’s one of the few standards I miss from Matlab (along with parallelization; I would kill for parallel implementations of apply & lapply).
Reply
8. wcw | 2008-09-03 at 1:08 am

..er, what?

> library(snow)

Yes, you have to parallelize by hand, and for someone who like me is really only competent to do that to embarrassingly parallel problems, that may not be the solution. Still, clusterSplit() your lists and go to town on them.

I don’t know Matlab that well. What would it do?
Reply
9. Radford Neal | 2008-09-03 at 1:22 am

I’d never heard of snow, and probably many other readers haven’t as well. It appears to be this package.
Reply
10. Fixing some problems in R « Hans Gilde’s weblog | 2008-09-03 at 1:45 pm

[…] 3, 2008 Radford Neal points out a few problems in R in this post on his […]
Reply
11. Corey | 2008-09-03 at 6:19 pm

wcw,

In MATLAB, 2-D arrays are atomic, so there’s no distinction between scalars, vectors, and matrices. (N-dimensional arrays for N>2 get a little special, but they’re not too bad.) There’s no wackiness when pulling a row vector or column vector out of a matrix, as both the operand and the output are atomic data types. Also, MATLAB’s binary “:” operator only generates increasing sequences, so if n>m, n:m is empty (MATLAB’s version of NULL). (There’s also a trinary n:k:m operator, where k is an arbitrary increment.)
Reply
12. Alex | 2008-09-08 at 12:57 am

wcw: I had not encounted the snow package previously. Thanks to you and Radford for the reference.
Reply
13. Sandro Saitta | 2008-09-18 at 3:02 am

Alex: I use Snow and PVM for parallelizing data mining tasks and it works fine (at least on Linux).
Reply
14. Jesse | 2012-04-20 at 7:58 pm

Seems like .. would be a problematic operator, since “x..y” is a valid identifier name (as are “x..” and “..y”, for that matter).
Certainly, naming variables in this way is usually a poor choice. But it does happen.
As mentioned in the comments on your “Flaws #1” post, the : character is already used in multiple contexts and creating new operators that use it could introduce complications for users.
Reply
- 15. Radford Neal | 2012-04-20 at 9:04 pm
  
  Yes, .. is probably not on, because as you say it can occur in valid identifiers.
  
  My current thought is that an operator, called :to:, should create an increasing sequence that is a vector with a dim attribute saying that it is one dimensional. It would of course work correctly when the end value is less than the start value, and its precedence would be lower than +. Subscripting would be changed to not drop dimensions when the subscript has a dim attribute. This could conceivably break an existing program, but I think this would be quite rare.
  
  I think :to: looks pretty natural, for instance
  for (i in 1:to:n) A [i, i+1:to:n] <- 7
  
  Current R practice would be for the name to be %to%, but that’s just too ugly. Occurences of “to” as an identifier in conjunction with existing uses of : should very seldom (never?) be ambiguous, and would be easy to disambiguate with spaces.
  
  I wouldn’t use :to: if designing a language from scratch, but that’s not the situation here.
  Reply

Radford Neal's blog