Fixing R’s design flaws in a new version of pqR

2016-06-25 at 8:46 pm 13 comments

I’ve released a new version of my pqR implementation of R. This version introduces extensions to the R language that fix some long-standing design flaws that were inherited from S.

These language extensions make it easier to write reliable programs, that work even in edge cases, such as data sets with one observation.

In particular, the extensions fix the problems that 1:n doesn’t work as intended when n is zero, and that M[1:n,] is a vector rather than a matrix when n is one, or when M has only one column. Since changing the “:” operator would cause too many problems with existing programs, pqR introduces a new “..” operator for generating increasing sequences. Unwanted dimension dropping is also addressed in ways that have minimal effects on existing code.

The new release, pqR-2016-06-24, is available at pqR-project-org. The NEWS file for this release also documents some other language extensions, as well as fixes for various bugs (some of which are also in R-3.3.1).

I’ve written about these design flaws in R before, here and here (and for my previous ideas on a solution, now obsolete, see here). These design flaws have been producing unreliable programs for decades, including bugs in code maintained by R Core. It is long past time that they were fixed.

It is crucial that the fixes make the easy way of writing a program also be the correct way. This is not the case with previous “fixes” like the seq_len function, and the drop=FALSE option, both of which are clumsy, as well as being unknown to many R programmers.

Here’s an example of how the new .. operator can be used:

          for (i in 2..nrow(M)-1)
              for (j in 2..ncol(M)-1)
                  M[i,j] <- 0

This code sets all the elements of the matrix M to zeros, except for those on the edges — in the first or last row or column.

If you replace the “..” operators above with “:“, the code will not work, because “:” has higher precedence than “-“. You need to write 2:(nrow(M)-1). This is a common error, which is avoided with the new “..” operator, which has lower precedence than the arithmetic operators. Fortunately the precedence problem with “:” is mostly just an annoyance, since it leads to the program not working at all, which is usually obvious.

The more insidious problem with writing the code above using “:” is that, after fixing the precedence problem, the result will work except when the number of rows or the number of columns in M is less than three. When M has two rows, 2:(nrow(M)-1) produces a sequence of length two, consisting of 2 and 1, rather than the sequence of length zero that is needed for this code to work correctly.

This could be fixed by prefixing the code segment with

        if (nrow(M)>2 && ncol(M)>2)

But this requires the programmer to realize that there is a problem, and to not be lazy (with the excuse that they don’t intend to ever use the code with small matrices). And of course the problems with “:” cannot in general be solved with a single check like this.

Alternatively, one could write the program as follows:

          for (i in 1+seq_len(nrow(M)-2))
              for (j in 1+seq_len(ncol(M)-2))
                  M[i,j] <- 0

I hope readers will agree that this is not an ideal solution.

Now let’s consider the problems with R dropping dimensions from matrices (and higher-dimensional arrays). Some of these stem from R usually not distinguishing a scalar from a vector of length one. Fortunately, R actually can distinguish these, since a vector can have a dim attribute that explicitly states that it is a one-dimensional array. Such one-dimensional arrays are presently uncommon, but are easily created — if v is any vector, array(v) will be a one-dimensional array with the same contents. (Note that it will print like a plain vector, though dim(array(v)) will show the difference.)

So, the first change in pqR to address the dimension dropping problem is to not drop a dimension of size one if its subscript is a one-dimensional array (excluding logical arrays, or when drop=TRUE is stated explicitly). Here’s an example of how this now works in pqR:

    > M <- matrix(1:12,3,4)
    > M
         [,1] [,2] [,3] [,4]
    [1,]    1    4    7   10
    [2,]    2    5    8   11
    [3,]    3    6    9   12
    > r <- c(1,3)
    > c <- c(2,4)
    > M[r,c]
         [,1] [,2]
    [1,]    4   10
    [2,]    6   12
    > c <- 3
    > M[r,c]
    [1] 7 9
    > M[array(r),array(c)]
         [,1]
    [1,]    7
    [2,]    9

The final command above is the one which now acts differently, not dropping the dimensions even though there is only one column, since array(c) is an explicit one-dimensional vector. The use of array(r) similarly guards against only one row being selected, though that has no effect above, where r is of length two.

In this situation, the same result could be obtained with similar ease using M[r,c,drop=FALSE]. But drop=FALSE applies to every dimension, which is not always what is needed for higher-dimensional arrays. For example, in pqR, if A is a three-dimensional array, A[array(u),1,array(v)] will now select the slice of A with second subscript 1, and always return a matrix, even if u or v happened to have length one. There is no other convenient way of doing this that I know of.

The power of this feature becomes much greater when combined with the new “..” operator, which is defined to return a sequence that is a one-dimensional array, rather than a plain vector. Here’s how this works when continuing the example above:

    > n <- 2
    > m <- 3
    > M[1..n,1..m]
         [,1] [,2] [,3]
    [1,]    1    4    7
    [2,]    2    5    8
    > m <- 1
    > M[1..n,1..m]
         [,1]
    [1,]    1
    [2,]    2
    > n <- 0
    > M[1..n,1..m]
         [,1]
    >

Note how M[1..n,1..m] is guaranteed to return a matrix, even if n or m is one. A matrix with zero rows or columns is also returned when appropriate, due to the “..” operator being able to produce a zero-length vector. To get the same effect without the “..” operator, one would need to write

    M [seq_len(n), seq_len(m), drop=FALSE]

It gets worse if you want to extract a subset that doesn’t start with the first row and first column — the simplest equivalent of M[a..b,x..y] seems to be

    M [a-1+seq_len(b-a+1), x-1+seq_len(y-x+1), drop=FALSE]

I suspect that not many R programmers have been writing code like this, which means that a lot of R programs don’t quite work correctly. Of course, the solution is not to berate these programmers for being lazy, but instead to make it easy to write correct code.

Dimensions can also get dropped inappropriately when an empty subscript is used to select all the rows or all the columns of a matrix. If this dimension happens to be of size one, R will reduce the result to a plain vector. Of course, this issue can be combined with the issues above — for example, M[1:n,] will fail to do what is likely intended if n is zero, or if n is one, or if M has only one column.

To solve this problem, pqR now allows “missing” arguments to be specified with an underscore, rather than by leaving the argument empty. The subscripting operator will not drop a dimension with an underscore subscript (unless drop=TRUE is specified explicitly). With this extension, along with “..“, one can rewrite M[1:n,] as M[1..n,_], which will always do the right thing.

Note that it is unfortunately probably not feasible to just never drop a dimension with a missing argument, since there is likely too much existing code that relies on the current behaviour (though there is probably even more code where the existing behaviour produces bugs). Hence the creation of a new way to specify a missing argument. A more explicit “missing” indicator may be desirable anyway, as it seems more readable, and less error-prone, than nothing at all.

It may also be infeasible to extend the rule of not dropping dimensions indexed by one-dimensional arrays to logical subscripts — when a and b are one-dimensional arrays, M[a==0,b==0] may be intended to select a single element of M, not to return a 1×1 matrix — though one-dimensional arrays are rare enough at present that maybe one could get away with this.

The new “..” operator does break some existing code. In order that “..” can conveniently be used without always putting spaces around it, pqR now prohibits names from containing consecutive dots, except at the beginning or the end. So i..j is no longer a valid name (unless quoted with backticks), although ..i.. is still valid (but not recommended). With this restriction, most uses of the “..” operator are unambiguous, though there are exceptions, such as i..(x+y), which is a call of the function i.., and i..-j, which computes i.. minus j. There would be no ambiguities at all if consecutive dots were allowed only at the beginning of names, but unfortunately the ggplot2 package uses names like ..count.. in its API (not just internally).

Also, .. is now a reserved word. This is not actually necessary to avoid ambiguity, but not making it reserved seems error-prone, since many typos would be valid syntax, and fetching from .. would not even be a run-time error, since it is defined as a primitive. A number of CRAN packages use .. as a name, but almost all such uses are typos, with ... being what was intended (many such uses are copied from an example with a typo in help(make.rgb)).

To accommodate packages with incompatible uses of “..“, there is an option to disabling parsing of “..” as an operator, allowing packages written without using this new extensions to still be installed.

The new pqR also has other new features, including a new version of the “for” statement. Implementation of these new language features is made possible by the new parser that was introduced in pqR-2015-09-14, which has other advantages as well. I plan to write blog posts on these topics soon.

Entry filed under: Computing, R Programming, Statistics, Statistics - Computing.

Critique of ‘Debunking the climate hiatus’, by Rajaratnam, Romano, Tsiang, and Diffenbaugh Mother and daughter

13 Comments Add your own

1. Daniel Mastropietro | 2016-06-27 at 4:53 am

I find this solution excellent! I’ve been repeatedly (just) complaining about how R is so flawed in these respects of managing indices and unnecessarily adding difficulty to a programmer’s life…
Thanks for your contribution!

The only (big) concern I have: how are we supposed to promote the use of pqR over R? I mean, if we write functions and packages that work with pqR, they will not work with regular R, which is so much widely used…
Reply
- 2. Radford Neal | 2016-06-27 at 6:13 am
  
  From the technical and licensing aspects, there would be no big problem with incorporating the pqR parser into R Core or other versions of R, along with the other code changes that implement these language extensions. And the pqR parser has other advantages in any case. There would be some work involved, of course, but no more than for many other changes that are routinely made.
  
  Of course, the R Core people would have to be persuaded to do this.
  Reply
  - 3. Daniel Mastropietro | 2016-06-28 at 4:18 pm
    
    Sounds good! Is the R Core team already aware of the pqR package and/or have you been able to tell them about this idea of incorporating the pqR parser into R Core?
    Reply
  - 4. Radford Neal | 2016-06-28 at 4:37 pm
    
    They’re certainly aware of pqR in general.
    
    I told them about the new parser in pqR about a year ago, and offered to help with incorporating it into R Core, in a message to the r-devel list, which you can see at https://stat.ethz.ch/pipermail/r-devel/2015-September/071777.html
    
    I received no response from R Core whatsoever.
    
    I’ll probably post another message to r-devel in a few days, after I’ve put up a couple more blog posts on features in the new version of pqR. But obviously I have no expectation that R Core will do anything.
    Reply
  - 5. Daniel Mastropietro | 2016-06-29 at 8:20 am
    
    Have you tried contacting the R Consortium?
    http://www.r-bloggers.com/get-involved-with-the-r-consortium/
    Reply
  - 6. Radford Neal | 2016-06-29 at 9:18 am
    
    I”m involved with an R consortium group thinking about the R API (for C, Fortran, C++, etc.), but I haven’t talked to anyone there about R language extensions.
    
    I plan to implement a number of other language extensions, at which point I’ll be better able to see what possible implementation or compatibility problems there are for the whole set. That might be a better time to make more formal proposals.
    
    You can see some of these plans (which may not exactly match what I eventually implement) at http://www.cs.utoronto.ca/~radford/ftp/pqR-Rusers.pdf
    Reply
7. Joshua Pritikin | 2016-06-27 at 10:29 pm

My impression is that the R Core people suffer from tremendous inertial forces. Can you suggest anything that we can do to encourage them to seriously evaluate your work for inclusion?
Reply
8. Carl Witthoft | 2016-06-28 at 3:40 pm

OTOH, not everyone (I’m waving my hand here) wants the default class of a subsetting operation to be a matrix. I want a 1Xn array to be a vector, not an array. (Though I do admit that base:dim function really should default to “length” when the argument is not an array). As to 1:n-5 vs. 1..n-5 and operator precedence, I stand by the programmer’s rule: add parentheses whenever there’s a chance of misinterpretation. I have to bounce between R, MATLAB, and c++ (at least!) all the time. I don’t want to have to reset my memory of operator precedence every time I change languages.
Reply
- 9. Radford Neal | 2016-06-28 at 4:01 pm
  
  The way the extension has been done, if M is a matrix, then an expression such as M[1,1:10] will be a simple vector (of course, you’d probably write this as M[1,1..10] once you got used to the .. operator).
  
  So the question is whether you like the fact that M[1:n,1:10] is a matrix when n is greater than 1 but a simple vector when n happens to be 1. In particular, suppose you assign this value, with A <- M[1:n,1:10]. Do you like the fact that A[i,j] will then produce an error if n happened to be 1?
  Reply
10. Daniel Mastropietro | 2016-06-29 at 9:47 am

Thanks for the link.

I am totally for the implementation of correct generalizations of operators (such as the v[-ix] that you mention in your presentation when ix has length 0).

I wonder why the R Core team is so uninterested in these improvements…

Also was happy to learn the origin of the pqR name… I thought the name was *purely* based on the alphabetical order of the letters involved :-)
Reply
11. Longhai Li | 2016-07-04 at 1:55 am

Congratulation on the release. More and more people will be convinced to try pqR, like me. I hope that R core will incorporate the new operator .. into the new R. I don’t see that what they will lose by adding ..
Reply
12. Conor Anderson (@conanelbarbudo) | 2016-08-01 at 2:25 pm

It’s a little late now, but I’ve updated my PKGBUILD for pqR on Arch. If the requirements look a little odd, it is because I shamefully crimped them from the R AUR entry. Let me know if anything there is superfluous. https://aur.archlinux.org/packages/pqr/
Reply
- 13. Radford Neal | 2016-08-01 at 4:36 pm
  
  Thanks. I should learn more about archlinux…
  Reply

Radford Neal's blog