Design Flaws in R #2 — Dropped Dimensions

2008-08-20 at 12:06 am 25 comments

In a comment on my first post on design flaws in the R language, Longhai remarked that he has encountered problems as a result of R’s default behaviour of dropping a dimension of a matrix when you select only one row/column from that dimension. This was indeed the design flaw that I was going to get to next! I think it also points to what is perhaps a deeper design flaw.

I’ll give a toy example of the problem, which shows up regularly in real (but more complicated) contexts. Consider the following function, whose arguments are a matrix X and a vector s of column indexes, and which returns a vector containing the Euclidean norm of each row of X, looking only at columns in s:

   subset.norms <- function (X, s)
   { sqrt(apply(X[,s]^2,1,sum))
   }

Here’s an example of its use:

   > M
        [,1] [,2] [,3] [,4]
   [1,]    1    4    7   10
   [2,]    2    5    8   11
   [3,]    3    6    9   12
   > subset.norms(M,c(1,3))
   [1] 7.071068 8.246211 9.486833

Perhaps it would be interesting to see how the norms get smaller as we drop leading dimensions:

   > for (k in 1:4) print(subset.norms(M,k:4))
   [1] 12.88410 14.62874 16.43168
   [1] 12.84523 14.49138 16.15549
   [1] 12.20656 13.60147 15.00000
   Error in apply(X[, s]^2, 1, sum) : dim(X) must have a positive length

Oops… When the loops gets to where k has the value 4, the subset k:4 contains just one dimension. When R comes to evaluating X[,s], it doesn’t return a matrix with one column, but rather a vector. The apply function doesn’t work on vectors, so we get an error message rather than the answer.

There’s a fix. The “[” operator takes a “drop”argument, which defaults to TRUE, giving the behaviour above, but which can be explicitly set to FALSE to disable conversion from a matrix to a vector in these circumstances. If we modify the function as follows:

   subset.norms2 <- function (X, s)
   { sqrt(apply(X[,s,drop=FALSE]^2,1,sum))
   }

We get the right answers:

   > for (k in 1:4) print(subset.norms2(M,k:4))
   [1] 12.88410 14.62874 16.43168
   [1] 12.84523 14.49138 16.15549
   [1] 12.20656 13.60147 15.00000
   [1] 10 11 12

There are two problems with this fix. One is that in some programs, one needs to put drop=FALSE in numerous places, making the code rather hard to read. The more serious problem, and what makes this a real design flaw, is that writing code without drop=FALSE is so much easier that it’s often left out even when it is needed. Indeed, since the errors typically occur only for extreme cases, it’s easy for the programmer to not realize there’s a problem. (As an aside, this is another context where the reversing sequences problem arises — if n is 0, subset.norms2(M,1:n) should produce all zeros, but doesn’t, not because of a bug in subset.norms2 but because 1:0 doesn’t produce the empty sequence.)

Can we solve the dimenion dropping problem by just changing the default for drop from TRUE to FALSE? Of course not, since this would break too many existing programs. But even if backward compatibility weren’t a problem, this wouldn’t be a good solution, since the times when we want drop to be TRUE are even more numerous than when we want it to be FALSE. Look at the following, for instance:

   > M[2,3,drop=FALSE]
        [,1]
   [1,]    8

If drop defaults to FALSE, ordinary subscripting with single indexes for both dimensions will give a 1×1 matrix! Now, R will treat matrices as vectors (and scalars) as needed in many contexts, but propagating a dim attribute (what marks something as a matrix) all over the place will sooner or later cause problems.

What can be done at this point? I’m not sure. The best idea I can come up with is to let subscripts be separated by semicolons as well as commas, with use of a semicolon signaling that drop will be FALSE. In other words, M[i;j] would be equivalent to M[i,j,drop=FALSE], but much more concise. People could start using the semicolon whenever they’re not trying to extract single elements, without breaking old programs.

The root cause of the dropped dimension problem seems to be that R does not have scalars. Instead, what looks like a scalar is actually a vector of length one. To R, there is no difference between M[2,3] and M[2:2,3:3]. So there’s no way for the first of these to drop dimensions and the second to retain them.

It’s probably hopeless to try to change now, but I think it would have been better for scalars to be treated as vectors or converted to vectors as necessary, without pretending that they’re always vectors. Not having real scalars may seem like a unification or simplification, but it just creates problems. These start with the first thing a new user sees on trying R out, which is likely to be something like this:

   > 2+2
   [1] 4

Good. R knows how to add 2 and 2. But what’s the “[1]” doing there?

About these ads

Entry filed under: R Programming, Statistics, Statistics - Computing. Tags: .

The Harmonic Mean of the Likelihood: Worst Monte Carlo Method Ever Young Explorer

25 Comments Add your own

  • 1. Ed  |  2008-08-20 at 1:47 am

    Thanks for the useful post. I have spent lots of time debugging this problem (I always forget that R converts a single row to a vector), and I never knew the “drop=F” solution. While the code still doesn’t look nice, it’s better than what I had been doing (inserting if (is.numeric(M)) statements that convert the vector back to a matrix).

    Reply
  • 2. Hadley Wickham  |  2008-08-20 at 9:24 am

    I think you miss the underlying reason that causes the problem with drop (and many other sometimes annoying features of R) – the tension between real-time data analysis and programming.

    Most of the time when you’re doing data analysis in R drop = T is the right default – you just want the minimum structure that has your data in it. When you’re programming, you want functions to be consistent, always return the same data type, with same dimensions. There are many functions in R which violate this rule of good programming, but most of the time it’s to facilitate real-time interactive use.

    Duncan Temple Lang (and others) have done some exploration of making this boundary more distinct by adding annotations to functions that would allow you to push the defaults towards the programming side of the scale.

    Reply
  • 3. radfordneal  |  2008-08-20 at 10:47 am

    Yes. As I mentioned in my first post on design flaws, the original version of S was (as I understand it) more a calculator than a programming language. So some things are set up on the assumption that it’s a human entering the command. I don’t think a substantial conflict between the goal of good interactive use and good programming is inevitable, however. I think it’s more that the programming aspects weren’t even considered at the beginning.

    For example, the way “:” produces both ascending and descending sequences isn’t a great help for an interactive user. Ascending sequences are what you mostly want, so making “:” always produce ascending sequences would be fine. On the infrequent occassions when you want a descending sequence, rev(1:n) isn’t too hard to type. I think the original designers simply threw in the reversing behaviour of “:” without much thought.

    Reply
  • 4. Luis  |  2008-08-20 at 7:52 pm

    Actually, I find the distinction between vectors and matrices more annoying than the ‘dropping’ issue. Why not treat everything as a matrix?

    Reply
  • 5. radfordneal  |  2008-08-20 at 8:11 pm

    Yes, that would be another consistent approach. As it is, R inconsistently considers scalars to be vectors of length one, but doesn’t consider vectors to be 1xn or nx1 matrices. Maybe a reason for that is the problem of choosing 1xn or nx1. But I suspect that if this was done consistently, dropping would cease to be an issue. (Actually, it would be impossible to drop, so the real question is whether this approach would cause some other problems.)

    Reply
  • 6. Tieming Ji  |  2008-08-21 at 3:55 pm

    The R insertions in this blog look neat. Could you please tell me how did you manage them (cut and paste)? Thanks!

    Reply
  • 7. Radford Neal  |  2008-08-21 at 6:11 pm

    Yes, I did copy and paste, into an HTML <pre> environment. The coloured background is just what the wordpress style I’m using does with <pre>.

    Reply
  • 8. wcw  |  2008-08-22 at 2:12 am

    The seq() behavior you noted really is a problem, if minor, but ‘drop = TRUE’ as default isn’t — as you noted around paragraph nine.

    Yes, the first time you realize that you have to regex-replace ‘.. = FALSE’ in a script, it’s annoying, but this one isn’t a design flaw. It’s a design decision with which you work. Your proposed solution — which boils down to changing nothing, but saving a couple keystrokes — should have told you as much. When you change nothing, you have not found a design flaw, full stop.

    But keep up the good work. R is nice, but indeed far, far, far from perfect.

    Reply
  • 9. Radford Neal  |  2008-08-22 at 9:54 am

    Well, I’m not saying that the design flaw was just setting the wrong default for “drop”! As I note, it’s not that simple – the design problem goes deeper. But I do think it’s a flaw, not just an annoying aspect of the language. The reason is that it strongly encourages the writing of unreliable programs. My proposed (imperfect) solution, even though it just “saves a couple keystrokes” (actually 7 to 11 keystrokes, depending on whether you rely on F being FALSE), would make it much easier to write programs that don’t have occassional funny bugs due to dropping.

    Reply
  • 10. Cyril  |  2008-08-22 at 3:01 pm

    A nice illustration of this irritating problem which almost invariably comes up when doing matrix calculations in R.

    Regarding your comment that R does not consider a vector as a 1xn or nx1 matrix. In fact, sometimes it seems to do either.

    Using matrix (hope this trick work in comments)

    M <- matrix(1:16, 4, 4)

    and a vector

    v <- c(0,0,1,0)

    one can left-multiply and right-multiply M by v to select the third row or column. So in that context the vector does behave like a row or column matrix depending on context. And the result, although it is (in math terms) a vector, is obtained in R as a row or column matrix.

    In fact v %*% M %*% v yields the same result as M[3,3,drop=F] as in your above example.

    I would also prefer to have everything kept as matrix, perhaps trimming the output. Or at least have some clear “typecasting” rules.

    Reply
  • 11. Sandro Saitta  |  2008-09-15 at 9:20 am

    I have also encountered these problem with R. Which means that my code contains a lot of DROP=F or data.frame(…) in order to avoid this conversion issue.

    To Luis: if you want everything to be treated as a matrix, then Matlab would be a better choice than R :-)

    Reply
  • [...] a few lucid demonstrations of R’s flaws, see these interesting Radford Neal posts: (1) (2) (3) .  It has way more problems than these, of course.  The core’s development model is too [...]

    Reply
  • 13. R complaints - Justin Talbot  |  2009-01-26 at 6:19 pm

    [...] indexing issues #1, #2, #2a, #3 (Radford Neal) As a CS guy I find 1-based vectors hard to justify, but Radford notes a [...]

    Reply
  • 14. Max  |  2009-11-11 at 7:40 am

    I agree with Hadley that the root of the problems is that R is not meant to be used as a programming language but rather as a data analysis environment. As such, we shouldn’t be using it as a programming language. The problem is that I do not know any programming language with sufficient high-level plotting functions.

    Reply
  • 15. Radford Neal  |  2009-11-12 at 8:31 pm

    I think “data analysis” versus “programming” is a false distinction. A data analysis environment that doesn’t allow programming would be very limiting. That’s one reason why R is popular – you can use it for data analysis, which includes a bit of programming now and then when you need to (for example) do some complicated conversion from one data format to another.

    I also think that there is little or no actual conflict between good features for data analysis and good features for programming. Historically, S was designed first with data analysis in mind, with probably little thought for programming. And I’m not blaming the original designers for not forseeing how programming would become more important. But if S/R were re-designed from scratch, there would be no reason to put in features bad for programming in order to make data analysis easier. It’s not too hard to make things good for both. Even now, it may not be hard, as seen from my other posts on how to solve these problems.

    Reply
  • 16. Florian B  |  2010-08-12 at 12:12 pm

    Regarding the disambiguation of vectors versus matrixes:

    A matrix is just a vector with a dim attribute:
    > v v
    [1] 1 2 3 4 5 6 7 8 9 10
    > attr(v,”dim”) v
    [,1] [,2] [,3] [,4] [,5]
    [1,] 1 3 5 7 9
    [2,] 2 4 6 8 10

    There are loads of things in R which are confusing in the beginning, but once you know them, often you can work with them.

    Reply
  • 17. denes turei  |  2011-12-05 at 9:23 pm

    or you can use a function
    as.matrix()
    or
    matrix(vector,ncol=a,byrow=T)

    Reply
    • 18. Radford Neal  |  2011-12-06 at 3:41 pm

      Yes, but matrix(M[1:i,],ncol=ncol(M),byrow=TRUE) is even worse than M[1:i,,drop=FALSE].

      Reply
  • 19. Clement Kent  |  2012-10-24 at 5:41 pm

    A bit of history: S was loosely based on the much more rigorous language APL. APL is more consistent about dimensions than either S or R and is/was, IMHO, easier to use for most array manipulations. On the other hand, R is free and comes with a gazillion great packages; APL is now used by a much smaller group of programmers than R.

    Reply
    • 20. Radford Neal  |  2012-10-24 at 6:24 pm

      Well, S and APL both have vectors, but beyond that, any connection is VERY loose…

      Reply
  • 21. Eli  |  2013-03-06 at 3:13 pm

    I too write a lot of programs in R that use matrix multiplication.

    #this is an example that kills me
    a=matrix(1,4,4)
    # transpose of the first row is a 4×1 COLUMN matrix
    t(a[1,])

    sadly, first R removes the dim attr, then does a transpose on the vector (which by some bizarre logic is possible with no dim attr) and gives you a 1×4 ROW matrix.

    Horrible. However this is easily circumvented by employing drop=FALSE defensively throughout all your R code. Until… you need to work with arrays with more than 2 dimensions.

    a=array(1,dim=c(1,2,10))
    #once again R drops the dim attr and we get a row
    t(a[,,1])
    #that’s ok, let’s use drop=FALSE
    t(a[,,1,drop=FALSE])
    #oh no, a is not a matrix so we get an error
    #–which actually makes sense unlike the behavior for t() on a vector with no dim attr

    But what do we do to get a[,,1] as a matrix so we can take the transpose?

    #this works but is computationally 3x as expensive as needed
    b=a[,,1]
    dim(b)=c(dim(a)[1],dim(a)[2])
    t(b)

    #this is less expensive but ugly
    matrix(a[,,1],dim(a)[2],dim(a)[1],byrow=TRUE)

    #by the way, this does not work for getting the correct transpose of a matrix
    b=a[,,1]
    dim(b)=c(dim(a)[2],dim(a)[1])

    My solution was to write my own utility function for subsetting 3D matrices.

    Reply
    • 22. Radford Neal  |  2013-03-06 at 3:29 pm

      For the three-dimensional case, you maybe should use the “aperm” function.

      Reply
    • 23. Clement Kent  |  2013-03-06 at 6:06 pm

      My previous comment on APL, the predecessor of R and S, applies well to Eli’s complaint. APL, and a number of non-S languages that descend from it, do not drop dimensions, have rational designs for permutation, transposition, and subsetting, and allow a larger number of operations along dimensions of arrays than R. I’m not going backward to use APL, but really wish R could be upgraded to be as good as its grandparent is for arrays.

      Reply
  • 24. Anonymous  |  2013-05-02 at 4:17 pm

    Thank you, this is a very useful tip! I have an automated porgram that has to deal with this situation often. The “drop” statement shortened my program by at least 20 lines!

    Reply
  • 25. David Chudzicki  |  2013-12-23 at 12:28 pm

    Indexing data frames’ columns with the matrix syntax (e.g., dataFrame[, cols]) has the same problem, but in that case (unlike with matrices) there’s a good solution: don’t use this syntax!

    To get a data frame, say “dataFrame[cols]“; to get a vector, say “dataFrame[[col]]“. (This is what you’d expect from the fact that data frames are lists.)

    I could be missing something, but it seems to me like this is a good convention that should be added to everyone’s style guides.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

August 2008
M T W T F S S
    Sep »
 123
45678910
11121314151617
18192021222324
25262728293031

Most Recent Posts


Follow

Get every new post delivered to your Inbox.

Join 117 other followers