Design Flaws in R #3 — Zero Subscripts

2008-09-21 at 2:34 pm 17 comments

Unlike the two design flaws I posted about before (here, here, and also here), where one could at least see a reason for the design decision, even if it was unwise, this design flaw is just incomprehensible. For no reason at all that I can see, R allows one to use zero as a subscript without triggering an error. (Remember that in R, indexes for vectors and matrices start at one, not zero.)

This is of course a terrible decision, because it makes debugging harder, and makes it more likely that bugs will exist that have never been noticed.

So what does R do with a zero subscript, seeing as it’s meaningless? It just ignores it, which is possible because it views all numeric subscripts as vectors, that extract or replace a set of elements, not necessarily just one. So R simply removes all zeros from a vector used as a subscript, producing a shorter vector.

Here’s what happens (with the current version of R, 2.7.2):

   > a
   [1] 10 20 30 40 50
   > a[0]
   numeric(0)
   > a[c(4,2)]
   [1] 40 20
   > a[c(4,0,2,0)]
   [1] 40 20
   > a[0] <- 7
   [1] 10 20 30 40 50
   > a[c(4,0,2,0)] <- 7
   [1] 10  7 30  7 50

Contrast this with what happens when you use a subscript that is too large:

   > a
   [1] 10 20 30 40 50
   > a[7]
   [1] NA
   > a[c(4,7,2)]
   [1] 40 NA 20
   > a[7] <- 7
   [1] 10 20 30 40 50 NA  7

Extending vectors automatically when an assignment is made beyond the end can obviously be useful (though it might be wiser not to). Returning NA when extracting an element beyond the end is also a sensible action (though signalling an error immediately might be more useful for debugging). And negative subscripts are usefully defined as referring to their complement. But what possible use is there for ignoring zero subscripts rather than signalling an error?

It’s perhaps belabouring the obvious, but let me explain that signalling an error when a zero subscript is used is desirable because this is a very common sort of program bug. It can easily arise when a program is scanning backwards through the vector elements, and goes one step too far. It can also easily arise when data is initialized to zeros, with the intent to replace the zeros with something sensible later, but actually some zeros are never replaced. The way R behaves when zero is used as a subscript when replacing elements is particularly bad, since doing nothing at all can easily lead to an apparently working program that produces wrong answers. (The behaviour of returning an empty vector when zero is used as a subscript when extracting an element is more likely to produce an error later on, so that at least the problem will be evident.)

So what should be done? That’s easy — change R so that use of zero as a subscript produces an immediate error. That’s trivial to do (mixing positive and negative subscripts produces an immediate error now, so the apparatus for it must be there). Might that break some existing programs? Yes, it will. But 99.9% of those programs are already broken. The users just don’t know it, thinking that the answers they get are correct when they’re not. The remaining 0.1% of these broken programs were written by really stupid programmers who thought that exploiting an obscure and unwise feature in order to produce a really hard-to-understand program was a good idea. It wasn’t.

Along with this, R should be changed so that using NA as a subscript when replacing elements in a vector also produces an error. What to do with NA subscripts used to extract elements is a little bit harder to decide, but it seems to me that something about the following is a bit funny:

   > a
   [1] 10 20 30 40 50
   > a[NA]
   [1] NA NA NA NA NA
   > a[NA+0]
   [1] NA

Entry filed under: R Programming, Statistics, Statistics - Computing.

Applied Statistics PhD Comprehensive Question #2 Answers to Applied PhD Comprehensive Question #2

17 Comments Add your own

1. Rajiv | 2008-09-21 at 5:10 pm

Last week I tried to translate a C# program to R. It was so painful with the indexes. A graph with 100 nodes , a few vectors and pain.
Reply
2. Luis | 2008-09-21 at 5:27 pm

This somewhat relates to another feature that I consider an annoyance: vector recycling. I am OK with the existence of recycling, but not with using it as the default. I can track quite a few of my own mistakes to that feature.
Reply
- 3. techrsr | 2015-08-09 at 1:11 pm
  
  Couldn’t agree more. Vector recycling is a senseless use that pays no heed to the veracity of the arithmetic operation and I have no reason to think anyone requested it as a feature!
  Reply
4. Antonio Di Narzo | 2008-09-22 at 9:47 am

A very good point indeed.
Maybe a way to address it while maintaining the R engine backward compatible is to raise a warning when 0s in indexing.
If one desires, by setting
> options(warn=2)
the interpreter will stop on warnings, giving a safer computational framework. What do you think about that?
Reply
5. Aniko Szabo | 2008-09-22 at 11:05 am

Those are some cool “features”. The last one apparently happens because NA is logical while NA+0 is numeric.
My favorite related annoyance is with subsetting with NAs:
> a a[a==3] [1] NA 3
I can see the logic, but it catches me all the time. The simplest solution that I know is to use
> a[which(a==3)] [1] 3
but I still don’t like the default.
This is just to support your statement that it is not clear what to do with subsetting with NAs.
Reply
6. Aniko Szabo | 2008-09-22 at 11:07 am

I apparently did not enter the first code-block right:
It should be:
a <- c(NA,1:3)
a[a==3]

Is there a way to preview submissions?
Reply
7. Sandro Saitta | 2008-09-25 at 3:42 pm

Again, very good point. In fact I find indexes not straightforward in R since you can have negative indexes (to remove an element). So I find this somehow dangerous:

> a a[4] [1] NA > a[-4] [1] 1 2 3
Reply
8. Radford Neal | 2008-09-25 at 3:48 pm

I think something got absorbed by the blog software in the above post. I that that to enter a less-than sign, you need to use ampersand, “l”, “t”, semicolon. I’ll try it here: <

Unfortunately, there seems to be no way to edit comments after they’re submitted, even for me when moderating them.
Reply
9. David MacKay | 2008-09-25 at 9:04 pm

> a[HA]
[1] HA HA HA HA HA

I’m reminded of the useful conversational elements in the English-Spanish phrase book.
Do you speak Norwegian?
…
Does anyone here speak Norwegian?
…
I don’t speak Norwegian.
Reply
10. Sandro Saitta | 2008-09-29 at 3:26 am

> a = c(1,2,3)
> a[4]
[1] NA
> a[-4]
[1] 1 2 3
Reply
11. Radford Neal | 2008-09-29 at 9:28 am

Using = rather than <- does avoid the formatting problems.

The effect of 4 versus -4 does seem a bit funny, though individually the definitions of what they do seem reasonably sensible (not necessarily the best way, since many buggy references will be considered valid, but not silly).

Many properties that one might hope for aren’t going to be satisfied by R’s indexing. For instance, one might hope that saving the i’th element in another variable, then setting it to something else, then setting it back to its saved value, would leave the vector unchanged. But of course it won’t if the index is beyond the previous extent of the vector.
Reply
12. Kenn Konstabel | 2008-11-11 at 6:13 am

I agree that many of the flaws referred to above are indeed flaws. But R’s behavior in subsetting a vector with NA’s makes perfect sense:

a<-c(NA,3)
a[a==3]
# the first element might well be 3, were it not missing!!!!
# but if you want just 3’s, no NA’s:
# and it’s not about subsetting, it’s about comparing with `==`

a[a %in% 3]

#or
b <- na.omit(a)
b[b==3]

# or even [very funny!!!]
a[sapply(a, identical, 3)]

I almost always use %in% instead of == in functions
Reply
13. Radford Neal | 2008-11-11 at 3:56 pm

I agree that it’s not clear that an error should always be signalled when using NA in a subscript to access elements. I say above only that it should be an error when replacing elements.

However, I’m not convinced that the example you give is good behaviour. If I use a vector of TRUE/FALSE values as a subscript, I expect to get a vector whose length is equal to the number of TRUE values. If some of these TRUE/FALSE values are actually NA, the length of vector to return is uncertain (if you think of NA as being unceretainly either TRUE or FALSE). This is not at all the same as a vector of a well-defined length in which some elements are NA. In particular, it’s unlikely that the presence of the NA elements will correctly propagate through subsequent operations to produce a sensible result. So it may be best to signal an error when a logical vector containing an NA is used as a subscript.

The situation is different when the subscript is a vector of integers, some of which are NA. Then the result should be a vector the same length as the subscript, and putting in NA where the subscript is NA correctly represents the uncertainty in the value of that subscript.

It’s interesting that %in% handles NA the way you show, though I’m not sure it’s consistent with R’s general treatment of NA!
Reply
14. mariotomo | 2009-12-10 at 10:14 am

yes, R is a funny language. it does have its own logic, though!

R> a<-c(NA,3)
R> a==3
[1] NA TRUE
R> a %in% 3
[1] FALSE TRUE
Reply
15. R’s Dynamic Scoping « LingPipe Blog | 2010-09-09 at 2:12 pm

[…] Radford Neal: Two Surprising Things about R (following up his earlier series, design flaws in R) […]
Reply
16. Rickity Split Tankard | 2012-01-06 at 3:39 am

But the feature works so well with your other favourite feature!

> x <- c("a","b","c","d")
>
> # 3 elements
> i <- 3; x[0:i]
[1] "a" "b" "c"
>
> # 0 elements
> i <- 0; x[0:i]
character(0)
>
> # with a 1 doesn't work so well:
> i <-0; x[1:i]
[1] "a"

RN: I edited this to restore what I take to be what was intended. Remember! To enter a "<" character, you need to use "<".
Reply
17. Chaos Theatre » R: Statistical Bash | 2012-12-05 at 2:32 am

[…] 1-indexing arrays. I know Matlab does this. That doesn’t make it right. And also they somehow did it worse than Matlab. […]
Reply

Radford Neal's blog