## Design Flaws in R #1 — Reversing Sequences

*2008-08-06 at 7:34 pm* *
16 comments *

The R language for statistical computing has become the standard for academic statistical research, for the very good reason that it’s better than the alternatives. It’s far from perfect however. I could come up with a long “wish list” of desired features it lacks, but that’s not what I’ll do in this series of posts. Instead, I’ll concentrate on real design flaws — aspects of the language that make it all too easy to write unreliable programs. Many of these are inherited from R’s predecessors, S and S-Plus, and may even originate in the very earliest version of S, which was seen more as a calculator than as a programming language.

By far the most commonly-encountered design flaw in R is in the “:” operator for producing sequences, whose most common use is in “for” statements. Here’s an example:

# TRAPEZOIDAL INTEGRATION FUNCTION. Integrates # f(x,...) from x=a to x=b, using the trapezoidal # rule with n+1 points. trapezoidal.rule <- function(f,a,b,n,...) { h <- (b-a) / n s <- 0 for (j in 1 : (n-1)) { s <- s + f(a+j*h,...) } s*h + (f(a,...)+f(b,...))*(h/2) }

This approximates a definite integral by summing up the areas of *n* trapezoids defined using a grid of *n+1* points from *a* to *b*. The area of one trapezoid is the width of its base, *(b-a)/n*, times the average of the heights of its two side, *(f(left)+f(right))/2*, where *left* and *right* are adjacent grid points. The function *f* is used twice at every grid point except for the very first and very last. To avoid wasting time, the program evaluates *f* only once at these *n-1* middle points. This is handled by the “for” loop.

I’ve avoided a minor design flaw in R here, by using *1 : (n-1)* to define the range of values for *j* in the loop, which should be *1, 2, …, n-1*. Beginning R programmers usually write *1:n-1*, and only find out after hours of debugging that this is interpreted as *(1:n)-1*, and so produces the sequence *0, 1, …, n-1*. But at least they usually realize they have a bug.

The main problem is far more insidious. The program above works fine, *as long as n is greater than 1*. And it usually is, since you usually want to approximate the integral using more than one trapezoid, if you’re interested in getting an accurate result. But *n=1* is an entirely valid value, and sooner or later will come up, perhaps when a user cares more about speed than accuracy, or perhaps when another program automatically chooses this value of *n* for some reason. Here’s an illustration of what happens:

> trapezoidal.rule (function (x) x^2, 3, 6, 100) [1] 63.00045 > trapezoidal.rule (function (x) x^2, 3, 6, 2) [1] 64.125 > trapezoidal.rule (function (x) x^2, 3, 6, 1) [1] 202.5

The exact answer is 63. Even using *n=2*, one gets pretty close. But 202.5 when using *n=1*? The correct value for the approximation using one trapezoid is *3 x (3 ^{2}+6^{2})/2 = 67.5*.

What’s happened? Well, when *n=1*, the sequence *1 : (n-1)* defining the range in the “for” statement is *1:0*. Rather than interpreting this as the empty sequence, R decides that you must have wanted a sequence of *decreasing* values, namely 1 and 0. So the body of the “for” loop is done twice, rather than not at all, with the unfortunate result seen above.

Automatically reversing the direction of the sequence *p:q* when *p>q* must have seemed like a convenience to the original designers, who probably had in mind the situation where this expression is typed in directly by a user, who knows whether *p>q* or *p<q*. It’s a terrible idea in a programming language, however. Almost always, the logic of the program will require either an increasing sequence or a decreasing sequence, not one direction some times and the other direction other times. And in many programs, the empty sequence will be needed as a valid limiting case, and of course is never generated with automatic reversal.

Now, it clearly is possible to write working programs despite this design flaw. For a sequence starting at 1, *seq(length=n)* will work. You might think that *seq(1,n,by=1)* would also work, but instead it produces an error message when *n=0*. You can also use an “if” statement to bypass the “for” loop when it shouldn’t be done, as below:

if (n>1) { for (j in 1 : (n-1)) { s <- s + f(a+j*h,...) } }

Unfortunately, all these solutions are more work to write, and harder to read. It’s much easier to write programs that don’t guard against problems of reversing sequences (which crop up in places other than “for” loops too). The result is programs that mostly work, most of the time.

Can this flaw in R be fixed? At this point, changing the specification of the “:” operator is impossible — it would invalidate too many existing programs. New operators could be introduced, however. For example, “:>:” could generate an increasing sequence, with *a:>:b* producing the empty sequence if *a>b*. Similarly, “:<:” could generate a decreasing sequence. (Some might think these operators should be reversed, but I think they’re the right way around if you view “>” and “<” as arrows, not less than and greater than signs.) With these new operators, it would become almost as easy to write reliable programs as to write unreliable ones.

Entry filed under: R Programming, Statistics, Statistics - Computing. Tags: .

1.Jo | 2008-08-19 at 1:24 amAs a language designer, I would be very interested to hear your other critiques on the R language, from pedantic to the profound.

2.Byron | 2008-08-19 at 2:27 amI’d stay away from futher overloading : in R since it already does double duty when dealing with namespaces. You are, however, perfectly free to implement new operators using %% notation so you can have %>% and %<% if you’d like.

3.Hadley Wickham | 2008-08-19 at 9:43 amOr use seq_len. Doesn’t solve the problem when n is > 1 though.

A more R-like solution would be to assume that f is vectorised (or use Vectorise to ensure it is) and then eliminate the for loop.

4.Longhai | 2008-08-19 at 9:56 amHere is another issue about R I am feeling uncomfortable. When one takes only one column or one row from a matrix, R always “convert” it into a vector. I have not seen the convenience it brings to me so far, but I did find it sometimes makes program crash, for example, one may want to select only one variable from a large data set, and may want to use ncol(new data) to read the number of variables (should be equal to 1), but since R makes it a vector, function ncol does not work here. One can specify drop=FALSE to stop this conversion, but I really do not see why drop needs to be TRUE in my experience.

5.radfordneal | 2008-08-19 at 10:38 amAh, you’ve beat me to it!

I’m planning on posting on the “drop” issue, which I think is unfortunately a bit harder to solve than just making FALSE the default (even aside from backward compatibility issues).

6.Hans | 2008-08-21 at 12:31 amAfter being caught by this once too often (twice), I now usually define functions inc(low,high) and dec(high,low) that behave as your new operators would.

I don’t like strange binary operators, but you can feel free to assign them yourself and give it a try. The below code will assign a new global binary operator to %<%

“%<%”=function(l,r) {“your function here”}

7.Sandro Saitta | 2008-09-15 at 3:29 amVery interesting post. I had the same problem when I started to program in R a few months ago. I also think that one main drawback of R is its poor error outputs. Most of the time an error occurs, there is even no information about the line number. I’m looking forward to read your next posts about R.

8.Hadley Wickham | 2008-09-15 at 4:58 pmSandro – do you know about traceback? Line numbers are usually pretty meaningless.

9.Love it and hate it, R has come of age - Brendan O'Connor's Blog | 2009-01-07 at 2:04 am[…] a few lucid demonstrations of R’s flaws, see these interesting Radford Neal posts: (1) (2) (3) . It has way more problems than these, of course. The core’s development model is […]

10.Chris Barts | 2009-01-11 at 2:55 amCan you fix the smilies in your source code? They make it rather difficult to follow.

11.Radford Neal | 2009-01-11 at 2:52 pmThis is bizarre! I assure you that there were no smilies when I originally posted this, and for at least a few days afterwards. There must have been some software change with wordpress

since then.

I’ll try to figure out what’s going on…

12.Radford Neal | 2009-01-11 at 3:15 pmFixed now. I changed all occurances of colon followed by open parenthesis to put a space between them. It seems wordpress is now automatically converting such sequences to references to an image of a frowning face.

Personally, I think the smilies were always a bit too cute, and automatically replacing them with actual images of faces removes any remnants of the joke (such as it was). Some programmers have too much time on their hands.

I could go on about how stupid it is to format programs so that comments, keywords, etc. use different fonts and colours, and how the colour version of Unix/Linux “ls” that lists different types of files in different colours is really stupid, and really annoying when it’s the default and you can’t figure out how to disable it, but maybe I should stop about now…

13.R complaints — Justin Talbot | 2009-01-14 at 11:47 pm[…] indexing issues #1, #2, #2a, #3 (Radford Neal) As a CS guy I find 1-based vectors hard to justify, but Radford notes a […]

14.Stefan | 2010-06-04 at 9:11 pmFrom a programming language point of view, R seems to very flawed and bizarre. But, let’s remember it was designed by people who were not versed in computer science.

15.Florian B | 2010-08-12 at 12:03 pmAs Hadley said: Use seq_len or seq_along. It was promoted in a book by Chambers or Gentleman, and first I did not understand. After making the mistake a few times, I understood.

16.An R wish list for 2012 | Quantum Forest | 2011-12-26 at 6:09 am[…] by default R deals with the underlying numerical coding. Radford Neal presents other examples on reversing sequences and using curly brackets to speed up […]