The new pqR parser, and R’s “else” problem

2018-11-27 at 8:33 pm 1 comment

The latest version of pqR (mostly) solves R’s “else” problem, by modifying the new parser previously introduced in pqR. I’ll explain the “else” problem and solution here, and also present other advantages of pqR’s parser over the R Core parser, including a big speed advantage in one context.

The “else” problem in R

Probably most new R programmers have encountered the following problem. They put some statements like the following in a file, say “sltp.r”:

sltp <- iris$Sepal.Width < iris$Petal.Width
if (any(sltp)) print(iris[sltp,])
else cat("Sepal width is never less than petal width\n")

They then try to execute this script with “source”, and get this result:

> source("sltp.r")
Error in source("sltp.r") : sltp.r:3:1: unexpected 'else'
2: if (any(sltp)) print(iris[sltp,])
3: else
   ^

Trying to run the script with Rscript has the same problem:

$ Rscript sltp.r
Error: unexpected 'else' in "else"
Execution halted
$

This is puzzling, since the same code works fine inside a function whose body is in curly brackets:

> f <- function () {
+   sltp <- iris$Sepal.Width < iris$Petal.Width
+   if (any(sltp)) print(iris[sltp,])
+   else cat("Sepal width is never less than petal width\n")
+ }
> f()
Sepal width is never less than petal width

Now, there’s a reason for this. When entering expressions interactively, R is supposed to print the value of the expression as soon
as it’s been entered. But how can it know whether an “if” statement at the end of a line with no else clause is complete (there is no “else” part), or whether instead the user intends to enter an “else” clause on the next line?

One can imagine kludges to get around this in many cases, but the following example shows that there’s not going to be a general
solution:

> a <- 1:9; for (i in 1:9) if (a[i]>5) print(a[i])
[1] 6
[1] 7
[1] 8
[1] 9

After entering the first line above, the user expects to see the output below. They don’t expect to be asked to enter more text in case that text might be an “else” clause.

Situations like the one above are fairly rare, however. One could decide that the user has to enter a blank line after such an “if” statement to confirm that there is no “else” clause — that’s what Python does. But let’s suppose we don’t want to do that. Then we have to evaluate the expression containing the “if” immediately, and consider a following “else” to be an error.

But for non-interactive input, there’s no need to disallow a top-level “else” at the start of a line. Yet it has always been an error in R Core implementations. It no longer is an error in pqR — in input from a file read by Rscript, by “source”, or by “parse”, as well when “parse” is applied to a vector of character strings, it is now OK for an “else” be appear at the start of a line. This avoids annoyances when writing scripts, and also problems when pasting code into a script that was copied from inside a function with curly brackets (where “else” at the start of a line has always been legal).

I don’t know whether this problem has persisted so long in R Core implementations just due to inertia, or because it’s hard to modify the R Core parser to do this. It was fairly easy to modify the pqR parser, though there were some picky details involved.

New operators

The new pqR parser has also facilitated the introduction of new operators in pqR. The .. operator for reliable sequence generation was introduced previously. The latest release introduces ! and !! as operator forms of “paste0” and “paste”. It might be hard to modify the R Core parser to include these operators, because their correct parsing requires use of context — “..” is allowed as part of an identifier (but only at the start or end in pqR), and “!!” can be two successive unary-not operators, but not in a context where the new !! operator can legally appear.

Top-down recursive descent parsing

This leads into the biggest difference between the new pqR parser and the old R Core parser.

The R Core parser is a bottom-up parser automatically generated from a grammar using the Bison/Yacc parser generator. Automatically generated? Sounds wonderful! But one of the inside secrets of computer science is that, despite decades of development, automatic parser generators are not actually very useful. As an illustration of this, gcc previously used such a parser, but abandoned it.

The generally-preferred method is to manually write a top-down “recursive-descent” parser. In theory, this shouldn’t be. Top-down parsers with k-symbol lookahead can handle grammars in the class LL(k); bottom-up parsers with k-symbol lookahead can handle grammars in the class LR(k), which is larger than LL(k). But in practice, most programming languages are in the LL(k) class if one assumes low-level tokenization has been handled, and dealing with funny contextual issues like the pqR .. and !! operators is easier in a top-down parser. Furthermore, the advantage of just writing a grammar and having code for the parser generated automatically is largely illusory, since the code for recursive-descent parsers reads almost like a grammar, with one recursive function for each non-terminal symbol. The code gets more cluttered when one puts in the semantics, but this aspect is also easier in a recursive-descent parser than when using a parser generator.

Parsing speed

The new pqR parser is faster than the R Core parser, but typically only moderately so. One exception, however, is when parsing is done for Rscript, and an expression (e.g., a function definition) extends over many lines — R Core implementations take time growing as the square of the number of lines, whereas pqR takes linear time (as one would expect).

This gross inefficiency is due not just to the R Core parser itself but also to how it interfaces to R’s “read-eval-print loop” (the “REPL”). The R Core implementation used in Rscript first tries parsing the first line of input, and checks whether the parser says it has a complete expression. If not, it tries parsing the first two lines of input. If that doesn’t give a complete expression, it tries parsing the first three lines of input. And so forth.

Here’s an illustration of the resulting inefficiency. File x2.r contains the following:

f <- function (a) {
  a <- a+1; a <- a+1; a <- a+1; a <- a+1
  a <- a+1; a <- a+1; a <- a+1; a <- a+1
}
print(f(0))

Files x300.r and x600.r contain the same thing except that there are 300 and 600 repetitions of the line in the function definition rather than two repetitions.

The time for running these scripts with R-3.5.1’s version of Rscript can be seen here:

$ time R-3.5.1-gcc8/bin/Rscript x2.r
[1] 8

real    0m0.109s
user    0m0.096s
sys     0m0.014s
$ time R-3.5.1-gcc8/bin/Rscript x300.r
[1] 1200

real    0m0.866s
user    0m0.839s
sys     0m0.029s
$ time R-3.5.1-gcc8/bin/Rscript x600.r
[1] 2400

real    0m2.554s
user    0m2.523s
sys     0m0.032s

And for comparison, here are the times with pqR-2018-11-18’s version of Rscript:

$ time pqR-2018-11-18-gcc8/bin/Rscript x2.r
[1] 8

real    0m0.070s
user    0m0.052s
sys     0m0.019s
$ time pqR-2018-11-18-gcc8/bin/Rscript x300.r
[1] 1200

real    0m0.072s
user    0m0.062s
sys     0m0.012s
$ time pqR-2018-11-18-gcc8/bin/Rscript x600.r
[1] 2400

real    0m0.069s
user    0m0.058s
sys     0m0.012s

One can see that pqR is faster even for x2.r, but more importantly, with pqR the time for x300.r and x600.r is negligibly different, as one would expect since even 2400 additions should take negligible time. The huge increase in time with R-3.5.1 is due to the quadratic growth in the time to parse the function definition.

Entry filed under: Computing, R Programming, Statistics, Statistics - Computing.

New version of pqR, with major speed improvements Faster garbage collection in pqR

1 Comment Add your own

1. Bob Carpenter | 2018-11-28 at 9:23 am

The root of this problem is R’s poor statement syntax. I hadn’t run across the else problem, but what I really don’t like is that you have to break a non-nested arithmetic expression after the binary operator rather than before as is more standard in code and typography.

That’s why I went with the standard semicolon solution to the problem in Stan (despite Andrew Gelman mocking my choice for a year as leading to “BUGS with semicolons).

Like all optimization decisions, the requirements of the problem (primarily scalability and efficiency) dictate the appropriate solution. For a language like Stan, where programs tend to be much shorter than in C++, we can get away with less efficient approaches to parsing without it becoming a bottleneck. Programmer speed and ease of adding new constructs is much more important for us than parsing speed. To that end, we recently studied replacements for our existing C++ top-down recursive descent parser (based on the fiendishly clever and diabolically complicated Boost Spirit Qi framework, the choice of which was the worst decision I made in the original construction of Stan).

Sean Talts and Matthijs Vákár wrote prototypes of a custom top-down parser in Rust and a bottom-up LR(1) parser based on the Menhir framework in OCaml. It’s so much easier to write the latter, that it was an easy decision after seeing the prototypes. One of the reasons we didn’t go with a custom top-down approach is that it was harder to get the error messages right. OCaml’s Menhir has a nice built-in framework that automates error messages. In a matter of weeks, Matthijs was able to recreate much better error messages than we’d been able to develop for the top-down parser over several years (and person months, at least, of effort).
Reply

Radford Neal's blog