Here, I’ll give an overview of how the new scheme works, and present some performance comparisons with R-3.5.1. Some more details are presented in this talk.

The garbage collector is implemented as a separate module, which could also be of use in projects unrelated to R.

Objects in the new scheme are stored in “segments” — many objects per segment for small objects, just one per segment for big objects. This allows objects to be identified by a segment identifier and an offset within a segment (measured in “chunks”, currently 16 bytes in size), which together fit in 32 bits regardless of the size of a machine address.

It’s possible to configure pqR to use such 32-bit “compressed pointers” for all references, which reduces memory usage considerably on machines with 64-bit addresses, though at a cost of up to a 40% slowdown for scalar code dominated by interpretive overhead. (There are also compatibility issues with Rcpp and Rstudio when compressed pointers are used). The default is still to use machine addresses for references in R objects, but compressed pointers are always used internally by the garbage collector.

The new scheme reduces the space occupied by an R object even if references in the object do not use compressed pointers. The garbage collector needs to keep track of several sets of objects — for example, newly-allocated objects versus objects that were retained after the previous garbage collection. For this purpose, the old R Core garbage collector requires that every object contain two pointers used to implement such sets as doubly-linked lists. The new pqR garbage collector instead represents these sets much more compactly as bit vectors. If pqR is configured so object references are not done with compressed pointers, each object needs to store a compressed pointer to itself to allow the garbage collector to access these bit vectors, but that takes only 4 bytes, much less than the 16 bytes needed for two 64-bit pointers.

A full garbage collection requires that all accessible objects be scanned, and marked for retention. This can potentially result in accesses scattered over large areas of memory, many of which would not be to cache. On modern computers, an access to a memory location not in a cache can be hundreds of times slower than an in-cache reference.

This problem is reduced by the more compact layout of objects in pqR (considerably more compact if compressed pointers are used, and still somewhat more compact if not), since if the total memory occupied is smaller, a larger fraction of it will fit in cache. Locality of reference is also important, since an accesses to a location near to one accessed recently will likely go to a cache.

The use by pqR of bit vectors to represent membership in sets, including the set of objects that have been marked for retention, helps with locality. These bit vectors are stored in 64-byte structures associated with segments, allocated in blocks, which should often result in good locality of access. In contrast, with the old R Core garbage collector these operations involve accessing and writing to a “mark” bit in the object header and accessing and modifying the pointers in an object used for the doubly-linked lists. These accesses will be scattered over the whole area of memory used to hold objects.

It’s difficult to conduct meaningful speed comparisons of the garbage collector alone between pqR and R Core implementations, since they differ not just in their garbage collectors, but also in how many objects they allocate, and how many objects exist that may need to be scanned during garbage collection.

Regarding the last point, the R Core implementation is at a disadvantage because in its recommended configuration all the base R functions will be byte-compiled, increasing the number of objects that need to be scanned in a full garbage collection, whereas byte-compilation is not recommended for pqR. This is not a difference in the garbage collectors themselves, however.

But one can get some insight by looking at the performance of R code for which garbage collection speed might be expected to be an issue. In two tests I show below, garbage collection is more significant in the second than in the first, because in the second test many objects are allocated, retained for some time, but finally recovered.

Here are the two tests run (separately) with pqR-2018-11-18:

> a<-c(3,4,1); r <- rep(list(0),100) > system.time(for (i in 1:100000) for (j in 1:100) + r[[j]] <- list(x1=a+1,x2=a-1,x3=a+2,x4=a-2)) user system elapsed 5.993 0.000 5.993

> a<-c(3,4,1); r <- rep(list(0),100000) > system.time(for (i in 1:100) for (j in 1:100000) + r[[j]] <- list(x1=a+1,x2=a-1,x3=a+2,x4=a-2)) user system elapsed 8.217 0.041 8.257

And here are the same tests run with R-3.5.1 (with the JIT enabled):

> a<-c(3,4,1); r <- rep(list(0),100) > system.time(for (i in 1:100000) for (j in 1:100) + r[[j]] <- list(x1=a+1,x2=a-1,x3=a+2,x4=a-2)) user system elapsed 5.238 0.008 5.246

> a<-c(3,4,1); r <- rep(list(0),100000) > system.time(for (i in 1:100) for (j in 1:100000) + r[[j]] <- list(x1=a+1,x2=a-1,x3=a+2,x4=a-2)) user system elapsed 14.098 0.031 14.129

One can see that R-3.5.1 is a bit faster than pqR-2018-11-18 for the first test, but much slower for the second test. The difference seems to be due to pqR’s faster garbage collector. For the first test, the Linux “perf record” command reveals that both implementations spend less than 5% of their time in the garbage collector (much less than 5% for pqR). For the second test, about 57% of the compute time for R-3.5.1 is spent in the garbage collector, whereas pqR-2018-11-18 spends about 12% of its time in the garbage collector during this test. The faster garbage collection seen here for pqR is presumably due to the factors such as locality discussed above.

The R Core garbage collector is also slower for a more specific reason, involving handling of character strings. Both pqR and R Core garbage collectors are of the “generational” sort, in which most garbage collections only attempt to recover unused objects that were recently allocated, and consequently do not have to scan old objects that were allocated long ago (they are recovered if no longer used only in the infrequent full collections). But in the R Core implementation, even these partial garbage collections scan *all* character strings. Consequently, as more strings are kept around, all operations that allocate memory (and hence may trigger garbage collection) become slower.

Here’s an illustration. First, some times with pqR-2018-11-18:

> a<-seq(0,1,length=100); n <- 1000000 > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 0.477 0.004 0.480 > x<-paste("a",1:1000000,"a") > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 0.869 0.004 0.873 > y<-paste("b",1:1000000,"b") > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 0.899 0.000 0.898 > z<-paste("c",1:1000000,"c") > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 0.940 0.003 0.943

And here are the times with R-3.5.1:

> a<-seq(0,1,length=100); n <- 1000000 > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 0.504 0.008 0.512 > x<-paste("a",1:1000000,"a") > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 1.975 0.000 1.975 > y<-paste("b",1:1000000,"b") > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 2.857 0.005 2.861 > z<-paste("c",1:1000000,"c") > system.time(for (i in 1:n) r <- list(x=a+1,y=a-1)) user system elapsed 4.073 0.009 4.082

As more strings are created (with references kept to them), the list creation operations slow down only a bit in pqR, but they slow down enormously in R-3.5.1, as every garbage collection (even partial ones) has to scan an increasing number of character strings.

]]>

Probably most new R programmers have encountered the following problem. They put some statements like the following in a file, say “sltp.r”:

sltp <- iris$Sepal.Width < iris$Petal.Width if (any(sltp)) print(iris[sltp,]) else cat("Sepal width is never less than petal width\n")

They then try to execute this script with “source”, and get this result:

> source("sltp.r") Error in source("sltp.r") : sltp.r:3:1: unexpected 'else' 2: if (any(sltp)) print(iris[sltp,]) 3: else ^

Trying to run the script with Rscript has the same problem:

$ Rscript sltp.r Error: unexpected 'else' in "else" Execution halted $

This is puzzling, since the same code works fine inside a function whose body is in curly brackets:

> f <- function () { + sltp <- iris$Sepal.Width < iris$Petal.Width + if (any(sltp)) print(iris[sltp,]) + else cat("Sepal width is never less than petal width\n") + } > f() Sepal width is never less than petal width

Now, there’s a reason for this. When entering expressions interactively, R is supposed to print the value of the expression as soon

as it’s been entered. But how can it know whether an “if” statement at the end of a line with no else clause is complete (there is no “else” part), or whether instead the user intends to enter an “else” clause on the next line?

One can imagine kludges to get around this in many cases, but the following example shows that there’s not going to be a general

solution:

> a <- 1:9; for (i in 1:9) if (a[i]>5) print(a[i]) [1] 6 [1] 7 [1] 8 [1] 9

After entering the first line above, the user expects to see the output below. They don’t expect to be asked to enter more text in case that text might be an “else” clause.

Situations like the one above are fairly rare, however. One *could* decide that the user has to enter a blank line after such an “if” statement to confirm that there is no “else” clause — that’s what Python does. But let’s suppose we don’t want to do that. Then we have to evaluate the expression containing the “if” immediately, and consider a following “else” to be an error.

But for *non-interactive* input, there’s no need to disallow a top-level “else” at the start of a line. Yet it has always been an error in R Core implementations. It no longer is an error in pqR — in input from a file read by Rscript, by “source”, or by “parse”, as well when “parse” is applied to a vector of character strings, it is now OK for an “else” be appear at the start of a line. This avoids annoyances when writing scripts, and also problems when pasting code into a script that was copied from inside a function with curly brackets (where “else” at the start of a line has always been legal).

I don’t know whether this problem has persisted so long in R Core implementations just due to inertia, or because it’s hard to modify the R Core parser to do this. It was fairly easy to modify the pqR parser, though there were some picky details involved.

The new pqR parser has also facilitated the introduction of new operators in pqR. The `..` operator for reliable sequence generation was introduced previously. The latest release introduces `!` and `!!` as operator forms of “paste0” and “paste”. It might be hard to modify the R Core parser to include these operators, because their correct parsing requires use of context — “`..`” is allowed as part of an identifier (but only at the start or end in pqR), and “`!!`” can be two successive unary-not operators, but not in a context where the new `!!` operator can legally appear.

This leads into the biggest difference between the new pqR parser and the old R Core parser.

The R Core parser is a bottom-up parser automatically generated from a grammar using the Bison/Yacc parser generator. Automatically generated? Sounds wonderful! But one of the inside secrets of computer science is that, despite decades of development, automatic parser generators are not actually very useful. As an illustration of this, gcc previously used such a parser, but abandoned it.

The generally-preferred method is to manually write a top-down “recursive-descent” parser. In theory, this shouldn’t be. Top-down parsers with k-symbol lookahead can handle grammars in the class LL(k); bottom-up parsers with k-symbol lookahead can handle grammars in the class LR(k), which is larger than LL(k). But in practice, most programming languages are in the LL(k) class if one assumes low-level tokenization has been handled, and dealing with funny contextual issues like the pqR `..` and `!!` operators is easier in a top-down parser. Furthermore, the advantage of just writing a grammar and having code for the parser generated automatically is largely illusory, since the code for recursive-descent parsers reads almost like a grammar, with one recursive function for each non-terminal symbol. The code gets more cluttered when one puts in the semantics, but this aspect is also easier in a recursive-descent parser than when using a parser generator.

The new pqR parser is faster than the R Core parser, but typically only moderately so. One exception, however, is when parsing is done for Rscript, and an expression (e.g., a function definition) extends over many lines — R Core implementations take time growing as the *square* of the number of lines, whereas pqR takes linear time (as one would expect).

This gross inefficiency is due not just to the R Core parser itself but also to how it interfaces to R’s “read-eval-print loop” (the “REPL”). The R Core implementation used in Rscript first tries parsing the first line of input, and checks whether the parser says it has a complete expression. If not, it tries parsing the first two lines of input. If that doesn’t give a complete expression, it tries parsing the first three lines of input. And so forth.

Here’s an illustration of the resulting inefficiency. File x2.r contains the following:

f <- function (a) { a <- a+1; a <- a+1; a <- a+1; a <- a+1 a <- a+1; a <- a+1; a <- a+1; a <- a+1 } print(f(0))

Files x300.r and x600.r contain the same thing except that there are 300 and 600 repetitions of the line in the function definition rather than two repetitions.

The time for running these scripts with R-3.5.1’s version of Rscript can be seen here:

$ time R-3.5.1-gcc8/bin/Rscript x2.r [1] 8 real 0m0.109s user 0m0.096s sys 0m0.014s $ time R-3.5.1-gcc8/bin/Rscript x300.r [1] 1200 real 0m0.866s user 0m0.839s sys 0m0.029s $ time R-3.5.1-gcc8/bin/Rscript x600.r [1] 2400 real 0m2.554s user 0m2.523s sys 0m0.032s

And for comparison, here are the times with pqR-2018-11-18’s version of Rscript:

$ time pqR-2018-11-18-gcc8/bin/Rscript x2.r [1] 8 real 0m0.070s user 0m0.052s sys 0m0.019s $ time pqR-2018-11-18-gcc8/bin/Rscript x300.r [1] 1200 real 0m0.072s user 0m0.062s sys 0m0.012s $ time pqR-2018-11-18-gcc8/bin/Rscript x600.r [1] 2400 real 0m0.069s user 0m0.058s sys 0m0.012s

One can see that pqR is faster even for x2.r, but more importantly, with pqR the time for x300.r and x600.r is negligibly different, as one would expect since even 2400 additions should take negligible time. The huge increase in time with R-3.5.1 is due to the quadratic growth in the time to parse the function definition.

]]>This version has some major speed improvements, as well as some new features. I’ll details some of these improvements in future posts. Here, I’ll just mention a few things to show the flavour of the improvements in this release, and why you might be interested in pqR as an alternative to the R Core implementation.

One landmark reached in this release is that it is no longer advisable to use the byte-code compiler in pqR. The speed of direct interpretation of R code has now been improved to the point where it is about as fast at executing simple scalar code as the byte-code interpreter. Eliminating the byte-code compiler simplifies the overall implementation, and avoids possible semantic differences between interpreted and byte-compiled code. It is also important for pqR because some pqR optimizations and some new pqR features are not implemented in byte-code. For example, only the interpreter does optimizations such as deferring vector operations so that they may automatically be merged with other operations or be done in parallel when multiple cores are available.

Some vector operations have been substantially sped up compared to the previous release of pqR-2017-06-09. The improvement compared to R-3.5.1 can be even greater. Here is an example of replacing a subset of vector elements, benchmarked on an Intel “Skylake” processor, with both pqR-2018-11-18 and R-3.5.1 compiled from source with gcc 8.2.0 at optimization level -O3:

Here’s R-3.5.1:

> a <- numeric(20000) > system.time(for (i in 1:100000) a[2:19999] <- 3.1) user system elapsed 4.211 0.148 4.360

And here’s pqR-2018-11-18:

> a <- numeric(20000) > system.time(for (i in 1:100000) a[2:19999] <- 3.1) user system elapsed 0.256 0.000 0.257

So the current R Core implementation is 17 times slower than pqR for this replacement operation.

The advantage of pqR isn’t always this large, but many vector operations are sped up by smaller but still significant factors. An example:

With R-3.5.1:

> a <- seq(0,1,length=2000); b <- seq(1,0,length=2000) > system.time (for (i in 1:100000) { + d <- abs(a-b); r <- sum (d>0.4 & d<0.7) }) user system elapsed 1.215 0.015 1.231

With pqR-2018-11-18:

> a <- seq(0,1,length=2000); b <- seq(1,0,length=2000) > system.time (for (i in 1:100000) { + d <- abs(a-b); r <- sum (d>0.4 & d<0.7) }) user system elapsed 0.654 0.008 0.662

So for this example, pqR is almost twice as fast.

For some operations, pqR’s implementation has lower asymptotic time complexity, and so can be enormously faster. An example is the following convenient coding pattern that R programmers are currently warned to avoid:

With R-3.5.1:

> n <- 200000; a <- numeric(0); > system.time (for (i in 1:n) a <- c(a,(i+1)^2)) user system elapsed 30.387 0.223 30.612

With pqR-2018-11-18:

> n <- 200000; a <- numeric(0); > system.time (for (i in 1:n) a <- c(a,(i+1)^2)) user system elapsed 0.040 0.004 0.045

In R-3.5.1, extending a vector one element at a time with “c” takes time growing as n^{2}, as a new vector is allocated when each element is appended. With the latest version of pqR, the time grows only as n log n. In this example, that leads to pqR being 680 times faster, but the ratio could be made arbitrarily large by increasing n.

It’s still faster in pqR to preallocate a vector of length n, but only by about a factor of three, which would often be tolerable when writing one-off code if using “c” is more convenient.

The latest version of pqR has some new features. As for earlier pqR versions, some new features are aimed at addressing design flaws in R that lead to unreliable code, and others are aimed at making R more convenient for programming and scripting.

One new convenience feature is that the paste and paste0 operations can now be written with new `!!` and `!` operators. For example,

> city <- "Toronto"; province <- "Ontario" > city ! "," !! province [1] "Toronto, Ontario"

The `!!` operator pastes strings together with space separation; the `!` operator pastes with no separation. Of course, `!` retains its meaning of “not” when used as a unary operator; there is no ambiguity.

I’ll be writing some more blog posts regarding improvements in pqR-2018-11-18, and regarding some improvements in earlier pqR versions that I haven’t already blogged about. Of course, you can read about these now in the pqR NEWS file.

The main disadvantage of pqR is that it is not fully compatible with the current R Core version. It is a fork of R-2.15.0, with many, but not all, later changes incorporated. This affects what packages will work with pqR.

Addressing this compatibility issue is one thing that needs to be done going forward. I’ll discuss this and other plans — notably implementing automatic differentiation — in another future blog post.

I’m open to other people getting involved in this project. Of course, you can contribute now by trying out pqR and reporting any problems in the comments here or at the pqR issues page. Performance comparisons, especially on real-world applications, are also welcome.

Finally, for the paranoid, here are the shasum values for the compressed and uncompressed tar files that you can download from pqR-project.org:

89216dc76be23b3928c26561acc155b6e5ad32f3 pqR-2018-11-18.tar.gz f0ee8a37198b7e078fa1aec7dd5cda762f1a7799 pqR-2018-11-18.tar]]>

Click on image for larger version.

Brickworks, Toronto, June 2018. Nikon F3, Nikkor AIS 135mm 1:2.8 lens, Kodak Portra 400 film, Nikon Coolscan V.

]]>One major change with this version is that pqR, which was based on R-2.15.0, is now compatible with R-2.15.1. This allows for an increased number of packages in the pqR repository.

This release also has some significant speed improvements, a new form of the “for” statement, for conveniently iterating across columns or down rows of a matrix, and a new, less error-prone way for C functions to “protect” objects from garbage collection. There are also a few bug fixes (including fixes for some bugs that are also in the current R core release).

You can read more in the NEWS file, and get it from pqR-project.org.

Currently, pqR is distributed in source form only, and so you need to be comfortable compiling it yourself. It has been tested on Linux/Unix systems (with Intel/AMD, ARM, PowerPC, and SPARC processors), on Mac OS X (including macOS Sierra), and on Microsoft Windows (XP, 7, 8, 10) systems.

I plan to soon put up posts with more details on some of the features of this and the previous pqR release, as well as a post describing some of my future plans for pqR.

]]>Click on image for larger version.

Nikon FG, Nikkor AF-D 35mm 1:2 lens, Kodak Portra 160 film, Nikon Coolscan V.

]]>These language extensions make it easier to write reliable programs, that work even in edge cases, such as data sets with one observation.

In particular, the extensions fix the problems that `1:n` doesn’t work as intended when `n` is zero, and that `M[1:n,]` is a vector rather than a matrix when `n` is one, or when `M` has only one column. Since changing the “`:`” operator would cause too many problems with existing programs, pqR introduces a new “`..`” operator for generating increasing sequences. Unwanted dimension dropping is also addressed in ways that have minimal effects on existing code.

The new release, pqR-2016-06-24, is available at pqR-project-org. The NEWS file for this release also documents some other language extensions, as well as fixes for various bugs (some of which are also in R-3.3.1).

I’ve written about these design flaws in R before, here and here (and for my previous ideas on a solution, now obsolete, see here). These design flaws have been producing unreliable programs for decades, including bugs in code maintained by R Core. It is long past time that they were fixed.

It is crucial that the fixes make the *easy* way of writing a program also be the *correct* way. This is not the case with previous “fixes” like the `seq_len` function, and the `drop=FALSE` option, both of which are clumsy, as well as being unknown to many R programmers.

Here’s an example of how the new `..` operator can be used:

for (i in 2..nrow(M)-1) for (j in 2..ncol(M)-1) M[i,j] <- 0

This code sets all the elements of the matrix `M` to zeros, except for those on the edges — in the first or last row or column.

If you replace the “`..`” operators above with “`:`“, the code will not work, because “`:`” has higher precedence than “`-`“. You need to write `2:(nrow(M)-1)`. This is a common error, which is avoided with the new “`..`” operator, which has lower precedence than the arithmetic operators. Fortunately the precedence problem with “`:`” is mostly just an annoyance, since it leads to the program not working at all, which is usually obvious.

The more insidious problem with writing the code above using “`:`” is that, after fixing the precedence problem, the result will work *except* when the number of rows or the number of columns in `M` is less than three. When `M` has two rows, `2:(nrow(M)-1)` produces a sequence of length two, consisting of 2 and 1, rather than the sequence of length zero that is needed for this code to work correctly.

This could be fixed by prefixing the code segment with

if (nrow(M)>2 && ncol(M)>2)

But this requires the programmer to realize that there is a problem, and to not be lazy (with the excuse that they don’t intend to ever use the code with small matrices). And of course the problems with “`:`” cannot in general be solved with a single check like this.

Alternatively, one could write the program as follows:

for (i in 1+seq_len(nrow(M)-2)) for (j in 1+seq_len(ncol(M)-2)) M[i,j] <- 0

I hope readers will agree that this is not an ideal solution.

Now let’s consider the problems with R dropping dimensions from matrices (and higher-dimensional arrays). Some of these stem from R usually not distinguishing a scalar from a vector of length one. Fortunately, R actually can distinguish these, since a vector can have a `dim` attribute that explicitly states that it is a one-dimensional array. Such one-dimensional arrays are presently uncommon, but are easily created — if `v` is any vector, `array(v)` will be a one-dimensional array with the same contents. (Note that it will print like a plain vector, though `dim(array(v))` will show the difference.)

So, the first change in pqR to address the dimension dropping problem is to not drop a dimension of size one if its subscript is a one-dimensional array (excluding logical arrays, or when `drop=TRUE` is stated explicitly). Here’s an example of how this now works in pqR:

> M <- matrix(1:12,3,4) > M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > r <- c(1,3) > c <- c(2,4) > M[r,c] [,1] [,2] [1,] 4 10 [2,] 6 12 > c <- 3 > M[r,c] [1] 7 9 > M[array(r),array(c)] [,1] [1,] 7 [2,] 9

The final command above is the one which now acts differently, not dropping the dimensions even though there is only one column, since `array(c)` is an explicit one-dimensional vector. The use of `array(r)` similarly guards against only one row being selected, though that has no effect above, where `r` is of length two.

In this situation, the same result could be obtained with similar ease using `M[r,c,drop=FALSE]`. But `drop=FALSE` applies to every dimension, which is not always what is needed for higher-dimensional arrays. For example, in pqR, if `A` is a three-dimensional array, `A[array(u),1,array(v)]` will now select the slice of `A` with second subscript 1, and always return a matrix, even if `u` or `v` happened to have length one. There is no other convenient way of doing this that I know of.

The power of this feature becomes much greater when combined with the new “`..`” operator, which is defined to return a sequence that is a one-dimensional array, rather than a plain vector. Here’s how this works when continuing the example above:

> n <- 2 > m <- 3 > M[1..n,1..m] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 > m <- 1 > M[1..n,1..m] [,1] [1,] 1 [2,] 2 > n <- 0 > M[1..n,1..m] [,1] >

Note how `M[1..n,1..m]` is guaranteed to return a matrix, even if `n` or `m` is one. A matrix with zero rows or columns is also returned when appropriate, due to the “`..`” operator being able to produce a zero-length vector. To get the same effect without the “`..`” operator, one would need to write

M [seq_len(n), seq_len(m), drop=FALSE]

It gets worse if you want to extract a subset that doesn’t start with the first row and first column — the simplest equivalent of `M[a..b,x..y]` seems to be

M [a-1+seq_len(b-a+1), x-1+seq_len(y-x+1), drop=FALSE]

I suspect that not many R programmers have been writing code like this, which means that a lot of R programs don’t quite work correctly. Of course, the solution is not to berate these programmers for being lazy, but instead to make it easy to write correct code.

Dimensions can also get dropped inappropriately when an empty subscript is used to select all the rows or all the columns of a matrix. If this dimension happens to be of size one, R will reduce the result to a plain vector. Of course, this issue can be combined with the issues above — for example, `M[1:n,]` will fail to do what is likely intended if `n` is zero, or if `n` is one, or if `M` has only one column.

To solve this problem, pqR now allows “missing” arguments to be specified with an underscore, rather than by leaving the argument empty. The subscripting operator will not drop a dimension with an underscore subscript (unless `drop=TRUE` is specified explicitly). With this extension, along with “`..`“, one can rewrite `M[1:n,]` as `M[1..n,_]`, which will always do the right thing.

Note that it is unfortunately probably not feasible to just never drop a dimension with a missing argument, since there is likely too much existing code that relies on the current behaviour (though there is probably even more code where the existing behaviour produces bugs). Hence the creation of a new way to specify a missing argument. A more explicit “missing” indicator may be desirable anyway, as it seems more readable, and less error-prone, than nothing at all.

It may also be infeasible to extend the rule of not dropping dimensions indexed by one-dimensional arrays to logical subscripts — when `a` and `b` are one-dimensional arrays, `M[a==0,b==0]` may be intended to select a single element of `M`, not to return a 1×1 matrix — though one-dimensional arrays are rare enough at present that maybe one could get away with this.

The new “`..`” operator does break some existing code. In order that “`..`” can conveniently be used without always putting spaces around it, pqR now prohibits names from containing consecutive dots, except at the beginning or the end. So `i..j` is no longer a valid name (unless quoted with backticks), although `..i..` is still valid (but not recommended). With this restriction, most uses of the “`..`” operator are unambiguous, though there are exceptions, such as `i..(x+y)`, which is a call of the function `i..`, and `i..-j`, which computes `i..` minus `j`. There would be no ambiguities at all if consecutive dots were allowed only at the beginning of names, but unfortunately the ggplot2 package uses names like `..count..` in its API (not just internally).

Also, `..` is now a reserved word. This is not actually necessary to avoid ambiguity, but not making it reserved seems error-prone, since many typos would be valid syntax, and fetching from `..` would not even be a run-time error, since it is defined as a primitive. A number of CRAN packages use `..` as a name, but almost all such uses are typos, with `...` being what was intended (many such uses are copied from an example with a typo in `help(make.rgb)`).

To accommodate packages with incompatible uses of “`..`“, there is an option to disabling parsing of “`..`” as an operator, allowing packages written without using this new extensions to still be installed.

The new pqR also has other new features, including a new version of the “for” statement. Implementation of these new language features is made possible by the new parser that was introduced in pqR-2015-09-14, which has other advantages as well. I plan to write blog posts on these topics soon.

]]>As I discussed in that post, the significance of a pause in warming since around 2000, after a period of warming from about 1970 to 2000, would be to show that whatever the warming effect of CO2, other factors influencing temperatures can be large enough to counteract its effect, and hence, conversely, that such factors could also be capable of enhancing a warming trend (eg, from 1970 to 2000), perhaps giving a misleading impression that the effect of CO2 is larger than it actually is. To phrase this more technically, a pause, or substantial slowdown, in global warming would be evidence that there is a substantial degree of positive autocorrelation in global temperatures, which has the effect of rendering conclusions from apparent temperature trends more uncertain.

Whether you see a pause in global temperatures may depend on which series of temperature measurements you look at, and there is controversy about which temperature series is most reliable. In my previous post, I concluded that even when looking at the satellite temperature data, for which a pause seems most visually evident, one can’t conclude definitely that the trend in yearly average temperature actually slowed (ignoring short-term variation) in 2001 through 2014 compared to the period 1979 to 2000, though there is also no definite indication that the trend has not been zero in recent years.

Of course, I’m not the only one to have looked at the evidence for a pause. In this post, I’ll critique a paper on this topic by Bala Rajaratnam, Joseph Romano, Michael Tsiang, and Noah S. Diffenbaugh, Debunking the climate hiatus, published 17 September 2015 in the journal Climatic Change. Since my first post in this series, I’ve become aware that `tamino’ has also commented on this paper, here, making some of the same points that I will make. I’ll have more to say, however, some of which is of general interest, apart from the debate on the `pause’ or `hiatus’.

First, a bit about the authors of the paper, and the journal it is published in. The authors are all at Stanford University, one of the world’s most prestigious academic institutions. Rajaratnam is an Assistant Professor of Statistics and of Environmental Earth System Science. Romano is a Professor of Statistics and of Economics. Diffenbaugh is an Associate Professor of Earth System Science. Tsiang is a PhD student. Climatic Change appears to be a reputable refereed journal, which is published by Springer, and which is cited in the latest IPCC report. The paper was touted in popular accounts as showing that the whole hiatus thing was mistaken — for instance, by Stanford University itself.

You might therefore be surprised that, as I will discuss below, this paper is completely wrong. Nothing in it is correct. It fails in every imaginable respect.

To start, here is the data they analyse, taken from plots in their Figure 1:

The second plot is a closeup of data from the first plot, for years from 1998 to 2013.

Rajaratnam, et al. describe this data as “the NASA-GISS global mean land-ocean temperature index”, which is a commonly used data set, discussed in my first post in this series. However, the data plotted above, and which they use, is not actually the GISS land-ocean temperature data set. It is the GISS land-only data set, which is less widely used, since as GISS says, it “overestimates trends, since it disregards most of the dampening effects of the oceans”. They appear to have mistakenly downloaded the wrong data set, and not noticed that the vertical scale on their plot doesn’t match plots in other papers showing the GISS land-ocean temperature anomalies. (They also apply their methods to various other data sets, claiming similar results, but only results from this data are shown in the paper.)

GISS data sets continually change (even for past years), and I can’t locate the exact version used in this paper. For the 1998 to 2013 data, I manually digitized the plot above, obtaining the following values:

`1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
0.84 0.59 0.57 0.68 0.80 0.78 0.69 0.88 0.78 0.86 0.65 0.79 0.93 0.78 0.76 0.82
`

In my analyses below, I will use these values when only post-1998 data is relevant, and otherwise use the closest matching GISS land-only data I can find.

Before getting into details, we need to examine what Rajaratnam, et al. think are the questions that need to be addressed. In my previous post, I interpreted debate about a `pause’, or `hiatus’, or `slowdown’ as really being about the degree of long-term autocorrelation in the temperature series. If various processes unrelated to CO2 affect temperatures for periods of decades or more, judging the magnitude of the effect of CO2 will be more difficult, and in particular, the amount of recent warming since 1970 might give a misleadingly high impression of how much effect CO2 has on temperature. A hiatus of over a decade, following a period of warming, while CO2 continues to increase, would be evidence that such other effects, operating on long time scales, can be large enough to temporarily cancel the effects of CO2.

Rajaratnam, et al. seem to have some awareness that this is the real issue, since they say that the perceived hiatus has “inspired valuable scientific insight into the processes that regulate decadal-scale variations of the climate system”. But they then ignore this insight when formulating and testing four hypotheses, each of which they see as one possible way of formalizing a claim of a hiatus. In particular, they emphasize that they “attempt to properly account for temporal dependence”, since failure to do so can “lead to erroneous scientific conclusions”. This would make sense if they were accounting only for short-term auto-correlations. Accounting for long-term autocorrelation makes no sense, however, since the point of looking for a pause is that it is a way of looking for evidence of long-term autocorrelation. To declare that there is no pause on the grounds that it is not significant once long-term autocorrelation is accounted for would miss the entire point of the exercise.

So Rajaratnam, Romano, Tsiang, and Diffenbaugh are asking the wrong questions, and trying to answer them using the wrong data. Let’s go on to the details, however.

The first hypothesis that Rajaratnam, et al. test is that temperature anomalies from 1998 to 2013 have zero or negative trend. Here are the anomalies for this period (data shown above), along with a trend line fit by least-squares, and a horizontal line fit to the mean:

Do you think one can confidently reject the hypothesis that the true trend during this period is zero (or negative) based on these 16 data points? I think one can get a good idea with only a few seconds of thought. The least-squares fit must be heavily influenced by the two low points at 1999 and 2000. If there is actually no trend, the fact that both these points are in the first quarter of the series would have to be ascribed to chance. How likely is that? The answer is 1/4 times 1/4, which is 1/16, which produces a one-sided p-value over 5%. As Rajaratnam, et al. note, the standard regression t-test of the hypothesis that the slope is zero gives a two-sided p-value of 0.102, and a one-sided p-value of 0.051, in rough agreement.

Yet Rajaratnam, et al. conclude that there is “overwhelming evidence” that the slope is positive — in particular, they reject the null hypothesis based on a two-sided p-value of 0.019 (though personally I would regard this as somewhat less than overwhelming evidence). How do they get the p-value to change by more than a factor of five, from 0.102 to 0.019? By accounting for autocorrelation.

Now, you may not see much evidence of autocorrelation in these 16 data points. And based on the appearance of the rest of the series, you may think that any autocorrelation that is present will be positive, and hence make any conclusions *less* certain, not *more* certain, a point that the authors themselves note on page 24 of the supplemental information for the paper. What matters, however, is autocorrelation in the residuals — the data points minus the trend line. Of course, we don’t know what the true trend is. But suppose we look at the residuals from the least-squares fit. We get the following autocorrelations (from Fig. 8 in the supplemental information for the paper):

As you can see, the autocorrelation estimates at lags 1, 2, and 3 are negative (though the dotted blue lines show that the estimates are entirely consistent with the null hypothesis that the true autocorrelations at lags 1 and up are all zero).

Rajaratnam, et al. first try to account for such autocorrelation by fitting a regression line using an AR(1) model for the residuals. I might have thought that one would do this by finding the maximum likelihood parameter estimates, and then employing some standard method for estimating uncertainty such as looking at the observed information matrix. But Rajaratham, et al. instead find estimates using a method devised by Cochrane and Orcutt in 1949, and then estimate uncertainty by using a block bootstrap procedure. They report a two-sided p-value of 0.075, which of course makes the one-sided p-value significant at the conventional 5% level.

In the supplemental information (page 24), one can see that the reported p-value of 0.075 was obtained using a block size of 5. Oddly, however, a smaller p-value of 0.005 was obtained with a block size of 4. One might suspect that the procedure becomes more dubious with larger block sizes, so why didn’t they report the more significant result with the smaller block size (or perhaps report the p-value of 0.071 obtained with a block size of 3)?

The paper isn’t accompanied by code that would allow one to replicate its results, and I haven’t tried to reproduce this method myself, partly because they seem to regard it as inferior to their final method.

This final method of testing the null hypothesis of zero slope uses a circular block bootstrap without an AR(1) model, and yields the two-sided p-value of 0.019 mentioned above, which they regard as overwhelming evidence against the slope being zero (or negative). I haven’t tried reproducing their result exactly. There’s no code provided that would establish exactly how their method works, and exact reproduction would anyway be impossible without knowing what random number seed they used. But I have implemented my interpretation of their circular bootstrap method, as well as a non-circular block bootstrap, which seems better to me, since considering this series to be circular seems crazy.

I get a two-sided p-value of 0.023 with a circular bootstrap, and 0.029 with a non-circular bootstrap, both using a blocksize of 3, which is the same as Rajaratnam, et al. used to get their p-value of 0.019 (supplementary information, page 26). My circular bootstrap result, which I obtained using 20000 bootstrap replications, is consistent with their result, obtained with 1000 bootstrap replications, because 1000 replications is not really enough to obtain an accurate p-value (the probability of getting 0.019 or less if the true p-value is 0.023 is 23%).

So I’ve more-or-less confirmed their result that using a circular bootstrap on residuals of a least-squares fit produces a p-value indicative of a significant positive trend in this data. But you shouldn’t believe that this result is actually *correct*. As I noted at the start, just looking at the data one can see that there is no significant trend, and that there is no noticeable autocorrelation that might lead one to revise this conclusion. Rajaratam, et al. seem devoid of such statistical intuitions, however, concluding instead that “applying progressively more general statistical techniques, the scientific conclusions have progressively strengthened from “not significant,” to “significant at the 10 % level,” and then to “significant at the 5 % level.” It is therefore clear that naive statistical approaches can possibly lead to erroneous scientific conclusions.”

One problem with abandoning such “naive” approaches in favour of complex methods is that there are many complex methods, not all of which will lead to the same conclusion. Here, for example, are the results of testing the hypothesis of zero slope using a simple permutation test, and permutation tests based on permuting blocks of two successive observations (with two possible phases):

The simple permutation test gives the same p-value of 0.102 as the simple t-test, and looking at blocks of size two makes little difference. And here are the results of a simple bootstrap of the original observations, not residuals, and the same for block bootstraps of size two and three:

We again fail to see the significant results that Rajaratnam, et al. obtain with a circular bootstrap on residuals.

I also tried three Bayesian models, with zero slope, unknown slope, and unknown slope and AR(1) residuals. The log marginal likelihoods for these models were 11.2, 11.6, and 10.7, so there is no strong evidence as to which is better. The two models with unknown slope gave posterior probabilities of the slope being zero or negative of 0.077 and 0.134. Interestingly, the posterior distribution of the autoregressive coefficient in the model with AR(1) residuals showed a (slight) preference for a positive rather than a negative coefficient.

So how do Rajaratnam, et al. get a significant non-zero slope when all these other methods don’t? To find out, I tested my implementations of the circular and non-circular residual bootstrap methods on data sets obtained by randomly permuting the observations in the actual data set. I generated 200 permuted data sets, and applied these methods (with 20000 bootstrap samples) to each, then plotted a histogram of the 200 p-values produced. I also plotted the lag 1 autocorrelations for each series of permuted observations and each series of residuals from the least-squares fit to those permuted observations. Here are the results:

Since the observations are randomly permuted, the true trend is of course zero in all these data sets, so the distribution of p-values for a test of a zero trend ought to be uniform over (0,1). It actually has a pronounced peak between 0 and 0.05, which amplifies the probability of a `significant’ result by a factor of more than 3.5. If one adjusts the p-values obtained on the real data for this non-uniform distribution, the adjusted p-values are 0.145 and 0.130 for the non-circular and circular residual bootstrap methods. The `significant’ results of Rajaratnam, et al. are a result of the method they use being flawed.

One possible source of the wrong results can be seen in the rightmost plot above. On randomly permuted data, the lag-1 autocorrelations are biased towards negative values, more so for the residuals than for the observations themselves. This effect should be negligible for larger data sets. Using a block bootstrap method with only 16 observations may be asking for trouble.

Finally, if one looks at the context of these 16 data points, it becomes clear why there are two low temperature anomaly values in 1999 and 2000, and consequently a positive (albeit non-significant) trend in the data. Here are the values of the Multivariate ENSO Index over the last few decades:

Note the peak index value in the El Nino year of 1998, which explains the relatively high temperature anomaly that year, and the drop for the La Nina years thereafter, which explains the low temperature anomalies in 1999 and 2000. The ENSO index shows no long-term trend, and is probably not related to any warming from CO2. Any trend based on these two low data points being near the beginning of some time period is therefore meaningless for any discussion of warming due to CO2.

The second hypothesis that Rajaratnam, et al. test is that the trend from 1998 to 2013 is at least as great as the trend from 1950 to 1997. Rejection of this hypothesis would support the claim of a `slowdown’ in global warming in recent years. They find estimates and standard errors (assuming independent residuals) for the slopes in separate regressions for 1950-1997 and for 1998-2013, which they then combine to obtain a p-value for this test. The (one-sided) p-value they obtain is 0.210. They therefore claim that there is no statistically signficant slowdown in the recent trend (since 1998).

I can’t replicate this result exactly, since I can’t find the exact data set they used, but on what seems to be a similar version of the GISS land-only data, I get a one-sided p-value of 0.173, which is similar to their result.

This result is meaningless, however. As one can see in the plot from their paper shown above, there was a `pause’ in the temperature from 1950 to 1970, which makes the slope from 1950 to 1997 be less than it was in the decades immediately before 1998. Applying the same procedure but with the first period being from 1970 to 1997, I obtain a one-sided p-value of 0.027, which one might regard as statistically significant evidence of a slowdown in the trend.

Rajaratnam, et al. are aware of the sensitivity of their p-value to the start date of the first period, since in a footnote on page 27 of the supplemental information for the paper they say

Changing the reference period from 1950-1997 to 1880-1997 only strengthens the null hypothesis of no difference between the hiatus period and before. This follows from the fact that the trend during 1880-1997 is more similar to the trend in the hiatus period. Thus the selected period 1950-1997 can be regarded as a lower bound on p-values for tests of difference in slopes.

Obviously, it is not actually a lower bound, since the 1970-1997 period produces a lower value.

Perhaps they chose to start in 1950 because that is often considered the year by which CO2 levels had increased enough that one might expect to see a warming effect. But clearly (at least in this data set) there was no warming until around 1970. One might consider that only in 1970 did warming caused by CO2 actually start, in which case comparing to the trend from 1950 to 1997 (let alone 1880 to 1997) is deceptive. Alternatively, one might think that CO2 did have an effect starting in 1950, in which case the lack of any increase in temperatures from 1950 to 1970 is an early instance of a `pause’, which *strengthens*, rather than weakens, the claim that the effect of CO2 is weak enough that other factors can sometimes counteract it, and at other times produce a misleadingly large warming trend that should not all be attributed to CO2.

Rajaratnam, et al. also test this second hypothesis another way, by seeing how many 16-year periods starting after 1950 and ending before 1998 have a trend at least as small as the trend over 1998-2013. The results are summarized in these plots, from their Fig. 2:

Since 13 of these 33 periods have lower trends than that of 1998-2013, they arrive at a p-value of 13/33=0.3939, and conclude that “the observed trend during 1998–2013 does not appear to be anomalous in a historical context”.

However, in the supplemental information (page 29) they say

Having said this, from Figure 2 there is a clear pattern in the distribution of 16 year linear trends over time: all 16 year trends starting at 1950 all the way to 1961 are lower than the trend during hiatus period, and all 16 year linear trends starting at years 1962 all the way to 1982 are higher than the trend during the hiatus period, with the exception of the 1979-1994 trend.

That is the end of their discussion. It does not appear to have occurred to them that they should use 1970 as the start date, which would produce a p-value of 1/13=0.077, which would be at least some weak evidence of a slowdown starting in 1998.

The third hypothesis tested by Rajaratnam, et al. is that the expected global temperature is the same at every year from 1998 on — that global warming has `stalled’. Of course, as they briefly discuss in the supplemental information (page 31), this is the same as their first hypothesis — that the global temperature has a linear trend with slope zero from 1998 on. Their discussion of this hypothesis, and why it isn’t really the same as the first, makes no sense to me, so I will say no more about it, except to note that strangely no numerical p-values are given for the tests of this hypothesis, only `reject’ or `retain’ descriptions.

The fourth, and final, hypothesis tested in the paper is that the distribution of year-to-year differences in temperature is the same after 1998 as during the period from 1950 to 1998. Here also, their discussion of their methods is confusing and incomplete, and again no numerical p-values are given. It is clear, however, that their methods suffer from the same unwise choice of 1950 as the start of the comparison period as was the case for their test of the second hypothesis. Their failure to find a difference in the distribution of differences in the 1998-2013 period compared to the 1950-1997 period is therefore meaningless.

Rajaratnam, Romano, Tsiang, and Diffenbaugh conclude by summarizing their results as follows:

Our rigorous statistical framework yields strong evidence against the presence of a global warming hiatus. Accounting for temporal dependence and selection effects rejects — with overwhelming evidence — the hypothesis that there has been no trend in global surface temperature over the past ≈15 years. This analysis also highlights the potential for improper statistical assumptions to yield improper scientific conclusions. Our statistical framework also clearly rejects the hypothesis that the trend in global surface temperature has been smaller over the recent ≈ 15 year period than over the prior period. Further, our framework also rejects the hypothesis that there has been no change in global mean surface temperature over the recent ≈15 years, and the hypothesis that the distribution of annual changes in global surface temperature has been different in the past ≈15 years than earlier in the record.

This is all wrong. There is not “overwhelming evidence” of a positive trend in the last 15 years of the data — they conclude that only because they used a flawed method. They do not actually reject “the hypothesis that the trend in global surface temperature has been smaller over the recent ≈ 15 year period than over the prior period”. Rather, after an incorrect choice of start year, they fail to reject the hypothesis that the trend in the recent period has been equal to or greater than the trend in the prior period. Failure to reject a null hypothesis is not the same as rejecting the alternative hypothesis, as we try to teach students in introductory statistics courses, sometimes unsuccessfully. Similarly, they do not actually reject “the hypothesis that the distribution of annual changes in global surface temperature has been different in the past ≈15 years than earlier in the record”. To anyone who understands the null hypothesis testing framework, it is obvious that one could not possibly reject such a hypothesis using any finite amount of data.

Those familiar with the scientific literature will realize that completely wrong papers are published regularly, even in peer-reviewed journals, and even when (as for this paper) many of the flaws ought to have been obvious to the reviewers. So perhaps there’s nothing too notable about the publication of this paper. On the other hand, one may wonder whether the stringency of the review process was affected by how congenial the paper’s conclusions were to the editor and reviewers. One may also wonder whether a paper reaching the opposite conclusion would have been touted as a great achievement by Stanford University. Certainly this paper should be seen as a reminder that the reverence for “peer-reviewed scientific studies” sometimes seen in popular expositions is unfounded.

The results above can be reproduced by first downloading the data using this shell script (which downloads other data too, that I use in other blog posts), or manually download from the URLs it lists if you don’t have wget. You then need to download my R script for the above analysis and this R source file (renaming them to .r from the .doc that wordpress requires), and then running the script in R as described in its opening comments (which will take quite a long time).

]]>A recent focus of this debate has been whether temperature records show a `pause’ (or `hiatus’) in global warming over the last 10 to 20 years (or at least a `slowdown’ compared to the previous trend), and if so, what it might mean. Lukewarmers might interpret such a pause as evidence that other factors are comparable in importance to CO2, and can temporarily mask or exaggerate its effects, and hence that naively assuming the warming from 1970 to 2000 is primarily due to CO2 could lead one to overestimate the effect of CO2 on temperature.

Whether you sees a pause might, of course, depend on which data set of global temperatures you look at. These data sets are continually revised, not just by adding the latest observations, but by readjusting past observations.

Here are the yearly average land-ocean temperature anomaly data from 1955 to 2014 from the Goddard Institute for Space Studies (GISS), in the version before and after July of this year:

The old version shows signs of a pause or slowdown after about 2000, which has largely disappeared in the new version. Unsurprisingly, the revision has engendered some controversy. I should note that the difference is not really due to GISS itself, but rather to NOAA, from whom GISS gets the sea surface temperatures used.

Many people pointing to a pause look at the satellite temperature data from UAH, which starts in 1979. Below, I show it on the right, with the new GISS data from 1979 on the left, both in yearly (top) and monthly (bottom) forms:

Two things can be noted from these plots. First, the yearly UAH data (top right) can certainly be seen as showing roughly constant temperatures since somewhere between 1995 and 2000, apart from short-term variability. However, if one so wishes, one can also see it as showing a pretty much constant upward trend, again with short-term variability. Looking at the monthly UAH data (bottom right) gives a much stronger impression of a pause, since fitting a straight line to the monthly data leads to most points after about 2007 being under the line, while those before then back to about 2001 are mostly above the line, which is what one would expect if there is a pause at the end — see the plot below of the least-squares fitted line and its residuals:

The (new) GISS data also gives more of an impression of a slowdown with monthly rather than yearly data:

There are two issues with looking at monthly data, however. The first is that although both GISS and UAH data effectively have a seasonal adjustment — anomalies for each month are from a baseline for that month in particular — the seasonal effects actually vary over the years, introducing possible confusion. I’ll try fitting a model that handles this in a later post, but for now sticking to the yearly data avoids the problem. The second issue is that one can see a considerable amount of `autocorrelation’ in the monthly data. This brings us to the crucial question of what one should really be asking when considering whether there is a pause (or a slowdown) in the temperature data.

To some extent, talk of a `pause’ by lukewarmers is for rhetorical effect — look, no warming for 15 years! — as a counter to the rhetoric of the warmers — see how much the planet has warmed since 1880! — with such rhetoric by both sides being only loosely related to any valid scientific argument. However, one should try as much as possible to interpret both sides as making sensible arguments.

In this respect, note that the lukewarmers are certainly *not* claiming that the pause shows that although CO2 had a warming effect up until the year 2000, it stopped having a warming effect after 2000, so we don’t have to worry now. I doubt that anyone in the entire world believes such a thing (which is saying a lot considering what some people do believe).

Instead, the sensible lukewarmer interpretation of a `pause’ would be that the departures from the underlying trend in the temperature time series have a high degree of positive *autocorrelation* — that the departure from trend in one year is likely to be similar to the departures from trend of recent years. (Alternatively, some lukewarmers might think that there are deterministic or stochastic cycles, with periods of decades or more.) The effect of high autocorrelation is to make it harder to infer the magnitude of the true underlying trend from a relatively short series of observations.

The problem can be illustrated with simulated data sets, which I’ve arranged to look vaguely similar to the GISS data from 1955 to 2014 (though to avoid misleading anyone, I label the x-axis from 1 to 60 rather than 1955 to 2014).

I start by generating a series of 20000 values with high autocorrelation that will be added as residuals to a linear trend. I do this by summing a Gaussian series with autocorrelations that slowly decline to zero at lag 70, a slightly non-Gaussian series with autocorrelations that decline more quickly, and a series of independent Gaussian values. The R code is as follows:

`set.seed(1)`

n0 <- 20069

fa <- c(1,0.95,0.9,0.8/(1:67)^0.8); fa <- fa/sum(fa)

fb <- exp(-(0:69)/2.0); fb <- fb/sum(fb)

xa <- filter(rnorm(n0),fa); xa <- xa[!is.na(xa)]

xb <- filter(rt(n0,5),fb); xb <- xb[!is.na(xb)]

xc <- rnorm(length(xb))

xresid <- 0.75*xa + 0.08*xb + 0.06*xc

Here are the first 1500 values of this residual series:

Here are the autocorrelations estimated from the entire simulated residual series:

The `autocorrelation time’ shown above is one plus twice the sum of autocorrelations at lag 1 and up. It is the factor by which the effective sample size is less than it would be if the points were independent. With an autocorrelation time of 13 as above, for example, a data set of 60 points is equivalent to about 5 independent points.

I then split this long residual series into chunks of length 60, to each of which I added a trend with slope 0.01, and then shifted it to have sample mean of zero. Here are the first twenty of the 333 series that resulted:

The slope of the least-squares fit line is shown above each plot. As one can see, some slope estimates are almost twice the underlying trend of 0.01, while other slopes are much less than the underlying trend. Here is the histogram of slope estimates from all 333 series of length 60, along with the lower bound of the 95% confidence interval for the slope, computed assuming no autocorrelation:

Ignoring autocorrelation results in the true slope of 0.01 being below the lower bound of the 95% confidence interval 24% of the time (ten times what should be the case).

What is even more worrying is that looking at the residuals from the regression often shows only mild autocorrelation. Here are the autocorrelation (and autocorrelation time) estimates for the first 20 series:

One can compare these estimates with the plot of true residual autocorrelation above, and the true autocorrelation time of 13.

To see the possible relevance of this simulation to global temperature data, here are old and new GISS global temperature anomaly series (from 1955), centred and relabeled as for the simulated series, along with simulated series B and L from above:

It is worrying that the GISS series do not appear much different from the simulated series, which substantially overestimate the trend.

The real significance of a `pause’ or `slowdown’ in temperatures is that it would be evidence of such high autocorrelation, whose physical basis could be internal variability in the climate system, or the influence of external factors that themselves exhibit autocorrelation. Looking for a `pause’ may not be the best way of assessing whether autocorrelation is a big problem. But direct estimation of long-lag autocorrelations from relative short series is not an easy problem, and may be impossible without making strong prior assumptions regarding the form of the autocorrelation function.

Accordingly, I’ll now go back to looking at whether one can see a pause in the GISS and UAH temperature data, while keeping in mind that the point of this is to see whether high autocorrelation is a problem. I’ll look only at the yearly data, though as noted above, a pause or slowdown may be more evident in the monthly data.

Here are the old and new versions of the GISS data, from 1955 through 2014, with least-squares regression lines fitted separately to data before 1970, from 1970 to 2001, and after 2001. In the top plots, the fits are required to join up; in the bottom plots, there may jumps as well as slope changes at 1970 and 2001.

In the two top plots, the estimated slopes after 2001 are smaller than the slopes from 1970 to 2001, but the differences are not statistically significant (p-values about 0.3, assuming independent residuals). In the bottom two plots, the slopes before and after 2001 differ substantially, with the differences being significant (p-values of 0.003 and 0.018, assuming independent residuals). However, one might wonder whether the abrupt jumps are physically plausible.

Next, let’s look at the UAH data, which starts in 1979, along with the (new) GISS data from that date for comparison, and again consider a change in slope and/or a jump in 2001:

Omitting the data from 1970 to 1978 decreases the pre-2001 slope of the GISS data, lessening the contrast with the post-2001 slope. For the UAH data, the difference in slopes before and after 2001 is quite noticeable. However, for the top UAH plot, the difference is not statistically significant (p-value 0.19, assuming independent residuals). For the bottom plot, the two-sided p-value is 0.08. Based on the comparison with the GISS data, however, one might think that both differences would have been significant if data back to 1970 had been available.

There is a `cherry-picking’ issue with all the above p-values, however. The selection of 2001 as the point where the slope changes was made by looking at the data. One could try correcting for this by multiplying the p-values by the number of alternative choices of year, but this number is not clear. In a long series one would expect the slope to change at other times as well, as indeed seems to have happened in 1970. One could try fitting a general model of multiple `change-points’, but this seems inappropriately elaborate, given that the entire exercise is a crude way of testing for long-lag autocorrelation.

I have, however, tried out a Bayesian analysis, comparing a model with a single linear trend, a model with a trend that changes slope at an unknown year (between 1975 and 2010), a model with both a change in slope and a jump (at an unknown year), and a model in which the trend is a constant apart from a jump (at an unknown year). I selected informative priors for all the parameters, as is essential when comparing models in the Bayesian way by marginal likelihood, and computed the marginal likelihoods (and posterior quantities) by importance sampling from the prior (a feasible method for this small-scale problem). See the R code linked to below for details.

Here are the results of these four Bayesian models, shown as the posterior average trend lines:

In the last plot, note that the model has an abrupt step up at some year, but the posterior average shows a more gradual rise, since the year of the jump is uncertain. The log marginal likelihoods for the four models above are 16.0, 15.4, 15.7, and 14.4. If one were to (rather artificially) assume that these are the only four possible models, and that they have equal prior probabilities, the posterior probabilities of the four models would be 39%, 23%, 30%, and 9%.

I emphasize again that the exercise of looking for a `pause’ or `slowdown’ is really a crude way of looking for evidence of long-lag autocorrelation. The quantitative results should not be taken too seriously. Nevertheless, the conclusion I reach is that this data does not produce a definitive yes or no answer to whether there is a pause, even in the UAH data, for which a pause seems most evident. A few years more data might (or might not) be enough to make the situation clearer. Analysis of monthly data might also give a more definite result. Note, however, that `lack of definite evidence of a pause’ is not the same as `no pause’. It is not reasonable to assume a lack of long-lag autocorrelation absent definite evidence to the contrary, since the presence of such autocorrelation is quite plausible *a priori*.

In my previous post, I had said that this next post would examine two papers `debunking’ the pause, but it’s gotten too long already, so I’ll leave that for the post after this. I’ll then look at what can be learned by looking at monthly data, and by modeling some known effects on temperature (such as volcanic activity).

The results above can be reproduced by first downloading the data using this shell script (which downloads other data too, that I will use for later blog posts), or manually download from the URLs it lists if you don’t have wget. You then need to download my R script for reading these files, and my R script for the above analysis (and rename them to .r from the .doc that wordpress requires). Finally, run the second script in R as described in its opening comments.

UPDATE: You’ll also need this R source file.

]]>I will focus on anthropogenic warming that results, via the mis-named `greenhouse effect’, from CO2 produced by burning fossil fuels. There are other human-generated `greenhouse gasses’, and other human influences on climate, such as changes in land use, but the usual estimates of their effects are smaller than that of CO2, and in any case, they would call for different policy responses than reducing fossil fuel consumption. Other possible anthropogenic influences are, however, a possible complication when trying to determine the effects of CO2 by looking at temperature data.

What I’ll call the `warmer’ view of the effect of CO2 is what is accepted (at least verbally) by most governments, and is more-or-less found in the reports of the Intergovernmental Panel on Climate Change (IPCC) — that burning of fossil fuels increases CO2 in the atmosphere, resulting in a global increase in temperatures large enough to have quite substantial harmful effects on humans and the environment. The contrasting `no-warmer’ view is that increases in CO2 cause little or no warming, either (implausibly) because CO2 has no warming effect, or (somewhat more plausibly) because strong negative feedbacks limit its effects. In between is the `lukewarmer’ view — CO2 has some warming effect, but it is not large enough to be a major cause for worry, and does not warrant imposition of costly policies aimed at reducing fossil fuel consumption. This is the predominant view at some `skeptical’ web sites such as Watts Up With That.

There is also the `extreme-warmer’ view, that the effects of CO2 will be so large as to `fry the planet’, leading to the extinction of humans, and perhaps all life, which is surprisingly common among the general public, despite being utterly implausible. Of course, they are encouraged in this belief by alarmist papers such as `Mathematical Modelling of Plankton–Oxygen Dynamics Under the Climate Change‘ by Sekerci and Petrovskii, who apparently don’t understand that any arbitrary system of differential equations has a good chance of producing unstable behaviour, and that calling such a system a `model of a coupled plankton–oxygen dynamics’ does not make it a good model. It is very, very unlikely that life on earth would have lasted for over three billion years if the global ecosystem were really as unstable as is suggested in this paper.

The `warmer’ and `lukewarmer’ views are sufficiently plausible that it’s worth asking whether global temperature data has anything to say about which is closer to the truth. An alternative source of evidence is physical theory, embodied in computer simulations. Unfortunately, earth’s climate system is too complex to be simulated without various simplifications and approximations being made, so simulation cannot provide definitive answers, and must ultimately be checked against observations. Observations also have a rhetorical role, being potentially convincing to those who may put no trust in theory and simulation, but who naively think that measuring global temperature is a simple matter of reading thermometers.

Unfortunately, measuring global temperature is not so simple. Earth is a big place, with few observing stations, and every observing station is subject to biases from factors such as changes in the nature of its surroundings and in the time of day when observations are made. Measurements of temperature from space are indirect, and have potential biases from factors such as decaying satellite orbits. All time series of global temperatures are therefore the result of complex processing of raw data, whose appropriateness can be questioned.

It should come as no surprise to those aware of the political nature of this debate that supporters of the `warmer’ and `lukewarmer’ views tend to favour different global temperature datasets, which show different temperature trends in recent years. A favourite of the warmers is NASA’s GISS data, whose land-ocean version combines land temperature observations with sea surface temperature data. This data set was recently revised, with the new version showing a larger upward trend in temperature in recent years. The lukewarmers tend to favour the UAH data from satellite observations, also recently revised, with the new version showing a lower trend than before.

One should note that these two data sets are not measuring the same thing, or even trying to. GISS measures an ill-defined combination of water temperature near the top of the ocean and air temperature a few feet above the ground, in some variety of surroundings. UAH measures temperature in the lower part of the atmosphere, up to about 8000 metres above the surface. So it’s conceivable that the different trends in these two data sets both accurately reflect reality, though if so it’s hard to see how these different trends could continue indefinitely.

I’ll first show the monthly GISS global land-ocean temperatures (retrieved 2015-11-30) from 1880 to the end of 2014. (That’s when some other data I’ll be looking at ends; 2015 is so far mostly warmer than 2014.) These temperatures are expressed as `anomalies’ (in degrees Celsius) with respect to a base period (separately for each month of the year), since absolute values are meaningless given the arbitrary nature of what GISS is measuring. Here they are:

This graph is often portrayed (to the public) as convincing evidence that CO2 causes global warming. Look at that upward trend from about 1910! However, the rise from 1910 to 1940 can’t really be due to CO2. The direct warming effect of CO2 is generally accepted to be proportional to the logarithm of its concentration, with a doubling of CO2 producing roughly one degree Celsius of warming, which might be amplified (or diminished) by feedbacks. Here is a plot of the log base 2 of CO2 over the period above (data from here):

The increase from 1910 to 1940 is only about 0.05, which even with a generous factor of four allowance for positive feedback would give only 0.2 degrees Celsius of warming, compared to the warming of about 0.5 degrees in the GISS data. And if the 1910-1940 warming was really due to CO2, the warming from 1970-2000 should have been even greater than it was. Furthermore, part of the effect of CO2 is expected to be delayed by decades, making it an even less likely explanation of the 1910-1940 warming, since CO2 is thought to have been more-or-less constant before 1880.

Clearly, there are other influences on temperature than CO2. Once one realizes this, the upward temperature trend from 1970 to 2000 becomes less convincing as evidence of a warming effect of CO2. Furthermore, since CO2 has been increasing pretty much monotonically for over a hundred years, it is highly confounded with everything else that has been increasing over that period, as well as with long-period cycles. So any really persuasive argument regarding the effect of CO2 must be based on physical theory and on more detailed measurements that can confirm the effects of CO2 at a greater level of detail than a simple global average of temperature. This is the subject of `attribution’ studies, the critique of which is beyond the scope of this blog post (and beyond my expertise).

Nevertheless, there seems to be value in trying to better understand the global temperature data, partly as a `sanity check’ on claims based on more complex, and perhaps more questionable, analyses, and also to see whether there is any evidence of the data being wrong.

To lukewarmers, an aspect of the data that provides evidence of other factors being comparable in importance to CO2 is the `pause’ in warming (or at least a `slowdown’) that one can visually see in the plot above from about 2002. For a closer look, here is the same GISS data, but going back only to 1979:

The UAH satellite temperature data starts in 1979, so we can now compare with it (version 6.0beta4, downloaded 2015-11-30):

The base period for the anomalies in the UAH plot is different from GISS, so only the changes are comparable. (I’ve made the vertical scales match in that respect.)

Both data sets seem visually to show a slowdown or `pause’ around 2002, with this being more prominent in the UAH data (in which one might see the pause as going back as far as 1995). To lukewarmers, the significance of this pause is not that global warming has stopped, showing that CO2 has no effect, since they think that CO2 does have at least some small effect. Rather, they see it as evidence that other effects are large, sometimes large enough to cancel any underlying warming trend from CO2, and sometimes making any such trend appear larger than it actually is — and hence the warming in the 1970-2000 period cannot be taken as indicative of the magnitude of the warming due to CO2, or of what to expect in future.

As alluded to above, simple linear least squares fits to the GISS and UAH data for 1979-2014 show a greater trend for GISS (1.59 degrees C per century) than for UAH (1.12 degrees C per century). But if there is actually a change around 2002, a single trend line is of course largely meaningless.

Reactions to the `pause’ (or `hiatus’) from the warmer camp have taken several forms:

- Claims that the pause is an artifact of poorly adjusted temperature measurements, that disappears when adjustments are done properly.
- Claims that the visual appearance of a pause is deceiving — that the `pause’ is just chance variation, which the human eye overinterprets.
- Claims that if one subtracts changes due to known effects, such as volcanic eruptions, the pause disappears, showing that the underlying trend due to CO2 continues unabated. (Note that depending on the size of the underlying trend that is revealed, this would not necessarily be contrary to lukewarmer views.)
- Claims that warming from CO2 continues at a substantial rate, but that the heat is going somewhere that escapes measurement in global temperature data sets.

I will leave claims in category (4) for others to critique.

Claims in category (3) include a blog post by `tamino’. I plan to present my own analysis of this sort in a future blog post, and compare to that of `tamino’.

Two recent papers making claims in category (2) are `Debunking the climate hiatus‘, by Rajaratnam, Romano, Tsiang, and Diffenbaugh, and `On the definition and identifiability of the alleged “hiatus” in global warming‘, by Lewandowsky, Risbey, and Oreskes. Both of these papers look at (or say they look at) the GISS land-ocean temperature data, displayed above, but before the recent revision. I plan to comment on these papers in my next blog post.

Regarding (1), the GISS temperatures displayed above show a less prominent `pause’ than the version of GISS land-ocean temperatures distributed prior to July 2015 (obtained from the wayback machine’s version of 2015-04-18, stored here), which is shown below:

The revision results in a greater upward trend during the `pause’ period, as shown by the following plot of differences (with enlarged vertical scale):

To tell whether or not this revision was justified, one would need to examine in depth the temperature adjustments done for the GISS data set, which I haven’t done.

However, it’s not too hard to see some interesting things by examining the GISS land-ocean temperature data in more detail. I’ll look only at the most recent version (accessed 2015-11-30) .

First, one can look separately at the Northern Hemisphere:

and Southern Hemisphere:

The difference is rather striking. One would expect some overall difference due to the greater amount of ocean in the Southern Hemisphere, and the different nature of the polar regions. But that doesn’t explain the abrupt increase in the scatter of Southern Hemisphere data points after about 1955.

We can also look at each month of the year separately. Here’s the Northern Hemisphere:

And here’s the Southern Hemisphere:

In the Northern Hemisphere, variability is obviously greater in winter than in summer. The variability in the Southern Hemisphere winter seems slightly greater than in summer, but much less so than in the Northern Hemisphere. These are differences that I’ll take account of when modeling this data later.

I’ve marked 1955 by a short line at the bottom. In the Northern Hemisphere, the dip in January temperatures from 1955 to 1975 seems odd, since it doesn’t show up in December and February, but it’s hard to be sure that it’s not a real climatic effect. Something does happen around 1955 in the Southern Hemisphere plots, which increases the variance in May and August, and maybe June, July, and September. This can be confirmed by looking at plots for each of the 12 months of the year that show the difference of the anomaly for that month from the average anomaly for that month in the three preceding and three following years:

May through September seem to have higher variability in the years after 1955, and this is very clear for at least May and August. In contrast, similar plots for the Northern Hemisphere show no change in variance, or perhaps a slight decline after 1955 for May and June. It’s hard to see how this Southern Hemisphere variance change can reflect a real change in climate, given its abrupt onset, and that it does not appear in the Northern Hemisphere. More likely, it is an artifact of how the data is processed. A rapid improvement in quality of measurements after World War II might also be a possible explanation (though one would expect that to lead to less variability, rather than more).

Whatever the reason, it seems that relying on GISS data before 1955 might be unwise. In my later analyses, I will look at data only from 1959, since that is when some other related data sets begin, or from 1979 when comparing to the UAH data.

I note that obtaining all but the most recent GISS data is difficult. Some versions can be accessed at the wayback machine, but many versions apparently saved there produce an ‘access denied’ error. UAH has an extensive archive, but even it seems not to have all the versions that were distributed. GISS distributes the programs they use, but only the current version. I can’t find any programs at the UAH website. Both GISS and UAH ought to have a public repository that uses a source-code control system such as git, which would allow all versions of programs, raw data, and processed data to be accessed, with documentation of all changes.

To reproduce the results in this post, you will first need to download the data using this shell script (which downloads other data too, that I will use for later blog posts), or manually download from the URLs it lists if you don’t have wget. You then need to download my R script for reading these files, and my R script for making the plots (and rename them to .r from the .doc that wordpress requires). Finally, run the second script in R as described in its opening comments.

]]>