One major change with this version is that pqR, which was based on R-2.15.0, is now compatible with R-2.15.1. This allows for an increased number of packages in the pqR repository.

This release also has some significant speed improvements, a new form of the “for” statement, for conveniently iterating across columns or down rows of a matrix, and a new, less error-prone way for C functions to “protect” objects from garbage collection. There are also a few bug fixes (including fixes for some bugs that are also in the current R core release).

You can read more in the NEWS file, and get it from pqR-project.org.

Currently, pqR is distributed in source form only, and so you need to be comfortable compiling it yourself. It has been tested on Linux/Unix systems (with Intel/AMD, ARM, PowerPC, and SPARC processors), on Mac OS X (including macOS Sierra), and on Microsoft Windows (XP, 7, 8, 10) systems.

I plan to soon put up posts with more details on some of the features of this and the previous pqR release, as well as a post describing some of my future plans for pqR.

]]>Click on image for larger version.

Nikon FG, Nikkor AF-D 35mm 1:2 lens, Kodak Portra 160 film, Nikon Coolscan V.

]]>These language extensions make it easier to write reliable programs, that work even in edge cases, such as data sets with one observation.

In particular, the extensions fix the problems that `1:n` doesn’t work as intended when `n` is zero, and that `M[1:n,]` is a vector rather than a matrix when `n` is one, or when `M` has only one column. Since changing the “`:`” operator would cause too many problems with existing programs, pqR introduces a new “`..`” operator for generating increasing sequences. Unwanted dimension dropping is also addressed in ways that have minimal effects on existing code.

The new release, pqR-2016-06-24, is available at pqR-project-org. The NEWS file for this release also documents some other language extensions, as well as fixes for various bugs (some of which are also in R-3.3.1).

I’ve written about these design flaws in R before, here and here (and for my previous ideas on a solution, now obsolete, see here). These design flaws have been producing unreliable programs for decades, including bugs in code maintained by R Core. It is long past time that they were fixed.

It is crucial that the fixes make the *easy* way of writing a program also be the *correct* way. This is not the case with previous “fixes” like the `seq_len` function, and the `drop=FALSE` option, both of which are clumsy, as well as being unknown to many R programmers.

Here’s an example of how the new `..` operator can be used:

for (i in 2..nrow(M)-1) for (j in 2..ncol(M)-1) M[i,j] <- 0

This code sets all the elements of the matrix `M` to zeros, except for those on the edges — in the first or last row or column.

If you replace the “`..`” operators above with “`:`“, the code will not work, because “`:`” has higher precedence than “`-`“. You need to write `2:(nrow(M)-1)`. This is a common error, which is avoided with the new “`..`” operator, which has lower precedence than the arithmetic operators. Fortunately the precedence problem with “`:`” is mostly just an annoyance, since it leads to the program not working at all, which is usually obvious.

The more insidious problem with writing the code above using “`:`” is that, after fixing the precedence problem, the result will work *except* when the number of rows or the number of columns in `M` is less than three. When `M` has two rows, `2:(nrow(M)-1)` produces a sequence of length two, consisting of 2 and 1, rather than the sequence of length zero that is needed for this code to work correctly.

This could be fixed by prefixing the code segment with

if (nrow(M)>2 && ncol(M)>2)

But this requires the programmer to realize that there is a problem, and to not be lazy (with the excuse that they don’t intend to ever use the code with small matrices). And of course the problems with “`:`” cannot in general be solved with a single check like this.

Alternatively, one could write the program as follows:

for (i in 1+seq_len(nrow(M)-2)) for (j in 1+seq_len(ncol(M)-2)) M[i,j] <- 0

I hope readers will agree that this is not an ideal solution.

Now let’s consider the problems with R dropping dimensions from matrices (and higher-dimensional arrays). Some of these stem from R usually not distinguishing a scalar from a vector of length one. Fortunately, R actually can distinguish these, since a vector can have a `dim` attribute that explicitly states that it is a one-dimensional array. Such one-dimensional arrays are presently uncommon, but are easily created — if `v` is any vector, `array(v)` will be a one-dimensional array with the same contents. (Note that it will print like a plain vector, though `dim(array(v))` will show the difference.)

So, the first change in pqR to address the dimension dropping problem is to not drop a dimension of size one if its subscript is a one-dimensional array (excluding logical arrays, or when `drop=TRUE` is stated explicitly). Here’s an example of how this now works in pqR:

> M <- matrix(1:12,3,4) > M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > r <- c(1,3) > c <- c(2,4) > M[r,c] [,1] [,2] [1,] 4 10 [2,] 6 12 > c <- 3 > M[r,c] [1] 7 9 > M[array(r),array(c)] [,1] [1,] 7 [2,] 9

The final command above is the one which now acts differently, not dropping the dimensions even though there is only one column, since `array(c)` is an explicit one-dimensional vector. The use of `array(r)` similarly guards against only one row being selected, though that has no effect above, where `r` is of length two.

In this situation, the same result could be obtained with similar ease using `M[r,c,drop=FALSE]`. But `drop=FALSE` applies to every dimension, which is not always what is needed for higher-dimensional arrays. For example, in pqR, if `A` is a three-dimensional array, `A[array(u),1,array(v)]` will now select the slice of `A` with second subscript 1, and always return a matrix, even if `u` or `v` happened to have length one. There is no other convenient way of doing this that I know of.

The power of this feature becomes much greater when combined with the new “`..`” operator, which is defined to return a sequence that is a one-dimensional array, rather than a plain vector. Here’s how this works when continuing the example above:

> n <- 2 > m <- 3 > M[1..n,1..m] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 > m <- 1 > M[1..n,1..m] [,1] [1,] 1 [2,] 2 > n <- 0 > M[1..n,1..m] [,1] >

Note how `M[1..n,1..m]` is guaranteed to return a matrix, even if `n` or `m` is one. A matrix with zero rows or columns is also returned when appropriate, due to the “`..`” operator being able to produce a zero-length vector. To get the same effect without the “`..`” operator, one would need to write

M [seq_len(n), seq_len(m), drop=FALSE]

It gets worse if you want to extract a subset that doesn’t start with the first row and first column — the simplest equivalent of `M[a..b,x..y]` seems to be

M [a-1+seq_len(b-a+1), x-1+seq_len(y-x+1), drop=FALSE]

I suspect that not many R programmers have been writing code like this, which means that a lot of R programs don’t quite work correctly. Of course, the solution is not to berate these programmers for being lazy, but instead to make it easy to write correct code.

Dimensions can also get dropped inappropriately when an empty subscript is used to select all the rows or all the columns of a matrix. If this dimension happens to be of size one, R will reduce the result to a plain vector. Of course, this issue can be combined with the issues above — for example, `M[1:n,]` will fail to do what is likely intended if `n` is zero, or if `n` is one, or if `M` has only one column.

To solve this problem, pqR now allows “missing” arguments to be specified with an underscore, rather than by leaving the argument empty. The subscripting operator will not drop a dimension with an underscore subscript (unless `drop=TRUE` is specified explicitly). With this extension, along with “`..`“, one can rewrite `M[1:n,]` as `M[1..n,_]`, which will always do the right thing.

Note that it is unfortunately probably not feasible to just never drop a dimension with a missing argument, since there is likely too much existing code that relies on the current behaviour (though there is probably even more code where the existing behaviour produces bugs). Hence the creation of a new way to specify a missing argument. A more explicit “missing” indicator may be desirable anyway, as it seems more readable, and less error-prone, than nothing at all.

It may also be infeasible to extend the rule of not dropping dimensions indexed by one-dimensional arrays to logical subscripts — when `a` and `b` are one-dimensional arrays, `M[a==0,b==0]` may be intended to select a single element of `M`, not to return a 1×1 matrix — though one-dimensional arrays are rare enough at present that maybe one could get away with this.

The new “`..`” operator does break some existing code. In order that “`..`” can conveniently be used without always putting spaces around it, pqR now prohibits names from containing consecutive dots, except at the beginning or the end. So `i..j` is no longer a valid name (unless quoted with backticks), although `..i..` is still valid (but not recommended). With this restriction, most uses of the “`..`” operator are unambiguous, though there are exceptions, such as `i..(x+y)`, which is a call of the function `i..`, and `i..-j`, which computes `i..` minus `j`. There would be no ambiguities at all if consecutive dots were allowed only at the beginning of names, but unfortunately the ggplot2 package uses names like `..count..` in its API (not just internally).

Also, `..` is now a reserved word. This is not actually necessary to avoid ambiguity, but not making it reserved seems error-prone, since many typos would be valid syntax, and fetching from `..` would not even be a run-time error, since it is defined as a primitive. A number of CRAN packages use `..` as a name, but almost all such uses are typos, with `...` being what was intended (many such uses are copied from an example with a typo in `help(make.rgb)`).

To accommodate packages with incompatible uses of “`..`“, there is an option to disabling parsing of “`..`” as an operator, allowing packages written without using this new extensions to still be installed.

The new pqR also has other new features, including a new version of the “for” statement. Implementation of these new language features is made possible by the new parser that was introduced in pqR-2015-09-14, which has other advantages as well. I plan to write blog posts on these topics soon.

]]>As I discussed in that post, the significance of a pause in warming since around 2000, after a period of warming from about 1970 to 2000, would be to show that whatever the warming effect of CO2, other factors influencing temperatures can be large enough to counteract its effect, and hence, conversely, that such factors could also be capable of enhancing a warming trend (eg, from 1970 to 2000), perhaps giving a misleading impression that the effect of CO2 is larger than it actually is. To phrase this more technically, a pause, or substantial slowdown, in global warming would be evidence that there is a substantial degree of positive autocorrelation in global temperatures, which has the effect of rendering conclusions from apparent temperature trends more uncertain.

Whether you see a pause in global temperatures may depend on which series of temperature measurements you look at, and there is controversy about which temperature series is most reliable. In my previous post, I concluded that even when looking at the satellite temperature data, for which a pause seems most visually evident, one can’t conclude definitely that the trend in yearly average temperature actually slowed (ignoring short-term variation) in 2001 through 2014 compared to the period 1979 to 2000, though there is also no definite indication that the trend has not been zero in recent years.

Of course, I’m not the only one to have looked at the evidence for a pause. In this post, I’ll critique a paper on this topic by Bala Rajaratnam, Joseph Romano, Michael Tsiang, and Noah S. Diffenbaugh, Debunking the climate hiatus, published 17 September 2015 in the journal Climatic Change. Since my first post in this series, I’ve become aware that `tamino’ has also commented on this paper, here, making some of the same points that I will make. I’ll have more to say, however, some of which is of general interest, apart from the debate on the `pause’ or `hiatus’.

First, a bit about the authors of the paper, and the journal it is published in. The authors are all at Stanford University, one of the world’s most prestigious academic institutions. Rajaratnam is an Assistant Professor of Statistics and of Environmental Earth System Science. Romano is a Professor of Statistics and of Economics. Diffenbaugh is an Associate Professor of Earth System Science. Tsiang is a PhD student. Climatic Change appears to be a reputable refereed journal, which is published by Springer, and which is cited in the latest IPCC report. The paper was touted in popular accounts as showing that the whole hiatus thing was mistaken — for instance, by Stanford University itself.

You might therefore be surprised that, as I will discuss below, this paper is completely wrong. Nothing in it is correct. It fails in every imaginable respect.

To start, here is the data they analyse, taken from plots in their Figure 1:

The second plot is a closeup of data from the first plot, for years from 1998 to 2013.

Rajaratnam, et al. describe this data as “the NASA-GISS global mean land-ocean temperature index”, which is a commonly used data set, discussed in my first post in this series. However, the data plotted above, and which they use, is not actually the GISS land-ocean temperature data set. It is the GISS land-only data set, which is less widely used, since as GISS says, it “overestimates trends, since it disregards most of the dampening effects of the oceans”. They appear to have mistakenly downloaded the wrong data set, and not noticed that the vertical scale on their plot doesn’t match plots in other papers showing the GISS land-ocean temperature anomalies. (They also apply their methods to various other data sets, claiming similar results, but only results from this data are shown in the paper.)

GISS data sets continually change (even for past years), and I can’t locate the exact version used in this paper. For the 1998 to 2013 data, I manually digitized the plot above, obtaining the following values:

`1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
0.84 0.59 0.57 0.68 0.80 0.78 0.69 0.88 0.78 0.86 0.65 0.79 0.93 0.78 0.76 0.82
`

In my analyses below, I will use these values when only post-1998 data is relevant, and otherwise use the closest matching GISS land-only data I can find.

Before getting into details, we need to examine what Rajaratnam, et al. think are the questions that need to be addressed. In my previous post, I interpreted debate about a `pause’, or `hiatus’, or `slowdown’ as really being about the degree of long-term autocorrelation in the temperature series. If various processes unrelated to CO2 affect temperatures for periods of decades or more, judging the magnitude of the effect of CO2 will be more difficult, and in particular, the amount of recent warming since 1970 might give a misleadingly high impression of how much effect CO2 has on temperature. A hiatus of over a decade, following a period of warming, while CO2 continues to increase, would be evidence that such other effects, operating on long time scales, can be large enough to temporarily cancel the effects of CO2.

Rajaratnam, et al. seem to have some awareness that this is the real issue, since they say that the perceived hiatus has “inspired valuable scientific insight into the processes that regulate decadal-scale variations of the climate system”. But they then ignore this insight when formulating and testing four hypotheses, each of which they see as one possible way of formalizing a claim of a hiatus. In particular, they emphasize that they “attempt to properly account for temporal dependence”, since failure to do so can “lead to erroneous scientific conclusions”. This would make sense if they were accounting only for short-term auto-correlations. Accounting for long-term autocorrelation makes no sense, however, since the point of looking for a pause is that it is a way of looking for evidence of long-term autocorrelation. To declare that there is no pause on the grounds that it is not significant once long-term autocorrelation is accounted for would miss the entire point of the exercise.

So Rajaratnam, Romano, Tsiang, and Diffenbaugh are asking the wrong questions, and trying to answer them using the wrong data. Let’s go on to the details, however.

The first hypothesis that Rajaratnam, et al. test is that temperature anomalies from 1998 to 2013 have zero or negative trend. Here are the anomalies for this period (data shown above), along with a trend line fit by least-squares, and a horizontal line fit to the mean:

Do you think one can confidently reject the hypothesis that the true trend during this period is zero (or negative) based on these 16 data points? I think one can get a good idea with only a few seconds of thought. The least-squares fit must be heavily influenced by the two low points at 1999 and 2000. If there is actually no trend, the fact that both these points are in the first quarter of the series would have to be ascribed to chance. How likely is that? The answer is 1/4 times 1/4, which is 1/16, which produces a one-sided p-value over 5%. As Rajaratnam, et al. note, the standard regression t-test of the hypothesis that the slope is zero gives a two-sided p-value of 0.102, and a one-sided p-value of 0.051, in rough agreement.

Yet Rajaratnam, et al. conclude that there is “overwhelming evidence” that the slope is positive — in particular, they reject the null hypothesis based on a two-sided p-value of 0.019 (though personally I would regard this as somewhat less than overwhelming evidence). How do they get the p-value to change by more than a factor of five, from 0.102 to 0.019? By accounting for autocorrelation.

Now, you may not see much evidence of autocorrelation in these 16 data points. And based on the appearance of the rest of the series, you may think that any autocorrelation that is present will be positive, and hence make any conclusions *less* certain, not *more* certain, a point that the authors themselves note on page 24 of the supplemental information for the paper. What matters, however, is autocorrelation in the residuals — the data points minus the trend line. Of course, we don’t know what the true trend is. But suppose we look at the residuals from the least-squares fit. We get the following autocorrelations (from Fig. 8 in the supplemental information for the paper):

As you can see, the autocorrelation estimates at lags 1, 2, and 3 are negative (though the dotted blue lines show that the estimates are entirely consistent with the null hypothesis that the true autocorrelations at lags 1 and up are all zero).

Rajaratnam, et al. first try to account for such autocorrelation by fitting a regression line using an AR(1) model for the residuals. I might have thought that one would do this by finding the maximum likelihood parameter estimates, and then employing some standard method for estimating uncertainty such as looking at the observed information matrix. But Rajaratham, et al. instead find estimates using a method devised by Cochrane and Orcutt in 1949, and then estimate uncertainty by using a block bootstrap procedure. They report a two-sided p-value of 0.075, which of course makes the one-sided p-value significant at the conventional 5% level.

In the supplemental information (page 24), one can see that the reported p-value of 0.075 was obtained using a block size of 5. Oddly, however, a smaller p-value of 0.005 was obtained with a block size of 4. One might suspect that the procedure becomes more dubious with larger block sizes, so why didn’t they report the more significant result with the smaller block size (or perhaps report the p-value of 0.071 obtained with a block size of 3)?

The paper isn’t accompanied by code that would allow one to replicate its results, and I haven’t tried to reproduce this method myself, partly because they seem to regard it as inferior to their final method.

This final method of testing the null hypothesis of zero slope uses a circular block bootstrap without an AR(1) model, and yields the two-sided p-value of 0.019 mentioned above, which they regard as overwhelming evidence against the slope being zero (or negative). I haven’t tried reproducing their result exactly. There’s no code provided that would establish exactly how their method works, and exact reproduction would anyway be impossible without knowing what random number seed they used. But I have implemented my interpretation of their circular bootstrap method, as well as a non-circular block bootstrap, which seems better to me, since considering this series to be circular seems crazy.

I get a two-sided p-value of 0.023 with a circular bootstrap, and 0.029 with a non-circular bootstrap, both using a blocksize of 3, which is the same as Rajaratnam, et al. used to get their p-value of 0.019 (supplementary information, page 26). My circular bootstrap result, which I obtained using 20000 bootstrap replications, is consistent with their result, obtained with 1000 bootstrap replications, because 1000 replications is not really enough to obtain an accurate p-value (the probability of getting 0.019 or less if the true p-value is 0.023 is 23%).

So I’ve more-or-less confirmed their result that using a circular bootstrap on residuals of a least-squares fit produces a p-value indicative of a significant positive trend in this data. But you shouldn’t believe that this result is actually *correct*. As I noted at the start, just looking at the data one can see that there is no significant trend, and that there is no noticeable autocorrelation that might lead one to revise this conclusion. Rajaratam, et al. seem devoid of such statistical intuitions, however, concluding instead that “applying progressively more general statistical techniques, the scientific conclusions have progressively strengthened from “not significant,” to “significant at the 10 % level,” and then to “significant at the 5 % level.” It is therefore clear that naive statistical approaches can possibly lead to erroneous scientific conclusions.”

One problem with abandoning such “naive” approaches in favour of complex methods is that there are many complex methods, not all of which will lead to the same conclusion. Here, for example, are the results of testing the hypothesis of zero slope using a simple permutation test, and permutation tests based on permuting blocks of two successive observations (with two possible phases):

The simple permutation test gives the same p-value of 0.102 as the simple t-test, and looking at blocks of size two makes little difference. And here are the results of a simple bootstrap of the original observations, not residuals, and the same for block bootstraps of size two and three:

We again fail to see the significant results that Rajaratnam, et al. obtain with a circular bootstrap on residuals.

I also tried three Bayesian models, with zero slope, unknown slope, and unknown slope and AR(1) residuals. The log marginal likelihoods for these models were 11.2, 11.6, and 10.7, so there is no strong evidence as to which is better. The two models with unknown slope gave posterior probabilities of the slope being zero or negative of 0.077 and 0.134. Interestingly, the posterior distribution of the autoregressive coefficient in the model with AR(1) residuals showed a (slight) preference for a positive rather than a negative coefficient.

So how do Rajaratnam, et al. get a significant non-zero slope when all these other methods don’t? To find out, I tested my implementations of the circular and non-circular residual bootstrap methods on data sets obtained by randomly permuting the observations in the actual data set. I generated 200 permuted data sets, and applied these methods (with 20000 bootstrap samples) to each, then plotted a histogram of the 200 p-values produced. I also plotted the lag 1 autocorrelations for each series of permuted observations and each series of residuals from the least-squares fit to those permuted observations. Here are the results:

Since the observations are randomly permuted, the true trend is of course zero in all these data sets, so the distribution of p-values for a test of a zero trend ought to be uniform over (0,1). It actually has a pronounced peak between 0 and 0.05, which amplifies the probability of a `significant’ result by a factor of more than 3.5. If one adjusts the p-values obtained on the real data for this non-uniform distribution, the adjusted p-values are 0.145 and 0.130 for the non-circular and circular residual bootstrap methods. The `significant’ results of Rajaratnam, et al. are a result of the method they use being flawed.

One possible source of the wrong results can be seen in the rightmost plot above. On randomly permuted data, the lag-1 autocorrelations are biased towards negative values, more so for the residuals than for the observations themselves. This effect should be negligible for larger data sets. Using a block bootstrap method with only 16 observations may be asking for trouble.

Finally, if one looks at the context of these 16 data points, it becomes clear why there are two low temperature anomaly values in 1999 and 2000, and consequently a positive (albeit non-significant) trend in the data. Here are the values of the Multivariate ENSO Index over the last few decades:

Note the peak index value in the El Nino year of 1998, which explains the relatively high temperature anomaly that year, and the drop for the La Nina years thereafter, which explains the low temperature anomalies in 1999 and 2000. The ENSO index shows no long-term trend, and is probably not related to any warming from CO2. Any trend based on these two low data points being near the beginning of some time period is therefore meaningless for any discussion of warming due to CO2.

The second hypothesis that Rajaratnam, et al. test is that the trend from 1998 to 2013 is at least as great as the trend from 1950 to 1997. Rejection of this hypothesis would support the claim of a `slowdown’ in global warming in recent years. They find estimates and standard errors (assuming independent residuals) for the slopes in separate regressions for 1950-1997 and for 1998-2013, which they then combine to obtain a p-value for this test. The (one-sided) p-value they obtain is 0.210. They therefore claim that there is no statistically signficant slowdown in the recent trend (since 1998).

I can’t replicate this result exactly, since I can’t find the exact data set they used, but on what seems to be a similar version of the GISS land-only data, I get a one-sided p-value of 0.173, which is similar to their result.

This result is meaningless, however. As one can see in the plot from their paper shown above, there was a `pause’ in the temperature from 1950 to 1970, which makes the slope from 1950 to 1997 be less than it was in the decades immediately before 1998. Applying the same procedure but with the first period being from 1970 to 1997, I obtain a one-sided p-value of 0.027, which one might regard as statistically significant evidence of a slowdown in the trend.

Rajaratnam, et al. are aware of the sensitivity of their p-value to the start date of the first period, since in a footnote on page 27 of the supplemental information for the paper they say

Changing the reference period from 1950-1997 to 1880-1997 only strengthens the null hypothesis of no difference between the hiatus period and before. This follows from the fact that the trend during 1880-1997 is more similar to the trend in the hiatus period. Thus the selected period 1950-1997 can be regarded as a lower bound on p-values for tests of difference in slopes.

Obviously, it is not actually a lower bound, since the 1970-1997 period produces a lower value.

Perhaps they chose to start in 1950 because that is often considered the year by which CO2 levels had increased enough that one might expect to see a warming effect. But clearly (at least in this data set) there was no warming until around 1970. One might consider that only in 1970 did warming caused by CO2 actually start, in which case comparing to the trend from 1950 to 1997 (let alone 1880 to 1997) is deceptive. Alternatively, one might think that CO2 did have an effect starting in 1950, in which case the lack of any increase in temperatures from 1950 to 1970 is an early instance of a `pause’, which *strengthens*, rather than weakens, the claim that the effect of CO2 is weak enough that other factors can sometimes counteract it, and at other times produce a misleadingly large warming trend that should not all be attributed to CO2.

Rajaratnam, et al. also test this second hypothesis another way, by seeing how many 16-year periods starting after 1950 and ending before 1998 have a trend at least as small as the trend over 1998-2013. The results are summarized in these plots, from their Fig. 2:

Since 13 of these 33 periods have lower trends than that of 1998-2013, they arrive at a p-value of 13/33=0.3939, and conclude that “the observed trend during 1998–2013 does not appear to be anomalous in a historical context”.

However, in the supplemental information (page 29) they say

Having said this, from Figure 2 there is a clear pattern in the distribution of 16 year linear trends over time: all 16 year trends starting at 1950 all the way to 1961 are lower than the trend during hiatus period, and all 16 year linear trends starting at years 1962 all the way to 1982 are higher than the trend during the hiatus period, with the exception of the 1979-1994 trend.

That is the end of their discussion. It does not appear to have occurred to them that they should use 1970 as the start date, which would produce a p-value of 1/13=0.077, which would be at least some weak evidence of a slowdown starting in 1998.

The third hypothesis tested by Rajaratnam, et al. is that the expected global temperature is the same at every year from 1998 on — that global warming has `stalled’. Of course, as they briefly discuss in the supplemental information (page 31), this is the same as their first hypothesis — that the global temperature has a linear trend with slope zero from 1998 on. Their discussion of this hypothesis, and why it isn’t really the same as the first, makes no sense to me, so I will say no more about it, except to note that strangely no numerical p-values are given for the tests of this hypothesis, only `reject’ or `retain’ descriptions.

The fourth, and final, hypothesis tested in the paper is that the distribution of year-to-year differences in temperature is the same after 1998 as during the period from 1950 to 1998. Here also, their discussion of their methods is confusing and incomplete, and again no numerical p-values are given. It is clear, however, that their methods suffer from the same unwise choice of 1950 as the start of the comparison period as was the case for their test of the second hypothesis. Their failure to find a difference in the distribution of differences in the 1998-2013 period compared to the 1950-1997 period is therefore meaningless.

Rajaratnam, Romano, Tsiang, and Diffenbaugh conclude by summarizing their results as follows:

Our rigorous statistical framework yields strong evidence against the presence of a global warming hiatus. Accounting for temporal dependence and selection effects rejects — with overwhelming evidence — the hypothesis that there has been no trend in global surface temperature over the past ≈15 years. This analysis also highlights the potential for improper statistical assumptions to yield improper scientific conclusions. Our statistical framework also clearly rejects the hypothesis that the trend in global surface temperature has been smaller over the recent ≈ 15 year period than over the prior period. Further, our framework also rejects the hypothesis that there has been no change in global mean surface temperature over the recent ≈15 years, and the hypothesis that the distribution of annual changes in global surface temperature has been different in the past ≈15 years than earlier in the record.

This is all wrong. There is not “overwhelming evidence” of a positive trend in the last 15 years of the data — they conclude that only because they used a flawed method. They do not actually reject “the hypothesis that the trend in global surface temperature has been smaller over the recent ≈ 15 year period than over the prior period”. Rather, after an incorrect choice of start year, they fail to reject the hypothesis that the trend in the recent period has been equal to or greater than the trend in the prior period. Failure to reject a null hypothesis is not the same as rejecting the alternative hypothesis, as we try to teach students in introductory statistics courses, sometimes unsuccessfully. Similarly, they do not actually reject “the hypothesis that the distribution of annual changes in global surface temperature has been different in the past ≈15 years than earlier in the record”. To anyone who understands the null hypothesis testing framework, it is obvious that one could not possibly reject such a hypothesis using any finite amount of data.

Those familiar with the scientific literature will realize that completely wrong papers are published regularly, even in peer-reviewed journals, and even when (as for this paper) many of the flaws ought to have been obvious to the reviewers. So perhaps there’s nothing too notable about the publication of this paper. On the other hand, one may wonder whether the stringency of the review process was affected by how congenial the paper’s conclusions were to the editor and reviewers. One may also wonder whether a paper reaching the opposite conclusion would have been touted as a great achievement by Stanford University. Certainly this paper should be seen as a reminder that the reverence for “peer-reviewed scientific studies” sometimes seen in popular expositions is unfounded.

The results above can be reproduced by first downloading the data using this shell script (which downloads other data too, that I use in other blog posts), or manually download from the URLs it lists if you don’t have wget. You then need to download my R script for the above analysis and this R source file (renaming them to .r from the .doc that wordpress requires), and then running the script in R as described in its opening comments (which will take quite a long time).

]]>A recent focus of this debate has been whether temperature records show a `pause’ (or `hiatus’) in global warming over the last 10 to 20 years (or at least a `slowdown’ compared to the previous trend), and if so, what it might mean. Lukewarmers might interpret such a pause as evidence that other factors are comparable in importance to CO2, and can temporarily mask or exaggerate its effects, and hence that naively assuming the warming from 1970 to 2000 is primarily due to CO2 could lead one to overestimate the effect of CO2 on temperature.

Whether you sees a pause might, of course, depend on which data set of global temperatures you look at. These data sets are continually revised, not just by adding the latest observations, but by readjusting past observations.

Here are the yearly average land-ocean temperature anomaly data from 1955 to 2014 from the Goddard Institute for Space Studies (GISS), in the version before and after July of this year:

The old version shows signs of a pause or slowdown after about 2000, which has largely disappeared in the new version. Unsurprisingly, the revision has engendered some controversy. I should note that the difference is not really due to GISS itself, but rather to NOAA, from whom GISS gets the sea surface temperatures used.

Many people pointing to a pause look at the satellite temperature data from UAH, which starts in 1979. Below, I show it on the right, with the new GISS data from 1979 on the left, both in yearly (top) and monthly (bottom) forms:

Two things can be noted from these plots. First, the yearly UAH data (top right) can certainly be seen as showing roughly constant temperatures since somewhere between 1995 and 2000, apart from short-term variability. However, if one so wishes, one can also see it as showing a pretty much constant upward trend, again with short-term variability. Looking at the monthly UAH data (bottom right) gives a much stronger impression of a pause, since fitting a straight line to the monthly data leads to most points after about 2007 being under the line, while those before then back to about 2001 are mostly above the line, which is what one would expect if there is a pause at the end — see the plot below of the least-squares fitted line and its residuals:

The (new) GISS data also gives more of an impression of a slowdown with monthly rather than yearly data:

There are two issues with looking at monthly data, however. The first is that although both GISS and UAH data effectively have a seasonal adjustment — anomalies for each month are from a baseline for that month in particular — the seasonal effects actually vary over the years, introducing possible confusion. I’ll try fitting a model that handles this in a later post, but for now sticking to the yearly data avoids the problem. The second issue is that one can see a considerable amount of `autocorrelation’ in the monthly data. This brings us to the crucial question of what one should really be asking when considering whether there is a pause (or a slowdown) in the temperature data.

To some extent, talk of a `pause’ by lukewarmers is for rhetorical effect — look, no warming for 15 years! — as a counter to the rhetoric of the warmers — see how much the planet has warmed since 1880! — with such rhetoric by both sides being only loosely related to any valid scientific argument. However, one should try as much as possible to interpret both sides as making sensible arguments.

In this respect, note that the lukewarmers are certainly *not* claiming that the pause shows that although CO2 had a warming effect up until the year 2000, it stopped having a warming effect after 2000, so we don’t have to worry now. I doubt that anyone in the entire world believes such a thing (which is saying a lot considering what some people do believe).

Instead, the sensible lukewarmer interpretation of a `pause’ would be that the departures from the underlying trend in the temperature time series have a high degree of positive *autocorrelation* — that the departure from trend in one year is likely to be similar to the departures from trend of recent years. (Alternatively, some lukewarmers might think that there are deterministic or stochastic cycles, with periods of decades or more.) The effect of high autocorrelation is to make it harder to infer the magnitude of the true underlying trend from a relatively short series of observations.

The problem can be illustrated with simulated data sets, which I’ve arranged to look vaguely similar to the GISS data from 1955 to 2014 (though to avoid misleading anyone, I label the x-axis from 1 to 60 rather than 1955 to 2014).

I start by generating a series of 20000 values with high autocorrelation that will be added as residuals to a linear trend. I do this by summing a Gaussian series with autocorrelations that slowly decline to zero at lag 70, a slightly non-Gaussian series with autocorrelations that decline more quickly, and a series of independent Gaussian values. The R code is as follows:

`set.seed(1)`

n0 <- 20069

fa <- c(1,0.95,0.9,0.8/(1:67)^0.8); fa <- fa/sum(fa)

fb <- exp(-(0:69)/2.0); fb <- fb/sum(fb)

xa <- filter(rnorm(n0),fa); xa <- xa[!is.na(xa)]

xb <- filter(rt(n0,5),fb); xb <- xb[!is.na(xb)]

xc <- rnorm(length(xb))

xresid <- 0.75*xa + 0.08*xb + 0.06*xc

Here are the first 1500 values of this residual series:

Here are the autocorrelations estimated from the entire simulated residual series:

The `autocorrelation time’ shown above is one plus twice the sum of autocorrelations at lag 1 and up. It is the factor by which the effective sample size is less than it would be if the points were independent. With an autocorrelation time of 13 as above, for example, a data set of 60 points is equivalent to about 5 independent points.

I then split this long residual series into chunks of length 60, to each of which I added a trend with slope 0.01, and then shifted it to have sample mean of zero. Here are the first twenty of the 333 series that resulted:

The slope of the least-squares fit line is shown above each plot. As one can see, some slope estimates are almost twice the underlying trend of 0.01, while other slopes are much less than the underlying trend. Here is the histogram of slope estimates from all 333 series of length 60, along with the lower bound of the 95% confidence interval for the slope, computed assuming no autocorrelation:

Ignoring autocorrelation results in the true slope of 0.01 being below the lower bound of the 95% confidence interval 24% of the time (ten times what should be the case).

What is even more worrying is that looking at the residuals from the regression often shows only mild autocorrelation. Here are the autocorrelation (and autocorrelation time) estimates for the first 20 series:

One can compare these estimates with the plot of true residual autocorrelation above, and the true autocorrelation time of 13.

To see the possible relevance of this simulation to global temperature data, here are old and new GISS global temperature anomaly series (from 1955), centred and relabeled as for the simulated series, along with simulated series B and L from above:

It is worrying that the GISS series do not appear much different from the simulated series, which substantially overestimate the trend.

The real significance of a `pause’ or `slowdown’ in temperatures is that it would be evidence of such high autocorrelation, whose physical basis could be internal variability in the climate system, or the influence of external factors that themselves exhibit autocorrelation. Looking for a `pause’ may not be the best way of assessing whether autocorrelation is a big problem. But direct estimation of long-lag autocorrelations from relative short series is not an easy problem, and may be impossible without making strong prior assumptions regarding the form of the autocorrelation function.

Accordingly, I’ll now go back to looking at whether one can see a pause in the GISS and UAH temperature data, while keeping in mind that the point of this is to see whether high autocorrelation is a problem. I’ll look only at the yearly data, though as noted above, a pause or slowdown may be more evident in the monthly data.

Here are the old and new versions of the GISS data, from 1955 through 2014, with least-squares regression lines fitted separately to data before 1970, from 1970 to 2001, and after 2001. In the top plots, the fits are required to join up; in the bottom plots, there may jumps as well as slope changes at 1970 and 2001.

In the two top plots, the estimated slopes after 2001 are smaller than the slopes from 1970 to 2001, but the differences are not statistically significant (p-values about 0.3, assuming independent residuals). In the bottom two plots, the slopes before and after 2001 differ substantially, with the differences being significant (p-values of 0.003 and 0.018, assuming independent residuals). However, one might wonder whether the abrupt jumps are physically plausible.

Next, let’s look at the UAH data, which starts in 1979, along with the (new) GISS data from that date for comparison, and again consider a change in slope and/or a jump in 2001:

Omitting the data from 1970 to 1978 decreases the pre-2001 slope of the GISS data, lessening the contrast with the post-2001 slope. For the UAH data, the difference in slopes before and after 2001 is quite noticeable. However, for the top UAH plot, the difference is not statistically significant (p-value 0.19, assuming independent residuals). For the bottom plot, the two-sided p-value is 0.08. Based on the comparison with the GISS data, however, one might think that both differences would have been significant if data back to 1970 had been available.

There is a `cherry-picking’ issue with all the above p-values, however. The selection of 2001 as the point where the slope changes was made by looking at the data. One could try correcting for this by multiplying the p-values by the number of alternative choices of year, but this number is not clear. In a long series one would expect the slope to change at other times as well, as indeed seems to have happened in 1970. One could try fitting a general model of multiple `change-points’, but this seems inappropriately elaborate, given that the entire exercise is a crude way of testing for long-lag autocorrelation.

I have, however, tried out a Bayesian analysis, comparing a model with a single linear trend, a model with a trend that changes slope at an unknown year (between 1975 and 2010), a model with both a change in slope and a jump (at an unknown year), and a model in which the trend is a constant apart from a jump (at an unknown year). I selected informative priors for all the parameters, as is essential when comparing models in the Bayesian way by marginal likelihood, and computed the marginal likelihoods (and posterior quantities) by importance sampling from the prior (a feasible method for this small-scale problem). See the R code linked to below for details.

Here are the results of these four Bayesian models, shown as the posterior average trend lines:

In the last plot, note that the model has an abrupt step up at some year, but the posterior average shows a more gradual rise, since the year of the jump is uncertain. The log marginal likelihoods for the four models above are 16.0, 15.4, 15.7, and 14.4. If one were to (rather artificially) assume that these are the only four possible models, and that they have equal prior probabilities, the posterior probabilities of the four models would be 39%, 23%, 30%, and 9%.

I emphasize again that the exercise of looking for a `pause’ or `slowdown’ is really a crude way of looking for evidence of long-lag autocorrelation. The quantitative results should not be taken too seriously. Nevertheless, the conclusion I reach is that this data does not produce a definitive yes or no answer to whether there is a pause, even in the UAH data, for which a pause seems most evident. A few years more data might (or might not) be enough to make the situation clearer. Analysis of monthly data might also give a more definite result. Note, however, that `lack of definite evidence of a pause’ is not the same as `no pause’. It is not reasonable to assume a lack of long-lag autocorrelation absent definite evidence to the contrary, since the presence of such autocorrelation is quite plausible *a priori*.

In my previous post, I had said that this next post would examine two papers `debunking’ the pause, but it’s gotten too long already, so I’ll leave that for the post after this. I’ll then look at what can be learned by looking at monthly data, and by modeling some known effects on temperature (such as volcanic activity).

The results above can be reproduced by first downloading the data using this shell script (which downloads other data too, that I will use for later blog posts), or manually download from the URLs it lists if you don’t have wget. You then need to download my R script for reading these files, and my R script for the above analysis (and rename them to .r from the .doc that wordpress requires). Finally, run the second script in R as described in its opening comments.

UPDATE: You’ll also need this R source file.

]]>I will focus on anthropogenic warming that results, via the mis-named `greenhouse effect’, from CO2 produced by burning fossil fuels. There are other human-generated `greenhouse gasses’, and other human influences on climate, such as changes in land use, but the usual estimates of their effects are smaller than that of CO2, and in any case, they would call for different policy responses than reducing fossil fuel consumption. Other possible anthropogenic influences are, however, a possible complication when trying to determine the effects of CO2 by looking at temperature data.

What I’ll call the `warmer’ view of the effect of CO2 is what is accepted (at least verbally) by most governments, and is more-or-less found in the reports of the Intergovernmental Panel on Climate Change (IPCC) — that burning of fossil fuels increases CO2 in the atmosphere, resulting in a global increase in temperatures large enough to have quite substantial harmful effects on humans and the environment. The contrasting `no-warmer’ view is that increases in CO2 cause little or no warming, either (implausibly) because CO2 has no warming effect, or (somewhat more plausibly) because strong negative feedbacks limit its effects. In between is the `lukewarmer’ view — CO2 has some warming effect, but it is not large enough to be a major cause for worry, and does not warrant imposition of costly policies aimed at reducing fossil fuel consumption. This is the predominant view at some `skeptical’ web sites such as Watts Up With That.

There is also the `extreme-warmer’ view, that the effects of CO2 will be so large as to `fry the planet’, leading to the extinction of humans, and perhaps all life, which is surprisingly common among the general public, despite being utterly implausible. Of course, they are encouraged in this belief by alarmist papers such as `Mathematical Modelling of Plankton–Oxygen Dynamics Under the Climate Change‘ by Sekerci and Petrovskii, who apparently don’t understand that any arbitrary system of differential equations has a good chance of producing unstable behaviour, and that calling such a system a `model of a coupled plankton–oxygen dynamics’ does not make it a good model. It is very, very unlikely that life on earth would have lasted for over three billion years if the global ecosystem were really as unstable as is suggested in this paper.

The `warmer’ and `lukewarmer’ views are sufficiently plausible that it’s worth asking whether global temperature data has anything to say about which is closer to the truth. An alternative source of evidence is physical theory, embodied in computer simulations. Unfortunately, earth’s climate system is too complex to be simulated without various simplifications and approximations being made, so simulation cannot provide definitive answers, and must ultimately be checked against observations. Observations also have a rhetorical role, being potentially convincing to those who may put no trust in theory and simulation, but who naively think that measuring global temperature is a simple matter of reading thermometers.

Unfortunately, measuring global temperature is not so simple. Earth is a big place, with few observing stations, and every observing station is subject to biases from factors such as changes in the nature of its surroundings and in the time of day when observations are made. Measurements of temperature from space are indirect, and have potential biases from factors such as decaying satellite orbits. All time series of global temperatures are therefore the result of complex processing of raw data, whose appropriateness can be questioned.

It should come as no surprise to those aware of the political nature of this debate that supporters of the `warmer’ and `lukewarmer’ views tend to favour different global temperature datasets, which show different temperature trends in recent years. A favourite of the warmers is NASA’s GISS data, whose land-ocean version combines land temperature observations with sea surface temperature data. This data set was recently revised, with the new version showing a larger upward trend in temperature in recent years. The lukewarmers tend to favour the UAH data from satellite observations, also recently revised, with the new version showing a lower trend than before.

One should note that these two data sets are not measuring the same thing, or even trying to. GISS measures an ill-defined combination of water temperature near the top of the ocean and air temperature a few feet above the ground, in some variety of surroundings. UAH measures temperature in the lower part of the atmosphere, up to about 8000 metres above the surface. So it’s conceivable that the different trends in these two data sets both accurately reflect reality, though if so it’s hard to see how these different trends could continue indefinitely.

I’ll first show the monthly GISS global land-ocean temperatures (retrieved 2015-11-30) from 1880 to the end of 2014. (That’s when some other data I’ll be looking at ends; 2015 is so far mostly warmer than 2014.) These temperatures are expressed as `anomalies’ (in degrees Celsius) with respect to a base period (separately for each month of the year), since absolute values are meaningless given the arbitrary nature of what GISS is measuring. Here they are:

This graph is often portrayed (to the public) as convincing evidence that CO2 causes global warming. Look at that upward trend from about 1910! However, the rise from 1910 to 1940 can’t really be due to CO2. The direct warming effect of CO2 is generally accepted to be proportional to the logarithm of its concentration, with a doubling of CO2 producing roughly one degree Celsius of warming, which might be amplified (or diminished) by feedbacks. Here is a plot of the log base 2 of CO2 over the period above (data from here):

The increase from 1910 to 1940 is only about 0.05, which even with a generous factor of four allowance for positive feedback would give only 0.2 degrees Celsius of warming, compared to the warming of about 0.5 degrees in the GISS data. And if the 1910-1940 warming was really due to CO2, the warming from 1970-2000 should have been even greater than it was. Furthermore, part of the effect of CO2 is expected to be delayed by decades, making it an even less likely explanation of the 1910-1940 warming, since CO2 is thought to have been more-or-less constant before 1880.

Clearly, there are other influences on temperature than CO2. Once one realizes this, the upward temperature trend from 1970 to 2000 becomes less convincing as evidence of a warming effect of CO2. Furthermore, since CO2 has been increasing pretty much monotonically for over a hundred years, it is highly confounded with everything else that has been increasing over that period, as well as with long-period cycles. So any really persuasive argument regarding the effect of CO2 must be based on physical theory and on more detailed measurements that can confirm the effects of CO2 at a greater level of detail than a simple global average of temperature. This is the subject of `attribution’ studies, the critique of which is beyond the scope of this blog post (and beyond my expertise).

Nevertheless, there seems to be value in trying to better understand the global temperature data, partly as a `sanity check’ on claims based on more complex, and perhaps more questionable, analyses, and also to see whether there is any evidence of the data being wrong.

To lukewarmers, an aspect of the data that provides evidence of other factors being comparable in importance to CO2 is the `pause’ in warming (or at least a `slowdown’) that one can visually see in the plot above from about 2002. For a closer look, here is the same GISS data, but going back only to 1979:

The UAH satellite temperature data starts in 1979, so we can now compare with it (version 6.0beta4, downloaded 2015-11-30):

The base period for the anomalies in the UAH plot is different from GISS, so only the changes are comparable. (I’ve made the vertical scales match in that respect.)

Both data sets seem visually to show a slowdown or `pause’ around 2002, with this being more prominent in the UAH data (in which one might see the pause as going back as far as 1995). To lukewarmers, the significance of this pause is not that global warming has stopped, showing that CO2 has no effect, since they think that CO2 does have at least some small effect. Rather, they see it as evidence that other effects are large, sometimes large enough to cancel any underlying warming trend from CO2, and sometimes making any such trend appear larger than it actually is — and hence the warming in the 1970-2000 period cannot be taken as indicative of the magnitude of the warming due to CO2, or of what to expect in future.

As alluded to above, simple linear least squares fits to the GISS and UAH data for 1979-2014 show a greater trend for GISS (1.59 degrees C per century) than for UAH (1.12 degrees C per century). But if there is actually a change around 2002, a single trend line is of course largely meaningless.

Reactions to the `pause’ (or `hiatus’) from the warmer camp have taken several forms:

- Claims that the pause is an artifact of poorly adjusted temperature measurements, that disappears when adjustments are done properly.
- Claims that the visual appearance of a pause is deceiving — that the `pause’ is just chance variation, which the human eye overinterprets.
- Claims that if one subtracts changes due to known effects, such as volcanic eruptions, the pause disappears, showing that the underlying trend due to CO2 continues unabated. (Note that depending on the size of the underlying trend that is revealed, this would not necessarily be contrary to lukewarmer views.)
- Claims that warming from CO2 continues at a substantial rate, but that the heat is going somewhere that escapes measurement in global temperature data sets.

I will leave claims in category (4) for others to critique.

Claims in category (3) include a blog post by `tamino’. I plan to present my own analysis of this sort in a future blog post, and compare to that of `tamino’.

Two recent papers making claims in category (2) are `Debunking the climate hiatus‘, by Rajaratnam, Romano, Tsiang, and Diffenbaugh, and `On the definition and identifiability of the alleged “hiatus” in global warming‘, by Lewandowsky, Risbey, and Oreskes. Both of these papers look at (or say they look at) the GISS land-ocean temperature data, displayed above, but before the recent revision. I plan to comment on these papers in my next blog post.

Regarding (1), the GISS temperatures displayed above show a less prominent `pause’ than the version of GISS land-ocean temperatures distributed prior to July 2015 (obtained from the wayback machine’s version of 2015-04-18, stored here), which is shown below:

The revision results in a greater upward trend during the `pause’ period, as shown by the following plot of differences (with enlarged vertical scale):

To tell whether or not this revision was justified, one would need to examine in depth the temperature adjustments done for the GISS data set, which I haven’t done.

However, it’s not too hard to see some interesting things by examining the GISS land-ocean temperature data in more detail. I’ll look only at the most recent version (accessed 2015-11-30) .

First, one can look separately at the Northern Hemisphere:

and Southern Hemisphere:

The difference is rather striking. One would expect some overall difference due to the greater amount of ocean in the Southern Hemisphere, and the different nature of the polar regions. But that doesn’t explain the abrupt increase in the scatter of Southern Hemisphere data points after about 1955.

We can also look at each month of the year separately. Here’s the Northern Hemisphere:

And here’s the Southern Hemisphere:

In the Northern Hemisphere, variability is obviously greater in winter than in summer. The variability in the Southern Hemisphere winter seems slightly greater than in summer, but much less so than in the Northern Hemisphere. These are differences that I’ll take account of when modeling this data later.

I’ve marked 1955 by a short line at the bottom. In the Northern Hemisphere, the dip in January temperatures from 1955 to 1975 seems odd, since it doesn’t show up in December and February, but it’s hard to be sure that it’s not a real climatic effect. Something does happen around 1955 in the Southern Hemisphere plots, which increases the variance in May and August, and maybe June, July, and September. This can be confirmed by looking at plots for each of the 12 months of the year that show the difference of the anomaly for that month from the average anomaly for that month in the three preceding and three following years:

May through September seem to have higher variability in the years after 1955, and this is very clear for at least May and August. In contrast, similar plots for the Northern Hemisphere show no change in variance, or perhaps a slight decline after 1955 for May and June. It’s hard to see how this Southern Hemisphere variance change can reflect a real change in climate, given its abrupt onset, and that it does not appear in the Northern Hemisphere. More likely, it is an artifact of how the data is processed. A rapid improvement in quality of measurements after World War II might also be a possible explanation (though one would expect that to lead to less variability, rather than more).

Whatever the reason, it seems that relying on GISS data before 1955 might be unwise. In my later analyses, I will look at data only from 1959, since that is when some other related data sets begin, or from 1979 when comparing to the UAH data.

I note that obtaining all but the most recent GISS data is difficult. Some versions can be accessed at the wayback machine, but many versions apparently saved there produce an ‘access denied’ error. UAH has an extensive archive, but even it seems not to have all the versions that were distributed. GISS distributes the programs they use, but only the current version. I can’t find any programs at the UAH website. Both GISS and UAH ought to have a public repository that uses a source-code control system such as git, which would allow all versions of programs, raw data, and processed data to be accessed, with documentation of all changes.

To reproduce the results in this post, you will first need to download the data using this shell script (which downloads other data too, that I will use for later blog posts), or manually download from the URLs it lists if you don’t have wget. You then need to download my R script for reading these files, and my R script for making the plots (and rename them to .r from the .doc that wordpress requires). Finally, run the second script in R as described in its opening comments.

]]>The result is a new paper of mine on Fast exact summation using small and large superaccumulators (also available from arxiv.org).

A superaccumulator is a giant fixed-point number that can exactly represent any floating-point value (and then some, to allow for bigger numbers arising from doing sums). This concept has been used before to implement exact summation methods. But if done in software in the most obvious way, it would be pretty slow. In my paper, I introduce two new variations on this method. The “small” superaccumulator method uses a superaccumulator composed of 64-bit “chunks” that overlap by 32 bits, allowing carry propagation to be done infrequently. The “large” superaccumulator method has a separate chunk for every possible combination of the sign bit and exponent bits in a floating-point number (4096 chunks in all). It has higher overhead for initialization than the small superaccumulator method, but takes less time per term added, so it turns out to be faster when summing more than about 1000 terms.

Here is a graph of performance on a Dell Precision T7500 workstation, with a 64-bit Intel Xeon X5680 processor:

The horizontal axis is the number of terms summed, the vertical axis the time per term in nanoseconds, both on logarithmic scales. The time is obtained by repeating the same summation many times, so the terms summed will be in cache memory if it is large enough (vertical lines give sizes of the three cache levels).

The red lines are for the new small (solid) and large (dashed) superaccumulator methods. The blue lines are for the iFastSum (solid) and OnlineExact (dashed) methods of Zhu and Hayes (2010), which appear to be the fastest previous exact summation methods. The black lines are for the obvious (inexact) simple summation method (solid) and a simple out-of-order summation method, that adds terms with even and odd indexes separately, then adds together these two partial sums. Out-of-order summation provides more potential for instruction-level parallelism, but may not produce the same result as simple ordered summation, illustrating the reproducibility problems with trying to speed up non-exact summation.

One can see that my new superaccumulator methods are about twice as fast as the best previous methods, except for sums of less than 100 terms. For large sums (10000 or more terms), the large superaccumulator method is about 1.5 times slower than the obvious simple summation method, and about three times slower than out-of-order summation.

These results are all for serial implementations. One advantage of exact summation is that it can easily be parallelized without affecting the result, since the exact sum is the same for any summation order. I haven’t tried a parallel implementation yet, but it should be straightforward. For large summations, it should be possible to perform exact summation at the maximum rate possible given limited memory bandwidth, using only a few processor cores.

For small sums (eg, 10 terms), the exact methods are about ten times slower than simple summation. I think it should be possible to reduce this inefficiency, using a method specialized to such small sums.

However, even without such an improvement, the new superaccumulator methods should be good enough for replacing R’s “mean” function with one that computes the exact sample mean, since for small vectors the overhead of calling “mean” will be greater than the overhead of exactly summing the vector. Summing all the data points exactly, then rounding to 64-bit floating point, and then dividing by the number of data points wouldn’t actually produce the exactly-rounded mean (due to the effect of two rounding operations). However, it should be straightforward to combine the final division with the rounding, to produce the exactly-correct rounding of the sample mean. This should also be faster than the current inexact two-pass method.

Modifying “sum” to have an “exact=TRUE” option also seems like a good idea. I plan to implement these modifications to both “sum” and “mean” in a future version of pqR, though perhaps not the next version, which may be devoted to other improvements.

]]>

It seems that lots of actual data vectors could be stored more compactly than at present. Many integer vectors consist solely of elements that would fit in one or two bytes. Logical vectors could be stored using two bits per element (allowing TRUE, FALSE, and NA), which would use only one-sixteenth as much memory as at present. It’s likely that many operations would also be faster on such compact vectors, so there’s not even necessarily a time-space tradeoff.

For integer and logical types, the possible compact representations, and how to work with them, are fairly obvious. The challenge is how to start using such compact representations while retaining compatibility with existing R code, including functions written in C, Fortran, or whatever. Of course, one could use the S3 or S4 class facilities to define new classes for data stored compactly, with suitable redefinitions of standard operators such as `+’, but this would have substantial overhead, and would in any case not completely duplicate the behaviour of non-compact numeric, integer, or logical vectors. Below, I discuss how to implement compact representations in a way that is completely invisible to R programs. I hope to try this out in my pqR implementation of R sometime, though other improvements to pqR have higher priority at the moment.

How to compactly represent floating-point data (of R’s `numeric’ type) is not so obvious. If the use of a compact representation is to have no effect on the results, one cannot just use single-precision floating point. I describe a different approach in a new paper on Representing numeric data in 32 bits while preserving 64-bit precision (also on arxiv). I’ll present the idea of this paper next, before returning to the question of how one might put compact representations of any sort into an R interpreter, invisibly to R programs.

Statistical applications written in R typically represent numbers read from data files using 64-bit double-precision floating-point numbers (unless all numbers are integers). However, the information content of such data is often small enough that each data item could be represented in 32 bits. For example, if every item in the data file contains six or fewer digits, with the decimal point in one of the seven places before or after one of these digits, there are less than 7 million possible numbers (14 million if a number might be negative), which is much less than the approximately 4 billion possible values of a 32-bit element.

However, representing such data with 32-bit single-precision floating-point values won’t really work. Single-precision floating-point will be able to distinguish all numbers that have no more than six digits, but if these numbers are used in standard arithmetic operations, the results will in general *not* be the same as if they had been represented with 64-bit double precision. The problem is that numbers, such as 0.1, that have non-terminating binary representations will be rounded to much less precise values in single precision than in double precision.

Exactly the same results as using double precision could be obtained by using a decimal floating point representation. For example, a compact number could consist of a 28-bit signed integer, *m*, and a 4-bit exponent, *e*, which represent the number *m*×10^{–e}. To decode such a number, we would extract *m* and *e* with bit operations, use *e* to look up 10^{e} from a table, and finally divide *m* by 10^{e}. Unfortunately, the final division operation is comparatively slow on most current processors, so compressing data with this method would lead to substantially slower operations on long vectors (typically about six times slower). It’s much faster to instead multiply *m* by 10^{-e}, but this will not give accurate results, since 10^{-e} cannot be exactly represented in binary notation.

In my paper, I propose a faster way of representing 64-bit floating-point values in 32 bits, while getting exactly the same answers. The idea is simply to store only the upper 32 bits of the 64-bit number, consisting of the sign, the exponent, and the upper 20 bits of the mantissa (21 bits of precision, counting the implicit 1 bit). To use such a compact number, we need to fill in the lower 32 bits of the mantissa, which is done by looking these bits up in a table, indexing with some bits from the retained part of the mantissa and perhaps from the exponent.

Of course, this only works for some subsets of possible 64-bit floating-point values, in which there aren’t two numbers with the same upper 32 bits. Perhaps surprisingly, there are a number of interesting subsets with this property. For example, the set of all six-digit decimal numbers with the decimal point before or after any of the digits can be represented, and decoded using a table indexed by 19 bits from the mantissa and the exponent. Some smaller subsets can be decoded with smaller tables. More details are in the paper, including timing results for operations on vectors of such compactly-represented values, which show that it’s faster than representing data by decimal floating point, and sometimes faster than using the original 64-bit floating point values.

An interesting feature of this scheme is that the compact representation of a 64-bit value is the same regardless of what subset is being represented (and hence what decoding table will be used). So when compressing a stream of data, the data can be encoded before it is known what decoding scheme will be used. (Of course, it may turn out that no decoding scheme will work, and hence the non-compact form of the data will need to be used.) In contrast, when trying to compress an integer vector by storing it in one or two bytes, it may initially seem that a one-byte representation of the data will be possible, but if an integer not representable in one byte is later encountered, the previous values will need to be expanded to two bytes.

I’d like to be able to use such compact representations for R vectors invisibly — without changing any R programs, or external routines called from R that use the R API. This requires that a compactly-represented vector sometimes be converted automatically to its non-compact form, for example, when passed to an external routine that knows nothing about compact representations, or when it is operated on by some part of the R interpreter that has not been re-written to handle compact vectors. Compactly-represented vectors will also need to be expanded to their non-compact form when an element of the vector is replaced by a value that is not in the set that can be compactly represented.

It should be possible to accommodate code that doesn’t know about compact representations using the same variant result mechanism in pqR that is used to implement parallel computation in helper threads. With this mechanism, code in C that calls the internal “eval” function to evaluate an R expression can specify that the caller is prepared to handle a “variant” of the normal evaluation result, which in this application would be a result that is a compactly-stored vector. By default, such variant results will not be returned, so code that is unaware of compact vectors will still work. Of course, compact representations will be useful only if modifications to handle compact representations have been made to many parts of the interpreter, so that vectors can often remain in their compact form.

When we do need to expand a compact vector into it’s non-compact form, how should we do it? Should we keep the compact form around too, and use it if we no longer need the expanded form? That seems bad, since far from reducing memory usage, we’d then be increasing it by 50%. And even if we discard the compact form after expanding it, we still use 50% more memory temporarily, while doing the expansion, which may cause serious problems if the vector is really huge.

We can avoid these issues by expanding the vector in place, having originally allocated enough memory for the non-compact representation, with the compact form taking up only the first part of this space allocation. Now, this may seem crazy, since the whole point of using a compact representation is to avoid having to allocate the amount of memory needed for the non-compact representation! Modern operating systems can be clever, though. At least on Linux, Solaris, and Mac OS X, if you allocate a large block of memory (with C’s malloc function), real physical memory will be assigned to addresses in this memory block only when they are actually used. So if you use only the first half of the block, only that much physical memory will be allocated — except that allocation is actually done in units of “pages”, typically around 4 KBytes. So as long as physical memory (rather than virtual address space) is what you’re short of, and the vector is several times larger than the page size, allocating enough memory to hold the vector’s non-compact form should still save on physical memory if in fact only the compact representation is used.

Expanding compact vectors in place also avoids problems with garbage collections being triggered at unexpected times, and with the address of a vector changing when existing code may assume it will stay the same. Indeed, it’s not clear that these problems could be solved any other way.

However, one unfortunate consequence of allocating space to allow expansion in place is that compact representations will not help with programs that create a huge number of short vectors, because the allocation of physical memory in units of pages limits the usefulness of compact representations to vectors of at least a few thousand elements. It’s difficult to assess how often compact representations will provide a substantial benefit in real applications until they have been implemented, which as I mentioned above, will have to wait until after several other planned features have been implemented in pqR.

]]>

Click on image for larger version.

Toronto, March 2015. Fujica G690 with 100mm 1:3.5 lens, Kodak Portra 400 film (120), Nikon Coolscan 9000.

]]>Faculty at the suburban Mississauga campus teach undergraduate courses there, but also spend much time at the Department of Statistical Sciences on the downtown campus, teaching graduate courses, supervising graduate students, attending research seminars, etc. The University of Toronto has a diverse group of both young and experienced faculty working in statistics, both in the downtown and suburban statistics departments, and in related research groups such as machine learning and biostatistics.

The deadline to apply is December 15, 2014. You can see the ad here.

]]>