Posts filed under ‘Statistics’
|I have released a new version of my “pretty quick” R interpreter, pqR-2016-10-05.
One major change with this version is that pqR, which was based on R-2.15.0, is now compatible with R-2.15.1. This allows for an increased number of packages in the pqR repository.
This release also has some significant speed improvements, a new form of the “for” statement, for conveniently iterating across columns or down rows of a matrix, and a new, less error-prone way for C functions to “protect” objects from garbage collection. There are also a few bug fixes (including fixes for some bugs that are also in the current R core release).
Currently, pqR is distributed in source form only, and so you need to be comfortable compiling it yourself. It has been tested on Linux/Unix systems (with Intel/AMD, ARM, PowerPC, and SPARC processors), on Mac OS X (including macOS Sierra), and on Microsoft Windows (XP, 7, 8, 10) systems.
I plan to soon put up posts with more details on some of the features of this and the previous pqR release, as well as a post describing some of my future plans for pqR.
In particular, the extensions fix the problems that 1:n doesn’t work as intended when n is zero, and that M[1:n,] is a vector rather than a matrix when n is one, or when M has only one column. Since changing the “:” operator would cause too many problems with existing programs, pqR introduces a new “..” operator for generating increasing sequences. Unwanted dimension dropping is also addressed in ways that have minimal effects on existing code.
The new release, pqR-2016-06-24, is available at pqR-project-org. The NEWS file for this release also documents some other language extensions, as well as fixes for various bugs (some of which are also in R-3.3.1).
Records of global temperatures over the last few decades figure prominently in the debate over the climate effects of CO2 emitted by burning fossil fuels, as I discussed in my first post in this series, on What can global temperature data tell us? One recent controversy has been whether or not there has been a `pause’ (also referred to as a `hiatus’) in global warming over the last 15 to 20 years, or at least a `slowdown’ in the rate of warming, a question that I considered in my second post, on Has there been a `pause’ in global warming?
As I discussed in that post, the significance of a pause in warming since around 2000, after a period of warming from about 1970 to 2000, would be to show that whatever the warming effect of CO2, other factors influencing temperatures can be large enough to counteract its effect, and hence, conversely, that such factors could also be capable of enhancing a warming trend (eg, from 1970 to 2000), perhaps giving a misleading impression that the effect of CO2 is larger than it actually is. To phrase this more technically, a pause, or substantial slowdown, in global warming would be evidence that there is a substantial degree of positive autocorrelation in global temperatures, which has the effect of rendering conclusions from apparent temperature trends more uncertain.
Whether you see a pause in global temperatures may depend on which series of temperature measurements you look at, and there is controversy about which temperature series is most reliable. In my previous post, I concluded that even when looking at the satellite temperature data, for which a pause seems most visually evident, one can’t conclude definitely that the trend in yearly average temperature actually slowed (ignoring short-term variation) in 2001 through 2014 compared to the period 1979 to 2000, though there is also no definite indication that the trend has not been zero in recent years.
Of course, I’m not the only one to have looked at the evidence for a pause. In this post, I’ll critique a paper on this topic by Bala Rajaratnam, Joseph Romano, Michael Tsiang, and Noah S. Diffenbaugh, Debunking the climate hiatus, published 17 September 2015 in the journal Climatic Change. Since my first post in this series, I’ve become aware that `tamino’ has also commented on this paper, here, making some of the same points that I will make. I’ll have more to say, however, some of which is of general interest, apart from the debate on the `pause’ or `hiatus’. (more…)
As I discussed in my previous post, records of global temperatures over the last few decades figure prominently in the debate over the climate effects of CO2 emitted by burning fossil fuels. I am interested in what this data says about which of the reasonable positions in this debate is more likely to be true — the `warmer’ position, that CO2 from burning of fossil fuels results in a global increase in temperatures large enough to have quite substantial (though not absolutely catastrophic) harmful effects on humans and the environment, or the `lukewarmer’ position, that CO2 has some warming effect, but this effect is not large enough to be a major cause for worry, and does not warrant imposition of costly policies aimed at reducing fossil fuel consumption.
A recent focus of this debate has been whether temperature records show a `pause’ (or `hiatus’) in global warming over the last 10 to 20 years (or at least a `slowdown’ compared to the previous trend), and if so, what it might mean. Lukewarmers might interpret such a pause as evidence that other factors are comparable in importance to CO2, and can temporarily mask or exaggerate its effects, and hence that naively assuming the warming from 1970 to 2000 is primarily due to CO2 could lead one to overestimate the effect of CO2 on temperature.
Whether you sees a pause might, of course, depend on which data set of global temperatures you look at. These data sets are continually revised, not just by adding the latest observations, but by readjusting past observations. (more…)
Debates about anthropogenic climate change often centre around data on changes in global temperatures over the last few decades. There are good scientific reasons to look at this data, but it also plays a prominent role in political advocacy, sometimes fairly, sometimes not so fairly. This is the first in a series of posts in which I’ll discuss what this data can and cannot tell us, and examine some recent papers concerning whether or not there has been a “pause” in global warming over the last 10 to 20 years, and if so, what it might mean.
I will focus on anthropogenic warming that results, via the mis-named `greenhouse effect’, from CO2 produced by burning fossil fuels. There are other human-generated `greenhouse gasses’, and other human influences on climate, such as changes in land use, but the usual estimates of their effects are smaller than that of CO2, and in any case, they would call for different policy responses than reducing fossil fuel consumption. Other possible anthropogenic influences are, however, a possible complication when trying to determine the effects of CO2 by looking at temperature data.
What I’ll call the `warmer’ view of the effect of CO2 is what is accepted (at least verbally) by most governments, and is more-or-less found in the reports of the Intergovernmental Panel on Climate Change (IPCC) — that burning of fossil fuels increases CO2 in the atmosphere, resulting in a global increase in temperatures large enough to have quite substantial harmful effects on humans and the environment. The contrasting `no-warmer’ view is that increases in CO2 cause little or no warming, either (implausibly) because CO2 has no warming effect, or (somewhat more plausibly) because strong negative feedbacks limit its effects. In between is the `lukewarmer’ view — CO2 has some warming effect, but it is not large enough to be a major cause for worry, and does not warrant imposition of costly policies aimed at reducing fossil fuel consumption. This is the predominant view at some `skeptical’ web sites such as Watts Up With That.
There is also the `extreme-warmer’ view, that the effects of CO2 will be so large as to `fry the planet’, leading to the extinction of humans, and perhaps all life, which is surprisingly common among the general public, despite being utterly implausible. Of course, they are encouraged in this belief by alarmist papers such as `Mathematical Modelling of Plankton–Oxygen Dynamics Under the Climate Change‘ by Sekerci and Petrovskii, who apparently don’t understand that any arbitrary system of differential equations has a good chance of producing unstable behaviour, and that calling such a system a `model of a coupled plankton–oxygen dynamics’ does not make it a good model. It is very, very unlikely that life on earth would have lasted for over three billion years if the global ecosystem were really as unstable as is suggested in this paper.
The `warmer’ and `lukewarmer’ views are sufficiently plausible that it’s worth asking whether global temperature data has anything to say about which is closer to the truth. An alternative source of evidence is physical theory, embodied in computer simulations. Unfortunately, earth’s climate system is too complex to be simulated without various simplifications and approximations being made, so simulation cannot provide definitive answers, and must ultimately be checked against observations. Observations also have a rhetorical role, being potentially convincing to those who may put no trust in theory and simulation, but who naively think that measuring global temperature is a simple matter of reading thermometers.
Unfortunately, measuring global temperature is not so simple. Earth is a big place, with few observing stations, and every observing station is subject to biases from factors such as changes in the nature of its surroundings and in the time of day when observations are made. Measurements of temperature from space are indirect, and have potential biases from factors such as decaying satellite orbits. All time series of global temperatures are therefore the result of complex processing of raw data, whose appropriateness can be questioned.
It should come as no surprise to those aware of the political nature of this debate that supporters of the `warmer’ and `lukewarmer’ views tend to favour different global temperature datasets, which show different temperature trends in recent years. A favourite of the warmers is NASA’s GISS data, whose land-ocean version combines land temperature observations with sea surface temperature data. This data set was recently revised, with the new version showing a larger upward trend in temperature in recent years. The lukewarmers tend to favour the UAH data from satellite observations, also recently revised, with the new version showing a lower trend than before.
One should note that these two data sets are not measuring the same thing, or even trying to. GISS measures an ill-defined combination of water temperature near the top of the ocean and air temperature a few feet above the ground, in some variety of surroundings. UAH measures temperature in the lower part of the atmosphere, up to about 8000 metres above the surface. So it’s conceivable that the different trends in these two data sets both accurately reflect reality, though if so it’s hard to see how these different trends could continue indefinitely.
I’ll first show the monthly GISS global land-ocean temperatures (retrieved 2015-11-30) from 1880 to the end of 2014. (That’s when some other data I’ll be looking at ends; 2015 is so far mostly warmer than 2014.) These temperatures are expressed as `anomalies’ (in degrees Celsius) with respect to a base period (separately for each month of the year), since absolute values are meaningless given the arbitrary nature of what GISS is measuring. Here they are:
This graph is often portrayed (to the public) as convincing evidence that CO2 causes global warming. Look at that upward trend from about 1910! However, the rise from 1910 to 1940 can’t really be due to CO2. The direct warming effect of CO2 is generally accepted to be proportional to the logarithm of its concentration, with a doubling of CO2 producing roughly one degree Celsius of warming, which might be amplified (or diminished) by feedbacks. Here is a plot of the log base 2 of CO2 over the period above (data from here):
The increase from 1910 to 1940 is only about 0.05, which even with a generous factor of four allowance for positive feedback would give only 0.2 degrees Celsius of warming, compared to the warming of about 0.5 degrees in the GISS data. And if the 1910-1940 warming was really due to CO2, the warming from 1970-2000 should have been even greater than it was. Furthermore, part of the effect of CO2 is expected to be delayed by decades, making it an even less likely explanation of the 1910-1940 warming, since CO2 is thought to have been more-or-less constant before 1880.
Clearly, there are other influences on temperature than CO2. Once one realizes this, the upward temperature trend from 1970 to 2000 becomes less convincing as evidence of a warming effect of CO2. Furthermore, since CO2 has been increasing pretty much monotonically for over a hundred years, it is highly confounded with everything else that has been increasing over that period, as well as with long-period cycles. So any really persuasive argument regarding the effect of CO2 must be based on physical theory and on more detailed measurements that can confirm the effects of CO2 at a greater level of detail than a simple global average of temperature. This is the subject of `attribution’ studies, the critique of which is beyond the scope of this blog post (and beyond my expertise).
Nevertheless, there seems to be value in trying to better understand the global temperature data, partly as a `sanity check’ on claims based on more complex, and perhaps more questionable, analyses, and also to see whether there is any evidence of the data being wrong.
To lukewarmers, an aspect of the data that provides evidence of other factors being comparable in importance to CO2 is the `pause’ in warming (or at least a `slowdown’) that one can visually see in the plot above from about 2002. For a closer look, here is the same GISS data, but going back only to 1979:
The UAH satellite temperature data starts in 1979, so we can now compare with it (version 6.0beta4, downloaded 2015-11-30):
The base period for the anomalies in the UAH plot is different from GISS, so only the changes are comparable. (I’ve made the vertical scales match in that respect.)
Both data sets seem visually to show a slowdown or `pause’ around 2002, with this being more prominent in the UAH data (in which one might see the pause as going back as far as 1995). To lukewarmers, the significance of this pause is not that global warming has stopped, showing that CO2 has no effect, since they think that CO2 does have at least some small effect. Rather, they see it as evidence that other effects are large, sometimes large enough to cancel any underlying warming trend from CO2, and sometimes making any such trend appear larger than it actually is — and hence the warming in the 1970-2000 period cannot be taken as indicative of the magnitude of the warming due to CO2, or of what to expect in future.
As alluded to above, simple linear least squares fits to the GISS and UAH data for 1979-2014 show a greater trend for GISS (1.59 degrees C per century) than for UAH (1.12 degrees C per century). But if there is actually a change around 2002, a single trend line is of course largely meaningless.
Reactions to the `pause’ (or `hiatus’) from the warmer camp have taken several forms:
- Claims that the pause is an artifact of poorly adjusted temperature measurements, that disappears when adjustments are done properly.
- Claims that the visual appearance of a pause is deceiving — that the `pause’ is just chance variation, which the human eye overinterprets.
- Claims that if one subtracts changes due to known effects, such as volcanic eruptions, the pause disappears, showing that the underlying trend due to CO2 continues unabated. (Note that depending on the size of the underlying trend that is revealed, this would not necessarily be contrary to lukewarmer views.)
- Claims that warming from CO2 continues at a substantial rate, but that the heat is going somewhere that escapes measurement in global temperature data sets.
I will leave claims in category (4) for others to critique.
Claims in category (3) include a blog post by `tamino’. I plan to present my own analysis of this sort in a future blog post, and compare to that of `tamino’.
Two recent papers making claims in category (2) are `Debunking the climate hiatus‘, by Rajaratnam, Romano, Tsiang, and Diffenbaugh, and `On the definition and identifiability of the alleged “hiatus” in global warming‘, by Lewandowsky, Risbey, and Oreskes. Both of these papers look at (or say they look at) the GISS land-ocean temperature data, displayed above, but before the recent revision. I plan to comment on these papers in my next blog post.
Regarding (1), the GISS temperatures displayed above show a less prominent `pause’ than the version of GISS land-ocean temperatures distributed prior to July 2015 (obtained from the wayback machine’s version of 2015-04-18, stored here), which is shown below:
The revision results in a greater upward trend during the `pause’ period, as shown by the following plot of differences (with enlarged vertical scale):
To tell whether or not this revision was justified, one would need to examine in depth the temperature adjustments done for the GISS data set, which I haven’t done.
However, it’s not too hard to see some interesting things by examining the GISS land-ocean temperature data in more detail. I’ll look only at the most recent version (accessed 2015-11-30) .
First, one can look separately at the Northern Hemisphere:
and Southern Hemisphere:
The difference is rather striking. One would expect some overall difference due to the greater amount of ocean in the Southern Hemisphere, and the different nature of the polar regions. But that doesn’t explain the abrupt increase in the scatter of Southern Hemisphere data points after about 1955.
We can also look at each month of the year separately. Here’s the Northern Hemisphere:
And here’s the Southern Hemisphere:
In the Northern Hemisphere, variability is obviously greater in winter than in summer. The variability in the Southern Hemisphere winter seems slightly greater than in summer, but much less so than in the Northern Hemisphere. These are differences that I’ll take account of when modeling this data later.
I’ve marked 1955 by a short line at the bottom. In the Northern Hemisphere, the dip in January temperatures from 1955 to 1975 seems odd, since it doesn’t show up in December and February, but it’s hard to be sure that it’s not a real climatic effect. Something does happen around 1955 in the Southern Hemisphere plots, which increases the variance in May and August, and maybe June, July, and September. This can be confirmed by looking at plots for each of the 12 months of the year that show the difference of the anomaly for that month from the average anomaly for that month in the three preceding and three following years:
May through September seem to have higher variability in the years after 1955, and this is very clear for at least May and August. In contrast, similar plots for the Northern Hemisphere show no change in variance, or perhaps a slight decline after 1955 for May and June. It’s hard to see how this Southern Hemisphere variance change can reflect a real change in climate, given its abrupt onset, and that it does not appear in the Northern Hemisphere. More likely, it is an artifact of how the data is processed. A rapid improvement in quality of measurements after World War II might also be a possible explanation (though one would expect that to lead to less variability, rather than more).
Whatever the reason, it seems that relying on GISS data before 1955 might be unwise. In my later analyses, I will look at data only from 1959, since that is when some other related data sets begin, or from 1979 when comparing to the UAH data.
I note that obtaining all but the most recent GISS data is difficult. Some versions can be accessed at the wayback machine, but many versions apparently saved there produce an ‘access denied’ error. UAH has an extensive archive, but even it seems not to have all the versions that were distributed. GISS distributes the programs they use, but only the current version. I can’t find any programs at the UAH website. Both GISS and UAH ought to have a public repository that uses a source-code control system such as git, which would allow all versions of programs, raw data, and processed data to be accessed, with documentation of all changes.
To reproduce the results in this post, you will first need to download the data using this shell script (which downloads other data too, that I will use for later blog posts), or manually download from the URLs it lists if you don’t have wget. You then need to download my R script for reading these files, and my R script for making the plots (and rename them to .r from the .doc that wordpress requires). Finally, run the second script in R as described in its opening comments.
A while ago, I came across a mention of the Python math.fsum function, which sums a set of floating-point values exactly, then rounds to the closest floating point value. This seemed useful. In particular, I thought that if it’s fast enough it could be used instead of R’s rather primitive two-pass approach to trying to compute the sample mean more accurately (but still not exactly). My initial thought was to just implement the algorithm Python uses in pqR. But I soon discovered that there were newer (and faster) algorithms. And then I thought that I might be able to do even better…
The result is a new paper of mine on Fast exact summation using small and large superaccumulators (also available from arxiv.org).
Vectors in R can currently have elements of two sizes — 8-byte double-precision floating-point elements for `numeric’ vectors, or 4-byte elements for `integer’ or `logical’ vectors. You can also have vectors whose elements are 1-byte `raw’ values, but these raw vectors don’t support negative numbers, or NA values, so they aren’t suitable for general use.
It seems that lots of actual data vectors could be stored more compactly than at present. Many integer vectors consist solely of elements that would fit in one or two bytes. Logical vectors could be stored using two bits per element (allowing TRUE, FALSE, and NA), which would use only one-sixteenth as much memory as at present. It’s likely that many operations would also be faster on such compact vectors, so there’s not even necessarily a time-space tradeoff.
For integer and logical types, the possible compact representations, and how to work with them, are fairly obvious. The challenge is how to start using such compact representations while retaining compatibility with existing R code, including functions written in C, Fortran, or whatever. Of course, one could use the S3 or S4 class facilities to define new classes for data stored compactly, with suitable redefinitions of standard operators such as `+’, but this would have substantial overhead, and would in any case not completely duplicate the behaviour of non-compact numeric, integer, or logical vectors. Below, I discuss how to implement compact representations in a way that is completely invisible to R programs. I hope to try this out in my pqR implementation of R sometime, though other improvements to pqR have higher priority at the moment.
How to compactly represent floating-point data (of R’s `numeric’ type) is not so obvious. If the use of a compact representation is to have no effect on the results, one cannot just use single-precision floating point. I describe a different approach in a new paper on Representing numeric data in 32 bits while preserving 64-bit precision (also on arxiv). I’ll present the idea of this paper next, before returning to the question of how one might put compact representations of any sort into an R interpreter, invisibly to R programs. (more…)