Posts filed under ‘Statistics’

Faster garbage collection in pqR

The latest version of pqR and the version before as well use a new garbage collector, and new memory layouts for R objects, which both reduce memory usage and considerably speed up garbage collection.

(more…)

2018-11-29 at 7:45 pm Leave a comment

The new pqR parser, and R’s “else” problem

The latest version of pqR (mostly) solves R’s “else” problem, by modifying the new parser previously introduced in pqR.  I’ll explain the “else” problem and solution here, and also present other advantages of pqR’s parser over the R Core parser, including a big speed advantage in one context.

(more…)

2018-11-27 at 8:33 pm 1 comment

New version of pqR, with major speed improvements

I’ve released pqR-2018-11-18, a new version of my variant implementation of R.  You can install it on Linux, Windows, or Mac as described at pqR-project.org. Installation must currently be from source, similarly to source installs of R Core versions of R.

This version has some major speed improvements, as well as some new features. I’ll details some of these improvements in future posts. Here, I’ll just mention a few things to show the flavour of the improvements in this release, and why you might be interested in pqR as an alternative to the R Core implementation.

Performance improvements

One landmark reached in this release is that it is no longer advisable to use the byte-code compiler in pqR. The speed of direct interpretation of R code has now been improved to the point where it is about as fast at executing simple scalar code as the byte-code interpreter. Eliminating the byte-code compiler simplifies the overall implementation, and avoids possible semantic differences between interpreted and byte-compiled code. It is also important for pqR because some pqR optimizations and some new pqR features are not implemented in byte-code. For example, only the interpreter does optimizations such as deferring vector operations so that they may automatically be merged with other operations or be done in parallel when multiple cores are available.

Some vector operations have been substantially sped up compared to the previous release of pqR-2017-06-09. The improvement compared to R-3.5.1 can be even greater. Here is an example of replacing a subset of vector elements, benchmarked on an Intel “Skylake” processor, with both pqR-2018-11-18 and R-3.5.1 compiled from source with gcc 8.2.0 at optimization level -O3:

Here’s R-3.5.1:

> a <- numeric(20000)
> system.time(for (i in 1:100000) a[2:19999] <- 3.1)
   user  system elapsed 
  4.211   0.148   4.360 

And here’s pqR-2018-11-18:

> a <- numeric(20000)
> system.time(for (i in 1:100000) a[2:19999] <- 3.1)
   user  system elapsed 
  0.256   0.000   0.257 

So the current R Core implementation is 17 times slower than pqR for this replacement operation.

The advantage of pqR isn’t always this large, but many vector operations are sped up by smaller but still significant factors. An example:

With R-3.5.1:

> a <- seq(0,1,length=2000); b <- seq(1,0,length=2000)
> system.time (for (i in 1:100000) {
+       d <- abs(a-b); r <- sum (d>0.4 & d<0.7) })
   user  system elapsed 
  1.215   0.015   1.231 

With pqR-2018-11-18:

> a <- seq(0,1,length=2000); b <- seq(1,0,length=2000)
> system.time (for (i in 1:100000) {
+       d <- abs(a-b); r <- sum (d>0.4 & d<0.7) })
   user  system elapsed 
  0.654   0.008   0.662 

So for this example, pqR is almost twice as fast.

For some operations, pqR’s implementation has lower asymptotic time complexity, and so can be enormously faster. An example is the following convenient coding pattern that R programmers are currently warned to avoid:

With R-3.5.1:

> n <- 200000; a <- numeric(0);
> system.time (for (i in 1:n) a <- c(a,(i+1)^2))
   user  system elapsed 
 30.387   0.223  30.612 

With pqR-2018-11-18:

> n <- 200000; a <- numeric(0);
> system.time (for (i in 1:n) a <- c(a,(i+1)^2))
   user  system elapsed 
  0.040   0.004   0.045 

In R-3.5.1, extending a vector one element at a time with “c” takes time growing as n2, as a new vector is allocated when each element is appended. With the latest version of pqR, the time grows only as n log n. In this example, that leads to pqR being 680 times faster, but the ratio could be made arbitrarily large by increasing n.

It’s still faster in pqR to preallocate a vector of length n, but only by about a factor of three, which would often be tolerable when writing one-off code if using “c” is more convenient.

New features

The latest version of pqR has some new features. As for earlier pqR versions, some new features are aimed at addressing design flaws in R that lead to unreliable code, and others are aimed at making R more convenient for programming and scripting.

One new convenience feature is that the paste and paste0 operations can now be written with new !! and ! operators. For example,

> city <- "Toronto"; province <- "Ontario"
> city ! "," !! province
[1] "Toronto, Ontario"

The !! operator pastes strings together with space separation; the ! operator pastes with no separation. Of course, ! retains its meaning of “not” when used as a unary operator; there is no ambiguity.

What next?

I’ll be writing some more blog posts regarding improvements in pqR-2018-11-18, and regarding some improvements in earlier pqR versions that I haven’t already blogged about. Of course, you can read about these now in the pqR NEWS file.

The main disadvantage of pqR is that it is not fully compatible with the current R Core version. It is a fork of R-2.15.0, with many, but not all, later changes incorporated. This affects what packages will work with pqR.

Addressing this compatibility issue is one thing that needs to be done going forward. I’ll discuss this and other plans — notably implementing automatic differentiation — in another future blog post.

I’m open to other people getting involved in this project. Of course, you can contribute now by trying out pqR and reporting any problems in the comments here or at the pqR issues page. Performance comparisons, especially on real-world applications, are also welcome.

Finally, for the paranoid, here are the shasum values for the compressed and uncompressed tar files that you can download from pqR-project.org:

89216dc76be23b3928c26561acc155b6e5ad32f3  pqR-2018-11-18.tar.gz
f0ee8a37198b7e078fa1aec7dd5cda762f1a7799  pqR-2018-11-18.tar

2018-11-25 at 5:45 pm 5 comments

New release of pqR — faster, and now compatible with R-2.15.1

I have released a new version of my “pretty quick” R interpreter, pqR-2016-10-05.

One major change with this version is that pqR, which was based on R-2.15.0, is now compatible with R-2.15.1.  This allows for an increased number of packages in the pqR repository.

This release also has some significant speed improvements, a new form of the “for” statement, for conveniently iterating across columns or down rows of a matrix, and a new, less error-prone way for C functions to “protect” objects from garbage collection. There are also a few bug fixes (including fixes for some bugs that are also in the current R core release).

You can read more in the NEWS file, and get it from pqR-project.org.

Currently, pqR is distributed in source form only, and so you need to be comfortable compiling it yourself. It has been tested on Linux/Unix systems (with Intel/AMD, ARM, PowerPC, and SPARC processors), on Mac OS X (including macOS Sierra), and on Microsoft Windows (XP, 7, 8, 10) systems.

I plan to soon put up posts with more details on some of the features of this and the previous pqR release, as well as a post describing some of my future plans for pqR.

2016-10-08 at 9:01 pm Leave a comment

Fixing R’s design flaws in a new version of pqR

I’ve released a new version of my pqR implementation of R. This version introduces extensions to the R language that fix some long-standing design flaws that were inherited from S.

These language extensions make it easier to write reliable programs, that work even in edge cases, such as data sets with one observation.

In particular, the extensions fix the problems that 1:n doesn’t work as intended when n is zero, and that M[1:n,] is a vector rather than a matrix when n is one, or when M has only one column. Since changing the “:” operator would cause too many problems with existing programs, pqR introduces a new “..” operator for generating increasing sequences.  Unwanted dimension dropping is also addressed in ways that have minimal effects on existing code.

The new release, pqR-2016-06-24, is available at pqR-project-org. The NEWS file for this release also documents some other language extensions, as well as fixes for various bugs (some of which are also in R-3.3.1).
(more…)

2016-06-25 at 8:46 pm 13 comments

Critique of ‘Debunking the climate hiatus’, by Rajaratnam, Romano, Tsiang, and Diffenbaugh

Records of global temperatures over the last few decades figure prominently in the debate over the climate effects of CO2 emitted by burning fossil fuels, as I discussed in my first post in this series, on What can global temperature data tell us? One recent controversy has been whether or not there has been a `pause’ (also referred to as a `hiatus’) in global warming over the last 15 to 20 years, or at least a `slowdown’ in the rate of warming, a question that I considered in my second post, on Has there been a `pause’ in global warming?

As I discussed in that post, the significance of a pause in warming since around 2000, after a period of warming from about 1970 to 2000, would be to show that whatever the warming effect of CO2, other factors influencing temperatures can be large enough to counteract its effect, and hence, conversely, that such factors could also be capable of enhancing a warming trend (eg, from 1970 to 2000), perhaps giving a misleading impression that the effect of CO2 is larger than it actually is. To phrase this more technically, a pause, or substantial slowdown, in global warming would be evidence that there is a substantial degree of positive autocorrelation in global temperatures, which has the effect of rendering conclusions from apparent temperature trends more uncertain.

Whether you see a pause in global temperatures may depend on which series of temperature measurements you look at, and there is controversy about which temperature series is most reliable. In my previous post, I concluded that even when looking at the satellite temperature data, for which a pause seems most visually evident, one can’t conclude definitely that the trend in yearly average temperature actually slowed (ignoring short-term variation) in 2001 through 2014 compared to the period 1979 to 2000, though there is also no definite indication that the trend has not been zero in recent years.

Of course, I’m not the only one to have looked at the evidence for a pause. In this post, I’ll critique a paper on this topic by Bala Rajaratnam, Joseph Romano, Michael Tsiang, and Noah S. Diffenbaugh, Debunking the climate hiatus, published 17 September 2015 in the journal Climatic Change. Since my first post in this series, I’ve become aware that `tamino’ has also commented on this paper, here, making some of the same points that I will make.  I’ll have more to say, however, some of which is of general interest, apart from the debate on the `pause’ or `hiatus’. (more…)

2016-01-10 at 11:43 pm 31 comments

Has there been a ‘pause’ in global warming?

As I discussed in my previous post, records of global temperatures over the last few decades figure prominently in the debate over the climate effects of CO2 emitted by burning fossil fuels. I am interested in what this data says about which of the reasonable positions in this debate is more likely to be true —  the `warmer’ position, that CO2 from burning of fossil fuels results in a global increase in temperatures large enough to have quite substantial (though not absolutely catastrophic) harmful effects on humans and the environment, or  the `lukewarmer’ position, that CO2 has some warming effect, but this effect is not large enough to be a major cause for worry, and does not warrant imposition of costly policies aimed at reducing fossil fuel consumption.

A recent focus of this debate has been whether temperature records show a `pause’ (or `hiatus’) in global warming over the last 10 to 20 years (or at least a `slowdown’ compared to the previous trend), and if so, what it might mean. Lukewarmers might interpret such a pause as evidence that other factors are comparable in importance to CO2, and can temporarily mask or exaggerate its effects, and hence that naively assuming the warming from 1970 to 2000 is primarily due to CO2 could lead one to overestimate the effect of CO2 on temperature.

Whether you sees a pause might, of course, depend on which data set of global temperatures you look at. These data sets are continually revised, not just by adding the latest observations, but by readjusting past observations. (more…)

2015-12-19 at 11:52 pm 17 comments

Older Posts


Calendar

December 2018
M T W T F S S
« Nov    
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Posts by Month

Posts by Category