## Posts filed under ‘Statistics – Computing’

### New version of pqR, with automatic differentiation and arithmetic on lists

I’ve released pqR-2020-07-23, a new version of my variant implementation of R. You can install it on Linux, Windows, or Mac as described at pqR-project.org. Installation must currently be from source, similarly to source installs of R Core versions of R. |

This version has preliminary implementations of automatic differentiation and of arithmetic on lists. These are both useful for gradient-based optimization, such as maximum likelihood estimation and neural network training, as well as gradient-based MCMC methods. List arithmetic is helpful when dealing with models that have several groups of parameters, which are most conveniently represented using a list of vectors or matrices, rather than a single vector.

You can read the documentation on these facilities here and here. Some example programs are in this repository. I previously posted about the automatic differentiation facilities here. Automatic differentiation and arithmetic on lists for pqR are both discussed in this talk, along with some other proposals.

For the paranoid, here are the shasum values for the compressed and uncompressed tar files that you can download from pqR-project.org, allowing you to verify that they were downloaded uncorrupted:

c1b389861f0388b90122cbe1038045da30879785 pqR-2020-07-23.tar.gz 04b4586601d8796b12c310cd4bf81dc057f33bb2 pqR-2020-07-23.tar

### Critique of “Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period” — Part 4: Modelling R, seasonality, immunity

In this post, fourth in a series (previous posts: Part 1, Part 2, Part 3), I’ll finally talk about some substantive conclusions of the following paper:

Kissler, Tedijanto, Goldstein, Grad, and Lipsitch, Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period, Science, vol. 368, pp. 860-868, 22 May 2020 (released online 14 April 2020). The paper is also available here, with supplemental materials here.

In my previous post, I talked about how the authors estimate the reproduction numbers (*R*) over time for the four common cold coronavirus, and how these estimates could be improved. In this post, I’ll talk about how Kissler et al. use these estimates for *R* to model immunity and cross-immunity for these viruses, and the seasonal effects on their transmission. These modelling results inform the later parts of the paper, in which they consider various scenarios for future transmission of SARS-CoV-2 (the coronavirus responsible for COVID-19), whose characteristics may perhaps resemble those of these other coronaviruses.

The conclusions that Kissler et al. draw from their model do not seem to me to be well supported. The problems start with the artifacts and noise in the proxy data and *R* estimates, which I discussed in Part 2 and Part 3. These issues with the *R* estimates induce Kissler et al. to model smoothed *R* estimates, which results in autocorrelated errors that invalidate their assessments of uncertainty. The noise in *R* estimates also leads them to limit their model to the 33 weeks of “flu season”; consequently, their model cannot possibly provide a full assessment of the degree of seasonal variation in *R*, which is one matter of vital importance. The conclusions Kissler et al. draw from their model regarding immunity and cross-immunity for the betacoronavirues are also flawed, because they ignore the effects of aggregation over the whole US, and because their model is unrealistic and inconsistent in its treatment of immunity during a season and at the start of a season. A side effect of this unrealistic immunity model is that the partial information on seasonality that their model produces is biased.

After justifying these criticisms of Kissler et al.’s results, I will explore what can be learned using better incidence proxies and *R* estimates, and better models of seasonality and immunity.

The code I use (written in R) is available here, with GPLv2 licence.

### Critique of “Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period” — Part 3: Estimating reproduction numbers

This is the third in a series of posts (previous posts: Part 1, Part 2, next post: Part 4) in which I look at the following paper:

Kissler, Tedijanto, Goldstein, Grad, and Lipsitch, Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period, Science, vol. 368, pp. 860-868, 22 May 2020 (released online 14 April 2020). The paper is also available here, with supplemental materials here.

In this post, I’ll look at how the authors estimate the reproduction numbers (*R*) over time for the four common cold coronavirus, using the proxies for incidence that I discussed in Part 2. These estimates for *R* are used to model immunity and cross-immunity for these viruses, and the seasonal effects on their transmission. These modelling results inform the later parts of the paper, in which they consider various scenarios for future transmission of SARS-CoV-2 (the coronavirus responsible for COVID-19), whose characteristics may perhaps resemble those of these other coronaviruses.

I will be using the code (written in R) available here, with GPLv2 licence, which I wrote to replicate the results in the paper, and which allows me to more easily produce plots to help understand issues with the methods, and to try out alternative methods that may work better, than the code provided by the authors (which I discussed in Part 1). (more…)

### Critique of “Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period” — Part 2: Proxies for incidence of coronaviruses

This is the second in a series of posts (previous post: Part 1, next post: Part 3) in which I look at the following paper:

Kissler, Tedijanto, Goldstein, Grad, and Lipsitch, Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period, Science, vol. 368, pp. 860-868, 22 May 2020 (released online 14 April 2020). The paper is also available here, with supplemental materials here.

In this post, I’ll start to examine in detail the first part of the paper, where the authors look at past incidence of “common cold” coronaviruses, estimate the viruses’ reproduction numbers (R) over time, and use those estimates to model immunity and cross-immunity for these viruses, and seasonal effects on their transmission. The results of this part inform the later parts of the paper, in which they model the two common cold betacoronaviruses together with SARS-CoV-2 (the virus for COVID-19), and look at various scenarios for the future, varying the duration of immunity for SARS-CoV-2, the degree of cross-immunity of SARS-CoV-2 and common cold betacoronaviruses, and the effect of season on SARS-CoV-2 transmission.

In my previous post, I used the partial code released by the authors to try to reproduce the results in the first part of the paper. I was eventually able to do this. For this and future posts, however, I will use my own code, with which I can also replicate the paper’s results. This code allows me to more easily produce plots to help understand issues with the methods, and to try out alternative methods. The code (written in R) is available here, with GPLv2 licence. The data used is also included in this repository.

In this second post of the series, I examine how Kissler et al. produce proxies for the incidence of infection in the United States by the four common cold coronaviruses. I’ll look at some problems with their method, and propose small changes to try to fix them. I’ll also try out some more elaborate alternatives that may work better.

The coronavirus proxies are the empirical basis for the remainder of paper. (more…)

### Critique of “Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period” — Part 1: Reproducing the results

UPDATES: Next post in series: Part 2. Minor fix at ~~strikethrough~~ before last figure.

I’ve been looking at the following paper, by researchers at Harvard’s school of public health, which was recently published in *Science*:

Kissler, Tedijanto, Goldstein, Grad, and Lipsitch (2020) Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period (also available here, with supplemental materials here).

This is one of the papers referenced in my recent post on seasonality of COVID-19. The paper does several things that seem interesting:

- It looks at past incidence of “common cold” coronaviruses, estimating the viruses’ reproduction numbers (R) over time, and from that their degrees of cross-immunity and the seasonal effect on their transmission.
- It fits an ODE model for the two common cold betacoronaviruses, which are related to SARS-CoV-2 (the virus for COVID-19), using the same data.
- It then adds SARS-CoV-2 to this ODE model, and looks at various scenarios for the future, varying the duration of immunity for SARS-CoV-2, the degree of cross-immunity of SARS-CoV-2 and common cold betacoronaviruses, and the effect of season on SARS-CoV-2 transmission.

In future posts, I’ll discuss the substance of these contributions. In this post, I’ll talk about my efforts at reproducing the results in the paper from the code and data available, which is a prerequisite for examining why the results are as they are, and for looking at how the methods used might be improved.

I’ll also talk about an amusing / horrifying aspect of the R code used, which I encountered along the way, about CDC data sharing policy, and about the authors’ choices regarding some graphical presentations. (more…)

### Software for Flexible Bayesian Modeling – New release

I’ve released a new version of my Software for Flexible Bayesian Modeling and Markov Chain Sampling (FBM). This is the first public release since 2004, with the first release of the precursor software being in 1995. There was a version mostly completed in 2007 that never got released (due to my not getting around to checking that I’d fixed up everything). The new version has the changes from 2007 plus some more recent updates, including new features used for the tests in this paper.

FBM implements several general-purpose Markov chain sampling methods, such as Metropolis updates, Hamiltonian (Hybrid) Monte Carlo, and slice sampling. These methods can be applied to distributions defined by simple formulas, including posterior distributions for simple Bayesian models. Several more specialized modules have been written that implement posterior distributions for more complex models, including Bayesian neural networks, Gaussian process models, and mixture models (including Dirichlet process mixture models).

(more…)

### Automatic differentiation in pqR

I’ve released a version of my pqR implementation of R that has extensions for automatic differentiation. This is not a stable release, but it can be downloaded from pqR-project.org — look for the test version at the bottom — and installed the same as other pqR versions (from source, so you’ll need C and Fortran compilers). |

Note that this version very likely has various bugs — mostly showing up only if you use automatic differentiation, I hope.

You can read about the automatic differentation facilities here, or with help(Gradient) after installing the test version. Below are a few examples to show a bit of what you can do.

(more…)

### Faster garbage collection in pqR

The latest version of pqR and the version before as well use a new garbage collector, and new memory layouts for R objects, which both reduce memory usage and considerably speed up garbage collection. |

### The new pqR parser, and R’s “else” problem

The latest version of pqR (mostly) solves R’s “else” problem, by modifying the new parser previously introduced in pqR. I’ll explain the “else” problem and solution here, and also present other advantages of pqR’s parser over the R Core parser, including a big speed advantage in one context. |

### New version of pqR, with major speed improvements

I’ve released pqR-2018-11-18, a new version of my variant implementation of R. You can install it on Linux, Windows, or Mac as described at pqR-project.org. Installation must currently be from source, similarly to source installs of R Core versions of R. |

This version has some major speed improvements, as well as some new features. I’ll details some of these improvements in future posts. Here, I’ll just mention a few things to show the flavour of the improvements in this release, and why you might be interested in pqR as an alternative to the R Core implementation.

### Performance improvements

One landmark reached in this release is that it is no longer advisable to use the byte-code compiler in pqR. The speed of direct interpretation of R code has now been improved to the point where it is about as fast at executing simple scalar code as the byte-code interpreter. Eliminating the byte-code compiler simplifies the overall implementation, and avoids possible semantic differences between interpreted and byte-compiled code. It is also important for pqR because some pqR optimizations and some new pqR features are not implemented in byte-code. For example, only the interpreter does optimizations such as deferring vector operations so that they may automatically be merged with other operations or be done in parallel when multiple cores are available.

Some vector operations have been substantially sped up compared to the previous release of pqR-2017-06-09. The improvement compared to R-3.5.1 can be even greater. Here is an example of replacing a subset of vector elements, benchmarked on an Intel “Skylake” processor, with both pqR-2018-11-18 and R-3.5.1 compiled from source with gcc 8.2.0 at optimization level -O3:

Here’s R-3.5.1:

> a <- numeric(20000) > system.time(for (i in 1:100000) a[2:19999] <- 3.1) user system elapsed 4.211 0.148 4.360

And here’s pqR-2018-11-18:

> a <- numeric(20000) > system.time(for (i in 1:100000) a[2:19999] <- 3.1) user system elapsed 0.256 0.000 0.257

So the current R Core implementation is 17 times slower than pqR for this replacement operation.

The advantage of pqR isn’t always this large, but many vector operations are sped up by smaller but still significant factors. An example:

With R-3.5.1:

> a <- seq(0,1,length=2000); b <- seq(1,0,length=2000) > system.time (for (i in 1:100000) { + d <- abs(a-b); r <- sum (d>0.4 & d<0.7) }) user system elapsed 1.215 0.015 1.231

With pqR-2018-11-18:

> a <- seq(0,1,length=2000); b <- seq(1,0,length=2000) > system.time (for (i in 1:100000) { + d <- abs(a-b); r <- sum (d>0.4 & d<0.7) }) user system elapsed 0.654 0.008 0.662

So for this example, pqR is almost twice as fast.

For some operations, pqR’s implementation has lower asymptotic time complexity, and so can be enormously faster. An example is the following convenient coding pattern that R programmers are currently warned to avoid:

With R-3.5.1:

> n <- 200000; a <- numeric(0); > system.time (for (i in 1:n) a <- c(a,(i+1)^2)) user system elapsed 30.387 0.223 30.612

With pqR-2018-11-18:

> n <- 200000; a <- numeric(0); > system.time (for (i in 1:n) a <- c(a,(i+1)^2)) user system elapsed 0.040 0.004 0.045

In R-3.5.1, extending a vector one element at a time with “c” takes time growing as n^{2}, as a new vector is allocated when each element is appended. With the latest version of pqR, the time grows only as n log n. In this example, that leads to pqR being 680 times faster, but the ratio could be made arbitrarily large by increasing n.

It’s still faster in pqR to preallocate a vector of length n, but only by about a factor of three, which would often be tolerable when writing one-off code if using “c” is more convenient.

### New features

The latest version of pqR has some new features. As for earlier pqR versions, some new features are aimed at addressing design flaws in R that lead to unreliable code, and others are aimed at making R more convenient for programming and scripting.

One new convenience feature is that the paste and paste0 operations can now be written with new `!!` and `!` operators. For example,

> city <- "Toronto"; province <- "Ontario" > city ! "," !! province [1] "Toronto, Ontario"

The `!!` operator pastes strings together with space separation; the `!` operator pastes with no separation. Of course, `!` retains its meaning of “not” when used as a unary operator; there is no ambiguity.

### What next?

I’ll be writing some more blog posts regarding improvements in pqR-2018-11-18, and regarding some improvements in earlier pqR versions that I haven’t already blogged about. Of course, you can read about these now in the pqR NEWS file.

The main disadvantage of pqR is that it is not fully compatible with the current R Core version. It is a fork of R-2.15.0, with many, but not all, later changes incorporated. This affects what packages will work with pqR.

Addressing this compatibility issue is one thing that needs to be done going forward. I’ll discuss this and other plans — notably implementing automatic differentiation — in another future blog post.

I’m open to other people getting involved in this project. Of course, you can contribute now by trying out pqR and reporting any problems in the comments here or at the pqR issues page. Performance comparisons, especially on real-world applications, are also welcome.

Finally, for the paranoid, here are the shasum values for the compressed and uncompressed tar files that you can download from pqR-project.org:

89216dc76be23b3928c26561acc155b6e5ad32f3 pqR-2018-11-18.tar.gz f0ee8a37198b7e078fa1aec7dd5cda762f1a7799 pqR-2018-11-18.tar