Posts filed under ‘Statistics’

New release of pqR, with a curated repository

I have released a new version, pqR-2014-06-19, of my speedier, “pretty quick”, implementation of R.  This and the previous release (pqR-2014-02-23) are maintenance releases, with bug fixes, improved documentation, and better test procedures.

The result is that pqR now works with a large collection of 3438 packages.

(more…)

2014-06-21 at 10:59 pm 2 comments

Inaccurate results from microbenchmark

The microbenchmark package is a popular way of comparing the time it takes to evaluate different R expressions — perhaps more popular than the alternative of just using system.time to see how long it takes to execute a loop that evaluates an expression many times. Unfortunately, when used in the usual way, microbenchmark can give inaccurate results.

The inaccuracy of microbenchmark has two main sources — first, it does not correctly allocate the time for garbage collection to the expression that is responsible for it, and second, its summarizes the results by the median time for many repetitions, when the mean is what is needed. The median and mean can differ drastically, because just a few of the repetitions will include time for a garbage collection. These flaws can result in comparisons being reversed, with the expression that is actually faster looking slower in the output of microbenchmark. (more…)

2014-02-02 at 2:31 pm 8 comments

New version of pqR, now with task merging

 
I’ve now released pqR-2013-12-29, a new version of my speedier implementation of R.  There’s a new website, pqR-project.org, as well, and a new logo, seen here.

The big improvement in this version is that vector operations are sped up using task merging.

With task merging, several arithmetic operations on a vector may be merged into a single operation, reducing the time spent on memory stores and fetches of intermediate results. I was inspired to add task merging to pqR by Renjin and Riposte (see my post here and the subsequent discussion). (more…)

2014-01-01 at 10:11 pm 4 comments

Faculty Positions in Statistics and Computer Science at the University of Toronto

Several Assistant Professor positions are open at the University of Toronto, in Statistics and areas of Computer Science related to Statistics.

The suburban Scarborough campus of the University of Toronto has a position for an Assistant Professor in any area of Statistics. Faculty at Scarborough teach undergraduate courses at the suburban campus, but Statistics faculty there also spend much time at the Department of Statistical Sciences on the downtown campus, teaching graduate courses, supervising graduate students, attending research seminars, etc.

There is a position in Computational Biology at the downtown campus joint between the Department of Computer Science and the The Donnelly Centre for Cellular and Biomolecular Research.  There are many research groups at the University of Toronto also working on computational biology, including significant interests within Statistics, Biostatistics, the Machine Learning group in Computer Science.

There is also a position in Computer Science on “Big Data”, broadly interpreted.  You’ll note at the link that there are also two other Computer Science Assistant Professor positions open (at the two suburban campuses).  And there’s also a position for a lecturer (full-time teaching faculty, with a permanent appointment, subject to performance review) .

U of T has recently recruited two new faculty in Statistics and Machine Learning — Ruslan Salakhutdinov and Raquel Urtasun.  They join the existing faculty interested in Machine Learning, who include Geoffrey Hinton, Richard Zemel, Brendan Frey, and myself.

The deadline for applying to the Assistant Professor position in Statistics is December 10.  For the Computer Science Assistant Professor positions, the deadline is January 10, and for the lecturer position, the deadline is January 15.

2013-12-05 at 8:57 pm Leave a comment

Deferred evaluation in Renjin, Riposte, and pqR

The previously sleepy world of R implementation is waking up.  Shortly after I announced pqR, my “pretty quick” implementation of R, the Renjin implementation was announced at UserR! 2013.  Work also proceeds on Riposte, with release planned for a year from now. These three implementations differ greatly in some respects, but interestingly they all try to use multiple processor cores, and they all use some form of deferred evaluation.

Deferred evaluation isn’t the same as “lazy evaluation” (which is how R handles function arguments). Deferred evaluation is purely an implementation technique, invisible to the user, apart from its effect on performance. The idea is to sometimes not do an operation immediately, but instead wait, hoping that later events will allow the operation to be done faster, perhaps because a processor core becomes available for doing it in another thread, or perhaps because it turns out that it can be combined with a later operation, and both done at once.

Below, I’ll sketch how deferred evaluation is implemented and used in these three new R implementations, and also comment a bit on their other characteristics. I’ll then consider whether these implementations might be able to borrow ideas from each other to further expand the usefulness of deferred evaluaton. (more…)

2013-07-24 at 11:03 pm 17 comments

Fixing R’s NAMED problems in pqR

In R, objects of most types are supposed to be treated as “values”, that do not change when other objects change. For instance, after doing the following:

  a <- c(1,2,3)
  b <- a
  a[2] <- 0

b[2] is supposed to have the value 2, not 0. Similarly, a vector passed as an argument to a function is not normally changed by the function. For example, with b as above, calling f(b), will not change b even if the definition of f is f <- function (x) x[2] <- 0.

This semantics would be easy to implement by simply copying an object whenever it is assigned, or evaluated as the argument to a function. Unfortunately, this would be unacceptably slow. Think, for example, of passing a 10000 by 10000 matrix as an argument to a little function that just accesses a few elements of the matrix and returns a value computed from them.  The copying would take far longer than the computation within the function, and the extra 800 Megabytes of memory required might also be a problem.

So R doesn’t copy all the time.  Instead, it maintains a count, called NAMED, of how many “names” refer to an object, and copies only when an object that needs to be modified is also referred to by another name.  Unfortunately, however, this scheme works rather poorly.  Many unnecessary copies are still made, while many bugs have arisen in which copies aren’t made when necessary. I’ll talk about this more below, and discuss how pqR has made a start at solving these problems. (more…)

2013-07-02 at 9:44 pm 3 comments

How pqR makes programs faster by not doing things

One way my faster version of R, called pqR (see updated release of 2013-06-28), can speed up R programs is by not even doing some operations. This happens in statements like for (i in 1:1000000) ..., in subscripting expressions like v[i:1000], and in logical expressions like any(v>0) or all(is.na(X)).

This is done using pqR’s internal “variant result” mechanism, which is also crucial to how helper threads are implemented. This mechanism is not visible to the user, apart from the reductions in run time and memory usage, but knowing about it will make it easier to understand the performance of programs running under pqR. (more…)

2013-06-30 at 8:37 pm Leave a comment

Older Posts


Calendar

September 2014
M T W T F S S
« Jun    
1234567
891011121314
15161718192021
22232425262728
2930  

Posts by Month

Posts by Category


Follow

Get every new post delivered to your Inbox.

Join 115 other followers