## Archive for August, 2008

### R Design Flaws #1 and #2: A Solution to Both?

I’ve previously posted about two design flaws in R. The first post was about how R produces reversed sequences from a:b when a>b, with bad consequences in “for” statements (and elsewhere). The second post was about how R by default drops dimensions in expressions like M[i:j,] when i:j is a sequence only one long (ie, when i equals j).

In both posts, I suggested ways of extending R to try to solve these problems. I now think there is a better way, however, which solves both problems with one simple extension to R. This extension would also make R programs run faster and use less memory. (more…)

### Answers to Applied PhD Comprehensive Question #1

This post links to a Question I set for an applied statistics PhD comprehensive exam. My answers for this question are here (the question is repeated there, so no need to look at the old post).

Note that my answers are more elaborate than I would expect a student to write on an exam. I also gave credit for discussions that showed some insight, even if the final answer wasn’t completely correct.

### Design Flaws in R #2 — Dropped Dimensions

In a comment on my first post on design flaws in the R language, Longhai remarked that he has encountered problems as a result of R’s default behaviour of dropping a dimension of a matrix when you select only one row/column from that dimension. This was indeed the design flaw that I was going to get to next! I think it also points to what is perhaps a deeper design flaw. (more…)

### The Harmonic Mean of the Likelihood: Worst Monte Carlo Method Ever

Many Bayesian statisticians decide which of several models is most appropriate for a given dataset by computing the *marginal likelihood* of each model (also called the *integrated likelihood* or the *evidence*). The marginal likelihood is the probability that the model gives to the observed data, averaging over values of its parameters with respect to their prior distribution. If *x* is the entire dataset and *t* is the entire set of parameters, then the marginal likelihood is

Here, I use *P(x)* and so forth to represent a probability density or mass function, as appropriate. After P(x) has been computed for all models (which may well have different sorts of parameters), the model with largest marginal likelihood is chosen, or more elaborately, predictions might be made by averaging over all models, with weights proportional to their marginal likelihoods (perhaps multiplied by a prior factor for each model).

Use of the marginal likelihood to evaluate models is often rather dubious, because it is very sensitive to any flaws in the specifications of the priors for the parameters of each model. That may be a topic for a future post, but I’ll ignore that issue here, and talk instead about how people try to compute the marginal likelihood.

Computing *P(x)* is difficult, because it’s an integral of what is usually a highly variable function over what is typically a high dimensional parameter space. Of course, for some models the integral is analytically tractable (typically ones with conjugate priors), and if the parameter space is low-dimensional, brute-force numerical integration may work. For most interesting models, however, the only known way of accurately computing P(x) is by applying quite sophisticated Monte Carlo techniques, which have long been researched by physicists in order to solve the essentially identical problem of computing “free energies”. See, for example, Section 6.2 of my review of Probabilistic Inference Using Markov Chain Monte Carlo Methods, and my paper on Annealed Importance Sampling.

These sophisticated techniques aren’t too well-known, however. They also take substantial effort to implement and substantial amounts of computer time. So it’s maybe not surprising that many people have been tempted by what *seems *like a much easier method — compute the harmonic mean of the likelihood with respect to the posterior distribution, using the same Markov Chain Monte Carlo (MCMC) runs that they have usually already done to estimate parameters or make predictions based on each model.

### Applied Statistics PhD Comprehensive Question #1

PhD students in the Dept. of Statistics at the University of Toronto normally write three comprehensive exams at the end of their first year, in Probability, Theoretical Statistics, and Applied Statistics. Below is a question I set for the 2008 exam in Applied Statistics. It may be an interesting exercise for others too. It should in theory be doable by someone with just a good introductory undergraduate course in statistics, including multiple regression. However, many PhD students had difficulty with it, so I wouldn’t say it’s easy.

The question is here. I’ll post my answer in a week or so. I’ll also post a question from 2007 sometime.

**Update:** Here is the post with the answers.

### Inconsistent Maximum Likelihood Estimation: An “Ordinary” Example

The widespread use of the Maximum Likelihood Estimate (MLE) is partly based on an intuition that the value of the model parameter that best explains the observed data must be the best estimate, and partly on the fact that for a wide class of models the MLE has good asymptotic properties. These properties include “consistency” — that as the amount of data increases, the estimate will, with higher and higher probability, become closer and closer to the true value — and, moreover, that the MLE converges as quickly to this true value as any other estimator. These asymptotic properties might be seen as validating the intuition that the MLE must be good, *except *that these good properties of the MLE *do not hold* for some models.

This is well known, but the common examples where the MLE is inconsistent aren’t too satisfying. Some involve models where the number of parameters increases with the number of data points, which I think is cheating, since these ought to be seen as “latent variables”, not parameters. Others involve singular probability densities, or cases where the MLE is at infinity or at the boundary of the parameter space. Normal (Gaussian) mixture models fall in this category — the likelihood becomes infinite as the variance of one of the mixture components goes to zero, while the mean is set to one of the data points. One might think that such examples are “pathological”, and do not really invalidate the intuition behind the MLE.

Here, I’ll present a simple “ordinary” model where the MLE is inconsistent. The probability density defined by this model is free of singularities (or any other pathologies), for any value of the parameter. The MLE is always well defined (apart from ties, which occur with probability zero), and the MLE is always in the interior of the parameter space. Moreover, the problem is one-dimensional, allowing easy visualization. (more…)

### Design Flaws in R #1 — Reversing Sequences

The R language for statistical computing has become the standard for academic statistical research, for the very good reason that it’s better than the alternatives. It’s far from perfect however. I could come up with a long “wish list” of desired features it lacks, but that’s not what I’ll do in this series of posts. Instead, I’ll concentrate on real design flaws — aspects of the language that make it all too easy to write unreliable programs. Many of these are inherited from R’s predecessors, S and S-Plus, and may even originate in the very earliest version of S, which was seen more as a calculator than as a programming language.

By far the most commonly-encountered design flaw in R is in the “:” operator for producing sequences, whose most common use is in “for” statements. Here’s an example: (more…)

### What is This Blog?

Primarily, this is a blog about statistics, machine learning, computing, and other scientific areas where I have some expertise. Many posts will be quite technical, but some will be more widely accessible. I will also occasionally post on photography and other subjects that I have an amateur interest in, and on some societal topics, particularly those relating to statistics and to broader intellectual issues.

Many of my technical posts will point out flaws in research, methods, and tools that are commonly used. Such negative comments are essential to the scientific enterprise, being the social counterpart of the crucial role of self-criticism in individual research. I find that one of the main challenges in supervising PhD students is getting them to constantly ask whether their results, or the results of others, might be wrong. The aim in research is to discard bad ideas *quickly*, and with minimal effort. For this a blog is much more efficient than formal publication.

I will also try to publicize good work, including what I think is good work of my own. Traditionally, this is the role of conferences, but blogs may be a more efficient mechanism for this as well. Other technical posts will contain tutorial examples that illustrate important concepts, or present results that I think are interesting, but not significant enough for a formal paper.

I’ll sometimes post less technical examples and explanations of statistical concepts, accessible to readers with some general scientific background. Other non-technical posts will discuss statistical aspects of issues of personal or public significance. Sometimes I may discuss posts on other people’s blogs.

I may occasionally venture into other areas of science where I have less expertise. I’m an amateur photographer, and will sometimes post on technical aspects of photography, or just present a photo of mine that I like. I may post on other amateur interests as well.

Note that although I am a Professor of Statistics and Computer Science at the University of Toronto, and will occassionally refer to my work there, this blog is not approved by, supported by, or in any other way affiliated with the University of Toronto, or any other organization.