Inconsistent Maximum Likelihood Estimation: An “Ordinary” Example
The widespread use of the Maximum Likelihood Estimate (MLE) is partly based on an intuition that the value of the model parameter that best explains the observed data must be the best estimate, and partly on the fact that for a wide class of models the MLE has good asymptotic properties. These properties include “consistency” — that as the amount of data increases, the estimate will, with higher and higher probability, become closer and closer to the true value — and, moreover, that the MLE converges as quickly to this true value as any other estimator. These asymptotic properties might be seen as validating the intuition that the MLE must be good, except that these good properties of the MLE do not hold for some models.
This is well known, but the common examples where the MLE is inconsistent aren’t too satisfying. Some involve models where the number of parameters increases with the number of data points, which I think is cheating, since these ought to be seen as “latent variables”, not parameters. Others involve singular probability densities, or cases where the MLE is at infinity or at the boundary of the parameter space. Normal (Gaussian) mixture models fall in this category — the likelihood becomes infinite as the variance of one of the mixture components goes to zero, while the mean is set to one of the data points. One might think that such examples are “pathological”, and do not really invalidate the intuition behind the MLE.
Here, I’ll present a simple “ordinary” model where the MLE is inconsistent. The probability density defined by this model is free of singularities (or any other pathologies), for any value of the parameter. The MLE is always well defined (apart from ties, which occur with probability zero), and the MLE is always in the interior of the parameter space. Moreover, the problem is one-dimensional, allowing easy visualization.
The data consists of i.i.d. real values x1, x2, …, xn. The model has one positive real parameter, t. The distribution of a data point is an equal mixture of the standard normal and a normal distribution with mean t and standard deviation exp(-1/t^2):
Looking at the top left plot, the probability density function when t=2.5, you can see two modes near 0 and t. As t decreases to 0.6, and then 0.2, the mode at t moves left and gets narrower. A narrower mode has higher probability density at its peak. When t=0.2, the peak density of the mode at t is about 1010 times higher than the peak density of the mode at 0, which is invisible at the scale of the plot (the scale is noted in the left-side caption).
The bottom plots are of likelihood functions given data generated from the model with t=0.6. I generated 100 points, and used the first 10, the first 30, and all 100 for the three plots. With 10 data points, the value that maximizes the likelihood (0.5916) is close to the true parameter value (0.6). But as the number of data points increases, the MLE moves away from the true value, getting closer and closer to zero. The value of the likelihood at the MLE also gets bigger, reaching about 0.3×10162 when 100 data points are used.
This plot shows the likelihood function with 30 data points with the vertical scale changed to show detail other than the peak. A local maximum of the likelihood around the true value of 0.6 can now be seen. It is completely dominated by the maximum at 0.0743, which is about a factor of 1052 higher.
Why does this happen? First, note that regardless of the true value of t, the density in the vicinity of zero will be non-singular, so the probability that a data point will land in the interval (0,c) will be proportional to c when c is small. In a data set of n points, we can therefore expect the smallest positive data point, call it x0, to have a magnitude of about k/n, for some constant k. Now, consider the value of the log likelihood function at t=x0 versus its value at the true value of t. When t=x0, the log probability density for x0 will be approximately
In comparison, the density of x0 when t is not close to x0 will be approximately half its density under the N(0,1) distribution, which will approach a constant as n increases (and x0 goes to zero). The density under t=x0 for data points other than x0 will approach a constant as n increases and t goes to zero. On average, the true value of t (or values near the true value) will produce a higher value for the density of such points, but the difference in log densities will approach a constant. The end result is that the contribution to the log likelihood of data points other than x0 will be greater for the true value of t than for t=x0 by an amount that grows in proportion to n, while the contribution to the log likelihood of the point x0 will be greater for t=x0 than for the true value of t by an amount that grows in proportion to n2. As n increases, the n2 contribution will dominate, so the MLE will be close to zero rather than being close to the true value of the parameter. (Note, however, that the MLE will usually not be exactly equal to x0, though when n is large it will usually be nearby. The argument above just shows that a value of t near x0 will have higher likelihood than a value near the true value of t.)
So what does this say about the Maximum Likelihood intuition — that the best estimate is the parameter value that best explains the data? It illustrates the Bayesian critique of this intuition, which is that using the MLE ignores the volume of the parameter space where the data is fit well. When n is large, only a tiny region around x0 fits the data better than the true value of the parameter. In a posterior distribution (based on some prior), the smallness of this region will cancel the large height of the peak, with the result that the posterior probability that t is near x0 will usually not be large.