Inconsistent Maximum Likelihood Estimation: An “Ordinary” Example

2008-08-09 at 6:24 pm 28 comments

The widespread use of the Maximum Likelihood Estimate (MLE) is partly based on an intuition that the value of the model parameter that best explains the observed data must be the best estimate, and partly on the fact that for a wide class of models the MLE has good asymptotic properties. These properties include “consistency” — that as the amount of data increases, the estimate will, with higher and higher probability, become closer and closer to the true value — and, moreover, that the MLE converges as quickly to this true value as any other estimator. These asymptotic properties might be seen as validating the intuition that the MLE must be good, except that these good properties of the MLE do not hold for some models.

This is well known, but the common examples where the MLE is inconsistent aren’t too satisfying. Some involve models where the number of parameters increases with the number of data points, which I think is cheating, since these ought to be seen as “latent variables”, not parameters. Others involve singular probability densities, or cases where the MLE is at infinity or at the boundary of the parameter space. Normal (Gaussian) mixture models fall in this category — the likelihood becomes infinite as the variance of one of the mixture components goes to zero, while the mean is set to one of the data points. One might think that such examples are “pathological”, and do not really invalidate the intuition behind the MLE.

Here, I’ll present a simple “ordinary” model where the MLE is inconsistent. The probability density defined by this model is free of singularities (or any other pathologies), for any value of the parameter. The MLE is always well defined (apart from ties, which occur with probability zero), and the MLE is always in the interior of the parameter space. Moreover, the problem is one-dimensional, allowing easy visualization.

The data consists of i.i.d. real values x1, x2, …, xn. The model has one positive real parameter, t. The distribution of a data point is an equal mixture of the standard normal and a normal distribution with mean t and standard deviation exp(-1/t^2):

X\ |\ t\ \ \sim\ \ (1/2) N(0,1)\ +\ (1/2) N(t,\exp(-1/t^2)^2)

Using this R program, I produced some plots of densities and likelihood functions derived from this model.

Looking at the top left plot, the probability density function when t=2.5, you can see two modes near 0 and t. As t decreases to 0.6, and then 0.2, the mode at t moves left and gets narrower. A narrower mode has higher probability density at its peak. When t=0.2, the peak density of the mode at t is about 1010 times higher than the peak density of the mode at 0, which is invisible at the scale of the plot (the scale is noted in the left-side caption).

The bottom plots are of likelihood functions given data generated from the model with t=0.6. I generated 100 points, and used the first 10, the first 30, and all 100 for the three plots. With 10 data points, the value that maximizes the likelihood (0.5916) is close to the true parameter value (0.6). But as the number of data points increases, the MLE moves away from the true value, getting closer and closer to zero. The value of the likelihood at the MLE also gets bigger, reaching about 0.3×10162 when 100 data points are used.

This plot shows the likelihood function with 30 data points with the vertical scale changed to show detail other than the peak. A local maximum of the likelihood around the true value of 0.6 can now be seen. It is completely dominated by the maximum at 0.0743, which is about a factor of 1052 higher.

Why does this happen? First, note that regardless of the true value of t, the density in the vicinity of zero will be non-singular, so the probability that a data point will land in the interval (0,c) will be proportional to c when c is small. In a data set of n points, we can therefore expect the smallest positive data point, call it x0, to have a magnitude of about k/n, for some constant k. Now, consider the value of the log likelihood function at t=x0 versus its value at the true value of t. When t=x0, the log probability density for x0 will be approximately

-\log(2\sqrt{2\pi}\exp(-1/t^2))\, =\, \mbox{constant}  + 1/x_0^2\, \approx\, \mbox{constant} + n^2/k^2

In comparison, the density of x0 when t is not close to x0 will be approximately half its density under the N(0,1) distribution, which will approach a constant as n increases (and x0 goes to zero). The density under t=x0 for data points other than x0 will approach a constant as n increases and t goes to zero. On average, the true value of t (or values near the true value) will produce a higher value for the density of such points, but the difference in log densities will approach a constant. The end result is that the contribution to the log likelihood of data points other than x0 will be greater for the true value of t than for t=x0 by an amount that grows in proportion to n, while the contribution to the log likelihood of the point x0 will be greater for t=x0 than for the true value of t by an amount that grows in proportion to n2. As n increases, the n2 contribution will dominate, so the MLE will be close to zero rather than being close to the true value of the parameter. (Note, however, that the MLE will usually not be exactly equal to x0, though when n is large it will usually be nearby. The argument above just shows that a value of t near x0 will have higher likelihood than a value near the true value of t.)

So what does this say about the Maximum Likelihood intuition — that the best estimate is the parameter value that best explains the data? It illustrates the Bayesian critique of this intuition, which is that using the MLE ignores the volume of the parameter space where the data is fit well. When n is large, only a tiny region around x0 fits the data better than the true value of the parameter. In a posterior distribution (based on some prior), the smallness of this region will cancel the large height of the peak, with the result that the posterior probability that t is near x0 will usually not be large.

About these ads

Entry filed under: Statistics, Statistics - Technical. Tags: .

Design Flaws in R #1 — Reversing Sequences University College, University of Toronto

28 Comments Add your own

  • 1. Kevin Canini  |  2008-08-20 at 7:55 am

    This is an interesting post. I didn’t realize that cases where the MLE is inconsistent could be so simple. Anyway, I’m confused about the last paragraph. How would taking the MAP value of the parameter fix this pathology? For example, if an improper uniform prior (or a very wide normal distribution) were used, I don’t see how the MAP estimate would differ much from the MLE estimate. And isn’t the MLE analog of the “posterior probability that t is near x_0″ (i.e., the integral of the likelihood function around x_0) going to be small anyway, since the tall spike is so narrow?

    Reply
  • 2. radfordneal  |  2008-08-20 at 10:56 am

    You’re right that the MAP estimate would be much the same as the MLE. But MAP estimates aren’t Bayesian – they are better referred to as “maximum penalized likelihood” estimates. If you were to look at an estimate that pays attention to the whole posterior distribution, like the posterior median or the posterior mean, you’d get something quite different from the MLE.

    In an non-Bayesian framework, integrals of the likelihood function are not done. So there’s no concept of integrating the likelihood over the neighborhood of the MLE. That would be a Bayesian method with the prior happening to be uniform.

    The big difference between Bayesian and non-Bayesian methods is that Bayesian methods integrate over the parameter space, and non-Bayesian methods don’t. To integrate, one needs a base measure – ie, a prior distribution – so Bayesian methods inevitably involve a prior, but from this (somewhat mathematical) point of view, the prior isn’t the central issue.

    Reply
    • 3. Corey  |  2012-06-11 at 12:37 am

      “In an non-Bayesian framework, integrals of the likelihood function are not done.”

      Just dropped by to refresh my memory of this model. In point of fact Thomas Severini had a paper on the subject at the time the above comment was written. There’s been another one since.

      Reply
      • 4. Radford Neal  |  2012-06-11 at 9:32 am

        From a quick glance at the abstracts, the methods in those papers seem to be “semi-Bayesian”. They have a part of a prior, for nuisance parameters given a parameter of interest, allowing them to integrate over the nuisance parameter, but the results are viewed in non-Bayesian terms.

        Integration is always with respect ot some measure, which one might call a “prior”, and if you think having a “prior” means you’re “Bayesian”, then my statement is true. You might not think that, of course, but my point was really that you have to add something like a prior to the MLE framework before you can sensibly talk about integration over points near the MLE, as in the first comment.

  • 5. Bill Jefferys  |  2008-08-20 at 9:02 pm

    Radford, this would be a whole lot easier to visualize if you would rescale the pix so that they were all on the same horizontal scale. As it is, it’s not easy to see the effect you rightly point out. Since the likelihood isn’t a density, the vertical scale can be set any way you wish.

    I would think that roughly (-2,+3) would work well.

    Reply
  • 6. Corey  |  2008-08-20 at 10:08 pm

    Which of the regularity conditions that one normally handwaves away is being violated here?

    Reply
  • 7. radfordneal  |  2008-08-20 at 10:35 pm

    Bill,

    I maybe didn’t explain the plots well enough. The top row of plots are of the data density for various values of the parameter. There’s no actual data involved in these. They all have a horizontal range (for possible data points) of about (-3,5). But note that the vertical scale changes drastically (see the multiplication factor in the axis label). These plots are meant just to demonstrate how as the parameter gets closer to zero, a mode near the parameter value moves closer to zero too, and gets narrower and narrower.

    The lower plots are of the data and the corresponding likelihood function for 10, 30, and 100 points from a data set. The horizontal axis does double duty for both data values and parameter values (these are distinct, but closely related for this model). The horizontal range for all these lower plots is (-3,3), so you can compare what happens to the likelihood function (and hence the MLE) as the number of data points increases. Note that the likelihood function is plotted only for positive values, since I restricted the parameter to be positive (though the data can be negative). The horizontal range is (-3,3) for the later “zoomed in” likelihood plot too.

    Reply
  • 8. Adam  |  2008-08-21 at 5:25 pm

    Isn’t it possible that this is because that the discrete sample representation of the true probability density is inaccurate. It seems one needs infinite number of samples to accurately represent a continuous distribution. If we have infinite number of samples, then this infinity maybe able to cancel the infinity induced by the density concentrated at zero such that the maximal mode is still located at the true value instead of zero.

    Reply
  • 9. Radford Neal  |  2008-08-21 at 6:20 pm

    Adam,

    The notion of consistency I’m using in this post is the usual one, which is more specifically referred to as “weak” consistency – as the amount of data increases, the estimate will, with higher and higher probability, become closer and closer to the true value. There’s another notion, called “Fisher consistency”, which is related to what you’re thinking of. It applies only when you can sensibly view the estimate as a function of the empirical distribution, in which case you can ask what happens when you apply it to the actual distribution. If it gets the right answer then, it’s Fisher consistent.

    For example, consider estimating the mean of a normal distribution by the sample median. This is Fisher consistent, because the median of a normal distribution is indeed equal to its mean. (The sample median is also weakly consistent.)

    Maximum likelihood estimates are always going to be Fisher consistent (assuming they’re well defined). This is related to how the Kullback-Leibler divergence between distributions is zero only for the divergence between a distribution and itself. Fisher consistency isn’t of any practical use, however, (at least for continuous distributions) since we’re always dealing with finite samples.

    Reply
  • 10. Radford Neal  |  2008-08-22 at 3:38 pm

    Corey asks,

    Which of the regularity conditions that one normally handwaves away is being violated here?

    That depends on which MLE consistency theorem one looks at, of course, but once one gets beyond theorems for models with a finite number of values for the parameters, the conditions go beyond easily-checked things like continuity. The example here has all the non-singularity and continuity one might want. (The parameter space is non-compact, but that’s true for most models used in practice.)

    For example, in Theory of Statistics by Mark Schervish Theorem 7.49 plus Lemma 7.54, a crucial condition involves the expected value of the largest difference in log probability for a single data point given by the true parameter value and the log probability given by the parameter value in some region that best fits that data point. For this model, one needs to look at a region extending towards zero, and this expected value ends up as minus infinity, basically because -1/t^2 integrates to minus infinity over a region near zero. This is related to the argument in the post for why the MLE for this model won’t be consistent.

    Reply
  • 11. sv  |  2008-08-23 at 11:29 pm

    This is very interesting. For those of us not who don’t have any experience with the Bayesian methods, would it be possible to show the Bayesian solution as well?

    Reply
  • 12. Corey  |  2008-08-24 at 2:51 pm

    a crucial condition involves the expected value of the largest difference in log probability for a single data point given by the true parameter value and the log probability given by the parameter value in some region that best fits that data point

    Thanks! Putting it that way makes it seems — well, not obvious — but in obvious agreement with one’s intuition about how MLE consistency should (fail to) work.

    Reply
  • 13. Radford Neal  |  2008-08-24 at 3:00 pm

    Another point regarding how this example fails to satisfy the premises of theorems about consistency of the MLE… I can’t recall seeing a theorem to this effect, but I think one could prove the MLE is consistent for a bounded parameter space (as here, if we impose an upper limit on t, which wouldn’t affect the example) provided there is an upper bound on the probability density, for any parameter value and any data point. This isn’t satisfied by this example, of course.

    My intuition for why such a theorem should be possible is that under these conditions a finite discretization of the parameter space should give a good approximation to the model for all parameter values. And the MLE will be consistent if the parameter space is finite. Maybe sometime I’ll try formalizing this (if it hasn’t already been done).

    Reply
  • 14. Meng  |  2008-08-24 at 8:33 pm

    “finite parameter space” may be relaxed to “being bounded away from zero” in this example or similar conditions in general, I think.

    Reply
  • 15. Meng  |  2008-08-24 at 9:27 pm

    Re: Adam

    A sample size of 30 can be large enough to well represent the true density. You may draw a histogram to see this. Actually as illustrated by Radford, it will be even worse with more samples.

    “the infinity induced by the density concentrated at zero” is a (infinite) value of the likelihood function at a parameter point (close to zero). The value of the likelihood function at a parameter value equals to the product (or sum in log-scale) of the values of the density function at each data point with that parameter value. Therefore, for a given parameter value, how large the likehood function is is a combined effect from all the data points through their density function. However, if one data point has some type of dominant contribution (e.g. infinite density value), averaging over even a large number of values may not be able to remove atypical phenonimena like “infinity”.

    My opinion though.

    Reply
  • 16. Radford Neal  |  2008-08-24 at 9:58 pm

    Meng,

    In this example, parameter values just above zero are the source of the problem, but of course other values might be the problem for other models. By “finite parameter space” I meant that that there are only a finite number of possible values for the parameter (so the parameter space isn’t an interval of the real line, for instance). An approximation with a finite parameter space will be good for all values of the parameter if slight changes in parameter values don’t have huge effects. In the model of this post, that’s not true, because even though the density never becomes singlular (so there aren’t any infinite density values), it does become increasingly peaked as the parameter gets closer to zero.

    Reply
  • 17. Keith O'Rourke  |  2008-08-25 at 3:07 pm

    Looking at how the individual observation likelihood components combine (and contrast with each other) can be useful.

    This example is interesting to me because even though we know the observations were generated with a single (common) parameter most individual observation likelihoods support .6 but a few support (different?) small values around zero – and these few out weigh the majority… so the parameter value most likley to generate this set of observations is “misleading”.

    Simple modifications the the plot.lik code to get these plots are given below (on the log scale).

    cheers
    Keith

    function (x,…)
    {
    grid 0 & x<3],seq(0.01,3,by=0.01)))

    ll <- rep(NA,length(grid))
    for (i in 1:length(grid))
    { ll[i] <- log.lik(x,grid[i])
    }

    mlv <- max(ll)
    mle <- grid[ll==mlv][1]
    max.ll <- round(mlv/log(10))

    #ll <- ll – max.ll*log(10)
    #lik <- exp(ll)
    mlvi=(1:length(grid))[ll==mlv]
    ll <- ll – ll[ll==mlv] + 2
    lik <- ll

    plot(c(-0,2),c(-8,max(lik) + 2),type=”n”,
    xlab=”data / parameter”,
    ylab=paste(“likelihood (x 10^”,max.ll,”)”,sep=””),
    …)
    points(x,rep(0,length(x)),lty=19)
    lines(grid,lik)

    title(paste(“Likelihood -“,length(x),”data points,”,
    “MLE approximately”,round(mle,4)))
    iol=NULL
    for(k in 1:length(x)){
    for(ki in 1:length(grid))
    iol[ki]=log.lik(x[k],grid[ki])
    lines(grid,iol – iol[mlvi] + 2/length(x),col=2)}
    }

    Reply
  • 18. Radford Neal  |  2008-08-25 at 9:23 pm

    Keith,

    I think the code got sort of destroyed in your comment by the blog software interpreting less-than and greater-than signs as HTML commands. To get a less-than through, you have to use ampersand, “lt”, semicolon. I think…

    Here’s a try: < Worked?

    Reply
  • 19. Anonymous  |  2008-08-26 at 11:05 am

    I’ll resend when I am back at my usual computer

    Occured to me this morning if one took account of observations being actually discrete rather than truly continuous and replaced the density with the integral from obs – e to obs + e the inconsistency would go away – if e was big enough?

    cheers
    Keith

    Reply
  • 20. Keith O'Rourke  |  2008-08-26 at 11:07 am

    Last comment was from me – forgot to enter email and name ;-)

    Keith

    Reply
  • 21. Radford Neal  |  2008-08-26 at 2:47 pm

    if one took account of observations being actually discrete rather than truly continuous and replaced the density with the integral from obs – e to obs + e the inconsistency would go away – if e was big enough?

    Yes. In that case, the data space would be finite (we can ignore the infinity in the big direction), and with enough data, the probability of each of these possible data values would be very well estimated. Values for the single model parameter map to distributions over these finite number of data values, in a continuous and probably one-to-one fashion, allowing the parameter to be well estimated by maximum likelihood.
    However, you shouldn’t conclude that results about inconsistencies of MLEs aren’t relevant in practice, since data is always rounded. The practical effect of an inconsistent MLE is that the MLE also isn’t very good for finite amounts of data. Even if the data is rounded, so the MLE is consistent, this bad performance for finite amounts of data may remain. In a problems with higher-dimensional data, even fairly coarse rounding will still produce a huge number of possible data values, so that the finiteness may not be of any practical relevance.

    Reply
  • 22. Radford Neal  |  2008-08-26 at 9:35 pm

    Keith,

    Here are the R program and plots produced by it that you sent by email. I haven’t absorbed the plots entirely, but it’s certainly the case that the inconsistency of the MLE is due to a few (one, typically) data points dominating.

    Reply
  • [...] turns out to be quite a reasonable suggestion, which leads to good estimates (although, there are some rare exceptions). Here, for example, it turns out that to estimate the mean you should simply compute the average [...]

    Reply
  • 24. ekzept  |  2009-07-10 at 1:32 am

    Regarding “The big difference between Bayesian and non-Bayesian methods is that Bayesian methods integrate over the parameter space, and non-Bayesian methods don’t”, frequentist methods are not the only non-Bayesian method. There are also Kullback-Leibler methods, means of inference based upon divergence measures between likelihood functions. See for instance Burnham and Anderson, Wildlife Research, 2001, 28, 111–119, “KL information as a basis for strong inference in ecological studies.”

    Reply
  • 25. Oleg M.  |  2009-11-17 at 2:18 pm

    Did I misunderstand the example?

    I tried to replicate this example, and here is my code

    t <- 0.6
    n <- 100

    ind <- sample(c(0,1), n, replace = T) # this is 0/1/ indicator for the mixture
    X <- ind*rnorm(n) + (1-ind)*rnorm(n,t,exp(-1/t^2)) # this generates X’s from Neal’s example

    trange <- seq(0,1,0.0001)
    loglik <- function(th){
    sum(log(0.5*dnorm(X) + 0.5*dnorm(X,th,exp(-1/th^2)) ) )
    }

    LLt <- trange*0
    for (i in 1:length(trange)){
    LLt[i] <- loglik(trange[i])
    }
    plot(trange, LLt, type=”l”)

    j <- which.max(LLt)
    (mle <- trange[j])

    According with this, there are *no problems whatsoever* with the MLE.

    Reply
  • 26. Radford Neal  |  2009-11-18 at 2:02 am

    Looking at your program, I think you haven’t misunderstood the model. However, you’ve haven’t realized how serious the numerical issues with finding the true MLE are.

    You will have generated a different sample of 100 points than I used, so I can’t say for sure exactly what happened. But one possible problem is floating-point overflow / underflow. Your first grid point for t (other than 0, which you shouldn’t really have included) is 0.0001. The standard deviation for this value, which is exp(-1/0.0001^2), will underflow to zero.Another problem is that your grid isn’t fine enough. The smallest positive data point may be around 0.01. For values of t around there, the standard deviation is around exp(-1/0.01^2), which also underflows to zero, and hence is certainly much smaller than your grid spacing of 0.0001. So your grid is likely to just skip from one side of the MLE to the other without seeing any point close enough to the MLE for it to have a higher likelihood than the likelihood in the vicinity of the true parameter value.

    The R program I link to just after the first equation goes to considerable lengths to avoid these numerical problems.

    Reply
  • 27. Quora  |  2012-03-10 at 2:08 pm

    In which situations should we use minimum MSE estimates, Unbiased estimates, and MLE estimates respectively?…

    In general, if you know estimator [math]T[/math] has lower MSE than estimator [math]U[/math] for all possible values of the unknown parameters, you should choose [math]T[/math] over [math]U[/math]. (If squared error loss is not appropriate, substitute …

    Reply
    • 28. Radford Neal  |  2012-03-10 at 5:00 pm

      Yes, that’s known as T “dominating” U, and would seem to be a good reason not to use U, though examples like the fact that the “James-Stein estimator” for the mean of a vector in three or more dimensions dominates the simple sample mean make this less obvious than one might have thought…

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

August 2008
M T W T F S S
    Sep »
 123
45678910
11121314151617
18192021222324
25262728293031

Most Recent Posts


Follow

Get every new post delivered to your Inbox.

Join 119 other followers