You mention that:

“Some involve models where the number of parameters increases with the number of data points, which I think is cheating, since these ought to be seen as “latent variables”, not parameters.”

Do you have any references for further reading on these types of models? I am working an a similar problem.

Best, Håkan

]]>What you integrate over the posterior depends on what you are trying to do. You might integrate predictive probability densities for a new observation, for instance. Occassionally you might just integrate a parameter value – getting the posterior mean of it.

Certainly if you have a bunch of unrelated distributions and only one sampled point from each, you’re not going to be able to estimate the parameters of all these distributions very well. There needs to be some connection between data points to get the property that more data gets you more certain results.

]]>Thank you for the suggestion, I really appreciate it. So is integrating over the posterior the same thing as trying to find a MMSE estimator? Are there reading materials that I can study to understand this idea better?

In my case I have independent but non-identically distributed Gaussian observation samples, and I am trying to find an MLE using this data. I wonder if the non-identical part makes the MLE inconsistent, as essentially I have only one sample per each Gaussian distribution.

]]>You don’t say anything about your model or data, so it’s hard to give anything but general advice. The general Bayesian advice is that you should integrate predictions over the posterior distribution, not pick a single parameter vector that maximizes the likelihood, or anything else. Integrating over the posterior distribution certainly works fine for the example in this post – it’s only maximizing that fails for this problem, since the maximum is at a peak that is very high, but also very narrow, with a tiny total probability mass in the peak.

]]>Id appreciate any suggestions you might have to offer. I’m also not sure how to pick a regularization factor properly, and how to asses performance of such a regularized estimator, CRLB cannot be used any more, as my estimator is biased. Thank you for reading my comment, I will appreciate any suggestions, I am from a wireless communication and signal processing background not very skilled in statistical thinking. ]]>

What is “ordinary” or “pathological” is a matter of opinion. But the example I give does avoid some of the obvious things that one might think might make it irrelevant to “real” problems. In particular, it is NOT the case that the likelihood is ever infinite, for any non-zero value of t, nor is it the case that the MLE will be exactly at a data point. Of course, the distributions do in some sense get a bit “extreme” for some values of t.

At the other extreme, for models in which the parameter space is a finite set, the MLE will certainly be consistent.

Looking at both of these extremes may give you more insight into what to expect from an MLE.

]]>Example, my ‘family’ is a function of t: at one end (say t0) it does not include delta function. (Delta function is just an extreme, showing how absurdly narrow distributions in my/your ‘family’ could be)

Such a ‘family’ will never converge to a right original distribution because as it hits any data point, it MLE will diverge for t <0.

I do want to point out that it is a great lesson and amazing work nevertheless, I simply disagree that it is an ORDINARY example. Boy, its anything but, It includes divergent likely-hoods.

Am I getting something wrong here? It's always good to know if you are wrong, that's where we learn the most.

]]>It’s standard to define a likelihood function using a probability density function, when the data is continuous, so it’s quite common for likelihoods to be numerically evaluated as being greater than 1. However, theoretically, a likelihood is by definition an equivalence class of functions of parameters differing only in an overall constant factor, so it doesn’t even make sense to ask if a likelihood is greater than one.

Trying to think what you might be thinking, however, my guess is that you think that all real data is discrete, not continuous, and hence all real likelihoods are based on probability mass functions, not probability density functions. There’s something to be said for this view, and it does formally eliminate the inconsistency of the MLE in this example if you assume that the data has limited precision.

However, in high-dimensional problems the finite space of possible data sets is extremely large, even assuming individual values are rounded to not-too-much precision. It may then be more enlightening to consider continuous data (even if that’s an unrealizable idealization) than to trust that the MLE is guaranteed to be consistent in finite settings, when convergence to the correct value may in practice occur extremely slowly.

]]>L(Data)=P(Data|Model). For normally distributed IID data, this likelihood represented by the product point sampled Gaussians CAN be a good approximation of un-normalized probability, but aren’t always. Here they are not, for the peaked distribution surrounding the point closest to zero (singling out that point as sigma ->0) .

As likelihood is a probability, L(D) is never >1. I believe the issue here is in a problematic estimate of likelihood, not inconsistency of the MLE.

]]>