Does coverage matter?

2009-03-07 at 6:42 pm 13 comments

In response to Andrew Gelman’s extended April Fool’s diatribe on Objections to Bayesian Statistics, Larry Wasserman commented regarding physicists who want guaranteed frequentist coverage for their confidence intervals that “Their desire for frequentist coverage seems well justified. Someday, we can count how many of their intervals trapped the true parameter values and assess the coverage. The 95 percent frequentist intervals will live up to their advertised coverage claims. A trail of Bayesian intervals will, in general, not have this property”.

One thing to note about this statement is that it’s just not true. Confidence intervals produced in actual scientific research are notorious for not covering the true value, even when they are produced using frequentist recipes. This is why high-energy physicists insist on such absurdly high confidence levels (or absurdly low p-values) before declaring discoveries — what they call “five sigma” evidence, which corresponds to a p-value of less than 10^-6. If taken seriously, quoting such a small p-value would be pointless, since any reader would surely assign a higher probability than that to the possibility that the “discovery” results from fraud or gross incompetence. The high confidence levels demanded are just an ad hoc way of trying to compensate for possible inadequacies in the statistical model used, which can easily make the true coverage probability be much less than advertised (or the true Type I error rate much higher than advertised).

Let’s ignore this, though, since discussions of theory omitting messy practical issues can be valuable. The next thing to ask, then, is how it is possible that a 90% Bayesian probability interval — which purports to contain the true value with 90% probability — can contain the true value less than 90% of the time. A simple example will show how this can happen, and provide insight into whether we should care.

One can trivially get an example of low or zero coverage using a Bayesian method in which some parameter values are assigned low or zero prior probability, but that’s not very interesting, since if you use such a prior, you presumably want such values to be considered low probability (and hence excluded from posterior intervals) even if they’re not especially disfavoured by the data. But here I’ll give an example in which the coverage is zero for a parameter value that has just as high prior probability as all the other parameter values.

Suppose we have an unknown parameter, θ, with ten possible values, 0, 1, …, 9. We obtain data, x, which has nine possible values, 1, …, 9. If θ=0, the observation x is equally likely to be any of the values 1, …, 9 — ie, the probabiity of each of these values is 1/9. If θ>0, we will observe x=θ with probability one. Suppose we use a uniform prior for θ, so each of the values 0, 1, … 9 has prior probability 1/10.

Once we observe x, only two values of θ will remain possible — θ=x and θ=0. By assumption, the ratio of prior probabilities for these two values is one. The ratio of likelihoods is 1 over 1/9, or 9, in favour of θ=x. The posterior odds are the product of the prior odds and the likelihood ratio, so we find that the posterior probabilities are 9/10 for θ=x, 1/10 for θ=0, and zero for any other value of θ. A 90% Bayesian probability region is therefore easy to define — it’s just the set consisting only of the observed value of x.

What’s the coverage of this Bayesian probability region? Well, if θ>0, the coverage is 100%, since when θ>0 we are guaranteed to observe x=θ. But when θ=0, the coverage is 0%, since we never produce a posterior probability region that includes 0. Frequentist coverage is the minimum probability, for any true θ, that the region will include the true θ. So the coverage for these Bayesian probability regions is zero.

Should zero coverage be cause for worry? That depends first of all on whether we actually believe the model and the prior that were used to obtain the Bayesian regions, and secondly on whether a posterior probability region actually provides the information that we need. It’s certainly possible that the answers to both these questions are “yes”, in which case the zero coverage should not be a cause for worry. But I think that often Bayesian posterior regions are not what we want.

Recall that one way of looking at a 90% frequentist confidence region is as the set of all parameter values for which a hypothesis test would produce a p-value greater than 0.1 — ie, the set of parameter values that are consistent with the data, according to a hypothesis test using a 10% significance level. From this point of view, the confidence region is a way of telling theorists what theories to discard — namely, all those theories that predict a value for the parameter outside the confidence region.

Does this work for the Bayesian posterior regions in the example above? If we observe x=9, we construct the 90% posterior region {9}, and conclude that all theories that predict any value for the parameter other than 9 should be discarded (at least if we think 90% is high enough). This is certainly the right thing to do for theories that predict that the parameter is 1, 2, …, 8, since those values are excluded by the data with certainty. But what about parameter value 0? It’s outside the 90% posterior region, but note that its posterior probability of 1/10 is exactly the same as its prior probability. The observation of x=9 has not reduced the probability that θ=0 at all, so it certainly seems strange to say that θ=0 is now excluded by the data!

I think part of the problem is that reports of experimental results should not be aimed at presenting conclusions, as may seem most natural from a Bayesian viewpoint, but rather at providing the information with which the readers may draw conclusions. This may be the source of some objections to the prior distribution in Bayesian analysis, which can be seen as corrupting the objective presentation of the experimental results, even though frequentist methods like p-values are not suitable presentations either. In simple examples like the one above, the experimental results can be communicated fully and objectively (assuming the model is uncontroversial) by reporting the likelihood function — which in this example is 1 for Θ equal to the observed x, 1/9 for Θ equal to 0, and zero for any other Θ. There is no need for either frequentist confidence regions or Bayesian posterior regions. Readers can use the likelihood function to produce posterior regions based on whatever priors they like. (Frequentists using methods that violate the Likelihood Principle are out of luck, but you can’t please everyone.)

In more complex problems with a high-dimensional parameter, however, just presenting the likelihood function is both infeasible and unenlightening. One solution is to present a marginal likelihood function for just a smaller set of parameters of interest, integrating with respect to a prior distribution for the other “nuisance” parameters. This only works if the prior for nuisance parameters is relatively uncontroversial, but that may often be the case (or at least, the experimenters may be the best people to formulate this prior).

If you think you must instead produce something like a “Bayesian confidence region”, perhaps the best method is to report the region for which the ratio of posterior probability (or density) to prior probability (or density) is not less than 0.1 (for an analogue of a 90% region). This can also be thought of as the region containing all Θ values for which the Bayes Factor is at least 0.1 for a comparison of the model in which this Θ has prior probability one to the model in which Θ has whatever prior you have set up as the “default”. This can also be seen as the region of Θ where a theorist who assigns prior probability 1/2 to their pet theory predicting that value of Θ would not abandon their theory (using a 90% confidence level) if the they use your “default” prior for parameter values conditional on their theory being false. For the example above, the regions produced in this way (with uniform default prior) would consist of the observed x plus 0. Coverage would be 100%.

I hasten to add that this suggestion assumes a context in which various theories predict specific values for the parameter, and our interest is in which of these theories is true. This is quite unlike many applications of confidence intervals, in which the parameters are things such as the average daily caloric intake of Canadian adults, or the regression coefficient of income on years of education. For such parameters, the likelihood function (or marginal likelihood function) should be reported. If this likelihood happens to be approximately normal, and concentrated in a region where any reasonable prior would be nearly uniform, you could go ahead and present a 90% posterior region if you’re so inclined.

Entry filed under: Statistics, Statistics - Nontechnical.

Downtown with Sky Two Surpising Things about R

13 Comments Add your own

1. Larry Wasserman | 2009-03-10 at 12:53 pm

Radford:

I enjoyed your post.
Here is my (extremely opinionated) response.

>One thing to note about this statement is that it’s just not true.
>Confidence intervals produced in actual scientific research are
>notorious for not covering the true value, even when they are produced
>using frequentist recipes.

That’s a separate issue from the the conceptual issue of whether
coverage is desirable in the first place. Practical things that screw
up statistical methods affect Bayes and frequentist methods. I was
focusing on the philosophical question of whether coverage is a
desirable goal. (And I think it is).

>This is why high-energy physicists insist on such absurdly high
>confidence levels (or absurdly low p-values) before declaring
>discoveries — what they call “five sigma” evidence, which corresponds
>to a p-value of less than 10-6.

My dealings with them suggest that the bigger concern is multiple teating.

>One can trivially get an example of low or zero coverage using a
>Bayesian method in which some parameter values are assigned low or
>zero prior probability, but that’s not very interesting, since if you
>use such a prior, you presumably want such values to be considered
>low probability (and hence excluded from posterior intervals) even if
>they’re not especially disfavoured by the data.

Actually this is far from trivial and is the situation I am most
concerned with. In nonparametric problems (or high dimensional
parametric problems) the prior must put low probability over most of
the space. For example, the coverage of the usual posterior on
Sobolev spaces is 0! In fact, there is no prior that gives correct
coverage in typical nonparametric problems.

>But here I’ll give an example in which the coverage is zero for a
>parameter value that has just as high prior probability as all the
>other parameter values.

I like your example. Of course I would report {0,x} as the 100
percent confidence interval, as you note.

The idea of just presenting the likelihood is not useful in my
opinion. As you note, for high dimensional (or nonparametric) models,
this is problematic. Integrating out nusiance parameters can lead to
very biased results.

(Here is an example. In my astrostatistics work, my colleagues and I
have concluded that the current estimates of the Hubble constant is biased
upwards. The reason is that they use integrated likelihood for a high
dimensional likelihood. We could be wrong and future data should nail
it more precisely.)

I like to say that Bayesian are “slaves to the likelihood function.”
The likelihood is useful in low dimensions but not in complex, high
dimensional problems.

–Larry
Reply
2. Radford Neal | 2009-03-10 at 11:46 pm

Hi Larry,

In practice, one can often get good results putting priors on high or infinite dimensional parameter spaces. So the practical significance of the results you mention isn’t clear to me, though I don’t yet have good arguments on this issue. Note, though, that I argue in the post that posterior regions often aren’t answering the right question anyway, so I’m not necessarily disturbed that they might have zero coverage (when I wouldn’t be using them).

Regarding the uses of likelihood functions, it’s not just when the parameter space is of high dimension that physicists abandon them. The highly-influential paper of Feldman and Cousins on “Unified approach to the classical statistical analysis of small signals” advocates a dubious frequentist procedure for constructing confidence intervals for a simple one-parameter problem, for which plotting the likelihood function is trivial. In that paper, they several times concede that for any actual question that you might want answered, a Bayesian method would be preferable, but nevertheless insist that reporting confidence intervals with correct coverage is a must, even if there are good arguements that the confidence intervals are ridiculous.

So I don’t think that the fondness of some physicists for frequentist confidence intervals is due to an awareness of the difficulties of handling nuisance parameters. I think it’s due to a desire for a ritual that justifies saying they’ve “discovered” something. I argue in the post that it shouldn’t be the role of the experimenter to declare that the evidence is sufficient to claim a “discovery”. That’s the role of the reader.
Reply
3. brendonbrewer | 2009-03-11 at 1:16 am

>>In more complex problems with a high-dimensional parameter, however, just presenting the likelihood function is both infeasible and unenlightening.<<

It’s surely infeasible to plot a function in many dimensions, but the likelihood is still enlightening – it’s the evidence provided by the data. Although we would have liked for the data to uniquely determine the solution, that’s rare, and presenting the likelihood is just a formal statement of this.

As discussed by both of you, high dimensional Bayesian Inference is hard. When it’s too hard I usually suggest that the best thing to do is simply look at the data, and summaries thereof, and use common sense, knowledge of the topic, and intellectual honesty. The reality is that any Bayesian analysis that disagreed with this judgment would be rejected anyway. The “non-parametric” methods mentioned by Larry Wasserman can be helpful tools to guide this process, but they’re not objective deciders-of-the-truth.

Confidence intervals are based on prior (pre-data) probabilities and are not the relevant probabilities once data has been taken into account. You can have old confidence intervals and see how many of them contain the true value, once you know that value. However, what’s the point? Once you know the true value, that’s it. There’s no more inference to be done.

I haven’t thought that hard about coverage though so I’m open to more arguments, but I find it hard to believe that the Bayesian prescription of 1) write down all the possibilities 2) delete all the ones that are false, will ever be superseded.

Regarding borderline detections, there’s no shame in being uncertain. You can claim tentative discovery, without being required to justify it with controversial formalisms. Unless you’re into that sort of thing! :-)
Reply
4. Corey | 2009-03-15 at 10:39 pm

The example problem is similar to a ten-door version of the Monty Hall problem. Was that deliberate?

(theta = 0 corresponds to the event “initial door guessed was correct”)
Reply
5. Radford Neal | 2009-03-15 at 11:22 pm

No, I’m familiar with the Monty Hall problem, but wasn’t thinking of it here. I’m not sure I see the correspondence, since when the initial door is not correct, Monty isn’t confined to just one possibility, but can open any door other than the correct one.
Reply
6. Corey | 2009-04-11 at 1:17 am

In the multi-door Monty Hall variant I’m thinking of, when the initial door is not correct Monty opens every door except the initial one and the correct one. In the ten-door version, you pick a door, Monty opens eight doors with booby prizes, and you choose whether to switch or stay with your original choice.

This variant is useful for getting people past the intuitive feeling that switching and sticking are equivalent in the 3-door version. When the number of doors is large it’s much easier to see that the probability of that the door initially picked is the winner doesn’t change when Monty does the reveal, so the remaining door almost certainly hides the big prize.
Reply
7. Günter Zech | 2010-01-26 at 7:58 am

Hi,

I never understood the enthousiasme of many of my colleages for the coverage paradim. The problem with them is that they favor parameter values with low predictive pover and often exclude regions of high likelihood in favor of regions where the latter is extremely low. The likelihood function provides the experimental evidence and the problem with a high dimensional prameter space is also anavoidable and even more problematic in the frequentist scheme.
Physicists don’t like priors, and thsi is for good reasons. On the other hand, this does not justify the coverage approach. Why should intervals cover? (There are a few rare situations where coverage is important, but these is the very exception.) Myself and many high energy physicists think that likelihood ratio intervals and in the presence of nuisance parameters the profile likelihood ratio intrvals provide a sensible way to document the experimental evidence of the date with respect to the wanted parameters.

Günter

:
Reply
8. Harrison Prosper | 2010-03-21 at 9:05 pm

Hi,

High-energy physicists like exact coverage because it is viewed as “objective”. I have no problem with requiring approximate coverage. But I see absolutely no value is bending over backwards to get exact coverage. Why? Because coverage has no operational content in the real world. The thousands of intervals published over the past century by high-energy physicists have some coverage. My question to my colleagues is “so what?”. In what way is my understanding of the universe enhanced? Suppose that 2/3 of the intervals in the Particle Data Book cover their true values. That is very comforting of course but since I am not privy to the true values of any physical constant, I have no operational means of ascertaining with certainty the actual coverage. The fact that I can check the coverage of intervals in the ensemble of computer experiments I run on my laptop provides no mechanism to do the same in the real world.

In the end, what matters to me is that I learn something interesting about the world in which we live and I am able to test my understanding. My experience over the past quarter century is that I can learn a great deal about the world by using a judicious mix of frequentist and Bayesian ideas: analyze data using Bayesian ideas (because they are conceptually simpler, albeit computationally demanding) and run frequentist computer experiments to verify that the Bayesian analyses deliver what is claimed. That is precisely what my colleagues and I did in the work leading to the 2009 discovery of the production of single top quarks at Fermilab. We used Bayesian methods through and through (even using the dreaded “flat” prior for the parameter of interest – the cross section) and verified that the Bayesian methods worked as claimed. Indeed, the frequentist-verified Bayesian methods worked astonishingly well.
Reply
9. Mike Evans | 2010-07-02 at 1:10 pm

Hi Radford;

A couple of comments.

1. Bayesians regions have the right coverage probabilities (at least when using proper priors) if you think of a sequence (theta_i, x_i) with theta_i generated from the prior and x_i from the sampling distribution given theta_i. An obvious point, but it isn’t clear to me why this repeated samplng model is less relevant than the usual frequentist repeated sampling model. Both are just thought experiments that give some kind of validity to the inferences. As you point out this validity depends on the correctness of the model (in the frequentiist case) – hence the need for model checking – and also the prior
(in the Bayesian case) – hence the need for checking for prior-data conflict
– to assess whether or not the true value is out in the tails of the prior. In the end we can never be sure that the ingredients we put into a statistical analysis are correct only check that they make sense in light of the data. After that we want a logically sound system for reasoning and that seems to require a prior but maybe someday someone will present a convincing argument that one can get by without it.

2. The regions you suggest towards the end of your comment are I believe what I have been calling (for lack of better terminology) relative surprise regions for some years now. Besides their invariance under reparameterizations they have lots of nice properties when compared to other Bayesian proposals (e.g. see http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ejs/1229975382)
including optimal (Bayesian) repeated sampling propoerties.

3. Presenting only the likelihood is okay for inference but not for model checking.

Mike
Reply
10. Radford Neal | 2010-07-02 at 3:12 pm

Hi Mike,

You’re right that the suggestion at the end of my post is related to your “relative surprise” work. The link you gave doesn’t work for me. Another link that does (not sure if to the same thing) is here.

From equation (3) in the paper linked above, it seems that the “confidence regions” you would produce aren’t the same as the one I suggest above. I suggest including all parameter values for which the “relative surprise” is greater than 0.1 (say), but you look at what you call the “observed relative surprise”, which seems to be different.

Beyond this, I think my justification above may be different than yours. I’m thinking in somewhat sociological terms (as opposed to private inference), in a context in which different researchers have different priors, but all agree on a common prior for the parameter conditional on it not having a particular value on which their opinions differ from some other researchers’ opinions.
Reply
- 11. Mike Evans | 2010-07-11 at 11:06 pm
  
  Hi Radford;
  
  This link, to the pre-publication verion of the paper, hopefully works
  
  this one
  
  This develops some properties of what I call relative surprise regions (as in the link you gave). These comprise all those parameter values for which the ratio of the posterior density to the prior density is greater than k
  where k is chosen so that the posterior content of this set of parameter values is gamma. This set is also equal to the set of parameter values for which the observed relative surprise is less than or equal to gamma
  (you can think of 1 minus the observed relative surprise as a P-value). So your region would be a relative surprise regoion for some gamma. The paper linked to shows that these sets have a number of optimal properties in the class of all Bayesian credible regions.
  
  Your justification for them may indeed be different.
  
  Mike
  Reply
12. ali0482 | 2010-08-14 at 8:49 am

analyze data using Bayesian ideas (because they are conceptually simpler, albeit computationally demanding) and run frequentist computer experiments to verify that the Bayesian analyses deliver what is claimed. That is precisely what my colleagues and I did in the work leading to the 2009 discovery of the production of single top quarks at Fermilab. We used Bayesian methods through and through (even using the dreaded “flat” prior for the parameter of interest – the cross section) and verified that the Bayesian methods worked as claimed.
Reply
13. coreyyanofsky | 2018-11-28 at 10:27 pm

Apparently this model (only with θ going to 100) goes back to a book chapter by Allan Birnbaum entitled “Concepts of Statistical Evidence” in a festschrift for Ernest Nagel entitled “Philosophy, Science, and Method”.
Reply

Radford Neal's blog