A Beginners Guide to the Statistical p-value
We will answer the questions:
-
What is a p-value?
-
How do we compute statistical p-values?
-
How does the interpretation of p-values relate to the -criterion and criterion statistical values?
-
Is there an advantage to reporting exact p-values?
-
Why is the definition of the p-value so confusing, and what are the practical consequences of defining it this way?
What is a p-value?
If you’ve ever read the results of a biomedical or behavioral experiment, you have run across the p-value. It is used to justify statements that certain results are ‘statistically significant’, elevating those results to a level of importance and validity not achieved by results that do not quite reach statistical significance.
How does the p-value achieve this feat, and what is the logical basis for using p-values as a decision variable when testing scientific hypotheses?
The first critical step in answering these questions is a solid understanding of what a p-value is, and is not.
The p-value is: “The area under a sampling distribution over theoretical values of a statistic that covers all values that are at least as extreme as the experimentally derived value of the statistic”.
Let’s look at this graphically for a t-test. The t-test is computed when we want to compare the mean of a data sample to some theoretically predicted value. For example, if your theory predicts that the average experimental datum will be zero, then the sampling distribution of the t-statistic will look something like Fig. 1.
Each probability in the distribution shown in Fig. 1 is the probability density corresponding to the t-statistic computed from a different potential data mean, . Probability mass, on the other hand, can only be defined in terms of ranges of abscissa values along a probability density function. Thus, the p-value cannot be computed for any single observed mean, because the probability mass of any single abscissa value is zero.
You can see the p-value in Figure 1b; it is the shaded area.
-
Notice that the p-value includes probability densities that are from positive AND negative abscissa values… even though the observed mean is either positive or negative, not both.
-
It may surprise you to realize that, in this example (and any example where the abscissa is not a discrete variable), the p-value does not technically even include the observed dataset.
A p-value is not:
-
It is not the probability of your dataset. Indeed, a p-value is the probability of many datasets, although those datasets do not always include the observed dataset
-
It is not the probability of the null hypothesis ( ). In fact, the definition of the null hypothesis depends on assuming that the null hypothesis is true so to assign it anything but unity probability would make no sense
- further, it is not even possible to define the probability of an hypothesis within classical statistical practice
-
It is not the probability of any alternative hypothesis (hypotheses competing with the null hypothesis). In fact, there is almost never any need to define any specific alternative to the null hypothesis to perform the classical statistical hypothesis test, let alone compute probabilities relative to alternative hypotheses
- also, it is not possible to define the probability of any hypothesis within classical statistical practice
How do we compute statistical p-values?
To see how the p-value is computed, let’s start with an example problem. Suppose you are testing a drug for treatment of the common cold. You collect the following data in which 8 individuals recover once from a cold naturally, and once using your drug. They recover faster (in days) by the following amounts: D = [1.1, 2, 0, 0, 0.4, 0.6, 3, 1.2].
The t-statistic based on these data is:
To compute the t-test, we plot our data-defined t-statistic against the t-density, which is a sampling distribution based on the size of the dataset. Here, there are n = 8 observations, so the t-distribution is plotted by typing:
d=[1.1, 2, 0, 0, .4, .6,3, 1.2];
meand=mean(d); sd=std(d); n=length(d);
tdat=meand/(sd/sqrt(n));
tlist=linspace(-4.5,4.5,201);
p=tpdf(tlist,n-1);
figure; subplot(2,1,1); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
tcrit=[tinv(.025,n-1) tinv(.975,n-1)];
icrit=[find(tlist<tcrit(1),1,'last') find(tlist>tcrit(2),1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(1)*[1 1],[0 tpdf(tcrit(1),n-1)],'k-','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(2)*[1 1],[0 tpdf(tcrit(2),n-1)],'k-','LineWidth',1.75)
subplot(2,1,2); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
icrit=[find(tlist<-tdat,1,'last') find(tlist>tdat,1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(-tdat*[1 1],[0 tpdf(-tdat,n-1)],'k-','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tdat*[1 1],[0 tpdf(tdat,n-1)],'k-','LineWidth',1.75)
plot(tdat,0,'ko','MarkerFaceColor',[.2 .3 .4],'MarkerSize',7)
where we have added shading in the ‘tails’ of the upper distribution to represent the location of the probability mass corresponding to the -criterion [ ] and shading in the lower distribution to represent the location of the probability mass corresponding to the p-value defined by the dataset, p=0.0256.
These plots make it clear that the t-statistic, is greater than the criterion, , and therefore
-
this corresponds to a ‘statistically significant’ statistical hypothesis test
-
we can compute the statistical hypothesis test in the single line of code:
>> [h,p,ci,stats]=ttest(d)
How does the interpretation of p-values relate to the -criterion and criterion statistical values?
We just made passing mention of the critical value of the t-statistic and the -criterion. Let’s make sure we understand these, and their relationship to ‘statistical significance’.
1. You always start with the -criterion
- The -criterion sets your type I error rate, which is the rate at which you will (erroneously) find a statistically significant result (and reject ) in instances when the null hypothesis ( ) is in fact correct
-
You therefore choose an -criterion to set a low type I error rate that is nevertheless not too low relative to the statistical power of your experiment
2. The -criterion also sets the critical value of the test statistic. You can see this in the upper panel (a) of the figure.
- After setting aside 2.5% of the mass in both tails, the borders of those tail-area masses are the critical values of the statistic, here for the t-statistic.
3. After collecting your data, you can compute the p-value.
- Just as defines the edges of the -criterion at each tail of the t-distribution, the absolute value of the t-statistic computed from your data defines the edges of the tail-areas, positive and negative, that define the p-value.
-
The p-value is the area under the t-distribution starting at the positive and negative values of the t-statistic, and continuing out into the tails.
p-value quiz
You have developed a treatment for the common cold, and collect data on recovery times in both untreated patients and patients who are given your new treatment. In your analysis, the p-value corresponding to the observed difference in recovery times of the two groups was p = .015 (average recovery time for patients receiving your new treatment that was 1 day shorter than controls). Given that you have set your alpha-criterion to , which of the following statements are true:
1. You have disproved the null hypothesis (the hypothesis that there is no statistical difference in recovery times).
2. You have obtained more evidence against the null hypothesis than if the p-value were p=0.045.
3. You have found the probability of the null hypothesis being true.
4. You have proved your hypothesis (there is a reliable statistical difference in recovery time).
5. From the p-value, you can deduce the probability of the experimental hypothesis being true.
6. You are able to lower your alpha-criterion and report this effect as significant at the 0.02 level (i.e., p < 0.02).
7. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
8. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 98.5% of occasions.
9. You have found the probability of the alternative hypothesis being false.
10. You have computed the data analog of the type-I error rate, meaning there is a 1.5% chance you will incorrectly reject the null hypothesis when it is actually true
Before looking at the quiz answers, ask yourself: How many of these ten statements are correct?
Is there an advantage to reporting exact p-values?
It is the current recommendation and directive of the American Psychological Association (APA), a directive followed by most biomedical journals, that exact p-values be reported
Since it is neither incorrect to report the exact p-value associated with a statistical test, nor incorrect to report a p-value simply as greater or less than alpha, we should ask ourselves what the practical consequences of each practice is likely to be.
-
Reporting exact p-values adds precision to the report of your calculations, but also gives the erroneous impression that the p-value is a stand-in for 'level of support' or 'amount of evidence' for the conclusion that the null hypothesis should be rejected
-
Reporting only the values of test statistics and indicating which statistics reach significance (i.e., if p < ) makes it clear that the p-value is meant only to be compared to the threshold value defined by , and therefore the exact value is otherwise inconsequential.
If you've ever spoken to students in an undergraduate statistics course (or, often, their professors as well) about the meaning of the p-value, you will quickly realize that the most common misunderstanding regarding p-values is the erroneous belief that they are a measure of the 'evidence' against (or perhaps a measure of effect size).
The only argument in favor of reporting exact p-values that makes some sense is that when a low-powered study just misses statistical significance, it might be worth following it up with another study using a larger dataset.
However, this still would not offset the detrimental effect of confusing so many regarding the type of information that is meant to be conveyed by p-values (particularly because anyone sophisticated enough to make this judgement is also capable of recognizing the same thing based on the value of the test statistic and sample size).
In short, exact p-values only serve to perpetuate confusion regarding the nature of statistical hypothesis testing and the actual information conveyed by p-values, and should therefore not be reported.
In seeming agreement with my opinion, and also my distaste for the program of classical statistical hypothesis testing generally, the American Statistical Association (ASA) board of directors have issued a statement regarding the p-value that in part reads:
"Widespread use of 'statistical significance' (generally interpreted as 'p < 0.05') as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process"
-
They instead suggest that p-values should be used as a threshold that acts as only one element of a larger argument in favor of your conclusions and interpretation of your data.
Why is the definition of the p-value so confusing, and what are the practical consequences of defining it this way?
I think first I have to address the thought you may well be having, which is:
‘Is this REALLY correct? The p-value, which is the basis for nearly all historical biomedical and many other scientific findings from the last 100 years:
- is NOT the probability of your dataset under the null hypothesis?, and also
-
is NOT the probability of the null hypothesis being correct based on the data?'
Unfortunately, the answers to these questions are all 'yes'.
If the definition and use of the p-value felt a bit convoluted to you, you’re not alone. The p-value is the basis for interpreting the classical statistical null hypothesis test, which is a very roundabout way of assessing the connection between your data and competing hypotheses precisely because the p-value is NOT the information we really want, if we want to compare and test hypotheses.
You might be wondering, for example:
-
why it is that we computed the probabilities of potential (unobserved) datasets under a single hypothesis ( ), when we could have instead computed the probability of having observed the actual dataset, under a range of competing hypotheses.
-
This is the difference between computing sampling distributions and likelihood functions.
-
-
why didn’t we compute the probability of the null hypothesis based on the observed data, and compare that to the probability of the alternative hypothesis (i.e., determine which hypothesis is more likely, based on the observed data)?
-
why don’t any of our computations involve the predictions of the hypotheses that are competing with the null hypothesis?
These are excellent questions, and in fact represent much of the difference between classical and Bayesian statistical methods.
Quick answers:
The p-value is a probability that primarily describes un-observed datasets (datasets more extreme than observed experimentally), so we can’t just compute the probability of the observed dataset, because that would not yield a p-value.
You are forbidden from even uttering the phrase, ‘probability of the hypothesis’ (null or otherwise) within classical statistics. For this thought crime you are sent straight to classical statistical jail, and you do not collect 200 classical statistical dollars.
Within classical statistical hypothesis testing, the prohibition against defining probabilities of hypotheses is due to the definition of probability.
-
Within classical statistics, probability is defined as the frequency at which something is true.
- This means, for example, you can define the probability of a coin coming up tails, because flipping a coin sometimes results in it coming up heads and sometimes tails.
-
Contrariwise, you cannot define the probability that it will rain at 10 am tomorrow, because 10 am tomorrow only happens once and it will rain or not at that time (i.e., the event, raining tomorrow at 10am, cannot occur at some frequency out of the number of total occurrences).
-
Similarly, hypotheses are either true or false. They are not sometimes true and sometimes false, and therefore hypotheses do not meet the basic criterion for defining a probability under this frequency-based definition.
Fig. 1
A Beginners Guide to the Statistical p-value
We will answer the questions:
-
What is a p-value?
-
How do we compute statistical p-values?
-
How does the interpretation of p-values relate to the -criterion and criterion statistical values?
-
Is there an advantage to reporting exact p-values?
-
Why is the definition of the p-value so confusing, and what are the practical consequences of defining it this way?
What is a p-value?
If you’ve ever read the results of a biomedical or behavioral experiment, you have run across the p-value. It is used to justify statements that certain results are ‘statistically significant’, elevating those results to a level of importance and validity not achieved by results that do not quite reach statistical significance.
How does the p-value achieve this feat, and what is the logical basis for using p-values as a decision variable when testing scientific hypotheses?
The first critical step in answering these questions is a solid understanding of what a p-value is, and is not.
The p-value is: “The area under a sampling distribution over theoretical values of a statistic that covers all values that are at least as extreme as the experimentally derived value of the statistic”.
Let’s look at this graphically for a t-test. The t-test is computed when we want to compare the mean of a data sample to some theoretically predicted value. For example, if your theory predicts that the average experimental datum will be zero, then the sampling distribution of the t-statistic will look something like Fig. 1.
Each probability in the distribution shown in Fig. 1 is the probability density corresponding to the t-statistic computed from a different potential data mean, . Probability mass, on the other hand, can only be defined in terms of ranges of abscissa values along a probability density function. Thus, the p-value cannot be computed for any single observed mean, because the probability mass of any single abscissa value is zero.
You can see the p-value in Figure 1b; it is the shaded area.
-
Notice that the p-value includes probability densities that are from positive AND negative abscissa values… even though the observed mean is either positive or negative, not both.
-
It may surprise you to realize that, in this example (and any example where the abscissa is not a discrete variable), the p-value does not technically even include the observed dataset.
A p-value is not:
-
It is not the probability of your dataset. Indeed, a p-value is the probability of many datasets, although those datasets do not always include the observed dataset
-
It is not the probability of the null hypothesis ( ). In fact, the definition of the null hypothesis depends on assuming that the null hypothesis is true so to assign it anything but unity probability would make no sense
- further, it is not even possible to define the probability of an hypothesis within classical statistical practice
-
It is not the probability of any alternative hypothesis (hypotheses competing with the null hypothesis). In fact, there is almost never any need to define any specific alternative to the null hypothesis to perform the classical statistical hypothesis test, let alone compute probabilities relative to alternative hypotheses
- also, it is not possible to define the probability of any hypothesis within classical statistical practice
How do we compute statistical p-values?
To see how the p-value is computed, let’s start with an example problem. Suppose you are testing a drug for treatment of the common cold. You collect the following data in which 8 individuals recover once from a cold naturally, and once using your drug. They recover faster (in days) by the following amounts: D = [1.1, 2, 0, 0, 0.4, 0.6, 3, 1.2].
The t-statistic based on these data is:
To compute the t-test, we plot our data-defined t-statistic against the t-density, which is a sampling distribution based on the size of the dataset. Here, there are n = 8 observations, so the t-distribution is plotted by typing:
d=[1.1, 2, 0, 0, .4, .6,3, 1.2];
meand=mean(d); sd=std(d); n=length(d);
tdat=meand/(sd/sqrt(n));
tlist=linspace(-4.5,4.5,201);
p=tpdf(tlist,n-1);
figure; subplot(2,1,1); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
tcrit=[tinv(.025,n-1) tinv(.975,n-1)];
icrit=[find(tlist<tcrit(1),1,'last') find(tlist>tcrit(2),1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(1)*[1 1],[0 tpdf(tcrit(1),n-1)],'k-','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(2)*[1 1],[0 tpdf(tcrit(2),n-1)],'k-','LineWidth',1.75)
subplot(2,1,2); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
icrit=[find(tlist<-tdat,1,'last') find(tlist>tdat,1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(-tdat*[1 1],[0 tpdf(-tdat,n-1)],'k-','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'-','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tdat*[1 1],[0 tpdf(tdat,n-1)],'k-','LineWidth',1.75)
plot(tdat,0,'ko','MarkerFaceColor',[.2 .3 .4],'MarkerSize',7)
where we have added shading in the ‘tails’ of the upper distribution to represent the location of the probability mass corresponding to the -criterion [ ] and shading in the lower distribution to represent the location of the probability mass corresponding to the p-value defined by the dataset, p=0.0256.
These plots make it clear that the t-statistic, is greater than the criterion, , and therefore
-
this corresponds to a ‘statistically significant’ statistical hypothesis test
-
we can compute the statistical hypothesis test in the single line of code:
>> [h,p,ci,stats]=ttest(d)
How does the interpretation of p-values relate to the -criterion and criterion statistical values?
We just made passing mention of the critical value of the t-statistic and the -criterion. Let’s make sure we understand these, and their relationship to ‘statistical significance’.
1. You always start with the -criterion
- The -criterion sets your type I error rate, which is the rate at which you will (erroneously) find a statistically significant result (and reject ) in instances when the null hypothesis ( ) is in fact correct
-
You therefore choose an -criterion to set a low type I error rate that is nevertheless not too low relative to the statistical power of your experiment
2. The -criterion also sets the critical value of the test statistic. You can see this in the upper panel (a) of the figure.
- After setting aside 2.5% of the mass in both tails, the borders of those tail-area masses are the critical values of the statistic, here for the t-statistic.
3. After collecting your data, you can compute the p-value.
- Just as defines the edges of the -criterion at each tail of the t-distribution, the absolute value of the t-statistic computed from your data defines the edges of the tail-areas, positive and negative, that define the p-value.
-
The p-value is the area under the t-distribution starting at the positive and negative values of the t-statistic, and continuing out into the tails.
p-value quiz
You have developed a treatment for the common cold, and collect data on recovery times in both untreated patients and patients who are given your new treatment. In your analysis, the p-value corresponding to the observed difference in recovery times of the two groups was p = .015 (average recovery time for patients receiving your new treatment that was 1 day shorter than controls). Given that you have set your alpha-criterion to , which of the following statements are true:
1. You have disproved the null hypothesis (the hypothesis that there is no statistical difference in recovery times).
2. You have obtained more evidence against the null hypothesis than if the p-value were p=0.045.
3. You have found the probability of the null hypothesis being true.
4. You have proved your hypothesis (there is a reliable statistical difference in recovery time).
5. From the p-value, you can deduce the probability of the experimental hypothesis being true.
6. You are able to lower your alpha-criterion and report this effect as significant at the 0.02 level (i.e., p < 0.02).
7. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
8. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 98.5% of occasions.
9. You have found the probability of the alternative hypothesis being false.
10. You have computed the data analog of the type-I error rate, meaning there is a 1.5% chance you will incorrectly reject the null hypothesis when it is actually true
Before looking at the quiz answers, ask yourself: How many of these ten statements are correct?
Is there an advantage to reporting exact p-values?
It is the current recommendation and directive of the American Psychological Association (APA), a directive followed by most biomedical journals, that exact p-values be reported
Since it is neither incorrect to report the exact p-value associated with a statistical test, nor incorrect to report a p-value simply as greater or less than alpha, we should ask ourselves what the practical consequences of each practice is likely to be.
-
Reporting exact p-values adds precision to the report of your calculations, but also gives the erroneous impression that the p-value is a stand-in for 'level of support' or 'amount of evidence' for the conclusion that the null hypothesis should be rejected
-
Reporting only the values of test statistics and indicating which statistics reach significance (i.e., if p < ) makes it clear that the p-value is meant only to be compared to the threshold value defined by , and therefore the exact value is otherwise inconsequential.
If you've ever spoken to students in an undergraduate statistics course (or, often, their professors as well) about the meaning of the p-value, you will quickly realize that the most common misunderstanding regarding p-values is the erroneous belief that they are a measure of the 'evidence' against (or perhaps a measure of effect size).
The only argument in favor of reporting exact p-values that makes some sense is that when a low-powered study just misses statistical significance, it might be worth following it up with another study using a larger dataset.
However, this still would not offset the detrimental effect of confusing so many regarding the type of information that is meant to be conveyed by p-values (particularly because anyone sophisticated enough to make this judgement is also capable of recognizing the same thing based on the value of the test statistic and sample size).
In short, exact p-values only serve to perpetuate confusion regarding the nature of statistical hypothesis testing and the actual information conveyed by p-values, and should therefore not be reported.
In seeming agreement with my opinion, and also my distaste for the program of classical statistical hypothesis testing generally, the American Statistical Association (ASA) board of directors have issued a statement regarding the p-value that in part reads:
"Widespread use of 'statistical significance' (generally interpreted as 'p < 0.05') as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process"
-
They instead suggest that p-values should be used as a threshold that acts as only one element of a larger argument in favor of your conclusions and interpretation of your data.
Why is the definition of the p-value so confusing, and what are the practical consequences of defining it this way?
I think first I have to address the thought you may well be having, which is:
‘Is this REALLY correct? The p-value, which is the basis for nearly all historical biomedical and many other scientific findings from the last 100 years:
- is NOT the probability of your dataset under the null hypothesis?, and also
-
is NOT the probability of the null hypothesis being correct based on the data?'
Unfortunately, the answers to these questions are all 'yes'.
If the definition and use of the p-value felt a bit convoluted to you, you’re not alone. The p-value is the basis for interpreting the classical statistical null hypothesis test, which is a very roundabout way of assessing the connection between your data and competing hypotheses precisely because the p-value is NOT the information we really want, if we want to compare and test hypotheses.
You might be wondering, for example:
-
why it is that we computed the probabilities of potential (unobserved) datasets under a single hypothesis ( ), when we could have instead computed the probability of having observed the actual dataset, under a range of competing hypotheses.
-
This is the difference between computing sampling distributions and likelihood functions.
-
-
why didn’t we compute the probability of the null hypothesis based on the observed data, and compare that to the probability of the alternative hypothesis (i.e., determine which hypothesis is more likely, based on the observed data)?
-
why don’t any of our computations involve the predictions of the hypotheses that are competing with the null hypothesis?
These are excellent questions, and in fact represent much of the difference between classical and Bayesian statistical methods.
Quick answers:
The p-value is a probability that primarily describes un-observed datasets (datasets more extreme than observed experimentally), so we can’t just compute the probability of the observed dataset, because that would not yield a p-value.
You are forbidden from even uttering the phrase, ‘probability of the hypothesis’ (null or otherwise) within classical statistics. For this thought crime you are sent straight to classical statistical jail, and you do not collect 200 classical statistical dollars.
Within classical statistical hypothesis testing, the prohibition against defining probabilities of hypotheses is due to the definition of probability.
-
Within classical statistics, probability is defined as the frequency at which something is true.
- This means, for example, you can define the probability of a coin coming up tails, because flipping a coin sometimes results in it coming up heads and sometimes tails.
-
Contrariwise, you cannot define the probability that it will rain at 10 am tomorrow, because 10 am tomorrow only happens once and it will rain or not at that time (i.e., the event, raining tomorrow at 10am, cannot occur at some frequency out of the number of total occurrences).
-
Similarly, hypotheses are either true or false. They are not sometimes true and sometimes false, and therefore hypotheses do not meet the basic criterion for defining a probability under this frequency-based definition.
Fig. 1