A Beginners Guide to the Statistical pvalue
We will answer the questions:

What is a pvalue?

How do we compute statistical pvalues?

How does the interpretation of pvalues relate to the criterion and criterion statistical values?

Is there an advantage to reporting exact pvalues?

Why is the definition of the pvalue so confusing, and what are the practical consequences of defining it this way?
What is a pvalue?
If you’ve ever read the results of a biomedical or behavioral experiment, you have run across the pvalue. It is used to justify statements that certain results are ‘statistically significant’, elevating those results to a level of importance and validity not achieved by results that do not quite reach statistical significance.
How does the pvalue achieve this feat, and what is the logical basis for using pvalues as a decision variable when testing scientific hypotheses?
The first critical step in answering these questions is a solid understanding of what a pvalue is, and is not.
The pvalue is: “The area under a sampling distribution over theoretical values of a statistic that covers all values that are at least as extreme as the experimentally derived value of the statistic”.
Let’s look at this graphically for a ttest. The ttest is computed when we want to compare the mean of a data sample to some theoretically predicted value. For example, if your theory predicts that the average experimental datum will be zero, then the sampling distribution of the tstatistic will look something like Fig. 1.
Each probability in the distribution shown in Fig. 1 is the probability density corresponding to the tstatistic computed from a different potential data mean, . Probability mass, on the other hand, can only be defined in terms of ranges of abscissa values along a probability density function. Thus, the pvalue cannot be computed for any single observed mean, because the probability mass of any single abscissa value is zero.
You can see the pvalue in Figure 1b; it is the shaded area.

Notice that the pvalue includes probability densities that are from positive AND negative abscissa values… even though the observed mean is either positive or negative, not both.

It may surprise you to realize that, in this example (and any example where the abscissa is not a discrete variable), the pvalue does not technically even include the observed dataset.
A pvalue is not:

It is not the probability of your dataset. Indeed, a pvalue is the probability of many datasets, although those datasets do not always include the observed dataset

It is not the probability of the null hypothesis ( ). In fact, the definition of the null hypothesis depends on assuming that the null hypothesis is true so to assign it anything but unity probability would make no sense
 further, it is not even possible to define the probability of an hypothesis within classical statistical practice

It is not the probability of any alternative hypothesis (hypotheses competing with the null hypothesis). In fact, there is almost never any need to define any specific alternative to the null hypothesis to perform the classical statistical hypothesis test, let alone compute probabilities relative to alternative hypotheses
 also, it is not possible to define the probability of any hypothesis within classical statistical practice
How do we compute statistical pvalues?
To see how the pvalue is computed, let’s start with an example problem. Suppose you are testing a drug for treatment of the common cold. You collect the following data in which 8 individuals recover once from a cold naturally, and once using your drug. They recover faster (in days) by the following amounts: D = [1.1, 2, 0, 0, 0.4, 0.6, 3, 1.2].
The tstatistic based on these data is:
To compute the ttest, we plot our datadefined tstatistic against the tdensity, which is a sampling distribution based on the size of the dataset. Here, there are n = 8 observations, so the tdistribution is plotted by typing:
d=[1.1, 2, 0, 0, .4, .6,3, 1.2];
meand=mean(d); sd=std(d); n=length(d);
tdat=meand/(sd/sqrt(n));
tlist=linspace(4.5,4.5,201);
p=tpdf(tlist,n1);
figure; subplot(2,1,1); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
tcrit=[tinv(.025,n1) tinv(.975,n1)];
icrit=[find(tlist<tcrit(1),1,'last') find(tlist>tcrit(2),1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(1)*[1 1],[0 tpdf(tcrit(1),n1)],'k','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(2)*[1 1],[0 tpdf(tcrit(2),n1)],'k','LineWidth',1.75)
subplot(2,1,2); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
icrit=[find(tlist<tdat,1,'last') find(tlist>tdat,1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tdat*[1 1],[0 tpdf(tdat,n1)],'k','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tdat*[1 1],[0 tpdf(tdat,n1)],'k','LineWidth',1.75)
plot(tdat,0,'ko','MarkerFaceColor',[.2 .3 .4],'MarkerSize',7)
where we have added shading in the ‘tails’ of the upper distribution to represent the location of the probability mass corresponding to the criterion [ ] and shading in the lower distribution to represent the location of the probability mass corresponding to the pvalue defined by the dataset, p=0.0256.
These plots make it clear that the tstatistic, is greater than the criterion, , and therefore

this corresponds to a ‘statistically significant’ statistical hypothesis test

we can compute the statistical hypothesis test in the single line of code:
>> [h,p,ci,stats]=ttest(d)
How does the interpretation of pvalues relate to the criterion and criterion statistical values?
We just made passing mention of the critical value of the tstatistic and the criterion. Let’s make sure we understand these, and their relationship to ‘statistical significance’.
1. You always start with the criterion
 The criterion sets your type I error rate, which is the rate at which you will (erroneously) find a statistically significant result (and reject ) in instances when the null hypothesis ( ) is in fact correct

You therefore choose an criterion to set a low type I error rate that is nevertheless not too low relative to the statistical power of your experiment
2. The criterion also sets the critical value of the test statistic. You can see this in the upper panel (a) of the figure.
 After setting aside 2.5% of the mass in both tails, the borders of those tailarea masses are the critical values of the statistic, here for the tstatistic.
3. After collecting your data, you can compute the pvalue.
 Just as defines the edges of the criterion at each tail of the tdistribution, the absolute value of the tstatistic computed from your data defines the edges of the tailareas, positive and negative, that define the pvalue.

The pvalue is the area under the tdistribution starting at the positive and negative values of the tstatistic, and continuing out into the tails.
pvalue quiz
You have developed a treatment for the common cold, and collect data on recovery times in both untreated patients and patients who are given your new treatment. In your analysis, the pvalue corresponding to the observed difference in recovery times of the two groups was p = .015 (average recovery time for patients receiving your new treatment that was 1 day shorter than controls). Given that you have set your alphacriterion to , which of the following statements are true:
1. You have disproved the null hypothesis (the hypothesis that there is no statistical difference in recovery times).
2. You have obtained more evidence against the null hypothesis than if the pvalue were p=0.045.
3. You have found the probability of the null hypothesis being true.
4. You have proved your hypothesis (there is a reliable statistical difference in recovery time).
5. From the pvalue, you can deduce the probability of the experimental hypothesis being true.
6. You are able to lower your alphacriterion and report this effect as significant at the 0.02 level (i.e., p < 0.02).
7. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
8. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 98.5% of occasions.
9. You have found the probability of the alternative hypothesis being false.
10. You have computed the data analog of the typeI error rate, meaning there is a 1.5% chance you will incorrectly reject the null hypothesis when it is actually true
Before looking at the quiz answers, ask yourself: How many of these ten statements are correct?
Is there an advantage to reporting exact pvalues?
It is the current recommendation and directive of the American Psychological Association (APA), a directive followed by most biomedical journals, that exact pvalues be reported
Since it is neither incorrect to report the exact pvalue associated with a statistical test, nor incorrect to report a pvalue simply as greater or less than alpha, we should ask ourselves what the practical consequences of each practice is likely to be.

Reporting exact pvalues adds precision to the report of your calculations, but also gives the erroneous impression that the pvalue is a standin for 'level of support' or 'amount of evidence' for the conclusion that the null hypothesis should be rejected

Reporting only the values of test statistics and indicating which statistics reach significance (i.e., if p < ) makes it clear that the pvalue is meant only to be compared to the threshold value defined by , and therefore the exact value is otherwise inconsequential.
If you've ever spoken to students in an undergraduate statistics course (or, often, their professors as well) about the meaning of the pvalue, you will quickly realize that the most common misunderstanding regarding pvalues is the erroneous belief that they are a measure of the 'evidence' against (or perhaps a measure of effect size).
The only argument in favor of reporting exact pvalues that makes some sense is that when a lowpowered study just misses statistical significance, it might be worth following it up with another study using a larger dataset.
However, this still would not offset the detrimental effect of confusing so many regarding the type of information that is meant to be conveyed by pvalues (particularly because anyone sophisticated enough to make this judgement is also capable of recognizing the same thing based on the value of the test statistic and sample size).
In short, exact pvalues only serve to perpetuate confusion regarding the nature of statistical hypothesis testing and the actual information conveyed by pvalues, and should therefore not be reported.
In seeming agreement with my opinion, and also my distaste for the program of classical statistical hypothesis testing generally, the American Statistical Association (ASA) board of directors have issued a statement regarding the pvalue that in part reads:
"Widespread use of 'statistical significance' (generally interpreted as 'p < 0.05') as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process"

They instead suggest that pvalues should be used as a threshold that acts as only one element of a larger argument in favor of your conclusions and interpretation of your data.
Why is the definition of the pvalue so confusing, and what are the practical consequences of defining it this way?
I think first I have to address the thought you may well be having, which is:
‘Is this REALLY correct? The pvalue, which is the basis for nearly all historical biomedical and many other scientific findings from the last 100 years:
 is NOT the probability of your dataset under the null hypothesis?, and also

is NOT the probability of the null hypothesis being correct based on the data?'
Unfortunately, the answers to these questions are all 'yes'.
If the definition and use of the pvalue felt a bit convoluted to you, you’re not alone. The pvalue is the basis for interpreting the classical statistical null hypothesis test, which is a very roundabout way of assessing the connection between your data and competing hypotheses precisely because the pvalue is NOT the information we really want, if we want to compare and test hypotheses.
You might be wondering, for example:

why it is that we computed the probabilities of potential (unobserved) datasets under a single hypothesis ( ), when we could have instead computed the probability of having observed the actual dataset, under a range of competing hypotheses.

This is the difference between computing sampling distributions and likelihood functions.


why didn’t we compute the probability of the null hypothesis based on the observed data, and compare that to the probability of the alternative hypothesis (i.e., determine which hypothesis is more likely, based on the observed data)?

why don’t any of our computations involve the predictions of the hypotheses that are competing with the null hypothesis?
These are excellent questions, and in fact represent much of the difference between classical and Bayesian statistical methods.
Quick answers:
The pvalue is a probability that primarily describes unobserved datasets (datasets more extreme than observed experimentally), so we can’t just compute the probability of the observed dataset, because that would not yield a pvalue.
You are forbidden from even uttering the phrase, ‘probability of the hypothesis’ (null or otherwise) within classical statistics. For this thought crime you are sent straight to classical statistical jail, and you do not collect 200 classical statistical dollars.
Within classical statistical hypothesis testing, the prohibition against defining probabilities of hypotheses is due to the definition of probability.

Within classical statistics, probability is defined as the frequency at which something is true.
 This means, for example, you can define the probability of a coin coming up tails, because flipping a coin sometimes results in it coming up heads and sometimes tails.

Contrariwise, you cannot define the probability that it will rain at 10 am tomorrow, because 10 am tomorrow only happens once and it will rain or not at that time (i.e., the event, raining tomorrow at 10am, cannot occur at some frequency out of the number of total occurrences).

Similarly, hypotheses are either true or false. They are not sometimes true and sometimes false, and therefore hypotheses do not meet the basic criterion for defining a probability under this frequencybased definition.
Fig. 1
A Beginners Guide to the Statistical pvalue
We will answer the questions:

What is a pvalue?

How do we compute statistical pvalues?

How does the interpretation of pvalues relate to the criterion and criterion statistical values?

Is there an advantage to reporting exact pvalues?

Why is the definition of the pvalue so confusing, and what are the practical consequences of defining it this way?
What is a pvalue?
If you’ve ever read the results of a biomedical or behavioral experiment, you have run across the pvalue. It is used to justify statements that certain results are ‘statistically significant’, elevating those results to a level of importance and validity not achieved by results that do not quite reach statistical significance.
How does the pvalue achieve this feat, and what is the logical basis for using pvalues as a decision variable when testing scientific hypotheses?
The first critical step in answering these questions is a solid understanding of what a pvalue is, and is not.
The pvalue is: “The area under a sampling distribution over theoretical values of a statistic that covers all values that are at least as extreme as the experimentally derived value of the statistic”.
Let’s look at this graphically for a ttest. The ttest is computed when we want to compare the mean of a data sample to some theoretically predicted value. For example, if your theory predicts that the average experimental datum will be zero, then the sampling distribution of the tstatistic will look something like Fig. 1.
Each probability in the distribution shown in Fig. 1 is the probability density corresponding to the tstatistic computed from a different potential data mean, . Probability mass, on the other hand, can only be defined in terms of ranges of abscissa values along a probability density function. Thus, the pvalue cannot be computed for any single observed mean, because the probability mass of any single abscissa value is zero.
You can see the pvalue in Figure 1b; it is the shaded area.

Notice that the pvalue includes probability densities that are from positive AND negative abscissa values… even though the observed mean is either positive or negative, not both.

It may surprise you to realize that, in this example (and any example where the abscissa is not a discrete variable), the pvalue does not technically even include the observed dataset.
A pvalue is not:

It is not the probability of your dataset. Indeed, a pvalue is the probability of many datasets, although those datasets do not always include the observed dataset

It is not the probability of the null hypothesis ( ). In fact, the definition of the null hypothesis depends on assuming that the null hypothesis is true so to assign it anything but unity probability would make no sense
 further, it is not even possible to define the probability of an hypothesis within classical statistical practice

It is not the probability of any alternative hypothesis (hypotheses competing with the null hypothesis). In fact, there is almost never any need to define any specific alternative to the null hypothesis to perform the classical statistical hypothesis test, let alone compute probabilities relative to alternative hypotheses
 also, it is not possible to define the probability of any hypothesis within classical statistical practice
How do we compute statistical pvalues?
To see how the pvalue is computed, let’s start with an example problem. Suppose you are testing a drug for treatment of the common cold. You collect the following data in which 8 individuals recover once from a cold naturally, and once using your drug. They recover faster (in days) by the following amounts: D = [1.1, 2, 0, 0, 0.4, 0.6, 3, 1.2].
The tstatistic based on these data is:
To compute the ttest, we plot our datadefined tstatistic against the tdensity, which is a sampling distribution based on the size of the dataset. Here, there are n = 8 observations, so the tdistribution is plotted by typing:
d=[1.1, 2, 0, 0, .4, .6,3, 1.2];
meand=mean(d); sd=std(d); n=length(d);
tdat=meand/(sd/sqrt(n));
tlist=linspace(4.5,4.5,201);
p=tpdf(tlist,n1);
figure; subplot(2,1,1); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
tcrit=[tinv(.025,n1) tinv(.975,n1)];
icrit=[find(tlist<tcrit(1),1,'last') find(tlist>tcrit(2),1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(1)*[1 1],[0 tpdf(tcrit(1),n1)],'k','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tcrit(2)*[1 1],[0 tpdf(tcrit(2),n1)],'k','LineWidth',1.75)
subplot(2,1,2); plot(tlist,p,'.'); axis([tlist(1) tlist(end) 0 1.02*max(p)]); box off; hold on
icrit=[find(tlist<tdat,1,'last') find(tlist>tdat,1)];
for i=1:icrit(1),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tdat*[1 1],[0 tpdf(tdat,n1)],'k','LineWidth',1.75)
for i=icrit(2):length(tlist),
plot(tlist(i)*[1 1],p(i)*[0 1],'','Color',.5*[1 1 1],'LineWidth',1.75); end
plot(tdat*[1 1],[0 tpdf(tdat,n1)],'k','LineWidth',1.75)
plot(tdat,0,'ko','MarkerFaceColor',[.2 .3 .4],'MarkerSize',7)
where we have added shading in the ‘tails’ of the upper distribution to represent the location of the probability mass corresponding to the criterion [ ] and shading in the lower distribution to represent the location of the probability mass corresponding to the pvalue defined by the dataset, p=0.0256.
These plots make it clear that the tstatistic, is greater than the criterion, , and therefore

this corresponds to a ‘statistically significant’ statistical hypothesis test

we can compute the statistical hypothesis test in the single line of code:
>> [h,p,ci,stats]=ttest(d)
How does the interpretation of pvalues relate to the criterion and criterion statistical values?
We just made passing mention of the critical value of the tstatistic and the criterion. Let’s make sure we understand these, and their relationship to ‘statistical significance’.
1. You always start with the criterion
 The criterion sets your type I error rate, which is the rate at which you will (erroneously) find a statistically significant result (and reject ) in instances when the null hypothesis ( ) is in fact correct

You therefore choose an criterion to set a low type I error rate that is nevertheless not too low relative to the statistical power of your experiment
2. The criterion also sets the critical value of the test statistic. You can see this in the upper panel (a) of the figure.
 After setting aside 2.5% of the mass in both tails, the borders of those tailarea masses are the critical values of the statistic, here for the tstatistic.
3. After collecting your data, you can compute the pvalue.
 Just as defines the edges of the criterion at each tail of the tdistribution, the absolute value of the tstatistic computed from your data defines the edges of the tailareas, positive and negative, that define the pvalue.

The pvalue is the area under the tdistribution starting at the positive and negative values of the tstatistic, and continuing out into the tails.
pvalue quiz
You have developed a treatment for the common cold, and collect data on recovery times in both untreated patients and patients who are given your new treatment. In your analysis, the pvalue corresponding to the observed difference in recovery times of the two groups was p = .015 (average recovery time for patients receiving your new treatment that was 1 day shorter than controls). Given that you have set your alphacriterion to , which of the following statements are true:
1. You have disproved the null hypothesis (the hypothesis that there is no statistical difference in recovery times).
2. You have obtained more evidence against the null hypothesis than if the pvalue were p=0.045.
3. You have found the probability of the null hypothesis being true.
4. You have proved your hypothesis (there is a reliable statistical difference in recovery time).
5. From the pvalue, you can deduce the probability of the experimental hypothesis being true.
6. You are able to lower your alphacriterion and report this effect as significant at the 0.02 level (i.e., p < 0.02).
7. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
8. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 98.5% of occasions.
9. You have found the probability of the alternative hypothesis being false.
10. You have computed the data analog of the typeI error rate, meaning there is a 1.5% chance you will incorrectly reject the null hypothesis when it is actually true
Before looking at the quiz answers, ask yourself: How many of these ten statements are correct?
Is there an advantage to reporting exact pvalues?
It is the current recommendation and directive of the American Psychological Association (APA), a directive followed by most biomedical journals, that exact pvalues be reported
Since it is neither incorrect to report the exact pvalue associated with a statistical test, nor incorrect to report a pvalue simply as greater or less than alpha, we should ask ourselves what the practical consequences of each practice is likely to be.

Reporting exact pvalues adds precision to the report of your calculations, but also gives the erroneous impression that the pvalue is a standin for 'level of support' or 'amount of evidence' for the conclusion that the null hypothesis should be rejected

Reporting only the values of test statistics and indicating which statistics reach significance (i.e., if p < ) makes it clear that the pvalue is meant only to be compared to the threshold value defined by , and therefore the exact value is otherwise inconsequential.
If you've ever spoken to students in an undergraduate statistics course (or, often, their professors as well) about the meaning of the pvalue, you will quickly realize that the most common misunderstanding regarding pvalues is the erroneous belief that they are a measure of the 'evidence' against (or perhaps a measure of effect size).
The only argument in favor of reporting exact pvalues that makes some sense is that when a lowpowered study just misses statistical significance, it might be worth following it up with another study using a larger dataset.
However, this still would not offset the detrimental effect of confusing so many regarding the type of information that is meant to be conveyed by pvalues (particularly because anyone sophisticated enough to make this judgement is also capable of recognizing the same thing based on the value of the test statistic and sample size).
In short, exact pvalues only serve to perpetuate confusion regarding the nature of statistical hypothesis testing and the actual information conveyed by pvalues, and should therefore not be reported.
In seeming agreement with my opinion, and also my distaste for the program of classical statistical hypothesis testing generally, the American Statistical Association (ASA) board of directors have issued a statement regarding the pvalue that in part reads:
"Widespread use of 'statistical significance' (generally interpreted as 'p < 0.05') as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process"

They instead suggest that pvalues should be used as a threshold that acts as only one element of a larger argument in favor of your conclusions and interpretation of your data.
Why is the definition of the pvalue so confusing, and what are the practical consequences of defining it this way?
I think first I have to address the thought you may well be having, which is:
‘Is this REALLY correct? The pvalue, which is the basis for nearly all historical biomedical and many other scientific findings from the last 100 years:
 is NOT the probability of your dataset under the null hypothesis?, and also

is NOT the probability of the null hypothesis being correct based on the data?'
Unfortunately, the answers to these questions are all 'yes'.
If the definition and use of the pvalue felt a bit convoluted to you, you’re not alone. The pvalue is the basis for interpreting the classical statistical null hypothesis test, which is a very roundabout way of assessing the connection between your data and competing hypotheses precisely because the pvalue is NOT the information we really want, if we want to compare and test hypotheses.
You might be wondering, for example:

why it is that we computed the probabilities of potential (unobserved) datasets under a single hypothesis ( ), when we could have instead computed the probability of having observed the actual dataset, under a range of competing hypotheses.

This is the difference between computing sampling distributions and likelihood functions.


why didn’t we compute the probability of the null hypothesis based on the observed data, and compare that to the probability of the alternative hypothesis (i.e., determine which hypothesis is more likely, based on the observed data)?

why don’t any of our computations involve the predictions of the hypotheses that are competing with the null hypothesis?
These are excellent questions, and in fact represent much of the difference between classical and Bayesian statistical methods.
Quick answers:
The pvalue is a probability that primarily describes unobserved datasets (datasets more extreme than observed experimentally), so we can’t just compute the probability of the observed dataset, because that would not yield a pvalue.
You are forbidden from even uttering the phrase, ‘probability of the hypothesis’ (null or otherwise) within classical statistics. For this thought crime you are sent straight to classical statistical jail, and you do not collect 200 classical statistical dollars.
Within classical statistical hypothesis testing, the prohibition against defining probabilities of hypotheses is due to the definition of probability.

Within classical statistics, probability is defined as the frequency at which something is true.
 This means, for example, you can define the probability of a coin coming up tails, because flipping a coin sometimes results in it coming up heads and sometimes tails.

Contrariwise, you cannot define the probability that it will rain at 10 am tomorrow, because 10 am tomorrow only happens once and it will rain or not at that time (i.e., the event, raining tomorrow at 10am, cannot occur at some frequency out of the number of total occurrences).

Similarly, hypotheses are either true or false. They are not sometimes true and sometimes false, and therefore hypotheses do not meet the basic criterion for defining a probability under this frequencybased definition.
Fig. 1