Power analysis
We will answer the questions:

What is statistical power?

How is power related to effect size and the pvalue?

How is statistical power computed and interpreted? Three examples.

Does power analysis end up providing us with useful (actionable) information?
What is statistical power, and why do we compute it?
Scientific progress usually occurs by performing experiments. Experiments cost money, and it will be unsurprising to learn that large experiments that generate lots of data cost more money than small experiments that generate small amounts of data.
The bean counters will always want you to design and execute the smallest experiments possible, while still accomplishing your goals (of advancing knowledge within your particular scientific field).
When I was trained, this was conveyed to me in a slightly different way, by having it pointed out that 'it is much easier to chase after large effects than small effects'

Large and small effects are not absolute numbers, but rather defined relative to noise

a large effect is large relative to noise

i.e., your loudest relative speaking during a quiet dinner party

it is easy to hear what this person is saying

because the signal is large, and the noise is low



a small effect is small relative to noise

i.e., your quietest relative speaking during a heated dinnertime debate

it is difficult, if not impossible, to hear what this person is saying

because the signal is small and the noise is high



Power analysis is meant to tell you whether your experiment is designed to find a large effect or small effect, and consequently whether you will need very little or a great deal of data before you are likely to be able to detect that effect.
Statistical power is the probability that a particular type of experimental dataset will yield a statistically significant effect. This depends on:

whether the tobefound effect is large or small (relative to noise)

whether the experiment is large or small (the size of the tobecollected dataset)
The importance of a power calculation is in the planning stages of an experiment:

if your experiment is 'underpowered' you will not be able to detect anything in the data

if your experiment is 'overpowered' you will be wasting time and money collecting more data than necessary to achieve your goal

.... so instead you want the power to be juuust right. And that requires calculations.

How is power related to effect size and the pvalue?
In short, power analysis is meant to use the alphalevel of your experiment (the minimum threshold for a significant pvalue) and tell you how large an effect you are likely to find based on having collected N experimental observations.

In a power analysis, the probability of finding an effect with a particular hypothesis test under particular conditions (sample size, effect size, etc.) is called the 'statistical power' of the hypothesis test

Statistical power is always equal to 1  ß, where

ß is the Type II error rate

Examples of the questions answered by a power analysis are:

With statistical power 0.8, how large a correlation is one able to find at an alphalevel of 0.05 based on four experimental observations?

With statistical power 0.8, how many observations are needed to find a statistically reliable (alpha = 0.05) correlation of r = 0.5?

What is the statistical power of the hypothesis test using alpha = 0.05 of the correlation, r = 0.7, based on 12 observations?
Power calculations can be based on any three of:

Statistical power level (1  ß )

Effect size (given in normalized or SD units)

alphacriterion ( )

sample size (N)
allowing one to solve for the fourth.
Notice also that effect size will generally take the variance of observations into consideration, so that multiple scenarios will refer to the same effect, in sd units.

For example, a 1sample ttest has 80% statistical power to detect an effect size of 0.5 standard deviations at an alphalevel of 0.05 when based on N = 31 observations
 which could mean a true value of µ = 1, null hypothesis of 0, and sd of 2
 or it could mean a true value of µ = 3, null hypothesis of 1, and sd of 4
 or a true value of µ = 10, null hypothesis of 0, and sd of 20, and so on...
An interesting thing to notice here is that having a threshold (the pvalue) for significance creates a situation where we expect effect sizes to always be overestimated from data whenever statistical power is less than 100%

This is because effect sizes, as measured from noisy data, will either be somewhat higher or lower than the true effect size. However, when effect sizes measured from data are low (relative to the true effect size), they my fall below the significance level, but are more likely to be statistically significant and reported (published) when they are high. Thus, the overall outcome is that the expected value of a reported effect size is always higher than the true effect size.

For underpowered experiments, this overestimation of effect size is expected to be more pronounced

How is statistical power computed and interpreted?
Recall that we usually will only make power calculations when we are in the planning stages of an experiment.

A corollary is that we are not yet in possession of the experimental data

This is problematic, because we cannot generally predict how noisy our experimental data will ultimately be

Instead, we must guess, based on our expert knowledge of the scientific domain within which we are working


It is also problematic, because we do not know how large an effect we will ultimately find (assuming there is any effect to be found at all) in the data

Again, we must usually guess.


We will consider three examples:

onesample ttest

twosample ttest

correlation
In each example, we take the most common use of a power analysis, and compute the sample size consistent with a statistical power of 0.8.
Sample size equations typically make use of the inverse of the cumulative normal distribution. The cumulative normal distribution yields the probability mass:
that is the area under the distribution to the left of the critical value, c. The inverse of the cumulative Gaussian distribution (often called a zscore) works in reverse, and yields the critical value:
based on the probability mass, p.
1. onesample ttest
The number of observations is computed with the equation:
which in Matlab is calculated via:
muprime=0; %known reference value defined by theory or practical application
mu=1; %true mean (mu=mu0 under the null hypothesis)
deltamu=mumuprime; %difference between sample mean (mu) and a known reference value (mu0)
sd=2; %true sd of the sampling distribution
alpha=0.05; tails=2; beta=.2;
n=(sd*(norminv(1alpha/tails)+norminv(1beta))/deltamu)^2
2. twosample ttest
The number of observations is computed with the equation:
which in Matlab is calculated via:
deltamu=1; %true difference between two sample means (deltamu=0 under H0)
sd=2; %true sd of the sampling distributions
alpha=0.05; tails=2; beta=.2;
n=2*(sd*(norminv(1alpha/tails)+norminv(1beta))/(deltamu))^2
(note that a paired ttest is onesample test of differences, not a twosample test)
3. correlation
The number of observations is computed with the equation:
which in Matlab is calculated via:
rtrue=1; %true correlation between two samples (rtrue=0 under H0)
sd=2; %true sd of the sampling distributions
alpha=0.05; beta=.2;
n=((norminv(1alpha/tails)+norminv(1beta))/(.5*log((1+r)/(1r))))^2+3)
Note that instead of using specialized equations specific to sample size calculations, one can also determine n by simulation, using only the equation of the statistical test itself. For example, we can simulate the required sample size to reject 80% of null hypotheses using the twosample ttest by the following sequence of steps:

define all constants, including H0, true sample means and standard deviations, and alpha

use a pseudorandom number generator to draw many (say 750K) samples of various n from the two distributions

compute ttests for each of the 750K samplepairs

count the proportion of significant ttests
The value of n that achieves an 80% rejection rate is the desired sample size.
A simple Matlab script to execute these four steps is:
mu1=0; mu2=1; sd1=2; sd2=2; Nstep=5; Nrep=750000; Nover=0; iN=0; pwrlvl=0.8 %true values and other constants
reject=0; %initialize reject vector
figure; hold on; box off %prepare the figure
plot([0 80],[0 0],'k'); plot([0 80],[1 1],''k'); axis([0 80 .01 1.01])
while Nover<3, iN=iN+1; Nnow=iN*Nstep;
disp(['testing N=' num2str(Nnow) ' per sample'])
samp1=nrand(mu1,sd1,[Nnow Nrep]);
samp2=nrand(mu2,sd2,[Nnow Nrep]);
reject(iN+1)=sum(ttest2(samp1,samp2))/Nrep; Nover=Nover+(reject(end)>pwrlvl); end
Nlist=[0 [1:iN]]*Nstep; crit=interp1(reject,Nlist,pwrlvl);
plot(Nlist,reject,'ko','MarkerFaceColor',[.2 .3 .5],'MarkerSize',9);
plot(Ncrit*[1 1],[0 pwrlvl],'k:')
plot([0 Ncrit],pwrlvl*[1 1],'k:')
which matches the result above (between 63 and 64 observations per sample) computed from the specialized equation.

As an added bonus, it is easy to modify the simulation to fit nonstandard cases of the twosample ttest

such as unequal sample sizes (n1 ≠ n2) or

unequal standard deviations (sd1 ≠ sd2)

Try one of these by modifying the code above to see how easy it is.
Does power analysis end up providing us with useful (actionable) information?
A sample size analysis is generally performed to verify that a proposed experiment will yield a data analysis that is not over or underpowered with regard to finding a particular effect.
The ultimate usefulness of this exercise is determined entirely by the extent to which you are guessing about the true effect size.
Since you cannot know the true underlying effect size (if one even exists), the bestcase scenario is usually that we make power calculations based either on:

a specific theoretical prediction

the smallest effect size we judge to still be 'important'
In a basic ttest scenario the ideal is that a minimum 'clinically significant' effect size is known, allowing us to determine the minimum sample size necessary to achieve a given statistical power.

A clinically significant effect is one that makes a noticeable realworld difference. For example, a drug that shortens a bout of the flu by an hour would not be a particularly useful discovery (even if it were extremely reliable), whereas a drug that cut the duration of a typical bout of the flu by 1/2 would be highly useful, because it would be a noticeable (not simply detectable) improvement over not taking the drug.

In the ideal case your information includes both the mean difference (between sample and reference value or between two samples) and also the standard deviation that define the minimum clinically significant effect size

In a less ideal case, you know the mean difference that defines your desired effect size, but not the standard deviation

in an even less ideal case, you have a theoretical prediction of the reference value only (usually the null hypothesis of 0), but not of the change in sample mean or standard deviation that define a minimum clinically significant effect.

The last of these three cases is far and away the most common, and power analysis in this situation is based on the effect size that you feel you might find, rather than on any theoretical considerations such as a minimum effect that is clinically significant.
In all cases, there is generally some uncertainty regarding the assumed values of means and standard deviations that underlie any prediction of effect size. If your uncertainty spans enough parameter space for these variables (i.e., if your uncertainty spans a wide range of potential standard deviation values as well as sample mean values, etc.), it is very difficult to make sample size calculations that usefully inform our experimental designs.