statistical power analysis

Power analysis

We will answer the questions:

What is statistical power?
How is power related to effect size and the p-value?
How is statistical power computed and interpreted? Three examples.
Does power analysis end up providing us with useful (actionable) information?

What is statistical power, and why do we compute it?

Scientific progress usually occurs by performing experiments. Experiments cost money, and it will be unsurprising to learn that large experiments that generate lots of data cost more money than small experiments that generate small amounts of data.

The bean counters will always want you to design and execute the smallest experiments possible, while still accomplishing your goals (of advancing knowledge within your particular scientific field).

When I was trained, this was conveyed to me in a slightly different way, by having it pointed out that 'it is much easier to chase after large effects than small effects'

Large and small effects are not absolute numbers, but rather defined relative to noise
- a large effect is large relative to noise
  - i.e., your loudest relative speaking during a quiet dinner party
  - it is easy to hear what this person is saying
    - because the signal is large, and the noise is low
- a small effect is small relative to noise
  - i.e., your quietest relative speaking during a heated dinnertime debate
  - it is difficult, if not impossible, to hear what this person is saying
    - because the signal is small and the noise is high

Power analysis is meant to tell you whether your experiment is designed to find a large effect or small effect, and consequently whether you will need very little or a great deal of data before you are likely to be able to detect that effect.

Statistical power is the probability that a particular type of experimental dataset will yield a statistically significant effect. This depends on:

whether the to-be-found effect is large or small (relative to noise)
whether the experiment is large or small (the size of the to-be-collected dataset)

The importance of a power calculation is in the planning stages of an experiment:

if your experiment is 'underpowered' you will not be able to detect anything in the data
if your experiment is 'overpowered' you will be wasting time and money collecting more data than necessary to achieve your goal
- .... so instead you want the power to be juuust right. And that requires calculations.

How is power related to effect size and the p-value?

In short, power analysis is meant to use the alpha-level of your experiment (the minimum threshold for a significant p-value) and tell you how large an effect you are likely to find based on having collected N experimental observations.

In a power analysis, the probability of finding an effect with a particular hypothesis test under particular conditions (sample size, effect size, etc.) is called the 'statistical power' of the hypothesis test
Statistical power is always equal to 1 - ß, where
- ß is the Type II error rate

Examples of the questions answered by a power analysis are:

With statistical power 0.8, how large a correlation is one able to find at an alpha-level of 0.05 based on four experimental observations?
With statistical power 0.8, how many observations are needed to find a statistically reliable (alpha = 0.05) correlation of r = 0.5?
What is the statistical power of the hypothesis test using alpha = 0.05 of the correlation, r = 0.7, based on 12 observations?

Power calculations can be based on any three of:

Statistical power level (1 - ß )
Effect size (given in normalized or SD units)
alpha-criterion ( )
sample size (N)

allowing one to solve for the fourth.

Notice also that effect size will generally take the variance of observations into consideration, so that multiple scenarios will refer to the same effect, in sd units.

For example, a 1-sample t-test has 80% statistical power to detect an effect size of 0.5 standard deviations at an alpha-level of 0.05 when based on N = 31 observations
- which could mean a true value of µ = 1, null hypothesis of 0, and sd of 2
- or it could mean a true value of µ = 3, null hypothesis of 1, and sd of 4
- or a true value of µ = -10, null hypothesis of 0, and sd of 20, and so on...

An interesting thing to notice here is that having a threshold (the p-value) for significance creates a situation where we expect effect sizes to always be over-estimated from data whenever statistical power is less than 100%

This is because effect sizes, as measured from noisy data, will either be somewhat higher or lower than the true effect size. However, when effect sizes measured from data are low (relative to the true effect size), they my fall below the significance level, but are more likely to be statistically significant and reported (published) when they are high. Thus, the overall outcome is that the expected value of a reported effect size is always higher than the true effect size.
- For underpowered experiments, this overestimation of effect size is expected to be more pronounced

How is statistical power computed and interpreted?

Recall that we usually will only make power calculations when we are in the planning stages of an experiment.

A corollary is that we are not yet in possession of the experimental data
- This is problematic, because we cannot generally predict how noisy our experimental data will ultimately be
  - Instead, we must guess, based on our expert knowledge of the scientific domain within which we are working
- It is also problematic, because we do not know how large an effect we will ultimately find (assuming there is any effect to be found at all) in the data
  - Again, we must usually guess.

We will consider three examples:

one-sample t-test
two-sample t-test
correlation

In each example, we take the most common use of a power analysis, and compute the sample size consistent with a statistical power of 0.8.

Sample size equations typically make use of the inverse of the cumulative normal distribution. The cumulative normal distribution yields the probability mass:

that is the area under the distribution to the left of the critical value, c. The inverse of the cumulative Gaussian distribution (often called a z-score) works in reverse, and yields the critical value:

based on the probability mass, p.

1. one-sample t-test

The number of observations is computed with the equation:

which in Matlab is calculated via:

muprime=0; %known reference value defined by theory or practical application

mu=1; %true mean (mu=mu0 under the null hypothesis)

deltamu=mu-muprime; %difference between sample mean (mu) and a known reference value (mu0)
sd=2; %true sd of the sampling distribution

alpha=0.05; tails=2; beta=.2;

n=(sd*(norminv(1-alpha/tails)+norminv(1-beta))/deltamu)^2

2. two-sample t-test

The number of observations is computed with the equation:

which in Matlab is calculated via:

deltamu=1; %true difference between two sample means (deltamu=0 under H0)

sd=2; %true sd of the sampling distributions

alpha=0.05; tails=2; beta=.2;

n=2*(sd*(norminv(1-alpha/tails)+norminv(1-beta))/(deltamu))^2

(note that a paired t-test is one-sample test of differences, not a two-sample test)

3. correlation

The number of observations is computed with the equation:

which in Matlab is calculated via:

rtrue=1; %true correlation between two samples (rtrue=0 under H0)

sd=2; %true sd of the sampling distributions

alpha=0.05; beta=.2;

n=((norminv(1-alpha/tails)+norminv(1-beta))/(.5*log((1+r)/(1-r))))^2+3)

Note that instead of using specialized equations specific to sample size calculations, one can also determine n by simulation, using only the equation of the statistical test itself. For example, we can simulate the required sample size to reject 80% of null hypotheses using the two-sample t-test by the following sequence of steps:

define all constants, including H0, true sample means and standard deviations, and alpha
use a pseudo-random number generator to draw many (say 750K) samples of various n from the two distributions
compute t-tests for each of the 750K sample-pairs
count the proportion of significant t-tests

The value of n that achieves an 80% rejection rate is the desired sample size.

A simple Matlab script to execute these four steps is:

mu1=0; mu2=1; sd1=2; sd2=2; Nstep=5; Nrep=750000; Nover=0; iN=0; pwrlvl=0.8 %true values and other constants
reject=0; %initialize reject vector

figure; hold on; box off %prepare the figure
plot([0 80],[0 0],'k--'); plot([0 80],[1 1],''k--'); axis([0 80 -.01 1.01])
while Nover<3, iN=iN+1; Nnow=iN*Nstep;
disp(['testing N=' num2str(Nnow) ' per sample'])
samp1=nrand(mu1,sd1,[Nnow Nrep]);
samp2=nrand(mu2,sd2,[Nnow Nrep]);
reject(iN+1)=sum(ttest2(samp1,samp2))/Nrep; Nover=Nover+(reject(end)>pwrlvl); end

Nlist=[0 [1:iN]]*Nstep; crit=interp1(reject,Nlist,pwrlvl);

plot(Nlist,reject,'k-o','MarkerFaceColor',[.2 .3 .5],'MarkerSize',9);

plot(Ncrit*[1 1],[0 pwrlvl],'k:')

plot([0 Ncrit],pwrlvl*[1 1],'k:')

which matches the result above (between 63 and 64 observations per sample) computed from the specialized equation.

As an added bonus, it is easy to modify the simulation to fit non-standard cases of the two-sample t-test
- such as unequal sample sizes (n1 ≠ n2) or
- unequal standard deviations (sd1 ≠ sd2)

Try one of these by modifying the code above to see how easy it is.

Does power analysis end up providing us with useful (actionable) information?

A sample size analysis is generally performed to verify that a proposed experiment will yield a data analysis that is not over- or under-powered with regard to finding a particular effect.

The ultimate usefulness of this exercise is determined entirely by the extent to which you are guessing about the true effect size.

Since you cannot know the true underlying effect size (if one even exists), the best-case scenario is usually that we make power calculations based either on:

a specific theoretical prediction
the smallest effect size we judge to still be 'important'

In a basic t-test scenario the ideal is that a minimum 'clinically significant' effect size is known, allowing us to determine the minimum sample size necessary to achieve a given statistical power.

A clinically significant effect is one that makes a noticeable real-world difference. For example, a drug that shortens a bout of the flu by an hour would not be a particularly useful discovery (even if it were extremely reliable), whereas a drug that cut the duration of a typical bout of the flu by 1/2 would be highly useful, because it would be a noticeable (not simply detectable) improvement over not taking the drug.
- In the ideal case your information includes both the mean difference (between sample and reference value or between two samples) and also the standard deviation that define the minimum clinically significant effect size
- In a less ideal case, you know the mean difference that defines your desired effect size, but not the standard deviation
- in an even less ideal case, you have a theoretical prediction of the reference value only (usually the null hypothesis of 0), but not of the change in sample mean or standard deviation that define a minimum clinically significant effect.

The last of these three cases is far and away the most common, and power analysis in this situation is based on the effect size that you feel you might find, rather than on any theoretical considerations such as a minimum effect that is clinically significant.

In all cases, there is generally some uncertainty regarding the assumed values of means and standard deviations that underlie any prediction of effect size. If your uncertainty spans enough parameter space for these variables (i.e., if your uncertainty spans a wide range of potential standard deviation values as well as sample mean values, etc.), it is very difficult to make sample size calculations that usefully inform our experimental designs.

Screen Shot 2021-09-08 at 1.58.35 PM.png

Screen Shot 2021-09-08 at 1.59.39 PM.png

Screen Shot 2021-09-08 at 1.59.13 PM.png