What is probability?

We will answer the questions:

  • How are probabilities written?

  • What is probability?

  • How are probabilities computed?

  • How are probabilities used to test hypotheses?

  • Are there problems with the frequentist statistical definition of probability?

Here we will answer the question ‘what is probability’ from a classical statistical perspective (the original definition is from the Bayesian perspective).

How are probabilities written?

It is standard to talk about probabilities in terms of flipping a coin, so let’s go ahead and start there. To express a coin probability, we have to first define certain propositions (statements) about the coin and our background information about the scenario. For example, our background information might include something like the propositions:

Screen Shot 2021-03-23 at 8.27.30 AM.png

We also need to know what probability we want to compute, which we’ll call the object of the probability statement. An example might be:

Screen Shot 2021-03-23 at 8.31.48 AM.png

Fig. 1

Probability statements are written in the particular format given in Figure 1, so that we can immediately see the separation between the object and background information, separated by the vertical line. The statement in the figure is read, ‘the probability of x, given either y or z, and iota. The propositions in the background information (to the right of the vertical line) are also called conditioning statements, because they define the conditions we assume are true when we compute the probability of the object. In the case of coin-flipping, we might want to compute the probability:

Screen Shot 2021-03-23 at 8.35.01 AM.png

In all cases, we are computing a numerical value for the object of the probability statement (propositions to the left of the vertical bar), in the scenario defined by the conditioning statements. The numerical values of the probabilities we compute are only valid when the conditioning statements are true. 

What is probability?

So far, we’ve seen how to write probabilities, but we still haven’t seen a definition of probability. 

 

Here we will answer the question ‘what is probability’ from a classical statistical perspective. Within classical, or ‘frequentist’ statistics, probabilities are defined exclusively and only in terms of (relative) frequencies. 

 

  • Thus, the probability of a coin coming up heads is entirely defined in terms of the long run frequency of coin-flips coming up heads. Ultimately, this is supposed to be an empirical matter, wherein the probability of an event is its frequency of occurrence in an infinite series of repetitions. ​

    • So the probability of a particular datum in an experiment is the frequency at which that datum would occur in an infinite series of repetitions of the experiment. 

 

  • Long-run frequencies are elements of the world, and therefore prima facie objective, but this new definition has a drawback: you can define sampling probabilities, but not posterior probabilities (and, technically, only those sampling probabilities for which you’ve counted long-run frequencies, but we’ll ignore that requirement). ​​
    • Thus, for example, you can define the probability of a coin-flip coming up heads, the probability of drawing an orange jelly bean from a jar of mixed jelly beans, the probability of observing a correlation between two sets of numbers, the probability that the mean of two numbers differ by a particular amount, etc. 

      • These are all the types of things that you could theoretically repeat many times, where the outcome would vary from one repetition to the next, and you could count the different outcomes to compute long-run frequencies. 

    • By contrast, you cannot define the probability that a coin is two-tailed, that it will rain tomorrow, that two variables (not data) are correlated, or that stroke produces a deficit in temporal coordination. 

      • These are all the types of things that cannot be repeated, or where the outcome cannot change from one repetition to another, and therefore you cannot count the number of each type of outcome to compute a long-run frequency. 

How are probabilities computed?

  • First, how it’s supposed to be done in theory:

Theoretically, because the frequentist statistical tradition has meant probability to be identical with a frequency, you just need a list of occurrences, and probability is always just the number of times the event of interest occurs (heads in coin-flipping) out of the total number all events (total coin flips, whether heads or tails).

 

This works for coin-flips:

Screen Shot 2021-03-23 at 8.44.00 AM.png

because a coin-flip can come up heads on one flip, tails on the next, and so on. The probability is the relative frequency of the coin coming up heads. It also works for orange jelly beans:

Screen Shot 2021-03-23 at 8.49.45 AM.png
  • How it’s done in practice:

While in theory you would just count empirical relative frequencies to compute probabilities, this is not the way it ever works in practice. In practice, you would literally need an infinite series of repetitions for variable  whose probability you wish to compute!

 

Probabilities are actually computed via a kind of ‘thought experiment’. 

 

Instead of actually flipping a coin an infinite number of times, you would need to imagine how such coin flips would occur. In practice, this is done by counting the number of ways that a coin can come up heads (one), and dividing by the number of all outcomes (two), giving a probability of 1/2. 

 

In practice, you compute the probability of drawing an orange jelly bean from a jar by counting all the orange jelly beans in the jar, and dividing by the total number of jelly beans, regardless of color.

 

In the coin example, your thought experiment goes something like this: ‘each time I flip the coin, I have no information that would make me think the coin is more likely to come up heads vs. tails, so I assume each type of outcome will occur roughly equally often. Since heads is one of two possible outcomes, it should happen about half the time.

 

In the jelly bean example, your thought experiment goes something like: ‘if I can’t see the jelly beans when I pull them from the jar, then I have no reason to think that I’ll pick any particular bean over any other - they are all equally likely to be chosen. So if there are 11 orange, 37 blue, and 22 red jelly beans, then the probability of each is 11/70, 37/70, and 22/70.

You can see why this won’t work for hypotheses: While there are two conceivable outcomes, only one is possible. If a hypothesis is true, then it is only possible for it to be true; if it is false it will always be false. 

 

 

How are probabilities used to test hypotheses?

Your next question must surely be: ‘If we can’t define posterior probabilities, then how do we compare and test hypotheses?’. The answer is a bit complicated, but relies on computing sampling probabilities for different datasets. Suppose you are interested in the correlation between:

Screen Shot 2021-03-23 at 8.27.30 AM.png

and your hypothesis is:

Screen Shot 2021-03-23 at 8.58.59 AM.png

Fig. 2

That is, this is a drug that does not affect anxiety (e.g., aspirin). Now, you conduct an experiment in which you provide 0, 10, …, 60mg of aspirin and your data is ‘average anxiety level’ 10 min post-administration. How do you evaluate your hypothesis? You compute all the sampling probabilities for the datasets (or equivalently, data correlations) you might observe in your experiment. For example, you might observe a correlation of zero between the x-data and the average y-data, as in the black data in the lefthand plot of Fig 2.

 

Or, you might observe a higher positive (blue) or negative (green) correlation in the particular (noisy) dataset that you collect in your experiment. Although these are all conceivable outcomes, it is also clear that, if it is true that there is in fact no effect of this drug on anxiety, then it will be more likely that you will collect data that have a low correlation than a high correlation.

 

Thus, you imagine that the highest probability for data from your experiment will be at zero correlation, and that the probability drops off for higher potential data correlations. In particular, you would expect to see something that looked like the lefthand (flat line) plot in Fig. 2 if you ran this experiment. You wouldn’t really expect to see something that looked like the blue or green data, even though both datasets would be possible outcomes of the experiment (just due to noisy data).

 

This thought experiment would lead you to something like the bell-shaped black probability distribution plotted in Fig. 3. 

 

Notice that it’s possible to create this probability distribution because there are a great many possible outcomes of the experiment, even when there is no real relationship between dosage and anxiety. Those outcomes are just more likely (data correlations near zero) or less likely (data correlations near +/-1).

 

Hypothesis testing within the frequentist statistical tradition uses these data probabilities and assigns a ‘cutoff value’ for a dataset (data correlation in this example) that seems ‘too unlikely’. That cutoff is shown as a pair of vertical red lines in Fig. 3. Then, if the data correlation that is

Screen Shot 2021-03-23 at 9.02.18 AM.png

actually observed in your experiment crosses either threshold line (it is greater than about r = 0.42 or less than r = -0.42, you have to assume that your hypothesis (that there is really no relationship between aspirin dosage and anxiety) is false.

Are there problems with the frequentist statistical definition of probability?

If you are taking a statistics course, here are a few issues you may want to explore with your professor or classmates: 

 

  • The frequentist statistical definition was not the first historical try at defining probability. The definition of probability used by the originators of probability theory (people like Thomas Bayes, James Bernoulli, Blaise Pascal and Pierre Simon Laplace) was simply a ‘degree of belief’ in a proposition (such as ‘the true rate of heads for this coin is 0.5’, or ‘it will rain tomorrow’).

Fig. 3

Although it may sound antiquated to the modern ear, this definition of probability would have been expressed today as ‘the degree to which your information indicates the truth of a proposition' (propositions being statements, such as ‘it will rain tomorrow’).

 

Expressed as 'a measure of the available information that compels us to assert the truth or falsity of a proposition' makes it clear that this ‘Bayesian’ definition of probability is just as modern and scientific as any thought-experiment-based definition relying on imagined frequencies of coin-flip outcomes. 

 

But in addition, it allows us to:​​​

  • define the probabilities of hypotheses and therefore allows a straightforward method of Bayesian hypothesis testing that relies on comparing the probabilities of competing hypotheses (instead of the convoluted and ultimately flawed statistical hypothesis testing algorithm)

  • use frequencies and thought experiments to assign probabilities, if that is the information available to us (i.e., it does not exclude frequency information for defining probabilities)

  • easily combine the outcomes of previous experiments, and the reliabilities of those experiments, to create an intuitive method of meta-analysis via Bayes' theorem

  • use the mathematics of statistical mechanics and entropy to assign complex probability distributions 

 

  • The so-called Bayesian definition of probability allows us to use Bayes’ theorem, the cornerstone of modern probability theory, to analyze data and

  • ​optimally take account of background information (as in medical decision-making)​
  • use error propagation (which sounds like a bad thing, but is actually just the practice of understanding and measuring how errors introduced into one aspect of your calculation will affect your final conclusions) to honestly account for uncertainty in measurements and hypothesis tests 

  • use a mathematically consistent method of dealing with nuisance variables (e.g., obtaining a group location measure from combined data across subjects with multiple unknown variances, marginalizing over the unknown variances with respect to a maximally uninformative prior). 

Screen Shot 2021-03-26 at 6.33.18 AM.png