22 Nov 2017

Power Analysis

In statistics, power analysis refers to calculating how many runs of an experiment you need to perform if you want to detect an effect, given a lower bound for the strength of the effect. For example, if you were doing an experiment to detect the efficacy of a new drug, and you suspect that the drug will only work in 1 out of 10 people (but when it works, it completely eliminates the disease), how many patients will you have to test to conclude that the effect of the drug is statistically significant?

You can also run the analysis backwards: if you know you have a fixed number of subjects for your experiment, you can establish a lower bound for the size of any statistically significant effects you might measure. For example, if you’re only doing 50 trials, then you can use power analysis to figure out that you’ll only be able to detect a condition that affects 1 out of 20 people. If the effect you’re looking is real, but only affects 1 out of 30 people, then you won’t be able to find statistically significant evidence for it. (Numbers here obviously made up but that’s the kind of statements this analysis lets you make.)

Inputs and outputs

Instead of medical experiments, let’s confine ourselves to finding out whether a coin is biased or not [1]. To fully specify the problem, let’s state precisely what we want to figure out. The question we’re going to answer is,

If a coin’s bias is $x$ , and we do an experiment recording $N$ coin flips, what’s the chance that we find statistically significant evidence that the coin is biased?

If we answer this question, we’ll be able to figure out how many trials of flipping the coin we need–just keep increasing $N$ until we calculate that we have a very high probability of detecting the bias.

We’ve also specified that we’re looking for statistically significant evidence that the coin is biased. This means that we need to select a level of significance. I’m going to use the typical value of 0.05–that is, if we observe a $p$ -value under 0.05 in our experiment, then we conclude that the coin is biased. Technically, the significance level is another parameter you’d have to supply when doing power analysis, but for the rest of the post I’m going to fix it at 0.05.

So, an answer to the question would be a function or expression that you could supply with the suspected bias of the coin, $x$ , and the number of flips in your experiment, $N$ , and receive an output of a probability. Let’s figure out that function.

Math

We still need to state what we mean by “a coin’s bias is $x$ ”. There’s more than one way to define this, but for this post $x$ will be the probability that the coin lands on heads when you flip it. So an unbiased coin has $x=0.5$ , but any other value of $x$ means that a coin is biased.

We know what the probability of observing heads is when we flip the coin once. What’s the probability of observing $h$ heads when we flip the coin $N$ times? This is the binomial distribution

\begin{equation} P(h) = x^h \, (1 - x)^{N - h} \frac{N!}{(N - h)!\,h!} \end{equation}

Let’s write a function which we can supply with $h$ (and $x$ and $N$ , of course) that’ll give us this probability:

import numpy as np
from scipy.special import comb
import matplotlib.pyplot as plt
%matplotlib inline

def Pheads(h, N, x):
    heads = h
    tails = N - heads
    return (x**heads) * ((1 - x)**tails) * comb(N, heads)

A quick check that adding up the probabilities for all the outcomes we can observe is 1:

a = np.arange(101)
f = [Pheads(A, 100, 0.8) for A in a]
np.trapz(f, a)

0.99999999989814803

We’re good. Now we can calculate the outcome of doing an experiment.

The next question to answer is, given a certain outcome, what can we conclude about the coin? Now we’re going to start the statistics–we conclude that the coin is biased if the $p$ -value is below 0.05. So we just need a way to calculate the $p$ -value from the outcome of the experiment.

To do that, assume that the coin is actually fair (this corresponds to our null hypothesis). We need to calculate the probability of the outcome we observe or one more extreme occuring. In our context, say we observe 3 heads after 10 coin flips. We need to find the probability of getting 3 heads if the coin was fair. We also need to get the probabilities of observing all events more extreme than 3 heads, which means getting 2, 1, or 0 heads. The sum of these probabilities is the $p$ -value.

On the other hand, another way to phrase this outcome is: the difference between the expected number of heads (5) and the observed number was 2. Put this way, the more extreme outcomes correspond to an observed difference of 3, 4, and 5, which means we’d have to also add up the probabilities for observing 8 heads, 9 heads, and 10 heads. Which way is right?

They both are. The first way of adding up probabilities corresponds to a one- tailed $p$ -value, and the second corresponds to a two-tailed $p$ -value (since we’re adding up probabilities on both sides/tails of the distribution). If we knew that the coin was biased to land on tails more often, then we’d use first calculation, where we add up the probabilities for landing on fewer and fewer heads. But we’ve stated only that we suspect the coin to be biased; we don’t know if it’s biased towards landing on heads more, or on tails more (we’re not testing if $x > 0.5$ or $x < 0.5$ ; we’re just making sure that $x\neq 0.5$ . This means that we add up the probabilities all outcomes where the difference from $x$ and 0.5 is greater than what we observe, or the two-tailed $p$ -value.

def pvalue(h, N):
    h = min(h, N - h)
    left_tail = sum(Pheads(i, N, 0.5) for i in xrange(h + 1))
    right_tail = sum(Pheads(i, N, 0.5) for i in xrange(N - h, N + 1))
    if (2 * h == N):
        right_tail -= Pheads(h, N, 0.5)
    return left_tail + right_tail

Let’s test this out: if we flip the coin 10 times and observe 4 heads, we shouldn’t have much reason to suspect that the coin is biased. What’s the $p$ -value that corresponds to this outcome?

pvalue(4, 10)

0.75390625

It’s a very large number, and certainly not less than 0.05, which matches our intuition that we have very little reason to suspect a bias. On the other hand, what’s the $p$ -value for observing 1 head when flipping the coin 10 times?

pvalue(1, 10)

0.021484375

This $p$ -value is 0.02, which is less than 0.05. That means that at this significance level, we’d conclude that the coin is biased. This also matches our intuition where we’d be suspicious if we only saw one head after 10 flips.

Now that we can calculate the $p$ -value, and therefore know how to draw a conclusion from every experiment, we can answer our original question: What is the probability that we conclude the coin is biased? Let’s write an expression for that, including all our variables:

\begin{equation} P(\mathrm{biased}, h, N, x) \end{equation}

We’re going to externally fix $N$ and $x$ –in other words, they’ll be parameters we supply to the function that calculates $P$ . Let’s leave them out of the expression. That means there’s only one variable that we need to expand this in, $h$ :

\begin{equation} P(\mathrm{biased}) = \sum_{H=1}^N P(\mathrm{biased} \mid h \;\mathrm{is}\; H) \; P(h \;\mathrm{is}\; H) \end{equation}

The binomial formula, encoded in the function Pheads we wrote above, gives us $P(h \;\mathrm{is}\; H)$ . To figure out the other factor, we know that we only ever conclude the coin is biased if the $p$ -value is less than 0.05. This means that whenever $H$ is a value close enough to 0.5 such that the $p$ -value is above 0.05, there’s 0 probability we conclude the coin is biased, and so $P(\mathrm{biased} \mid h \;\mathrm{is}\; H) = 0$ . Whenever $H$ is sufficiently close to 0 or $N$ such that the $p$ -value is under 0.05, the probability of concluding the coin is biased is 1. Therefore the factor $P(\mathrm{biased} \mid h \;\mathrm{is}\; H)$ has the effect of filtering out all terms in the sum for which the $p$ -value is not below 0.05.

Based on this intuition, the Python function that calculates $P(\mathrm{biased})$ , which we’ll call the power, is:

def power(x, N):
    return sum(Pheads(h, N, x) for h in xrange(N + 1) if pvalue(h, N) < 0.05)

Giving it a spin

If the coin isn’t actually biased, and we observe 50 coin flips, we should almost never see evidence that would lead us to conclude the coin is rigged. The probabilty of making such a conclusion, the power, should be low:

x = 0.5
N = 50
prob = power(x, N)
print('probability of concluding rigged: {0}'.format(prob))

probability of concluding rigged: 0.0328391375643

There’s a 3.2% chance we’d conclude that the coin is rigged. This is a small number, like we suspected, so the code seems like it’s working. I’m actually surprised that the probability is this high, though–50 coin flips is a lot. But remember that we’re basing this calculation on a significance level of 0.05, and that we we made this test more stringent, by lowering 0.05 to say 0.01, we’d calculate smaller powers.

For the opposite scenario, where the coin strongly biased and we do 50 flips, the power is

x = 0.75
N = 50
prob = power(x, N)
print('probability of concluding rigged: {0}'.format(prob))

probability of concluding rigged: 0.944876640866

So we’d almost certainly be able to see evidence for a bias by doing 50 flips, if the bias is as strong as a $3/4$ chance to land on one side of the coin.

Plots with fixed N

Since we’re now reasonably sure our code is correct, let’s get a better sense of the behavior of this function by making plots. We’ll first look at the power across all possible values of the bias, for different $N$ .

from collections import OrderedDict

X = np.linspace(0, 1, 100)
N = [10, 40, 100]
power_curves = []
for n in N:
    power_curves.append([power(x, n) for x in X])
curve_dict = OrderedDict(zip(N, power_curves))

for k in curve_dict:
    plt.plot(X, curve_dict[k], label=r'$$N = {0}$$'.format(k))
plt.legend(loc='lower left')
plt.xlabel(r'$$x$$', size=20, usetex=True)
plt.ylabel('power', size=16)
plt.axis([0, 1, 0, 1])

[0, 1, 0, 1]

png

The first thing we notice is that these curves are symmetric across $x=0.5$ . This makes sense since we’re not testing for whether our coin is biased specifically towards heads or tails, but just whether it’s biased in some way. So a bias of landing on heads 90% of the time and a bias of landing on tails 90% of the time are treated equally.

The next interesting detail is that the curves start out very fat, for $N = 10$ , and get sharper and sharper as $N$ increases. That just means that as we do more coin flips in our experiment, we’re able to detect more subtle biases, i.e. when $x$ is closer to being 0.5. Let’s make this same plot again for larger values of $N$ .

X = np.linspace(0, 1, 100)
N = [100, 300, 1000]
power_curves = []
for n in N:
    power_curves.append([power(x, n) for x in X])
curve_dict = OrderedDict(zip(N, power_curves))

for k in curve_dict:
    plt.plot(X, curve_dict[k], label=r'$$N = {0}$$'.format(k))
plt.legend(loc='lower left')
plt.xlabel(r'$$x$$', size=20, usetex=True)
plt.ylabel('power', size=16)
plt.axis([0, 1, 0, 1])

[0, 1, 0, 1]

png

The curves continue to get narrower as we increase $N$ . Although I’m still a little surprised that the curve for $N=1000$ doesn’t look like an upside-down delta function; this plot implies that if you suspect a bias where $x=0.55$ , you still can’t be absolutely sure you’ve done enough testing to catch it after 1000 coin flips!

There’s something else that’s surprising in the first plot above. Look at the lowest points on each of the curves, right in the middle where $x=0.5$ . This point doesn’t always get higher as you increase $N$ ! When I first took a glance at these plots, I came away with the pattern that “as you increase $N$ , the curves get skinnier and the nadir gets higher.” But in the first plot, the nadir is highest for the curve where $N=40$ , which is not the skinniest or the fattest curve! That means the relationship between the height of the nadir and the width of the power curve, or $N$ , may not be as simple as we might think.

Plots with fixed x

Let’s plot the value of the power when $x=0.5$ for different values of $N$ . You can interpret this value as the probability that you’d conclude that you had a rigged coin, when in fact the coin wasn’t biased at all.

N = [10, 30, 60, 90, 150, 250, 400, 700, 1000]
z = [power(0.5, n) for n in N]

plt.plot(N, z, 'o-', label=r'$$x = 0.5$$')
plt.xlabel(r'$$N$$', usetex=True, size=20)
plt.ylabel('power', size=16)
plt.legend(loc='lower right')

<matplotlib.legend.Legend at 0x106ae3450>

png

Clearly there’s some non-monotonic behavior here. This is interesting. I would have thought that the power at $x=0.5$ would be a strictly decreasing function of $N$ –in other words, that there’d be less of a chance of drawing the incorrect conclusion that the coin is biased, as we do perform more trials. This is just based on the intuition that we usually have a higher probability of drawing correct conclusions as the number of trials goes up.

Let’s zoom in on the region from $N=0$ to $N=200$ .

N = np.arange(10, 210, 5)
z = [power(0.5, n) for n in N]

plt.plot(N, z, 'o-', label=r'$$x = 0.5$$')
plt.xlabel(r'$$N$$', usetex=True, size=20)
plt.ylabel('power', size=16)
plt.legend(loc='lower right')

png

That’s definitely interesting behavior. The overall trend of the curve seems to be going up, but there’s some periodicity too. You can see a series of sawtooths, each of which is slightly higher than the previous one and whose periodicity seems to be increasing. I wouldn’t have predicted this. I’m not just referring to the interesting shape of the graph, but the fact the the power is essentially increasing with $N$ when $x=0.5$ . This sounds good, until you realize that when $x=0.5$ the power is the probability of drawing the wrong conclusion.

Minimum sample size needed

We can also answer the question: given a p-value, and a suspected minimum bias of the coin, what is the minimum value for the number of flips $N$ we need to carry out so that the probability of concluding the coin is rigged is at least 0.8?

We can do this by fixing $x$ at the suspected bias value, and calculating the power over a large range of $N$ . Then we can just look for the first value of $N$ for which the power is at least 0.8.

x = 0.7
N = map(int, np.round(np.arange(4, 200, 1)))
powers = np.array([power(x, n) for n in N])
ind = np.where(powers >= 0.8)[0][0]
print('At least {0} flips'.format(N[ind]))

At least 49 flips

If we suspect there’s at least an 70% chance the coin is biased to land on heads or tails, and we want to have an 80% chance of detecting this bias in an experiment, we’d have to flip the coin at least 49 times in our experiment.

Testing

Another nice thing about examining a simple situation like coin flipping is that we can do simulations to check our function. Since the output of our function is a probability, we can use brute force to calculate it by running many trials, using a random number generator, and seeing the proportion for which our condition is satisfied.

In this case, since the power tells us the probability of making a conclusion in an experiment, for every trial we have to simulate an experiment, where we flip a coin $N$ times and use a $p$ -value to conclude whether or not it’s biased. After a large number of trials, say 10,000, we look at the fraction of trials where we concluded that the coin was biased. This fraction should equal the theoretical probability our function calculates.

from numpy.random import binomial
x = 0.3
N = 60
trials = 10000
print('simulated:   {0}'.format(sum(1 for i in xrange(trials) if pvalue(binomial(N, x), N) < 0.05) / float(trials)))
print('theoretical: {0}'.format(power(x, N)))

simulated:   0.8352
theoretical: 0.838182139392

x = 0.5
N = 200
trials = 10000
print('simulated:   {0}'.format(sum(1 for i in xrange(trials) if pvalue(binomial(N, x), N) < 0.05) / float(trials)))
print('theoretical: {0}'.format(power(x, N)))

simulated:   0.0405
theoretical: 0.0400371916134

It seems our power function matches the brute force calculations. So we can be confident that the analysis we did above is correct.

[1] Coin flipping is like mouse models in biology. I heard of a cancer researcher once saying something like “If you can’t cure cancer in a mouse, you’d better find another job”. Similarly, if we can’t apply statistical tools to a binomial model like coin flipping, then we probably don’t know what we’re talking about.