31 Oct 2017

Information

What does it mean to get information about one variable by observing another variable? (I’m going to assume that you know what joint, conditional, and marginal probability distributions are.) Let’s consider the simplest possible case: there are two variables, and , and their joint probability distribution is .

Let’s say that you have a device that only measures the variable . Over some span of time you record a bunch of measurements for , and come up with a putative probability distribution for , . If you knew , this would correspond to the marginal distribution for you could get by integrating out .

Now let’s say that you have another device, that only measures . Pick a certain value of , and call that value . Now, only when your second device tells you that the variable is close to , you measure . Again, after some time you can come up with another probability distribution for . This distribution is denoted by , which means the probability distribution of given that is . This is also known as a conditional probability distribution (it’s the distribution of conditioning on the fact that is ).

What if and are different? Let’s consider a concrete case: say is a flat, uniform distribution, but sharply spikes around one value. This means that normally, we don’t have a good idea what value takes. But, when we happen to know that is taking on the value , we have a pretty good idea of what value will take. Observing the value of changes our idea of what value will be, in a way that would let us better predict (it literally changes our state of knowledge and therefore the probability distribution we would use for , as a Bayesian would say). Another way of saying this is that we have gained information about one variable, , by observing another variable, .

To get really concrete, let be a variable measuring hair length, and be a categorical variable representing gender. If we used our first measuring device, our eyes, to indiscriminately look at the hair lengths of everyone around us, we’d observe a distribution that would be quite broad, with a sizable right tail. But, if we first pick a value for (male), and only observe someone’s hair length after we ascertain that the variable is male, we’d come up with a very different distribution of hair lengths, one that is concentrated at small values, and falls off much more quickly from 0. Knowing someone’s gender lets us make better predictions (on average) of what their hair length would be; observing the gender variable gives us information about the hair length variable (by Bayes’ theorem, the converse is also true, although the information gain could be much less substantial).

What about the opposite case, when and are the same? Does that mean we won’t get information about by observing ? Not quite–what if you pick a different value for , other than ? It’s possible that , but when takes another value, , you find that . In this case, you can still gain information about if you observe that is . So the true requirement that two variables and have to satisfy if they can’t give you information about each other–in other words, if they’re statistically independent– is

\begin{equation} P(x\vert y = c) = P(x)\quad \forall c \in \mathrm{range}(y). \end{equation}

In other words, you have to make sure that for every value that can take.

To summarize, gaining information means that we’re better able to predict one variable if we first observe another variable, compared to if we didn’t know what value the second variable was.