2.5.5. Conditional Random Variables

Consider two discrete random variables \(X\) and \(Y\). Then we can define the conditional probability

\[\P(X=x \given Y=y)\]

Note that for any value \(y\) we have a random variable \(X\given Y=y\) with probability mass function

\[X \given Y=y \sim p_{X\given Y=y}(x)\]

The notation for conditional random variables is not the same in al literature. You often find the notation

\[p_{X\given Y}(x\given y)\]

which I find a bit confusing as i like to reserve the \(\given\) symbol to precede the conditioning event and not a mere value. But be warned that in case you advance to Bayesian inference statistics the \(\given\) symbol will be used very often to denote the dependence on parameter values.

Now consider the situation where \(X\) is a continuous random variable whereas \(Y\) is a discrete random variable. In that case we have a probability density function for the random variable \(X\given Y=y\):

\[X\given Y=y \sim f_{X\given Y=y}(x)\]

The conditional random variable \(Y\given X=x\) is a discrete random variable.

\[Y\given X=x \sim p_{Y\given X=x}(y)\]

And yes although the probability \(\P(X=x)=0\) evidently \(X=x\) can be the outcome of the random experiment (there always is some outcome and that can be \(x\) of course).

To illustrate this situation consider apples (1) and pears (0) to be the possible outcome of random variable \(Y\) and let \(X\) denote the weight of a piece of fruit (either an apple of a pear). Then we may wonder what the probability is of a piece of fruit with weight \(x\) to be a pear or an apple i.e. \(\P(Y=y\given X=x)\). It is tempting to use Bayes rule directly and write

\[\P(Y=y\given X=x) = \frac{\P(X=x\given Y=y)\P(Y=y)}{\P(X=x)} \quad\text{THIS IS WRONG}\]

It is obviously wrong as \(X\) and \(X\given Y=y\) are both continuous random variables (and hence the above expression evaluates to \(0/0\)). This naive application of Bayes rule doesn’t imply that we can’t use it. To make the correct use of Bayes rule a bit more intuitive then just stating the result we introduce the (admittedly sloppy) notation:

\[X\approx x\]

for the event \(x\leq X \leq x+dx\) with probability \(f_X(x)dx\) for \(dx\rightarrow 0\), i.e. for infinitessimally small \(dx\):

\[\P(X\approx x) = \P(x\leq X \leq x+dx) = f_X(x)dx\]

(remember that probability density times interval length is a probability). With this definition we have:

\[\P(Y=y\given X=x) = \frac{\P(X\approx x\given Y=y)\P(Y=y)}{\P(X\approx x)} = \frac{f_{X\given Y=y}(x) dx \P(Y=y)}{f_X(x)dx}\]

the \(dx\) factors in nominator and denominator cancel out and so:

\[\P(Y=y\given X=x) = \frac{f_{X\given Y=y}(x) \P(Y=y)}{f_X(x)}\]

where:

\(\P(Y=y\given X=x)\): The a posteriory probability of class \(y\) given the value \(x\)
\(\P(Y=y)\): The a priori probability of class \(y\)
\(f_{X\given Y=y}\): The class conditional probability density function for \(X\given Y=y\).
\(f_X\): The evidence, i.e. the probability density for \(X\).