Conditional ProbabilitiesΒΆ

The conditional probability \(\P(A\given B)\) is the probability for event \(A\) given that we know that event \(B\) has occured too. For example we may ask for the probability of throwing a 6 with a fair die given that we have thrown an even number of points.

The conditional probability is defined as:

\[\P(A\given B) = \frac{\P(A\cap B)}{\P(B)}\]

In practical applications in machine learning we find ourselves in the situation were would like to calculate the probability of some event, say \(\P(A)\), but only the conditional probabilities \(\P(A\given B)\) and \(\P(A\given \neg B)\) are known. Then the following theorem can be used.

\[\P(A) = \P(A\given B)\,\P(B) + \P(A\given\neg B)\,\P(\neg B)\]

The proof starts with observing that:

\[A = (A\cap B) \cup (A\cap \neg B)\]

and because \(A\cap B\) and \(A\cap \neg B)\) are disjunct we may apply the third axiom and obtain:

\[\begin{split}\P(A) &= \P(A\cap B) + \P(A\cap \neg B)\\ &= \frac{\P(A\cap B)}{\P(B)}\,\P(B) + \frac{\P(A\cap \neg B)}{\P(\neg B)}\,\P(\neg B)\\ &= \P(A\given B)\,\P(B) + \P(A\given\neg B)\,\P(\neg B)\end{split}\]

This theorem may be extended to partitions of the universe \(U\). A partition of \(U\) is a collection of subsets \(B_i\) for \(i=1,\ldots,n\) such that \(B_i \cap B_j=\emptyset\) for any \(i\not=j\) and \(B_1\cup B_2\cup\cdots\cup B_n=U\) .

For any partition \(\{B_i\}\) of \(U\) we have:

\[\P(A) = \sum_{i=1}^{n} \P(A \given B_i)\,\P(B_i)\]

The proof is a generalization of the proof for the partition \(\{B, \neg B\}\) .

Bayes rule allows us to write \(\P(A\given B)\) in terms of \(\P(B\given A)\):

\[\P(A\given B) = \frac{\P(A)}{\P(B)}\,\P(B\given A)\]

The proof of Bayes rule simply follows from the definition of the conditional probability.

The definition of the conditional probability can be written in another form:

\[\P(A\,B) = \P(A\given B)\,\P(B)\]

In this form it is known as the chain rule (or product rule). This rule can be generalised as:

\[\P(A_1\,A_2\cdots A_n) = \P(A_1\given A_2\,\ldots,A_n)\,\P(A_2\given A_3\,\ldots,A_n) \cdots \P(A_{n-1}\given A_n)\,\P(A_n)\]

When you have learned to make probability trees in high school you have used conditional probabilities and the chain rule probably without knowing it. Consider the following random experiment. There are two vases labelled \(V_1\) and \(V_2\). In the vases are marbles of three colors. The following table shows how many of each color are in the vases.

Vase #Red #Green #Blue
1 2 3 4
2 3 5 1

First we randomly select a vase (no preference for a vase) and then from the selected vase we pick a marble. The probability tree looks like:

Starting at the root node on the left we first select Vase 1 or Vase 2. From either the node ‘Vase 1’ or the node ‘Vase 2’ we then pick a marble from the vase. The marble is either ‘Red’, ‘Green’ or ‘Blue’.

The probability of picking either vase is \(1/2\) . Picking a Red marble from Vase 1 has probability \(\P(\text{Red}\given \text{Vase 1})=2/9\) . So the probability of ending up with a Red marble from Vase 1 is given by:

\[\P(\text{Vase 1}\cap \text{Red}) = \P(\text{Vase 1}) \, \P(\text{Red}\given \text{Vase 1}) = \frac{1}{2}\,\frac{2}{9}\]

Thus the ‘trick’ that you have been using in highschool to multiply the probablities along the edges in the tree is nothing more then the application of the chain rule of conditional probabilities.