Basic Probability Ideas

Probability Conditional Probability Law of Total Probability Bayes' Theorem

Probability

The collection of every possible outcome of an experiment is called a sample space denoted as \(S\). It can be discrete or continuous. An event \(A\) is a set of outcomes of an experiment, or a subset of the sample space \(A \subseteq S\).

The probability of \(A\) denoted as \(P(A)\) satisfies the following axioms:

  1. \( 0 \leq P(A) \leq 1\)
  2. \(P(S) = 1\) and \(P(\emptyset) = 0\)
  3. If the events \(A\) and \(B\) are mutually exclusive, \(P(A \cup B) = P(A) + P(B)\)
  4. Note: Mutually exclusive means that the two events \(A\) and \(B\) cannot occur at the same time.
and can be calculated as: \[ P(A) = \frac{\text{number of outcomes in } A}{\text{number of outocomes in } S}. \]
Using event algebra, the follwing basic facts can be derived:
  1. Complement: \(P(\bar{A}) = 1 - P(A)\)
  2. Note: \(1 = P(S) = P(A \cup \bar{A}) = P(A) + P(\bar{A})\)

  3. Addition:\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
  4. Note: If \(A\) and \(B\) are mutually exclusive , \(P(A \cap B) = P(\emptyset) = 0\).

  5. Inclusion: If \(B \subset A\), then \(A \cap B = B\), and so \(P(A) - P(B) = P(A \cap \bar{B})\)

  6. de Morgan's laws:
    \(P(\overline{A \cup B}) = P( \bar{A} \cap \bar{B})\)
    \(P(\overline{A \cap B}) = P( \bar{A} \cup \bar{B})\)

There are some counting rules to find all possible outcomes in \(S\) quickly.
  1. Multiplication principle:
  2. If an experiment consists of a sequence of operations \(O_1, O_2, \cdots, O_r\) with \(n_1, n_2, \cdots, n_r\) outcomes respectively, then the total mumber of possible outcomes is the product \(n_1n_2\cdots n_r\).

  3. Permutation:
  4. Sampling without replacement where the oder of sampling matters. \[ \begin{align*} {}_n P_r &= n(n-1)(n-2)\cdots (n-1+1) \\\\ &= \frac{n!}{(n-r)!} \end{align*} \] (Multiplying the number of possible outcomes at each step until \(r\)th one is made.)

  5. Combinations:
  6. Sampling without replacement where the oder of sampling does not matter. \[ {}_n C_r = \binom{n}{r} = \frac{n!}{r!(n-r)!} = \frac{ {}_n P_r }{r!} \] Note: This is the Binomial coefficient for the binomial expansion.

    We can generalize the binomial coefficient in \(k \geq 2\) groups so that there are \(r_i\) in group \(i \quad (1 \leq i \leq k)\) and \(r_1 + r_2 + \cdots + r_k = n\): \[ \frac{n!}{r_1 ! r_2 ! \cdots r_k !} \] This is called the multinomial coefficient.
    For example, if we deal 52 cards evenly among 4 players, giving each player 13 cards, the total number of different hands within the 4 players is caluculated by \[ \frac{52!}{13!13!13!13!} \]

Conditional Probability

The conditional probability of an event \(A\) given an event \(B\) is defined as \[ P(A \mid B) = \frac{P(A \cap B)}{P(B)}. \tag{1} \] which means \[ P(A \cap B) = P(A \mid B) P(B). \tag{2} \] Also, from Equation (1), \[ P(B \mid A) = \frac{P(B \cap A)}{P(A)} = \frac{P(A \cap B)}{P(A)}. \] Using Equation (2), \[ P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}. \tag{3} \] This is known as Bayes' theorem or inverse probability law. We will discuss it in a more useful form later.

If \(A\) and \(B\) are mutually independent, \[ P(A \mid B) = P(A), \text{ and } P(B \mid A) = P(B). \] Thus \[ P(A \cap B) = P(A) P(B). \] In addition, in this case, \[ \begin{align*} P(A)P(\bar{B}) &= P(A)[1 - P(B)] \\\\ &= P(A) - P(A)P(B) \\\\ &= P(A) - P(A \cap B) \\\\ &= P(A \cap \bar{B}) \end{align*} \] Similarly, \( P(\bar{A})P(B) = P(\bar{A} \cap B)\) and \( P(\bar{A})P(\bar{B}) = P(\bar{A} \cap \bar{B})\).

Note: Two events are mutually independent if the occurrence of one event does not affect the probability of the occurrence of the other event. Be careful not to confuse it with mutually exclusive.

Law of Total Probability

Theorem 1: Law of Total Probability Let the sample space \(S\) be decomposed into \(k\) mutually exclusive events \(B_1, B_2, \cdots B_k\). Then for any event \(A\), \[ \begin{align*} P(A) &= P(A \mid B_1)P(B_1) + P(A \mid B_2)P(B_2) + \cdots + P(A \mid B_k)P(B_k) \\\\ &= \sum_{i =1} ^k P(A \mid B_i)P(B_i) \end{align*} \]
Proof: \begin{align*} P(A) &= P[(A \cap B_1) \cup (A \cap B_2) \cup \cdots \cup (A \cap B_k)] \\\\ &= P(A \cap B_1) + P(A \cap B_2) + \cdots + P(A \cap B_k) \\\\ &= P(A \mid B_1)P(B_1) + P(A \mid B_2)P(B_2) + \cdots + P(A \mid B_k)P(B_k) \\\\ &= \sum_{i =1} ^k P(A \mid B_i)P(B_i) \end{align*}

Bayes' Theorem

We revisit Equation (3): \[ P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}. \] \(P(B)\) is called the prior probability of \(B\) and \(P(B \mid A)\) is called the posterior probability of \(B\). This is the foundation of Bayesian statistics.

Here, using the Law of Total Probability, we can get a more general form of Bayes' Theorem.

Theorem 2: Bayes' Theorem Let mutually exclusive events \(B_1, B_2, \cdots B_k\) be partitions of the sample space \(S\) with the condition that \(P(B_i) > 0\) for \(i = 1, 2, \cdots, k\). Then for \(j = 1, 2, \cdots, k\), \[ \begin{align*} P(B_j \mid A) &= \frac{P(A \mid B_j)P(B_j)}{\sum_{i=1}^k P(A \mid B_i)P(B_i)} \\\\ &= \frac{P(B_j \cap A)}{P(A)} \end{align*} \]
This is a powerful tool in machine learning such as classification. For example, in medical diagnosis, \(B_1, B_2, \cdots B_k\) can be possible diseases and let \(A\) be observed symptoms, then \(P(B_j \cap A)\) represents the probability of having diseas \(B_j\) given the symptoms \(A\).