Maximum Likelihood Estimation

Point Estimators

In practice, it is hard or impossible to find a population parameter \(\theta\). Instead, we estimate the unknown parameter by statistic computations from sample data. The estimated parameter from the sample data is called a point estimator denoted by \(\hat{\theta}\).
For example, to estimate the population mean \(\mu = \theta\), we compute a sample mean \[ \bar{X} = \frac{1}{n}\sum_{i = 1}^n X_i = \hat{\theta} \] The point estimator is also a random variable, and then a function of sampled random variables \(X_1, X_2, \cdots, X_n\).

Once we obtained the estimator, it is natural that we want to know how much it is close to the true parameter. There are two factors we have to be concerned about:

\(\text{Bias }(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta\)

bias

\(\text{Var }(\hat{\theta}) = \mathbb{E}[\hat{\theta} - \mathbb{E}(\hat{\theta})]^2\)

variance

If both bias and variance are low enough, the estimator must be acceptable as an approximation of the population parameter.

Mean square error(MSE): \[ \begin{align*} \text{MSE }(\hat{\theta}) &= \mathbb{E }(\hat{\theta} - \theta)^2 \\\\ &= \text{Var }(\hat{\theta}) + [\text{Bias }(\hat{\theta})]^2 \end{align*} \]

\[ \begin{align*} \text{MSE }(\hat{\theta}) &= \mathbb{E }(\hat{\theta} - \theta)^2 \\\\ &= \mathbb{E }[\hat{\theta} - \theta + \mathbb{E }(\hat{\theta}) - \mathbb{E }(\hat{\theta})]^2 \\\\ &= \mathbb{E }[\hat{\theta} - \mathbb{E }(\hat{\theta})]^2 + 2 \mathbb{E }[\hat{\theta} - \mathbb{E }(\hat{\theta})][\mathbb{E }(\hat{\theta})- \theta] + [\mathbb{E }(\hat{\theta})- \theta]^2 \\\\ &= \mathbb{E }[\hat{\theta} - \mathbb{E }(\hat{\theta})]^2 + 2 [\mathbb{E }(\hat{\theta}) - \mathbb{E }(\hat{\theta})][\mathbb{E }(\hat{\theta})- \theta] + [\mathbb{E }(\hat{\theta})- \theta]^2 \\\\ &= \text{Var }(\hat{\theta}) + 0 + [\text{Bias }(\hat{\theta})]^2 \end{align*} \] Note: The population parameter is a fixed value. Thus, \(\mathbb{E }\{[\mathbb{E }(\hat{\theta})- \theta]\} = [\mathbb{E }(\hat{\theta})- \theta] \) because \(\mathbb{E }(constant) = constant\).

The MSE serves as a criterion for comparing estimators, enabling us to identify the most suitable one. Once an estimator is selected, its precision in approximating the population parameter is typically assessed using the standard error (SE), which represents the standard deviation of the estimator's sampling distribution.

For example, since \(\text{Var }(\bar{X}) = \frac{\sigma^2}{n}\), the standard error of the mean (SEM) is given by \[ \text{SE }(\bar{X}) = \sqrt{\text{Var }(\bar{X})} = \frac{\sigma}{\sqrt{n}}. \]

Likelihood Functions

Suppose observations \(X_1, X_2, \cdots, X_n\) are i.i.d. random variables. The "observed" values of these random variables are denoted by \(x_1, x_2, \cdots, x_n\) respectively. Then the joint p.d.f. or p.m.f. of \(X_1, X_2, \cdots, X_n\) is given by \[ f(x_1, x_2, \cdots, x_n | \theta) = \prod_{i = 1}^n f(x_i|\theta) \] where \(\theta\) is some unknown parameter.

we call this likelihood function of \(\theta\) for observed \(x_1, x_2, \cdots, x_n\) and denote it by \(L(\theta | x_1, x_2, \cdots, x_n)\), or simply \(L(\theta)\): \[ \underbrace{L(\theta | x_1, x_2, \cdots, x_n)}_{\text{After sampling}} = \underbrace{\prod_{i = 1}^n f(x_i|\theta)}_{\text{Before sampling}} \]

Maximum Likelihood Estimation

In machine learning, model fitting (or training) is the process of estimating unknown parameters \(\pmb{\theta} = (\theta_1, \theta_2, \cdots, \theta_k) \) from sample data \(\mathcal{D} = \{\mathbf{x_1}, \mathbf{x_2}, \cdots, \mathbf{x_n}\}\), which can be represented by an optimization problem of the form \[ \pmb{\hat{\theta}} = \arg \min_{\pmb{\theta}} \mathcal{L} (\pmb{\theta}) \] where \(\mathcal{L} (\pmb{\theta})\) is a loss function(or objective function).

The most common approach for the optimization problem is maximum likelihood estimation (MLE): \[ \pmb{\hat{\theta}_{MLE}} = \arg \max_{(\pmb{\theta})} L(\pmb{\theta}) \] where \(L(\pmb{\theta}) \) is a likelihood function of \(\pmb{\theta}\) for sample data \(\mathcal{D}\). If \(L(\pmb{\theta})\) is differentiable function of \(\pmb{\theta}\), then \(\pmb{\hat{\theta}_{MLE}}\) can be calculated by solving the following equation: \[ \begin{align*} &\nabla_{\pmb{\theta}} \ln L(\pmb{\theta}) = \nabla_{\pmb{\theta}} \ln \prod_{i = 1}^n f(\mathbf{x_i}|\pmb{\theta}) = 0 \\\\ &\Longrightarrow \nabla_{\pmb{\theta}} \ln L(\pmb{\theta}) = \sum_{i=1}^ n \nabla_{\pmb{\theta}} \ln f(\mathbf{x_i}|\pmb{\theta}) = 0 \end{align*} \] Note: In practice, it is efficient to work with the log-likelihood function because we can compute it additions instead of multiplications.

Example 1: Binomial Distribution \(X \sim b(n, p) \)

Consider flipping a coin \(n\) times and we got \(k\) Heads. We assume \(P(Head) = \theta\) and \(P(Tail) = (1- \theta)\), where \(\theta \in [0, 1]\). Then \[ P(\mathcal{D} | \theta) = \theta^k (1-\theta)^{n-k}. \] To obtain \(\hat{\theta}_{MLE}\), we can solve: \[ \frac{d}{d\theta}[\ln \theta^k (1-\theta)^{n-k}] = 0 \] \[ \begin{align*} &\Longrightarrow \frac{d}{d\theta}[ k\ln (\theta) + (n-k)\ln (1-\theta)] = 0 \\\\ &\Longrightarrow \frac{k}{\theta} - \frac{n-k}{1-\theta} = 0 \\\\ &\Longrightarrow k(1 - \theta) - (n-k)\theta = 0 \end{align*} \] Therefore, \[ \hat{\theta}_{MLE} = \frac{k}{n}. \] This is equivalent to the sample proportion \(\hat{p} = \frac{X}{n}\), which is used as the point estimator for the population proportion \(p = \theta\).

Note: \[ \begin{align*} &\mathbb{E}[\hat{p}] = \frac{1}{n}\mathbb{E}[X] = \frac{1}{n}np = p \\\\ &\text{Var }(\hat{p}) = \frac{1}{n^2}\text{Var }[X] = \frac{1}{n^2}np(1-p) = \frac{p(1-p)}{n} \end{align*} \]

Example 2: Normal Distribution

Suppose \(\mathcal{D} = \{x_1, x_2, \cdots, x_n \}\) is from a normal distribution with p.d.f \[ f(x | \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}\exp \Big\{- \frac{(x - \mu)^2}{2\sigma^2}\Big\}. \] And its likelihood fuction is given by \[ \begin{align*} L(\mu, \sigma^2) &= \prod_{i=1}^n [ f(x_i | \mu, \sigma^2)] \\\\ &= \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big)^n \exp \Big\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \Big\} \end{align*} \] The log-likelihood function is given by \[ \begin{align*} \ln L(\mu, \sigma^2) &= n \ln \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\ &= -n \ln (\sigma) - n \ln (\sqrt{2\pi}) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\ &= -\frac{n}{2} \ln (\sigma^2) - \frac{n}{2}\ln (2\pi) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \tag{1} \\\\ \end{align*} \] Setting the partial derivative of (1) with respect to \(\mu\) equal to zero: \[ \begin{align*} &\frac{\partial \ln L(\mu, \sigma^2) }{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \\\\ &\Longrightarrow \sum_{i=1}^n (x_i) - n\mu = 0 \end{align*} \] Thus, \[ \hat{\mu}_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \tag{2} \] Similarly, setting the partial derivative of (1) with respect to \(\sigma^2\) equal to zero and substituting (2) in the equation: \[ \begin{align*} &\frac{\partial \ln L(\mu, \sigma^2) }{\partial \sigma^2} = -\frac{n}{2\sigma^2}+ \frac{1}{2\sigma^4}\sum_{i=1}^n (x_i - \bar{x})^2 = 0 \\\\ &\Longrightarrow -n \sigma^2 + \sum_{i=1}^n (x_i - \bar{x})^2 = 0 \end{align*} \] Thus, \[ \hat{\sigma^2}_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 \] Recall that the variance is biased, so we use \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\) for the sample variance.

Maximum Likelihood Estimation

Probability & Statistics

Point Estimators

Likelihood Functions

Maximum Likelihood Estimation

Example 1: Binomial Distribution \(X \sim b(n, p) \)

Example 2: Normal Distribution