Point Estimators
In practice, it is hard or impossible to find a population parameter \(\theta\). Instead, we estimate the unknown parameter by
statistic computations from sample data. The estimated parameter from the sample data is called a point estimator denoted
by \(\hat{\theta}\).
For example, to estimate the population mean \(\mu = \theta\), we compute a sample mean
\[
\bar{X} = \frac{1}{n}\sum_{i = 1}^n X_i = \hat{\theta}
\]
The point estimator is also a radom variable, and then a function of sampled random variables \(X_1, X_2, \cdots, X_n\).
Once we obtained the estimator, it is natural that we want to know how much it is close to the true parameter. There are two
factors we have to be concerned about:
- \(\text{Bias }(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta\)
The bias measures the average accuracy of the estimator.
- \(\text{Var }(\hat{\theta}) = \mathbb{E}[\hat{\theta} - \mathbb{E}(\hat{\theta})]^2\)
The variance measures the reliability(precision) of the estimator.
If both bias and variance are low enough, the estimator must be acceptable as an approximation of the population parameter.
Mean square error(MSE):
\[
\begin{align*}
\text{MSE }(\hat{\theta}) &= \mathbb{E }(\hat{\theta} - \theta)^2 \\\\
&= \text{Var }(\hat{\theta}) + [\text{Bias }(\hat{\theta})]^2
\end{align*}
\]
\[
\begin{align*}
\text{MSE }(\hat{\theta}) &= \mathbb{E }(\hat{\theta} - \theta)^2 \\\\
&= \mathbb{E }[\hat{\theta} - \theta + \mathbb{E }(\hat{\theta}) - \mathbb{E }(\hat{\theta})]^2 \\\\
&= \mathbb{E }[\hat{\theta} - \mathbb{E }(\hat{\theta})]^2
+ 2 \mathbb{E }[\hat{\theta} - \mathbb{E }(\hat{\theta})][\mathbb{E }(\hat{\theta})- \theta]
+ [\mathbb{E }(\hat{\theta})- \theta]^2 \\\\
&= \mathbb{E }[\hat{\theta} - \mathbb{E }(\hat{\theta})]^2
+ 2 [\mathbb{E }(\hat{\theta}) - \mathbb{E }(\hat{\theta})][\mathbb{E }(\hat{\theta})- \theta]
+ [\mathbb{E }(\hat{\theta})- \theta]^2 \\\\
&= \text{Var }(\hat{\theta}) + 0 + [\text{Bias }(\hat{\theta})]^2
\end{align*}
\]
Note: The population parameter is a fixed value. Thus,
\(\mathbb{E }\{[\mathbb{E }(\hat{\theta})- \theta]\} = [\mathbb{E }(\hat{\theta})- \theta] \) because
\(\mathbb{E }(constant) = constant\).
The MSE serves as a criterion for comparing estimators, enabling us to identify the most suitable one. Once an estimator
is selected, its precision in approximating the population parameter is typically assessed using the standard error (SE), which
represents the standard deviation of the estimator's sampling distribution.
For example, since \(\text{Var }(\bar{X}) = \frac{\sigma^2}{n}\), the standard error of the mean (SEM) is given by
\[
\text{SE }(\bar{X}) = \sqrt{\text{Var }(\bar{X})} = \frac{\sigma}{\sqrt{n}}.
\]
Likelihood Functions
Suppose observations \(X_1, X_2, \cdots, X_n\) are i.i.d. random variables. The "observed" values of these ramdom variables are
denoted by \(x_1, x_2, \cdots, x_n\) respectively. Then the joint p.d.f. or p.m.f. of \(X_1, X_2, \cdots, X_n\) is given by
\[
f(x_1, x_2, \cdots, x_n | \theta) = \prod_{i = 1}^n f(x_i|\theta)
\]
where \(\theta\) is some unknown parameter.
we call this likelihood function of \(\theta\) for observed \(x_1, x_2, \cdots, x_n\) and denote it by
\(L(\theta | x_1, x_2, \cdots, x_n)\), or simply \(L(\theta)\):
\[
\underbrace{L(\theta | x_1, x_2, \cdots, x_n)}_{\text{After sampling}} = \underbrace{\prod_{i = 1}^n f(x_i|\theta)}_{\text{Before sampling}}
\]
Maximum Likelihood Estimation
In machine learning, model fitting (or training) is the process of estimating unknown
parameters \(\pmb{\theta} = (\theta_1, \theta_2, \cdots, \theta_k) \) from sample data
\(\mathcal{D} = \{\mathbf{x_1}, \mathbf{x_2}, \cdots, \mathbf{x_n}\}\), which can be represented by an optimization problem of the form
\[
\pmb{\hat{\theta}} = \arg \min_{\pmb{\theta}} \mathcal{L} (\pmb{\theta})
\]
where \(\mathcal{L} (\pmb{\theta})\) is a loss function(or objective function).
The most common approach for the optimization problem is maximum likelihood estimation (MLE):
\[
\pmb{\hat{\theta}_{MLE}} = \arg \max_{(\pmb{\theta})} L(\pmb{\theta})
\]
where \(L(\pmb{\theta}) \) is a likelihood function of \(\pmb{\theta}\) for sample data \(\mathcal{D}\).
If \(L(\pmb{\theta})\) is differentiable function of \(\pmb{\theta}\), then \(\pmb{\hat{\theta}_{MLE}}\) can be calculated by solving the
following equation:
\[
\begin{align*}
&\nabla_{\pmb{\theta}} \ln L(\pmb{\theta}) = \nabla_{\pmb{\theta}} \ln \prod_{i = 1}^n f(\mathbf{x_i}|\pmb{\theta}) = 0 \\\\
&\Longrightarrow \nabla_{\pmb{\theta}} \ln L(\pmb{\theta}) = \sum_{i=1}^ n \nabla_{\pmb{\theta}} \ln f(\mathbf{x_i}|\pmb{\theta}) = 0
\end{align*}
\]
Note: In practice, it is efficient to work with the log-likelihood function because we can
compute it additions instead of multiplications.
Example 1: Binomial Distribution \(X \sim b(n, p) \)
Consider flipping a coin \(n\) times and we got \(k\) Heads. We assume \(P(Head) = \theta\) and
\(P(Tail) = (1- \theta)\), where \(\theta \in [0, 1]\). Then
\[
P(\mathcal{D} | \theta) = \theta^k (1-\theta)^{n-k}.
\]
To obtain \(\hat{\theta}_{MLE}\), we can solve:
\[
\frac{d}{d\theta}[\ln \theta^k (1-\theta)^{n-k}] = 0
\]
\[
\begin{align*}
&\Longrightarrow \frac{d}{d\theta}[ k\ln (\theta) + (n-k)\ln (1-\theta)] = 0 \\\\
&\Longrightarrow \frac{k}{\theta} - \frac{n-k}{1-\theta} = 0 \\\\
&\Longrightarrow k(1 - \theta) - (n-k)\theta = 0
\end{align*}
\]
Therefore,
\[
\hat{\theta}_{MLE} = \frac{k}{n}.
\]
This is equivalent to the sample proportion \(\hat{p} = \frac{X}{n}\), which is used as the point estimator for
the population proportion \(p = \theta\).
Note:
\[
\begin{align*}
&\mathbb{E}[\hat{p}] = \frac{1}{n}\mathbb{E}[X] = \frac{1}{n}np = p \\\\
&\text{Var }(\hat{p}) = \frac{1}{n^2}\text{Var }[X] = \frac{1}{n^2}np(1-p) = \frac{p(1-p)}{n}
\end{align*}
\]
Example 2: Normal Distribution
Suppose \(\mathcal{D} = {x_1, x_2, \cdots, x_n}\) is from a normal distributuion with p.d.f
\[
f(x | \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}\exp \Big\{- \frac{(x - \mu)^2}{2\sigma^2}\Big\}.
\]
And its likelihood fuction is given by
\[
\begin{align*}
L(\mu, \sigma^2) &= \prod_{i=1}^n [ f(x_i | \mu, \sigma^2)] \\\\
&= \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big)^n \exp \Big\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \Big\}
\end{align*}
\]
The log-likelihood function is given by
\[
\begin{align*}
\ln L(\mu, \sigma^2) &= n \ln \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\
&= -n \ln (\sigma) - n \ln (\sqrt{2\pi}) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\
&= -\frac{n}{2} \ln (\sigma^2) - \frac{n}{2}\ln (2\pi) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \tag{1} \\\\
\end{align*}
\]
Setting the partialderivative of (1) with respect to \(\mu\) equal to zero:
\[
\begin{align*}
&\frac{\partial \ln L(\mu, \sigma^2) }{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \\\\
&\Longrightarrow \sum_{i=1}^n (x_i) - n\mu = 0
\end{align*}
\]
Thus,
\[
\hat{\mu}_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \tag{2}
\]
Similarly, setting the partialderivative of (1) with respect to \(\sigma^2\) equal to zero and substituting (2) in the equation:
\[
\begin{align*}
&\frac{\partial \ln L(\mu, \sigma^2) }{\partial \sigma^2} = -\frac{n}{2\sigma^2}+ \frac{1}{2\sigma^4}\sum_{i=1}^n (x_i - \bar{x})^2 = 0 \\\\
&\Longrightarrow -n \sigma^2 + \sum_{i=1}^n (x_i - \bar{x})^2 = 0
\end{align*}
\]
Thus,
\[
\hat{\sigma^2}_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2
\]
Recall that the variance is biased, so we use \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\) for the sample variance.