Binary Logistic Regression
Consider discriminative classification model: \[ p(y \mid x, \theta) \] where \(x \in \mathbb{R}^D\) is an input vector, \(y \in \{1, \cdots, C\}\) is the class label, and \(\theta\) are the parameters.
Let \(y \in \{0, 1\}\). Then the model is known as binary logistic regression, which can be represented by: \[ p(y \mid x, \theta) = \text{Ber }(y \mid \sigma(w^\top x + b)) \] where \(\sigma\) is the sigmoid(logistic) function, \(w \in \mathbb{R}^D\) is the weight vector, \(b \in \mathbb{R}\) is a bias, and \(\theta = (w, b)\) are the model parameters. The notation \(\text{Ber}(y \mid p)\) denotes the Bernoulli distribution, defined as: \[ \text{Ber}(y \mid p) = p^y (1 - p)^{1 - y}, \quad y \in \{0, 1\},\; p \in [0, 1]. \]
In other words, \[ p(y = 1 \mid x, \theta) = \sigma(a) = \frac{1}{1 + e^{-a}} \] where \(a\) is called the logit(or the pre-activation): \[ a = w^\top x + b = \log (\frac{p}{1-p}), \quad \text{with } p = p(y = 1 \mid x, \theta). \]
The sigmoid function outputs the probability that the class label is \(y = 1\). Therefore, we define a decision rule such that we predict \(y = 1\) if and only if class 1 is more likely than class 0. That is, \[ \begin{align*} f(x) &= \mathbb{I}(p(y=1 \mid x) > p(y=0 \mid x)) \\\\ &= \mathbb{I} \left( \log \frac{p(y = 1 \mid x)}{p(y =0 \mid x)} > 0 \right) \\\\ &= \mathbb{I}(a). \end{align*} \] Here, \(\mathbb{I}(\cdot)\) denotes the indicator function, which returns 1 if the condition inside is true, and 0 otherwise: \[ \mathbb{I}(\text{condition}) = \begin{cases} 1 & \text{if condition is true,} \\ 0 & \text{otherwise.} \end{cases} \]
Therefore, the prediction function can be written as: \[ f(x ; \theta) = w^\top x + b. \] Since \(w^\top x\) is the inner product between the weight vector \(w\) and the feature vector \(x\), this function defines a linear hyperplane with normal vector \(w\) and an offset \(b\) from the origin. In other words, the weight vector \(w\) determines the ""orientation" of the decision boundary, and the bias term \(b\) shifts the boundary. Moreover, the norm \(\|w\|\) controls the "steepness" of the sigmoid function — a larger norm results in a sharper transition between predicted classes, which reflects greater confidence in predictions.
To estimate the parameters \(w\), we maximize the likelihood of the observed dataset \(\mathcal{D} = \{(x_n, y_n)\}_{n=1}^N\). Equivalently, we minimize the negative log-likelihood (NLL): \[ \begin{align*} \mathrm{NLL }(w) &= - \frac{1}{N} \log p(\mathcal{D} \mid w) \\\\ &= - \frac{1}{N} \log \prod_{n =1}^N \text{Ber }(y_n \mid \mu_n) \\\\ &= - \frac{1}{N} \sum_{n=1}^N \log \left[ \mu_n^{y_n} (1-\mu_n)^{1-y_n} \right] \\\\ &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \mu_n + (1-y_n) \log (1-\mu_n) \right] \\\\ &= - \frac{1}{N} \sum_{n=1}^N \mathbb{H}_{ce} (y_n , \mu_n) \end{align*} \] where \(\mu_n = \sigma(a_n) = \sigma(w^\top x_n)\) is the model's predicted probability of class 1, and \(\mathbb{H}_{ce}(y_n, \mu_n)\) is the binary cross entropy bethween \(y_n\) and \(\mu_n\).
We then find the maximum likelihood estimate (MLE) by solving: \[ \nabla_w \text{NLL }(w) = 0 \] via gradient-based optimization (in practice, using automatic differentiation).
The gradient of the NLL with respect to \(w\) is \[ \begin{align*} \nabla_w \text{NLL }(w) &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n (1-\mu_n) x_n - (1-y_n) \mu_n x_n \right] \\\\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n - y_n) x_n, \\\\ \end{align*} \] Equivalently, in matrix form: \[ \nabla_w \text{NLL }(w) = \frac{1}{N} \left( \mathbf{1}_N^\top (\text{diag }(\mu - y)X)\right)^\top. \] where \(X \in \mathbb{R}^{N \times D}\) is the design matrix whose \(n\)th row is \(x_n^\top\), and \(\mu, y\in\mathbb{R}^N\) are the vectors of \(\mu_n\) and \(y_n\), respectively. Now, it is clear that each \(x_n\) is being weighted by its residual \(\mu_n - y_n\).
In practice, we never build an \(N \times N\) diagonal matrix. Instead, we simply computes \[ \nabla_w \text{NLL }(w) = \frac{1}{N} X^\top (\mu - y) \] which is both more memory- and compute-efficient.
Recall the sigmoid function: \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] We compute its derivative with respect to \(a\): \[ \begin{aligned} \frac{d\sigma}{da} &= \frac{d}{da}\,(1 + e^{-a})^{-1} \\\\\ &= -1 \cdot (1 + e^{-a})^{-2} \;\bigl(-e^{-a}\bigr) \\\\ &= \frac{e^{-a}}{(1 + e^{-a})^2} \\\\ &= \left( \frac{1}{1 + e^{-a}}\right) \left(\frac{e^{-a}}{1 + e^{-a}}\right) \\\\ &= \sigma(a)\,\bigl(1 - \sigma(a)\bigr). \end{aligned} \]
Since \(\mu_n = \sigma(a_n)\), we conclude: \[ \frac{\partial \mu_n}{\partial a_n} = \mu_n\,(1 - \mu_n). \]
For a single data point \((x_n,y_n)\), let \[ \ell_n = -\bigl[y_n\log\mu_n + (1-y_n)\log(1-\mu_n)\bigr]. \] We compute: \[ \frac{\partial \ell_n}{\partial \mu_n} = -\Bigl(\frac{y_n}{\mu_n} - \frac{1 - y_n}{1 - \mu_n}\Bigr) = \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)}. \] By chain rule, \[ \begin{aligned} \frac{\partial \ell_n}{\partial a_n} &= \frac{\partial \ell_n}{\partial \mu_n} \cdot \frac{\partial \mu_n}{\partial a_n} \\\\ &= \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)} \cdot \mu_n\,(1 - \mu_n) \\\\ &= \mu_n - y_n. \end{aligned} \] Also, since \(a_n = w^\top x_n + b\), \[ \frac{\partial a_n}{\partial w} = x_n. \] Now combine all of them: \[ \frac{\partial \ell_n}{\partial w} = \frac{\partial \ell_n}{\partial a_n} \cdot \frac{\partial a_n}{\partial w} = (\mu_n - y_n)x_n. \] Finally, summing over all \(N\) examples and dividing by \(N\) gives \[ \nabla_w \mathrm{NLL}(w) = \frac{1}{N}\sum_{n=1}^N (\mu_n - y_n)\,x_n. \]
To verify convexity, we examine the Hessian: \[ \begin{align*} H(w) &=\nabla_w^2 \text{NLL }(w) \\\\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n( 1 - \mu_n)x_n) x_n^\top \\\\ &= \frac{1}{N} X^\top S X \end{align*} \] where \(S = \text{diag }(\mu_1(1-\mu_1), \cdots, \mu_N(1-\mu_N))\).
For any \(v \in \mathbb{R}^D\), \[ v^\top X^\top S X v = (v^\top X^\top S^{\frac{1}{2}})(S^{\frac{1}{2}} X v) \] and then \[ \| v^\top X^\top S^{\frac{1}{2}} \|_2^2 > 0. \] Thus the Hessian \(H(w)\) is positive definite, confirming that the NLL is strictly convex and any stationary point is the global optimum.
Note: In practice, \(\mu_n\) can be very close to 0 or 1. We need to use \(\ell_2\) regularization to make the Hessian nonsingular.