Intro to Classification

Binary Logistic Regression Binary Logistic Regression Demo Kernel Methods RBF Kernel & Random Fourier Features Multinomial logistic regression

Binary Logistic Regression

Consider discriminative classification model: \[ p(y \mid x, \theta) \] where \(x \in \mathbb{R}^D\) is an input vector, \(y \in \{1, \cdots, C\}\) is the class label, and \(\theta\) are the parameters.

Let \(y \in \{0, 1\}\). Then the model is known as binary logistic regression, which can be represented by: \[ p(y \mid x, \theta) = \text{Ber }(y \mid \sigma(w^\top x + b)) \] where \(\sigma\) is the sigmoid(logistic) function, \(w \in \mathbb{R}^D\) is the weight vector, \(b \in \mathbb{R}\) is a bias, and \(\theta = (w, b)\) are the model parameters. The notation \(\text{Ber}(y \mid p)\) denotes the Bernoulli distribution, defined as: \[ \text{Ber}(y \mid p) = p^y (1 - p)^{1 - y}, \quad y \in \{0, 1\},\; p \in [0, 1]. \]

In other words, \[ p(y = 1 \mid x, \theta) = \sigma(a) = \frac{1}{1 + e^{-a}} \] where \(a\) is called the logit(or the pre-activation): \[ a = w^\top x + b = \log (\frac{p}{1-p}), \quad \text{with } p = p(y = 1 \mid x, \theta). \]

The sigmoid function outputs the probability that the class label is \(y = 1\). Therefore, we define a decision rule such that we predict \(y = 1\) if and only if class 1 is more likely than class 0. That is, \[ \begin{align*} f(x) &= \mathbb{I}(p(y=1 \mid x) > p(y=0 \mid x)) \\\\ &= \mathbb{I} \left( \log \frac{p(y = 1 \mid x)}{p(y =0 \mid x)} > 0 \right) \\\\ &= \mathbb{I}(a). \end{align*} \] Here, \(\mathbb{I}(\cdot)\) denotes the indicator function, which returns 1 if the condition inside is true, and 0 otherwise: \[ \mathbb{I}(\text{condition}) = \begin{cases} 1 & \text{if condition is true,} \\ 0 & \text{otherwise.} \end{cases} \]

Therefore, the prediction function can be written as: \[ f(x ; \theta) = w^\top x + b. \] Since \(w^\top x\) is the inner product between the weight vector \(w\) and the feature vector \(x\), this function defines a linear hyperplane with normal vector \(w\) and an offset \(b\) from the origin. In other words, the weight vector \(w\) determines the ""orientation" of the decision boundary, and the bias term \(b\) shifts the boundary. Moreover, the norm \(\|w\|\) controls the "steepness" of the sigmoid function — a larger norm results in a sharper transition between predicted classes, which reflects greater confidence in predictions.

To estimate the parameters \(w\), we maximize the likelihood of the observed dataset \(\mathcal{D} = \{(x_n, y_n)\}_{n=1}^N\). Equivalently, we minimize the negative log-likelihood (NLL): \[ \begin{align*} \mathrm{NLL }(w) &= - \frac{1}{N} \log p(\mathcal{D} \mid w) \\\\ &= - \frac{1}{N} \log \prod_{n =1}^N \text{Ber }(y_n \mid \mu_n) \\\\ &= - \frac{1}{N} \sum_{n=1}^N \log \left[ \mu_n^{y_n} (1-\mu_n)^{1-y_n} \right] \\\\ &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \mu_n + (1-y_n) \log (1-\mu_n) \right] \\\\ &= - \frac{1}{N} \sum_{n=1}^N \mathbb{H}_{ce} (y_n , \mu_n) \end{align*} \] where \(\mu_n = \sigma(a_n) = \sigma(w^\top x_n)\) is the model's predicted probability of class 1, and \(\mathbb{H}_{ce}(y_n, \mu_n)\) is the binary cross entropy bethween \(y_n\) and \(\mu_n\).

We then find the maximum likelihood estimate (MLE) by solving: \[ \nabla_w \text{NLL }(w) = 0 \] via gradient-based optimization (in practice, using automatic differentiation).

The gradient of the NLL with respect to \(w\) is \[ \begin{align*} \nabla_w \text{NLL }(w) &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n (1-\mu_n) x_n - (1-y_n) \mu_n x_n \right] \\\\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n - y_n) x_n, \\\\ \end{align*} \] Equivalently, in matrix form: \[ \nabla_w \text{NLL }(w) = \frac{1}{N} \left( \mathbf{1}_N^\top (\text{diag }(\mu - y)X)\right)^\top. \] where \(X \in \mathbb{R}^{N \times D}\) is the design matrix whose \(n\)th row is \(x_n^\top\), and \(\mu, y\in\mathbb{R}^N\) are the vectors of \(\mu_n\) and \(y_n\), respectively. Now, it is clear that each \(x_n\) is being weighted by its residual \(\mu_n - y_n\).

In practice, we never build an \(N \times N\) diagonal matrix. Instead, we simply computes \[ \nabla_w \text{NLL }(w) = \frac{1}{N} X^\top (\mu - y) \] which is both more memory- and compute-efficient.

Derivation of the gradient of \(\text{NLL }(w)\):

Recall the sigmoid function: \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] We compute its derivative with respect to \(a\): \[ \begin{aligned} \frac{d\sigma}{da} &= \frac{d}{da}\,(1 + e^{-a})^{-1} \\\\\ &= -1 \cdot (1 + e^{-a})^{-2} \;\bigl(-e^{-a}\bigr) \\\\ &= \frac{e^{-a}}{(1 + e^{-a})^2} \\\\ &= \left( \frac{1}{1 + e^{-a}}\right) \left(\frac{e^{-a}}{1 + e^{-a}}\right) \\\\ &= \sigma(a)\,\bigl(1 - \sigma(a)\bigr). \end{aligned} \]

Since \(\mu_n = \sigma(a_n)\), we conclude: \[ \frac{\partial \mu_n}{\partial a_n} = \mu_n\,(1 - \mu_n). \]

For a single data point \((x_n,y_n)\), let \[ \ell_n = -\bigl[y_n\log\mu_n + (1-y_n)\log(1-\mu_n)\bigr]. \] We compute: \[ \frac{\partial \ell_n}{\partial \mu_n} = -\Bigl(\frac{y_n}{\mu_n} - \frac{1 - y_n}{1 - \mu_n}\Bigr) = \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)}. \] By chain rule, \[ \begin{aligned} \frac{\partial \ell_n}{\partial a_n} &= \frac{\partial \ell_n}{\partial \mu_n} \cdot \frac{\partial \mu_n}{\partial a_n} \\\\ &= \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)} \cdot \mu_n\,(1 - \mu_n) \\\\ &= \mu_n - y_n. \end{aligned} \] Also, since \(a_n = w^\top x_n + b\), \[ \frac{\partial a_n}{\partial w} = x_n. \] Now combine all of them: \[ \frac{\partial \ell_n}{\partial w} = \frac{\partial \ell_n}{\partial a_n} \cdot \frac{\partial a_n}{\partial w} = (\mu_n - y_n)x_n. \] Finally, summing over all \(N\) examples and dividing by \(N\) gives \[ \nabla_w \mathrm{NLL}(w) = \frac{1}{N}\sum_{n=1}^N (\mu_n - y_n)\,x_n. \]

To verify convexity, we examine the Hessian: \[ \begin{align*} H(w) &=\nabla_w^2 \text{NLL }(w) \\\\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n( 1 - \mu_n)x_n) x_n^\top \\\\ &= \frac{1}{N} X^\top S X \end{align*} \] where \(S = \text{diag }(\mu_1(1-\mu_1), \cdots, \mu_N(1-\mu_N))\).

For any \(v \in \mathbb{R}^D\), \[ v^\top X^\top S X v = (v^\top X^\top S^{\frac{1}{2}})(S^{\frac{1}{2}} X v) \] and then \[ \| v^\top X^\top S^{\frac{1}{2}} \|_2^2 > 0. \] Thus the Hessian \(H(w)\) is positive definite, confirming that the NLL is strictly convex and any stationary point is the global optimum.

Note: In practice, \(\mu_n\) can be very close to 0 or 1. We need to use \(\ell_2\) regularization to make the Hessian nonsingular.

Binary Logistic Regression Demo

Given a linear scoring function: \[ z = w_0 + w_1 x_1 + w_2 x_2 \] and the predicted probability: \[ \sigma(z) = \frac{1}{1 + e^{-z}}, \] we minimize the average binary cross-entropy plus an \(\ell_2\) penalty on the weights (excluding the bias): \[ \mathrm{NLL }(w) = - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \sigma(z_n) + (1-y_n) \log (1-\sigma(z_n)) \right] + \frac{\lambda}{2} \sum_{j=1}^2 w_j^2. \] Note: The learning rate controls the step size in each gradient descent update.

Kernel Methods

In many problems the original feature vectors \(x\in\mathbb{R}^D\) are not linearly separable. One fix is to apply a nonlinear feature mapping \[ \phi(x)\colon\mathbb{R}^D\to\mathbb{R}^P,\quad P\gg D, \] hoping that \(\{\phi(x_n)\}\) become separable. But explicit mappings can blow up model complexity and lead to overfitting, even with regularization.

Alternatively, if a learning algorithm depends only on inner products \(\langle \phi(x),\phi(x')\rangle\), we can invoke the kernel trick: replace each dot-product with a kernel function \(K(x,x')\) that computes \(\langle\phi(x),\phi(x')\rangle\) implicitly—no \(\phi\) ever constructed.

Kernel: A function \[ K : \mathbb{R}^D \times \mathbb{R}^D \;\to\; \mathbb{R} \] is a kernel for some feature map \(\phi\) if \[ \forall\,x,x',\quad K(x,x') = \langle \phi(x),\,\phi(x')\rangle. \] In practice \(K\) must also be symmetric and positive semi-definite (Mercer's condition).

Concretely, we build the kernel matrix \[ K_{ij} = K(x_i,x_j),\quad i,j=1,\dots,N, \] and run our algorithm using \(\{K_{ij}\}\) instead of \(\{\phi(x_i)\}\).

For large \(N\) (e.g. \(>10^5\)), storing \(N\times N\) kernels is infeasible. Two common remedies are:

RBF Kernel & Random Fourier Features

Among the many valid kernel functions, one of the most widely used is the RBF (Gaussian) kernel: \[ K_{\mathrm{RBF}}(x,x') = \exp\Bigl(-\tfrac{\|x-x'\|^2}{2\sigma^2}\Bigr), \] which corresponds to an infinite-dimensional feature map.
Note: RBF stands for "Radical Basis Function."

By Bochner's theorem, any shift-invariant kernel admits the representation \[ K_{\mathrm{RBF}}(x,x') = \mathbb{E}_{\omega\sim\mathcal{N}(0,\sigma^{-2}I)}\bigl[\cos(\omega^\top(x-x'))\bigr]. \]

To approximate, sample \(D'\) frequencies \(\{\omega_k\}\) and phases \(\{b_k\}\sim\mathrm{Uniform}[0,2\pi]\), then define \[ \phi_{\mathrm{RFF}}(x) = \sqrt{\frac{2}{D'}}\, \bigl[\cos(\omega_1^\top x + b_1),\,\dots,\,\cos(\omega_{D'}^\top x + b_{D'})\bigr]\in\mathbb{R}^{D'}. \]

One shows \[ \phi_{\mathrm{RFF}}(x)^\top \phi_{\mathrm{RFF}}(x') \;\approx\; K_{\mathrm{RBF}}(x,x'), \] so we can train a regularized linear model on these \(D'\)-dim features in \(O(ND')\) time, avoiding the \(O(N^2)\) kernel matrix.

Multinomial Logistic Regression

When there are more than two classes (\(C>2\)), we generalize the sigmoid to the softmax function. For inputs \(x\in\mathbb{R}^D\) and class weights \(\{w_c,b_c\}_{c=1}^C\), the model defines \[ p(y=c\,|\,x,\theta) = \frac{\exp\!\bigl(w_c^\top x + b_c\bigr)} {\sum_{c'=1}^C \exp\!\bigl(w_{c'}^\top x + b_{c'}\bigr)}, \quad c=1,\dots,C. \]

We fit \(\theta=\{w_c,b_c\}\) by minimizing the average cross-entropy loss \[ \mathrm{NLL}(\theta) = -\frac{1}{N}\sum_{n=1}^N \sum_{c=1}^C \mathbb{I}\{y_n=c\}\,\log p(y_n=c\,|\,x_n,\theta). \]

The gradient has a similarly clean form: \[ \frac{\partial\mathrm{NLL}}{\partial w_c} = \frac{1}{N}\sum_{n=1}^N \Bigl[p(y_n=c\,|\,x_n)-\mathbb{I}\{y_n=c\}\Bigr]\,x_n, \] and you can optimize it via batch or (mini-batch) gradient descent.

The Hessian is block-structured but remains positive semi-definite, ensuring convexity.