Multilayer Perceptron (MLP)
The key idea of deep neural networks (DNNs) is "composing" a vast number of simple functions to
make a huge complex function. In this section, we focus on a specific type of DNN known as the multilayer perceptron (MLP),
also referred to as a feedforward neural network (FFNN).
An MLP defines a composite function of the form:
\[
f(x ; \theta) = f_L (f_{L-1}(\cdots(f_1(x))\cdots))
\]
where each component function \( f_\ell(x) = f(x; \theta_\ell) \) represents the transformation at
layer \( \ell \,\), \( x \in \mathbb{R}^D \) is an input vector with \( D \) features, and
\(\theta\) is a collection of parameters(weights and biases):
\[
\theta = \{ \theta_\ell \}_{\ell=1}^L \text{ ,where } \theta_\ell = \{ W^{(\ell)}, b^{(\ell)} \}.
\]
Each layer is assumed to be differentiable and consists of two operations: an affine transformation followed
by a non-linear differentiable activation function \( g_\ell : \mathbb{R} \to \mathbb{R}\). An MLP consists of an input layer,
one or more hidden layers, and an output layer.
We define the hidden units \(z^{(\ell)}\) at each layer \(\ell\) passed elementwise through the activation:
\[
z^{(\ell)} = g_{\ell}(b^{(\ell)} + W^{(\ell)}z^{(\ell -1)}) = g_{\ell}(a^{(\ell)})
\]
where \(a^{(\ell)}\) is called the pre-activations, and the output of the newtwork is denoted by
\(\hat{y} = h_\theta(x) = g_L(a^{(L)})\).
Note: Input data is typically stored as an \( N \times D \) design matrix, where each
row corresponds to a data point and each column to a feature. This is referred to as structured data or
tabular data. In contrast, for unstructured data such as images or text, different architectures
are used:
- Convolutional Neural Networks (CNNs) for images
- Recurrent Neural Networks (RNNs) and Transformers for sequential data (e.g. text)
In particular, modern Large Language Models (LLMs) such as GPT are based on the transformer architecture and
have replaced RNNs in many natural language processing tasks.
Activation Functions
Without a non-linear activation function, a neural network composed of multiple layers would reduce to a
single linear transformation:
\[
f(x ; \theta) = \theta^{(L)} \theta^{(L-1)} \cdots \theta^{(2)} \theta^{(1)} x.
\]
This composition is still linear in \( x \), and therefore incapable of representing non-linear decision boundaries.
Non-linear activation functions are necessary to break this linearity and allow networks to approximate arbitrary functions.
Historically, a common choice was the sigmoid (logistic) activation function:
\[
\sigma(a) = \frac{1}{1+e^{-a}}.
\]
However, sigmoid functions saturate for large positive or negative inputs: \( \sigma(a) \to 1 \) as \( a \to +\infty \), and \( \sigma(a) \to 0 \) as \( a \to -\infty \). .
In these regions, the gradient becomes very small, leading to the vanishing gradient problem — gradients shrink as they propagate backward, making learning
slow or unstable in deep networks.
To address this, modern networks often use the Rectified Linear Unit (ReLU):
\[
g(a) = \max(0, a) = a \mathbb{I}(a>0)
\]
ReLU introduces non-linearity while preserving gradient magnitude for positive inputs. It is computationally simple and helps maintain gradient flow during training, which
is why it is now a standard choice in modern neural networks architectures.
Learning in Neural Networks
Training the network is finding parameters \( \theta = \{ \theta_\ell \}_{\ell=1}^L \), where
\( \theta_\ell = \{ W^{(\ell)}, b^{(\ell)} \} \) that minimize the empirical risk (average loss over all training data):
\[
J(\theta) = \frac{1}{N} \sum_i \mathcal{L}(y_i, \hat{y_i})
\]
where \(\hat{y_i} = h_{\theta}(x_i)\) is the network's prediction.
For the binary classification, a common choice can be
binary cross-entropy:
\[
\mathcal{L}(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y})
\]
The optimization is done by performing a gradient-based optimization method which iteratively updates
parameters in the direction of negative gradient:
\[
\theta \leftarrow \theta - \alpha \nabla_{\theta} J(\theta)
\]
where \(\alpha\) is the learning rate.
Our demo employs mini-batch gradient descent, which computes gradients on a small random subset of the data at
each iteration. This provides a good balance between computational efficiency and gradient quality, often leading to faster convergence
and better generalization compared to using the entire dataset at once.
The gradients are computed efficiently by the backpropagation
algorithm. Backprop is an efficient application of the chain rule starting from the gradient of the loss w.r.t the
output and working backwards. The algorithm computes all gradients in just two passes through the network:
Algorithm: BACKPROPAGATION
Consider the MLP with K layers
Input: \(x \in \mathbb{R}^D\)
//Forward Pass
\(x_1 = x\);
// \(f_k\) is an activation with the previous output: \(x_k\) and the parameters for this layer: \(\theta_k\)
for \(k = 1 : K\) do
\(x_{k+1} = f_k(x_k, \theta_k)\);
//Backward Pass
\(u_{K+1} = 1\); //Gradient of \mathcal{L} wrt itself is 1
for \(k = K : 1\) do
\(g_k = u_{k+1}^\top \frac{\partial f_k (x_k, \theta_k)}{\partial \theta_k}\); //Gradient of the loss wrt \(\theta_k\)
\(u_k^\top = u_{k+1}^\top \frac{\partial f_k (x_k, \theta_k)}{\partial x_k}\); //Gradient of the loss wrt \(x_k\)
Output:
\(\mathcal{L} = x_{K+1}\); //Loss value (computed in forward pass)
\(\nabla_x \mathcal{L} = u_1\); //Gradient wrt the input
\(\{\nabla_{\theta} \mathcal{L} = g_k\}_{k=1}^K\); ///Gradients wrt the parameters
Neural Networks Demo
This interactive demo showcases how a simple neural network can learn to classify non-linear patterns.
You can generate datasets, tweak model parameters, and visualize the training process in real time.
- Model Architecture:
- 2 input features (\(x_1\) and \(x_2\))
- 1 hidden layer with ReLU activation (adjustable number of units)
- 1 output unit with sigmoid activation for binary classification
- Forward Pass:
The network computes predictions by applying matrix operations and non-linear activations. Selecting a
demo point shows a step-by-step computation.
- Training:
The network is trained using mini-batch gradient descent with backpropagation to minimize
binary cross-entropy loss. Each iteration uses a small, randomly sampled subset of the training data to
update weights.
- Faster and more stable than full-batch training
- Helps escape flat regions and saddle points
- More closely mirrors how real-world neural networks are trained
- Training Optimizations:
- Dynamic learning rate adjustment
- Gradient clipping: Prevents instability due to the exploding gradients by scaling them when their norm exceeds a threshold.
- Early stopping when performance stabilizes
- \(\ell_2\) regularization (λ) to reduce overfitting
- Visualizations:
- Color-coded data points for training and test sets
- Decision boundary (green) shows where prediction = 0.5
- Probability contours reveal model confidence
- Dynamic network graph and forward pass breakdown
Try Adjusting:
- Hidden Units: More neurons allow for more complex decision boundaries
- Regularization (λ): Helps prevent overfitting by discouraging large weights
- Learning Rate: Controls how quickly the model updates
- Max Iterations: Sets how long the training runs before stopping
Development of Deep Learning
The modern revolution in deep learning has been driven not only by algorithmic advances, but also by dramatic improvements in hardware—especially the rise of
graphics processing units (GPUs). Originally designed to accelerate matrix-vector computations for real-time rendering in
video games, GPUs turned out to be ideally suited for the linear algebra operations at the heart of neural networks.
In the early 2010s, researchers discovered that GPUs could speed up deep learning training by orders of magnitude compared to traditional CPUs. This enabled the
training of large neural networks on large labeled datasets, like ImageNet, which led to breakthroughs in computer vision, speech recognition
(converting spoken language to text), and broader natural language processing (NLP) tasks such as translation, summarization, and question answering.
Today, GPUs are a core component in AI research and development, alongside other fields such as scientific computing, complex simulations, and even cryptocurrency mining.
Zooming out further, GPUs themselves rely on foundational advances in semiconductor technology. Semiconductors are materials whose conductivity can
be precisely controlled, making them the backbone of all modern electronics—from GPUs and CPUs to memory chips and mobile devices. By using advanced fabrication
techniques and nanometer-scale engineering, manufacturers can pack billions of transistors (the basic units of computation) onto a single chip.
This density of computation enables the incredible power of today's hardware and fuels the era of foundation models, including large language models (LLMs)
such as GPT and BERT.