Intro to Deep Neural Networks

Convolutional Neural Networks (CNNs)

Deep neural networks (DNNs) have undergone remarkable evolution over the past few decades. From early image classifiers to today's cutting-edge language and multimodal models, the field has progressed through a series of foundational innovations. Broadly speaking, modern DNNs are typically feedforward networks, where information flows in a single direction — from input to output — without any loops or recurrence. This design simplifies computations, making them faster and more stable.

In the 1990s, Convolutional Neural Networks (CNNs) emerged as an architecture for processing grid-like data such as images. A CNN uses convolutional layers that apply learnable filters across local spatial regions. Mathematically, a 2D convolution operation is given by: \[ (f * x)(i, j) = \sum_m \sum_n f(m, n) \cdot x(i - m, j - n) \] where \(f\) is the filter (or kernel) and \(x\) is the input image. This operation captures local spatial patterns while sharing parameters across the image.

The breakthrough came in 2012, when AlexNet won the ImageNet competition. It stacked convolution and pooling layers, used the ReLU nonlinearity \( \text{ReLU}(x) = \max(0, x) \), and applied dropout to reduce overfitting. AlexNet demonstrated that deep CNNs trained on GPUs could vastly outperform traditional vision pipelines.

This success led to the deeper architecture, ResNet(2015), which introduced residual connections. Such a model advanced image recognition and set the stage for modern vision architectures.

ResNet: Residual Connections

ResNet is a feedforward architecture in which each layer uses a residual block of the form: \[ \mathcal{F}_{\ell}' (\mathbf{x}) = \mathcal{F}_{\ell} (\mathbf{x}) + \mathbf{x} \] where \(\mathcal{F}_{\ell}\) is a nonlinear transformation at layer \(\ell\), and the input \(\mathbf{x}\) is added back directly.

The activations at the final layer \(L\) can be expressed recursively in terms of an earlier layer \(\ell\): \[ \mathbf{z}_L = \mathbf{z}_{\ell} + \sum_{i = \ell}^{L - 1} \mathcal{F}_i (\mathbf{z}_i; \mathbf{\theta}_i), \] where \(\mathbf{\theta}_i\) denotes the parameters of the transformation at layer \(i\).

This formulation mitigates the vanishing gradient problem in deep neural networks, enabling the successful training of very deep models.

Applying the chain rule, the gradient of the loss \(\mathcal{L}\) with respect to the parameters at layer \(\ell\) becomes: \[ \begin{align*} \frac{\partial \mathcal{L}}{\partial \mathbf{\theta}_{\ell}} &= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{\ell}} \\\\ &= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} \frac{\partial \mathbf{z}_L}{\partial \mathbf{z}_{\ell}} \\\\ &= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} \left(1 + \sum_{i = \ell}^{L-1} \frac{\partial \mathcal{F}_i (\mathbf{z}_i ; \mathbf{\theta}_i)}{\partial \mathbf{z}_{\ell}}\right) \\\\ &= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} + \text{other terms} \end{align*} \] The “other terms” in the gradient capture indirect paths through deeper residual blocks and reflect how updates propagate backward through the network. These terms may diminish in magnitude in very deep networks, but they are generally not zero and play a critical role in learning refined representations. Residual networks avoid the vanishing gradient problem not by eliminating these terms, but by ensuring they are not the only path for gradient flow.

Layer Normalization

Let \(z_i\) be the \(i\)'th element of a tensor. For example, in 2d images, the index \(i\) has four components, indicating batch, height, width, and channel: \[ i = (i_N, i_H, i_W, i_C). \] For each index \(z_i\), the mean and variance are computed by pooling statistics across other dimensions of the tensor but not across examples in the batch: \[ \begin{align*} &\mu_i = \frac{1}{\| \mathcal{S}_i\|} \sum_{k \in \mathcal{S}_i} z_k \\\\ &\sigma_i = \sqrt{\frac{1}{\| \mathcal{S}_i\|}\sum_{k \in \mathcal{S}_i} (z_k - \mu_i)^2 + \epsilon} \end{align*} \] where \(\mathcal{S}_i\) is the set of elements over which statistics are computed for index \(i\), and \(\epsilon\) is a small constant for numerical stability.

The normalized and scaled output is given by: \[ \begin{align*} &\hat{z}_i = \frac{(z_i - \mu_i)}{\sigma_i} \\\\ &\tilde{z_i} = \gamma_c \hat{z}_i + \beta_c \end{align*} \] where \(c\) is the channel index corresponding to \(i\), and \(\gamma_c, \beta_c\) are learnable parameters for scaling and shifting.

Unlike batch normalization, which pools across the batch dimension, layer normalization pools only over feature dimensions for each input example. This makes it especially effective in settings where batch-level statistics are unstable or unavailable — such as in autoregressive models, small batch sizes, or variable-length sequences.

Attention Mechanisms

Recurrent architectures process sequences step-by-step, which limits parallelism and makes modeling long-range dependencies challenging. For example, imagine reading a sentence and trying to understand what each word refers to. When processing the word "it," you automatically look back at previous words to determine its meaning. Attention mechanisms formalize this intuitive process by computing a weighted sum of all input feature vectors, enabling the model to dynamically focus on the most relevant parts of the sequence regardless of distance.

In traditional feedforward neural networks, each layer computes hidden activations as a linear transformation of input activations followed by a nonlinearity: \[ \mathbf{Z} = \varphi (\mathbf{X}\mathbf{W}) \in \mathbb{R}^{m \times v'} \] where \(\mathbf{X} \in \mathbb{R}^{m \times v}\) is a matrix of input (or hidden) feature vectors, and \(\mathbf{W} \in \mathbb{R}^{v \times v'}\) is a fixed weight matrix learned during training.

This formulation uses the same set of weights \(\mathbf{W}\) for every input position. To allow more flexibility, we can instead make the weights input-dependent. That is, we can allow: \[ \mathbf{Z} = \varphi (\mathbf{X}\mathbf{W}(\mathbf{X})), \] where the transformation matrix \(\mathbf{W}(\mathbf{X})\) varies depending on the input itself. This allows the model to dynamically adjust its computations based on the input context. Such input-dependent multiplicative interactions are central to attention mechanisms.

In more general settings, the input-dependent transformation matrix can be computed from separate sets of learned representations called queries, keys, and values. This leads to a broader formulation of attention: \[ \mathbf{Z} = \varphi (\mathbf{V}\mathbf{W}(\mathbf{Q}, \mathbf{K})) \] where:

\(\mathbf{Q} \in \mathbb{R}^{m \times q}\) is a set of queries
\(\mathbf{K} \in \mathbb{R}^{m \times q}\) is a set of keys
\(\mathbf{V} \in \mathbb{R}^{m \times v}\) is a set of values

The transformation matrix \(\mathbf{W}(\mathbf{Q}, \mathbf{K})\) represents a learned or computed relationship between queries and keys. Note: In practice, these matrices are often obtained by applying learned linear projections to the input sequence.

In attention, the core idea is to compute each output vector \(\mathbf{z}_j\) as a weighted sum over all value vectors \(\mathbf{v}_i\), where the weights \(\alpha_{ij}\) are determined by a similarity between the query \(\mathbf{q}_j\) and each key \(\mathbf{k}_i\): \[ \mathbf{z}_j = \sum_{i}\alpha_{ij}\mathbf{v}_i \] where \(0 \leq \alpha_{ij} \leq 1\) and \(\sum_i \alpha_{ij} = 1\). This allows the model to selectively attend to relevant parts of the input.

A common similarity function is the inner product between query and key vectors. If \(\mathbf{q}, \mathbf{k} \in \mathbb{R}^d\) are independent random variables with components that have zero mean and unit variance, then their inner product \(\mathbf{q}^\top \mathbf{k}\) has zero mean and variance \(d\). To prevent large variance from pushing the softmax into saturation regions where gradients vanish, we normalize it by \(\sqrt{d}\): \[ a(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^\top \mathbf{k}}{\sqrt{d}}. \] This is known as scaled dot-product attention score, where the scale factor \(\frac{1}{\sqrt{d}}\) normalizes the variance so that it remains stable(i.e., \(= 1\)) as the dimension \(d\) increases. This is used to compute attention weights \(\alpha_{ij}\) after applying the softmax function.

In practice, we compute attention over minibatches of \(n\) query vectors. Let: \[ \mathbf{Q} \in \mathbb{R}^{n \times d}, \quad \mathbf{K} \in \mathbb{R}^{m \times d}, \quad \mathbf{V} \in \mathbb{R}^{m \times v} \] Then the attention-weighted output is computed as: \[ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d}}\right) \mathbf{V} \in \mathbb{R}^{n \times v} \] where the softmax is applied row-wise to normalize each row of the \(n \times m\) similarity matrix.

Self-Attention

Self-attention is a mechanism that allows each position \(i\) in a sequence to attend to all other positions in the same sequence. Given a sequence of input tokens \(\mathbf{x}_1, \cdots, \mathbf{x}_n \in \mathbb{R}^d\), a sequence of outputs is given by: \[ \mathbf{y}_i = \text{Attention}(\mathbf{q}_i, (\mathbf{x}_1, \mathbf{x}_1), \cdots, (\mathbf{x}_n, \mathbf{x}_n )) \in \mathbb{R}^d \] where the query is \(q_i\) and the keys and values are \(\mathbf{x}_1, \cdots \mathbf{x}_n\).

This mechanism captures contextual relationships between tokens by allowing information to flow from any position to any other in a single layer. During training, all the outputs are known. Thus the model can evaluate \(\mathbf{y}_i\) in parallel, which overcomes the sequential bottleneck of past DNNs.

Multi-Head Attention

While a single self-attention mechanism can capture dependencies between tokens, using only one set of projections may limit the model's expressiveness. Multi-head attention (MHA) addresses this by computing multiple self-attention operations in parallel using different learned projections, called "heads."

Formally, given an input matrix \( X \in \mathbb{R}^{n \times d} \), each attention head applies its own learned projections to compute queries, keys, and values: \[ \text{head}_i = \text{Attention}\left(\mathbf{W}_i^{(q)} \mathbf{q}, \{\mathbf{W}_i^{(k)} \mathbf{k}_j, \mathbf{W}_i^{(v)}\mathbf{v}_j\}\right) \in \mathbb{R}^{p_v}, \] where \( \mathbf{W}_i^{(q)}, \mathbf{W}_i^{(k)}, \mathbf{W}_i^{(v)} \in \mathbb{R}^{p_v \times d} \) are projection matrices specific to head \( i \).

All heads are computed in parallel and their outputs are concatenated and linearly projected to produce the final output: \[ \text{MHA}(\mathbf{X}) = \text{MHA}\left(\mathbf{q}, \{\mathbf{k}_j, \mathbf{v}_j\}\right) = \mathbf{W}_o \begin{bmatrix} \text{head}_1 \\ \vdots \\ \text{head}_h\\ \end{bmatrix}. \] where \( \mathbf{W}_o \in \mathbb{R}^{d \times hp_v} \) is a learned output projection matrix.

By using multiple heads, the model can attend to information from different representation subspaces and capture a richer set of relationships in the data. Each head may focus on different positions or interaction patterns, improving both performance and interpretability.

Positional Encoding

The standard self-attention is permutation invariant, which means the model ignores the order of tokens in an input sequence. To address this, positional encoding is added to the input embeddings to inject information about the position of each token. A widely used method is to define fixed, deterministic positional encodings using sine and cosine functions of varying frequencies.

The positional encoding vector for position \( i \) and dimension \( j \) is defined as follows: \[ p_{i, 2j} = \sin \left( \frac{i}{C^{2j/d}} \right) \] \[ p_{i, 2j+1} = \cos \left( \frac{i}{C^{2j/d}} \right) \] where:

\( i \) is the position index in the sequence, starting from 0,
\( j \) is the dimension index in the positional encoding vector,
\( d \) is the total dimension of the model (e.g., 512),
\( C \) is a constant (usually set to 10,000) to control the frequency scale.

This formulation ensures that each dimension of the positional encoding corresponds to a sinusoid of a different frequency. As a result, any two positions will have a unique positional encoding vector, and the relative position between tokens can be represented as a linear function of the encodings.

Example: Suppose the model dimension is \( d = 4 \), and we compute the positional encoding for position \( i = 2 \). Then we compute each component of the encoding as: \[ \begin{align*} &p_{2,0} = \sin \left( \frac{2}{10,000^{0/4}} \right) = \sin(2) \\\\ &p_{2,1} = \cos \left( \frac{2}{10,000^{0/4}} \right) = \cos(2) \\\\ &p_{2,2} = \sin \left( \frac{2}{10,000^{2/4}} \right) = \sin \left( \frac{2}{100} \right) \\\\ &p_{2,3} = \cos \left( \frac{2}{10,000^{2/4}} \right) = \cos \left( \frac{2}{100} \right) \end{align*} \]

This results in the positional encoding vector for position 2 as: \[ \mathbf{p}_2 = \left[ \sin(2),\ \cos(2),\ \sin(0.02),\ \cos(0.02) \right]. \]

These encodings are added element-wise to the token embeddings at each position, allowing the model to incorporate position information without relying on sequence-aware architectures like CNNs. Moreover, the use of periodic functions enables the model to potentially generalize to sequences longer than those seen during training.

Transformers

The Transformer architecture, introduced by Vaswani et al. (2017), replaces recurrence and convolution with a mechanism based entirely on self-attention. In both the encoder and decoder, the core components are multi-head self-attention layers, feedforward networks, residual connections, and layer normalization. The design enables full parallelization across sequence positions during training, and allows the model to capture long-range dependencies through direct pairwise interactions between tokens. By using learned attention weights that vary depending on the input, the model dynamically determines which parts of the sequence to emphasize at each layer.

Transformer-based models form the basis of many state-of-the-art systems across domains such as text, vision, and multimodal learning. In particular, large language models (LLMs) such as the GPT family, developed by OpenAI, are built on stacked Transformer decoder blocks trained on massive text corpora. These models leverage the scalability and expressiveness of the Transformer to generate coherent and context-aware output across a wide range of tasks.

Transformer Encoder & Decoder EncoderBlock(X) Z = LayerNorm(MHA(Q =X, K = X, V = X) + X) E = LayerNorm(FeedForward(Z) + Z) return E
Encoder(X, N) E = POS(Embed(X)) for \(n\) in range (N) E = EncoderBlock(E) return E
DecoderBlock(Y, E) Z = LayerNorm(MHA(Q =Y, K = Y, V = Y) + Y) Z' = LayerNorm(MHA(Q = Z, K = E, V = E) + Z) D = LayerNorm(FeedForward(Z') + Z') return D
Decoder(Y, E, N) D = POS(Embed(Y)) for \(n\) in range (N) D = DecoderBlock(D, E) return D

Note:

MHA(...) : the Multi-head self-attention. Each token attends to every other token.
+ (...) : residual connection. The input is added back to the output of attention.
LayerNorm(...) : Layer normalization, which is applied after the residual connection to stabilize training.
FeedForward(...) : A position-wise fully connected feedforward network is applied to each token vector independently.
Embed(...) : Each token is mapped to a high-dimensional embedding vector.
POS(...) : Positional encoding is added to embeddings to inject token order information.
N: the number of copies of the block