The Embedding Table
Every word in our vocabulary has a learned embedding — a vector of numbers that captures meaning. These aren't hand-crafted; they're learned from data through the very gradient process we're about to trace.
How gradients flow through a simplified sentiment classifier
Every word in our vocabulary has a learned embedding — a vector of numbers that captures meaning. These aren't hand-crafted; they're learned from data through the very gradient process we're about to trace.
To process a phrase, we look up each word's embedding from the table. Mathematically, this is a matrix multiplication: a one-hot vector times the embedding matrix (see it in action). In practice, frameworks use an index operation as a shortcut.
We sum the two word embeddings element-by-element to get a single vector h representing the whole phrase. Both words contribute equally. Remember this — it matters for backprop. As Karpathy puts it, if there is a sum in the forward pass, there is a broadcasting in the backward pass.
Each class (Positive/Negative) has its own weight vector. We compute the dot product of h with each weight vector to get a logit — a raw score for each class.
The logits pass through softmax to become probabilities that sum to 1. Cross-entropy loss measures how far the prediction is from the true label.
Now we reverse direction. The gradient of loss with respect to itself is simply 1 — a small increase in loss directly increases our error.
The gradient flows through softmax. For the true class, the gradient is negative (we want to increase it). For the wrong class, it's positive (we want to decrease it).
Gradients flow back to the weights. Using the chain rule:
dL/dW = dL/dLogit × h. The gradient also flows through
to the combined embedding h.
Since h = e1 + e2, the gradient ∂L/∂h is copied identically to both words. Both rows in the embedding table receive the exact same gradient. This is gradient routing — the sum operation acts as a gradient duplicator. As Karpathy puts it, “addition is a router of gradient.”
Because addition is commutative, this model treats “not bad” and “bad not” as identical inputs. Both words receive the same gradient, so there is no mechanism to learn that “not” should flip the meaning of its neighbor. To handle composition like negation, you need architectures that break this symmetry — positional embeddings, attention, or recurrence.