Shallow Learning

How gradients flow through a simplified sentiment classifier

Phrase:
Scroll down to begin the forward pass. Watch how words become predictions through learned embeddings.

The Embedding Table

Every word in our vocabulary has a learned embedding — a vector of numbers that captures meaning. These aren't hand-crafted; they're learned from data through the very gradient process we're about to trace.

Word Lookup

To process a phrase, we look up each word's embedding from the table. Mathematically, this is a matrix multiplication: a one-hot vector times the embedding matrix (see it in action). In practice, frameworks use an index operation as a shortcut.

Combining Words

We sum the two word embeddings element-by-element to get a single vector h representing the whole phrase. Both words contribute equally. Remember this — it matters for backprop. As Karpathy puts it, if there is a sum in the forward pass, there is a broadcasting in the backward pass.

Weighted Sum → Logits

Each class (Positive/Negative) has its own weight vector. We compute the dot product of h with each weight vector to get a logit — a raw score for each class.

Softmax & Cross-Entropy Loss

The logits pass through softmax to become probabilities that sum to 1. Cross-entropy loss measures how far the prediction is from the true label.

Backpropagation Begins

Now we reverse direction. The gradient of loss with respect to itself is simply 1 — a small increase in loss directly increases our error.

Through Softmax

The gradient flows through softmax. For the true class, the gradient is negative (we want to increase it). For the wrong class, it's positive (we want to decrease it).

To the Weights

Gradients flow back to the weights. Using the chain rule: dL/dW = dL/dLogit × h. The gradient also flows through to the combined embedding h.

Back to the Embedding Table

Since h = e1 + e2, the gradient ∂L/∂h is copied identically to both words. Both rows in the embedding table receive the exact same gradient. This is gradient routing — the sum operation acts as a gradient duplicator. As Karpathy puts it, “addition is a router of gradient.”

The Limits of Sum

Because addition is commutative, this model treats “not bad” and “bad not” as identical inputs. Both words receive the same gradient, so there is no mechanism to learn that “not” should flip the meaning of its neighbor. To handle composition like negation, you need architectures that break this symmetry — positional embeddings, attention, or recurrence.