AI Foundations for PMs
PM 01Chapter 2 of 7

How Models Learn

~7 min essentials·20 min full·7 sections

Parameters, loss functions, and gradient descent — demystified.

Full — every example, fold, and depth note.

Key takeaway

Training is the process of adjusting parameters to minimize the loss function using gradient descent.

Highlight any sentence below for a plain-English explanation
§2.1·~1 min

What is a parameter (weight)

The internal settings a model adjusts

Key takeaway

Parameters are the internal settings a model adjusts during training; they are the learned "knowledge" of the system, not hand-coded logic.

Why this matters for you

When a vendor quotes you for fine-tuning, they are charging you to adjust these billions of knobs using your proprietary data. If your data isn't clean, you are paying to turn the knobs in the wrong direction.

Walk into a server room hosting a modern large language model and you won't find a database of facts or a hard drive full of rules. Instead, you'll find an enormous matrix of decimal numbers that define exactly how the model reacts to any input it receives. A parameter, or weight, is simply a numerical value inside the model that determines how much importance to give to a specific piece of input data. Think of them as billions of tiny volume knobs on an audio mixer, where every knob controls the signal passing from one artificial neuron to the next. When we say a model 'learns,' we literally mean it is twisting these billions of knobs up and down until the output matches what we want to see.

§2.2·~1 min

What is a loss function

The mathematical definition of a mistake

Key takeaway

The loss function is the mathematical definition of a mistake; it tells the model exactly how wrong its current prediction is so it can adjust.

Why this matters for you

If you don't explicitly define what a "mistake" is for your product, the engineering team will choose a default metric that might optimize for the wrong behavior. You must align the loss function with your actual business goals.

Imagine shooting an arrow at a target while blindfolded, and a coach yelling out exactly how many inches wide you missed the bullseye. That coach is the loss function. A loss function is a mathematical formula that calculates the difference between what the model predicted and what the correct answer actually is. If a model predicts a house will sell for $400,000 and it actually sells for $500,000, the loss function calculates a massive penalty. The entire goal of the training process is to minimize this penalty, driving the loss as close to zero as possible over millions of examples.

§2.3·~1 min

Forward pass explained

The core action of generating predictions

Key takeaway

A forward pass is the act of pushing data through the model's layers to generate a single prediction; it is the fundamental action of inference.

Why this matters for you

Every forward pass costs compute, money, and time. When scoping latency requirements for a real-time feature, you are directly dictating how fast the forward pass must execute.

To see the loss function in action, the model first has to make a guess. A forward pass is the process where raw input data enters the model, flows through the network of parameters, and emerges on the other side as a final prediction. When you upload an image of a dog to an AI classifier, the pixels flow through the first layer of parameters to detect edges, then the next layer to detect shapes, until the final layer outputs the word "dog." During a forward pass, the model is simply applying its current knowledge; it is not learning or updating its parameters at all.

§2.4·~1 min

Backpropagation explained

How the network learns from its errors

Key takeaway

Backpropagation is the feedback loop that calculates exactly how much each parameter contributed to an error, allowing the model to learn.

Why this matters for you

Understanding backpropagation reveals why training is so much more expensive and complex than inference. It explains why you can't just "teach the model a new rule" in real-time.

If the forward pass is the model taking a test, backpropagation is the teacher grading the test and showing the model exactly where it went wrong. Backpropagation is the algorithm that traces an error backwards through the network to determine which specific parameters were responsible for the mistake. After the loss function calculates the total error, backpropagation acts like a forensic investigator, moving in reverse from the final output layer all the way back to the input layer. It assigns a slice of the blame to every single parameter that played a role in generating the incorrect prediction.

§2.5·~1 min

Gradient descent without the math

Navigating the landscape to find the lowest loss

Key takeaway

Gradient descent is the directional compass the model uses to incrementally update its parameters and find the lowest possible error.

Why this matters for you

Gradient descent guarantees the model will try to find a solution, but it doesn't guarantee it will find the best one. If your training stalls or the model gets stuck, this is the mechanism that is failing.

Imagine you are blindfolded and dropped onto the side of a mountain, and your only goal is to find the lowest point in the valley. You cannot see the landscape, so you feel the ground with your feet, take a step in the direction that goes downhill, and repeat. Gradient descent is exactly this process: a step-by-step mathematical algorithm that constantly nudges the model's parameters in the direction that reduces the error. The "gradient" is simply the slope of the hill, representing how steeply the error is increasing or decreasing. By always moving opposite to the gradient—stepping downhill—the model incrementally approaches the optimal configuration where the loss is minimized.

The Training Loop

How the four components work together to iteratively improve the model.

Initialize parametersStart with trainable weights that encode model behavior.
Run forward passGenerate predictions from current weights on training examples.
Compute lossMeasure prediction error against labeled outcomes.
Backpropagate gradientsCalculate how each weight contributed to the error.
Update and repeatAdjust weights over many batches until validation performance stabilizes.
§2.6·~1 min

Epochs and iterations

The cycles required to fully absorb patterns

Key takeaway

An epoch is one full pass through the entire training dataset; training requires many epochs for the model to fully absorb the patterns.

Why this matters for you

The number of epochs dictates how long training takes and how much compute it burns. You have to balance the need for a smarter model against the sheer cost of looping through the data repeatedly.

Reading a textbook once rarely guarantees you will ace the final exam; you usually have to read it multiple times for the concepts to stick. In machine learning, an epoch represents one complete pass of the entire training dataset through the model. If you are training a fraud detector on one million historical transactions, the model completes one epoch only after it has looked at all one million examples, made predictions, and updated its parameters via backpropagation. Because one pass is rarely enough to find the bottom of the valley, models are typically trained over dozens or hundreds of epochs to solidify their understanding.

The Data Hierarchy

An epoch contains multiple batches; processing one batch is one iteration.

Batch updates (inner loop)
Weights are updated after each mini-batch using gradient signals.
Forward pass, loss, backward pass, optimizer step
Epoch cycle (middle loop)
One epoch processes the full training set through many batches.
Shuffle data, run batches, aggregate training metrics
Training program (outer loop)
Multiple epochs continue until validation no longer improves.
Early stopping, checkpointing, generalization checks
§2.7·~1 min

Overfitting vs underfitting

The balance between memorization and generalization

Key takeaway

Overfitting is when a model memorizes the training data but fails on new data; underfitting is when it fails to learn anything useful at all.

Why this matters for you

If you only test a model on the data it was trained on, an overfitted model will look like a perfect success right up until the moment it fails catastrophically in production.

Think of a student preparing for a math test. An underfitting student glances at the textbook for five minutes, learns nothing, and fails the test. Underfitting occurs when a model is too simple or hasn't trained long enough to capture the underlying patterns in the data, resulting in poor predictions across the board. It is the equivalent of a real estate model guessing that every house costs exactly $300,000 regardless of the neighborhood or square footage. Underfitting is usually obvious early in the development cycle, and the fix is straightforward: use a more complex model or train it for more epochs.

The Fitting Spectrum

Underfitting learns nothing. Overfitting memorizes everything. Optimal fit learns the underlying pattern.

Model Metrics
Technical quality signal
Track precision, recall, latency, and drift for reliability.
Business Metrics
Commercial value signal
Track conversion, cost-to-serve, cycle time, and retention outcomes.
Decision Metrics
Governance signal
Track override rates, incident levels, and escalation quality.
When the model fails to generalize, your job is to fix the data, not the code.

Spotify's Audio Profiles

Spotify's recommendation engine uses millions of parameters to represent the acoustic characteristics of songs. One parameter might loosely correlate with "acousticness," while another correlates with "danceability," but they are just numbers discovered during training. The product team cannot easily manually tweak a specific parameter to fix a bad recommendation, highlighting the opaque nature of learned weights.

Concept check · 1 of 10
Put in order

Put the four steps of a single training iteration in the order they actually happen.

Drag to arrange — what runs first (top) through last (bottom).

  1. 1.Backpropagation — assign blame for the loss to every parameter that contributed.
  2. 2.Loss calculation — compare predictions to the labels and score how wrong they were.
  3. 3.Weight update — nudge each parameter a small step in the direction that would have reduced the loss.
  4. 4.Forward pass — push a batch of inputs through the network's current weights to produce predictions.
Portrait of Krishna Kumar, Curator

Vetted by Krishna KumarCurator, FactorBeam