How Models Learn
Parameters, loss functions, and gradient descent — demystified.
Full — every example, fold, and depth note.
Key takeaway
Training is the process of adjusting parameters to minimize the loss function using gradient descent.
What is a parameter (weight)
The internal settings a model adjusts
Key takeaway
Parameters are the internal settings a model adjusts during training; they are the learned "knowledge" of the system, not hand-coded logic.
Why this matters for you
When a vendor quotes you for fine-tuning, they are charging you to adjust these billions of knobs using your proprietary data. If your data isn't clean, you are paying to turn the knobs in the wrong direction.Walk into a server room hosting a modern large language model and you won't find a database of facts or a hard drive full of rules. Instead, you'll find an enormous matrix of decimal numbers that define exactly how the model reacts to any input it receives. A parameter, or weight, is simply a numerical value inside the model that determines how much importance to give to a specific piece of input data. Think of them as billions of tiny volume knobs on an audio mixer, where every knob controls the signal passing from one artificial neuron to the next. When we say a model 'learns,' we literally mean it is twisting these billions of knobs up and down until the output matches what we want to see.
What is a loss function
The mathematical definition of a mistake
Key takeaway
The loss function is the mathematical definition of a mistake; it tells the model exactly how wrong its current prediction is so it can adjust.
Why this matters for you
If you don't explicitly define what a "mistake" is for your product, the engineering team will choose a default metric that might optimize for the wrong behavior. You must align the loss function with your actual business goals.Imagine shooting an arrow at a target while blindfolded, and a coach yelling out exactly how many inches wide you missed the bullseye. That coach is the loss function. A loss function is a mathematical formula that calculates the difference between what the model predicted and what the correct answer actually is. If a model predicts a house will sell for $400,000 and it actually sells for $500,000, the loss function calculates a massive penalty. The entire goal of the training process is to minimize this penalty, driving the loss as close to zero as possible over millions of examples.
Forward pass explained
The core action of generating predictions
Key takeaway
A forward pass is the act of pushing data through the model's layers to generate a single prediction; it is the fundamental action of inference.
Why this matters for you
Every forward pass costs compute, money, and time. When scoping latency requirements for a real-time feature, you are directly dictating how fast the forward pass must execute.To see the loss function in action, the model first has to make a guess. A forward pass is the process where raw input data enters the model, flows through the network of parameters, and emerges on the other side as a final prediction. When you upload an image of a dog to an AI classifier, the pixels flow through the first layer of parameters to detect edges, then the next layer to detect shapes, until the final layer outputs the word "dog." During a forward pass, the model is simply applying its current knowledge; it is not learning or updating its parameters at all.
Backpropagation explained
How the network learns from its errors
Key takeaway
Backpropagation is the feedback loop that calculates exactly how much each parameter contributed to an error, allowing the model to learn.
Why this matters for you
Understanding backpropagation reveals why training is so much more expensive and complex than inference. It explains why you can't just "teach the model a new rule" in real-time.If the forward pass is the model taking a test, backpropagation is the teacher grading the test and showing the model exactly where it went wrong. Backpropagation is the algorithm that traces an error backwards through the network to determine which specific parameters were responsible for the mistake. After the loss function calculates the total error, backpropagation acts like a forensic investigator, moving in reverse from the final output layer all the way back to the input layer. It assigns a slice of the blame to every single parameter that played a role in generating the incorrect prediction.
Gradient descent without the math
Navigating the landscape to find the lowest loss
Key takeaway
Gradient descent is the directional compass the model uses to incrementally update its parameters and find the lowest possible error.
Why this matters for you
Gradient descent guarantees the model will try to find a solution, but it doesn't guarantee it will find the best one. If your training stalls or the model gets stuck, this is the mechanism that is failing.Imagine you are blindfolded and dropped onto the side of a mountain, and your only goal is to find the lowest point in the valley. You cannot see the landscape, so you feel the ground with your feet, take a step in the direction that goes downhill, and repeat. Gradient descent is exactly this process: a step-by-step mathematical algorithm that constantly nudges the model's parameters in the direction that reduces the error. The "gradient" is simply the slope of the hill, representing how steeply the error is increasing or decreasing. By always moving opposite to the gradient—stepping downhill—the model incrementally approaches the optimal configuration where the loss is minimized.
The Training Loop
How the four components work together to iteratively improve the model.
Epochs and iterations
The cycles required to fully absorb patterns
Key takeaway
An epoch is one full pass through the entire training dataset; training requires many epochs for the model to fully absorb the patterns.
Why this matters for you
The number of epochs dictates how long training takes and how much compute it burns. You have to balance the need for a smarter model against the sheer cost of looping through the data repeatedly.Reading a textbook once rarely guarantees you will ace the final exam; you usually have to read it multiple times for the concepts to stick. In machine learning, an epoch represents one complete pass of the entire training dataset through the model. If you are training a fraud detector on one million historical transactions, the model completes one epoch only after it has looked at all one million examples, made predictions, and updated its parameters via backpropagation. Because one pass is rarely enough to find the bottom of the valley, models are typically trained over dozens or hundreds of epochs to solidify their understanding.
The Data Hierarchy
An epoch contains multiple batches; processing one batch is one iteration.
Overfitting vs underfitting
The balance between memorization and generalization
Key takeaway
Overfitting is when a model memorizes the training data but fails on new data; underfitting is when it fails to learn anything useful at all.
Why this matters for you
If you only test a model on the data it was trained on, an overfitted model will look like a perfect success right up until the moment it fails catastrophically in production.Think of a student preparing for a math test. An underfitting student glances at the textbook for five minutes, learns nothing, and fails the test. Underfitting occurs when a model is too simple or hasn't trained long enough to capture the underlying patterns in the data, resulting in poor predictions across the board. It is the equivalent of a real estate model guessing that every house costs exactly $300,000 regardless of the neighborhood or square footage. Underfitting is usually obvious early in the development cycle, and the fix is straightforward: use a more complex model or train it for more epochs.
The Fitting Spectrum
Underfitting learns nothing. Overfitting memorizes everything. Optimal fit learns the underlying pattern.
Real product examples
Spotify's Audio Profiles
Spotify's recommendation engine uses millions of parameters to represent the acoustic characteristics of songs. One parameter might loosely correlate with "acousticness," while another correlates with "danceability," but they are just numbers discovered during training. The product team cannot easily manually tweak a specific parameter to fix a bad recommendation, highlighting the opaque nature of learned weights.
Put the four steps of a single training iteration in the order they actually happen.
Drag to arrange — what runs first (top) through last (bottom).
- 1.Backpropagation — assign blame for the loss to every parameter that contributed.
- 2.Loss calculation — compare predictions to the labels and score how wrong they were.
- 3.Weight update — nudge each parameter a small step in the direction that would have reduced the loss.
- 4.Forward pass — push a batch of inputs through the network's current weights to produce predictions.

Vetted by Krishna KumarCurator, FactorBeam

