Founder 01Chapter 2 of 8

How Models Learn — What It Means for Your Company

~8 min essentials·22 min full·8 sections

Parameters, training costs, feedback loops, overfitting, and generalisation — explained for founders allocating capital, runway, and hiring.

Full — every example, fold, and depth note.

Key takeaway

Models learn by adjusting billions of numerical weights through expensive offline training cycles. Every 'the model needs more data' request is a runway and roadmap decision — not a technical footnote.

Highlight any sentence below for a plain-English explanation

§2.1·~1 min

What a model actually is

A very expensive function that gets better with data — and what 'better' costs

Key takeaway

A trained model is not a database of facts or a rulebook. It is a frozen mathematical function — billions of numeric weights — that maps inputs to outputs. 'Better' means those weights were adjusted on more (and better) data, which always costs compute, time, and labelling.

Why this matters for you

Investors, customers, and your own team will talk about 'the model' as if it were software you can patch like a bug. Founders who understand what a model actually is make build-vs-buy decisions, set honest timelines, and stop promising instant learning from a single user complaint.

Strip away the marketing and a model is a very large calculator with adjustable internal knobs. Input data flows through layers of weighted connections and emerges as a prediction — a classification, a score, a generated sentence. The entire 'intelligence' of the system is the configuration of those numbers at the moment training stopped.

§2.2·~1 min

Parameters and weights

Why model size matters to your budget, your latency, and your build-vs-buy call

Key takeaway

Parameters — also called weights — are the individual numbers inside a model that control how input signals are amplified or suppressed. More parameters generally mean more capability, more training cost, and more inference cost. Model size is a budget line, not a vanity metric.

Why this matters for you

When a vendor quotes fine-tuning on a '70 billion parameter model,' they are describing the scale of the asset you are renting or adapting. Founders who do not understand parameter count cannot negotiate infrastructure contracts, evaluate vendor proposals, or explain to the board why GPT-4 costs more per token than GPT-4o-mini.

Think of parameters as volume knobs on an impossibly large mixing board. Each knob controls how much one signal influences the next layer. The model's behaviour is the combined effect of billions of these settings. Parameter count is the single biggest driver of both training bill and inference bill.

§2.3·~1 min

What training actually costs

Compute, data, and time — the three resources that drain your runway

Key takeaway

Training consumes three scarce resources: GPU compute (dollars per hour), labelled data (weeks of human effort), and calendar time (months before you know if it worked). Every ML roadmap request is a capital allocation decision disguised as an engineering ticket.

Why this matters for you

When your ML lead says 'we need another training run,' they are asking for a budget line item, not a code change. Founders who cannot translate training requests into dollars and weeks will approve work that burns runway without moving product-market fit.

Training is not a one-time software build — it is a recurring R&D experiment. Each training run rents clusters of GPUs for days or weeks, consumes electricity at data-center scale, and requires engineers to babysit loss curves in case the run collapses halfway through. For a seed-stage startup, a single serious fine-tuning project can cost $20K–$100K in compute alone — before you count the salaries of the people running it.

Parameters and weights

Parameters — also called weights — are the individual numbers inside a model that control how input signals are amplified or suppressed. More parameters…

Batch updates (inner loop)

Weights are updated after each mini-batch using gradient signals.

Forward pass, loss, backward pass, optimizer step

Epoch cycle (middle loop)

One epoch processes the full training set through many batches.

Shuffle data, run batches, aggregate training metrics

Training program (outer loop)

Multiple epochs continue until validation no longer improves.

Early stopping, checkpointing, generalization checks

§2.4·~1 min

Loss functions and model improvement

The feedback loop that turns usage data into compounding advantage

Key takeaway

A loss function is the mathematical definition of 'mistake' your model optimises against. Whatever you measure is exactly what the model will chase — including behaviours that destroy trust, margin, or regulatory standing.

Why this matters for you

Companies with more usage generate more labelled outcomes, which enables better loss signals, which produces better models, which attract more usage. Founders who do not design their product to capture that signal are leaving the compounding loop on the table for competitors.

Every training cycle follows the same loop: predict, measure error, assign blame, adjust weights. The loss function scores how wrong the prediction was. Backpropagation traces that error back through billions of parameters. Gradient descent nudges each weight to reduce the score next time. There is no shortcut where a thumbs-down instantly rewires the model. Feedback becomes training data; training data becomes the next run.

Parameters and weights

Parameters — also called weights — are the individual numbers inside a model that control how input signals are amplified or suppressed. More parameters…

Initialize parametersStart with trainable weights that encode model behavior.

Run forward passGenerate predictions from current weights on training examples.

Compute lossMeasure prediction error against labeled outcomes.

Backpropagate gradientsCalculate how each weight contributed to the error.

Update and repeatAdjust weights over many batches until validation performance stabilizes.

§2.5·~1 min

Overfitting — the startup analogy

Building so perfectly for your first ten customers that you fail the next thousand

Key takeaway

Overfitting is when a model memorises your training data — including noise, outliers, and quirks — instead of learning patterns that generalise. It is the ML equivalent of building a product so tailored to your design-partner accounts that it cannot scale to your actual market.

Why this matters for you

Overfitted models score 99% in demos and collapse in production. Founders who accept vendor metrics without asking what data they were measured on are signing up for the most expensive kind of wrong: one that looks like success until customers churn.

Picture a student who memorises answer keys instead of learning the subject. On the practice test, they score perfectly. On the real exam with slightly different questions, they fail. In a startup context, this is the pilot that works flawlessly in your office with your data and dies at the first enterprise customer with different document formats, edge cases, and user behaviour.

Parameters and weights

Parameters — also called weights — are the individual numbers inside a model that control how input signals are amplified or suppressed. More parameters…

Input layer

Raw features enter the network

Tokens, pixels, tabular fields

Hidden layers

Combinations of signals

Millions to billions of weights

Output layer

Final prediction or generation

Class probabilities, next token, embedding

§2.6·~1 min

Generalisation

Why a model that wins your demo can still lose product-market fit

Key takeaway

Generalisation is the model's ability to perform on data it has never seen — new users, new geographies, new document formats, new fraud patterns. A model that only works in your demo environment is negative ROI with good marketing.

Why this matters for you

Product-market fit is generalisation under commercial pressure. Founders who conflate pilot success with market readiness discover at Series A that the model works for three design partners and fails for the segment you need to grow into.

Training teaches patterns; generalisation proves you learned the right ones. A churn model trained on US SaaS customers may fail in EU markets with different payment behaviour. A document parser trained on clean PDFs may fail on phone photos of contracts. Every new segment, geography, or use case is a generalisation test — not a configuration change.

§2.7·~1 min

Transfer learning

Why you do not train from scratch — and what you actually pay for with an API

Key takeaway

Transfer learning starts from a model that already learned language, vision, or audio from massive public datasets, then adapts it to your domain with far less data and compute. APIs sell you the output of someone else's transfer learning — you pay per use, not per training run.

Why this matters for you

Training GPT-4 from scratch is a multi-hundred-million-dollar bet. Fine-tuning Llama on your support transcripts is a five-figure project. Using OpenAI's API is a per-token line item. Founders who understand transfer learning choose the right tier instead of accidentally funding frontier pre-training.

Pre-training teaches general capability; adaptation teaches your context. A base model already understands English grammar, code syntax, or image edges. Fine-tuning nudges weights toward your tone, format, or task — not toward learning language from zero. Transfer learning is the startup cost advantage — if you use it instead of reinventing it.

§2.8·~1 min

Founder decision lens

What 'the model needs more data' means for your roadmap and runway

Key takeaway

'We need more data' is never a throwaway engineering comment. It is a request for labelling budget, calendar time, hiring, and a delayed launch — with no guarantee the next training run generalises better. Translate it before you approve it.

Why this matters for you

Boards approve headcount and burn based on milestones. ML teams approve training runs based on data readiness. Founders who bridge those languages avoid the worst outcome: burning a quarter retraining a model that still fails the enterprise pilot.

When engineering asks for more data, ask four questions before approving budget. How many labelled examples, by when, at what cost per label? What validation metric must improve to justify the run? What happens if the run fails? What is the opportunity cost versus shipping with current performance plus human review? If the team cannot answer them, the problem is scoping — not GPU shortage.

Real product examples

As a founder: when your ML team asks for another training run, translate it into dollars, weeks, and board-level risk before you say yes. Learning is never free, instant, or guaranteed to generalise.

OpenAI's GPT-4 — capability as frozen weights

GPT-4 is not 'looking things up' when you ask a question. It is running your prompt through a fixed weight matrix produced by a training run that cost nine figures and months of calendar time. OpenAI sells access to that frozen artifact via API. Founders buying API access are renting someone else's training CapEx — not avoiding the economics of learning.

Concept check · 1 of 6

Multiple choice

Your ML lead says they are "fine-tuning a 70B parameter model on your support transcripts." What are they actually doing?

Vetted by Krishna KumarCurator, FactorBeam