Founder 01Chapter 3 of 8

Training vs Inference — Your Two Biggest Cost Lines

~8 min essentials·20 min full·8 sections

Training is upfront CapEx; inference is per-user OpEx. Confusing them destroys unit economics and kills Series A companies that scale before the spreadsheet says they can.

Full — every example, fold, and depth note.

Key takeaway

Training buys capability once. Inference rents capability on every click. Founders who model inference cost per user before launch survive viral growth; founders who do not get a bridge-round conversation they did not plan for.

Highlight any sentence below for a plain-English explanation

§3.1·~1 min

Training cost

The one-time capital expenditure — and why most startups should never pay it

Key takeaway

Training is the massive offline investment that produces a frozen model: GPU clusters, months of calendar time, and irrecoverable spend if the run fails. It is R&D CapEx, not a feature launch. Almost every startup should rent pre-trained models instead of funding pre-training.

Why this matters for you

Boards and investors will ask why you are not 'building your own AI.' The answer is arithmetic: frontier training runs cost eight to nine figures. Founders who cannot articulate this confuse capital allocation and approve projects that consume a Series A without producing a deployable product.

Training is what happens before any user sees your product. Clusters of GPUs run for weeks or months, processing trillions of tokens or millions of images, adjusting weights through billions of forward-backward cycles. You are paying to manufacture the asset; you have not yet paid to operate it.

Training cost

Training is the massive offline investment that produces a frozen model: GPU clusters, months of calendar time, and irrecoverable spend if the run fails. It…

Training (CapEx)

Upfront model creation

Compute-heavy runs produce model weights before revenue.

Inference (OpEx)

Continuous serving cost

Each user request consumes tokens and scales with usage.

§3.2·~1 min

Inference cost

The recurring operational expenditure on every user click

Key takeaway

Inference is the per-query cost of running the frozen model: input tokens processed, output tokens generated, GPUs rented by the millisecond. It is COGS. Every user interaction spends money — there is no marginal cost near zero like traditional SaaS.

Why this matters for you

Your gross margin is set at inference time, not at fundraising time. Founders who launch AI features without a per-query cost model discover at 10× usage that their best customers are their least profitable — often during the same quarter they are pitching Series A.

Inference is a forward pass — data in, prediction out, weights unchanged. The model does not learn during inference. It applies the weights produced by training to calculate the next token, classification, or embedding. Training is a bonfire you light once. Inference is a meter that runs forever.

§3.3·~1 min

The unit economics trap

Technically impressive and financially catastrophic — at the same time

Key takeaway

The unit economics trap is shipping an AI feature customers love whose inference cost per user exceeds the revenue that user generates. Viral adoption makes this worse, not better — there are no economies of scale in the forward pass.

Why this matters for you

Series A investors will model your gross margin. If your AI feature drags blended margin below 50% with no path to improvement, you are a services business wearing a SaaS multiple. Founders who catch this pre-launch keep fundraising optionality; founders who catch it post-viral spike negotiate from weakness.

Traditional SaaS unit economics assume near-zero marginal cost per user. AI breaks that assumption permanently. Each generation requires fresh compute proportional to tokens processed. Caching helps only when queries repeat identically — rare in conversational or personalised products. Your LTV:CAC ratio means nothing if gross margin on the AI SKU is negative.

The unit economics trap

The unit economics trap is shipping an AI feature customers love whose inference cost per user exceeds the revenue that user generates. Viral adoption makes…

Launch AI featureProduct demand increases quickly.

Usage scalesInference volume rises with engagement.

Revenue lagsMonetization does not match variable cost.

Margins compressCOGS outpaces contribution per user.

Corrective actionReprice, optimize, or redesign workload.

§3.4·~1 min

API dependency vs model ownership

The strategic tradeoff you must choose — not stumble into

Key takeaway

API dependency means paying retail per-token prices with zero infrastructure overhead. Model ownership means renting GPUs 24/7 and hiring ML ops — but capturing wholesale inference margins at scale. Neither is universally correct; the answer is a function of volume, capital, and control requirements.

Why this matters for you

Founders who default to APIs without a migration thesis pay forever. Founders who self-host too early pay idle GPU bills while product-market fit is still uncertain. The decision should be modelled, dated, and revisited at revenue milestones — not inherited from engineering preference.

APIs trade margin for speed and optionality. You ship in days, scale instantly, and swap models when vendors release improvements. You pay a markup that funds the vendor's training CapEx and profit. APIs are the correct seed-stage default for most products.

API dependency vs model ownership

API dependency means paying retail per-token prices with zero infrastructure overhead. Model ownership means renting GPUs 24/7 and hiring ML ops — but…

API dependency

Fast launch

Low setup burden

Vendor-priced inference

Lower control

Model ownership

Higher setup burden

Dedicated infra

Lower marginal cost at scale

Higher control

§3.5·~1 min

Inference cost at scale

The numbers that kill Series A companies between 1,000 and 100,000 users

Key takeaway

Going from 1,000 to 100,000 users can multiply inference spend 100× while revenue per user stays flat. Auto-regressive generation makes long outputs disproportionately expensive. There is no volume discount on the forward pass.

Why this matters for you

Series A decks show hockey-stick revenue. They often omit hockey-stick COGS. Founders who model inference at 10× and 100× current usage avoid the emergency bridge round caused by a viral feature that loses money on every click.

Auto-regressive generation makes output length a cost multiplier. Each generated token requires another full forward pass through the model. A 2,000-token report is not twice the cost of a 1,000-token report — it is roughly twice the sequential compute, with output tokens priced at a premium. Default verbosity in your product is a COGS policy, not a UX accident.

§3.6·~1 min

Cost optimisation levers

Caching, routing, compression, and smaller models — before you raise a bridge round

Key takeaway

Inference cost is not fixed. Caching, model routing, prompt compression, batching, and distillation can cut COGS 40–80% without killing the feature — if you invest before crisis, not during it.

Why this matters for you

Investors prefer founders who show margin discipline proactively. A bridge round to 'fix unit economics' signals you shipped without understanding the business model. These levers are cheaper than emergency fundraising.

Model routing sends easy queries to cheap models and hard queries to expensive ones. A classifier or small model can handle FAQ routing, extraction, or simple summarisation. Reserve frontier models for multi-step reasoning. Ask engineering for a routing architecture in the MVP spec, not the Series A retrofit.

Cost optimisation levers

Inference cost is not fixed. Caching, model routing, prompt compression, batching, and distillation can cut COGS 40–80% without killing the feature — if you…

Route requestsMatch workload to cheapest model that clears quality.

Reduce contextPass only relevant evidence into prompts.

Cache repeated workAvoid paying twice for identical answers.

Batch non-urgent jobsShift throughput work to efficient windows.

Distill and tuneKeep quality while lowering serving cost.

§3.7·~1 min

The inference cost conversation with your CTO

Five questions every founder should ask — and understand the answers to

Key takeaway

You do not need to write CUDA kernels. You need to ask five questions about every AI feature and understand whether the answers fit your margin model: cost per query, p95 latency, model tier, trigger logic, and 10× scale projection.

Why this matters for you

CTOs optimise for capability and reliability by default. Founders optimise for survival and margin. The inference cost conversation aligns both — or surfaces that you need a different technical lead for this phase of company.

Question 1: What is all-in cost per query at median and p95 usage? Include input tokens, output tokens, retrieval, tool calls, and orchestration overhead. Median tells you pricing; p95 tells you tail risk. Do not accept 'it depends' without a worked example using real pilot data.

§3.8·~1 min

Founder decision lens

Building your AI cost model before investors ask for it

Key takeaway

Before you sign any AI infrastructure contract, build a spreadsheet: cost per query × queries per user × users × model price trajectory. Include training/fine-tuning as one-time rows and inference as monthly rows. Update it when vendors change pricing.

Why this matters for you

Investors will ask for AI unit economics in Series A diligence. Customers will ask for predictability in enterprise contracts. Your future self will ask why you launched without caps. The spreadsheet is the founder tool that prevents all three conversations from becoming surprises.

Start with a single feature and model it honestly. Inputs: average prompt tokens, average output tokens, model price per million tokens, queries per user per day, expected MAU. If COGS exceeds 30% of ARPU at expected scale, the feature economics need redesign before marketing amplifies them.

Real product examples

As a founder: when your CEO or board asks why you are not training your own foundation model, explain that you are buying variable inference to avoid fixed training CapEx — and show the spreadsheet that says when that trade flips.

OpenAI GPT-4 — nine-figure training bet

GPT-4's training run reportedly exceeded $100M in compute alone, ran for months, and answered zero customer queries during that period. OpenAI amortises that CapEx across API revenue and enterprise contracts. Founders buying tokens are renting the outcome of that bet — not replicating it.

Concept check · 1 of 6

Sort into categories

Sort each line item into the correct cost bucket.

Drag each item into a category

Renting 512 H100 GPUs for four weeks to pre-train a foundation model.Every user click on 'Summarise' generating 600 output tokens.A two-day fine-tuning run on 3,000 labelled sales emails.Nightly batch scoring of support tickets with your classifier.OpenAI API bill charged per million tokens in production.Labelling 40,000 medical images for a new training dataset.

Training (CapEx)

Inference (OpEx)

Vetted by Krishna KumarCurator, FactorBeam