PM 01Chapter 3 of 7

Training vs Inference — Two completely different operations with very different costs

~7 min essentials·15 min full·7 sections

Training is a massive one-time R&D cost; inference is the ongoing variable cost of running the model. Confuse them and viral success will bankrupt your margins.

Full — every example, fold, and depth note.

Key takeaway

Training is a massive fixed R&D cost; inference is an ongoing variable cost. You rent intelligence during inference to avoid the capital expenditure of training.

Highlight any sentence below for a plain-English explanation

§3.1·~1 min

What is training

The learning phase — slow, expensive, done offline

Key takeaway

Training is the massive, offline, upfront capital expenditure where the model learns; it is an R&D investment you make once per version.

Why this matters for you

When your CEO asks why the new open-source model isn't running on your servers yet, you need to explain that serving the model is easy, but if you want to teach it your proprietary data from scratch, you are asking for a multi-million dollar GPU budget and a six-month timeline.

Walk into a server farm running a large language model and you will see racks of specialized hardware pulling enough electricity to power a small town, running for months without a single user query. You are not looking at a product serving customers. You are looking at the brute-force computational effort required to compress the internet into a statistical artifact. This massive initial computation is what creates the model's intelligence. Training is the phase where the model is actually learning, and it happens entirely behind closed doors long before the first user logs in.

§3.2·~1 min

What is inference

The usage phase — fast, cheap per call, expensive at scale

Key takeaway

Inference is the operational, per-query variable cost of using the frozen model to generate answers for your users.

Why this matters for you

In your next sprint planning, when the team debates the cost of the new AI feature, you must focus the conversation entirely on inference volume. Every time a user clicks the button, you incur a small compute cost that will eat your margins if not capped.

A user pastes a rough draft into your text box, clicks "Polish", and waits two seconds for a beautifully rewritten paragraph to appear. The model did not learn anything about the user, it did not update its internal parameters, and it did not get smarter for the next person. It simply pushed the user's text through its frozen web of parameters to calculate the most statistically likely response. This process of generating an output from a frozen model is called inference. It is the operational phase of AI, and it is the only phase your users will ever experience.

§3.3·~1 min

Who pays for what

Why using GPT-4 via API is fundamentally different from building your own model

Key takeaway

Using a vendor API means paying a retail markup on variable inference costs, while hosting an open-source model means absorbing fixed infrastructure costs to get wholesale inference.

Why this matters for you

When you choose between OpenAI and hosting Llama 3, you are not making a technology choice; you are making a gross margin choice. You must project your user volume to determine which cost structure fits your business model.

Your engineering lead presents two options for the new summarization feature: call the Anthropic API, or download an open-source model and host it on your own AWS instances. Both will achieve the exact same user experience. The difference lies entirely in how the money moves. You are choosing between paying a premium per usage, or paying a flat rate to keep servers running even when no one is using them. This is the classic build-versus-buy dilemma, weaponized by the sheer cost of AI compute.

§3.4·~1 min

Inference cost at scale

How 10x users becomes 10x your AWS bill

Key takeaway

Generative AI breaks traditional software economics because processing 10x more users requires exactly 10x more compute, offering zero economies of scale.

Why this matters for you

When your marketing team celebrates a viral spike in usage, you need to be the one checking the dashboard. If you mispriced your feature, that viral spike will bankrupt the product line before the end of the quarter.

A viral tweet sends thousands of new users to your SaaS platform overnight. In traditional software, this is pure profit. Serving a web page to ten thousand users costs virtually the same as serving it to ten users; the database queries are cheap and the infrastructure scales effortlessly. Generative AI destroys this paradigm. Generating text or images requires heavy, dedicated compute for every single request. You do not get a discount for volume.

§3.5·~1 min

Fine-tuning as a middle path

Starting from a trained model and adapting it — cost and tradeoff explained

Key takeaway

Fine-tuning is a lightweight training run that alters the model's tone and format, acting as a bridge between the massive cost of pre-training and the constraints of inference.

Why this matters for you

When marketing complains that the AI writes like a robot, your engineers will suggest fine-tuning. You must ensure they are doing it to teach the model how to speak, not what facts to know, otherwise you will waste time on the wrong solution.

Your customer support bot is technically accurate, but it sounds like a generic corporate manual. You want it to speak in your brand's specific, empathetic tone. You cannot fix this with a better prompt without consuming half your context window, and you certainly aren't going to train a new model from scratch. You need the model to fundamentally adopt your style. This is where you take a fully trained model and run a much smaller, cheaper training pass over your specific data. You are not teaching it English; you are just teaching it your accent.

§3.6·~1 min

The latency problem

Why inference speed is a product problem, not just an engineering problem

Key takeaway

Because models generate text sequentially, high intelligence inherently means high latency, forcing you to design UI paradigms that mask the waiting time.

Why this matters for you

When your CEO asks why the new AI feature feels sluggish compared to the rest of the app, you need to explain that you cannot simply optimize the database. You must manage the user's perception of time while the model thinks.

A user clicks "Generate Report" and stares at a loading spinner. Three seconds pass. Five seconds. They assume the app has crashed and click refresh, abandoning the process entirely. The model hasn't crashed; it is performing flawlessly. It is simply grinding through billions of parameters to calculate the very first word of the report. In traditional software, latency is an engineering bug; in generative AI, latency is a law of physics. You cannot buy your way out of it, so you must design your way around it.

§3.7·~1 min

PM decision lens: unit economics

The cost-per-query calculation every AI PM must own

Key takeaway

You must calculate the cost-per-query and multiply it by expected user volume before a single line of code is written, or your feature will accidentally burn your runway.

Why this matters for you

In the roadmap review, when stakeholders push to add AI to every surface area of the product, your financial model is your only defense against shipping features that lose money on every click.

The design team pitches a feature that proactively reads every inbound email and generates a suggested reply, storing it as a draft so the user sees it instantly when they open the app. The UX is flawless. The engineering is trivial. But the unit economics are catastrophic. You are proposing running expensive inference on 100% of emails, even though users only open and reply to 20% of them. You have designed a feature that burns money in the background for zero user value.

Real product examples

As a PM: When your CEO asks why you aren't training your own model, you must explain that you are buying variable inference to avoid fixed capital expenditure. You only migrate to self-hosted models when your volume justifies the fixed infrastructure cost.

Meta's Llama 3 — The open-source subsidy

Meta spent billions of dollars purchasing NVIDIA H100 GPUs solely to train their Llama 3 models over several months. They absorbed this massive upfront capital expenditure so the open-source community wouldn't have to. The model they released is just the frozen weights produced by that multi-billion dollar run.

Concept check · 1 of 9

Sort into categories

Sort each line item on the AI cloud bill into the bucket it belongs to.

Drag each item into a category

Renting 1,024 GPUs for three weeks to pre-train a 70B-parameter foundation model.Every time a user clicks 'Summarise this doc' and the model generates 800 tokens.A nightly batch job that scores all new support tickets with your churn classifier.A two-day fine-tuning run on 5,000 brand-voice examples.Real-time autocomplete suggestions firing as the user types.An OpenAI API charge of $0.002 per 1k input tokens on a production chatbot.

Training (large one-time CapEx)

Inference (per-request OpEx)

Vetted by Krishna KumarCurator, FactorBeam