Training vs Inference — Two completely different operations with very different costs
Training is a massive one-time R&D cost; inference is the ongoing variable cost of running the model. Confuse them and viral success will bankrupt your margins.
Full — every example, fold, and depth note.
Key takeaway
Training is a massive fixed R&D cost; inference is an ongoing variable cost. You rent intelligence during inference to avoid the capital expenditure of training.
What is training
The learning phase — slow, expensive, done offline
Key takeaway
Training is the massive, offline, upfront capital expenditure where the model learns; it is an R&D investment you make once per version.
Why this matters for you
When your CEO asks why the new open-source model isn't running on your servers yet, you need to explain that serving the model is easy, but if you want to teach it your proprietary data from scratch, you are asking for a multi-million dollar GPU budget and a six-month timeline.Walk into a server farm running a large language model and you will see racks of specialized hardware pulling enough electricity to power a small town, running for months without a single user query. You are not looking at a product serving customers. You are looking at the brute-force computational effort required to compress the internet into a statistical artifact. This massive initial computation is what creates the model's intelligence. Training is the phase where the model is actually learning, and it happens entirely behind closed doors long before the first user logs in.
What is inference
The usage phase — fast, cheap per call, expensive at scale
Key takeaway
Inference is the operational, per-query variable cost of using the frozen model to generate answers for your users.
Why this matters for you
In your next sprint planning, when the team debates the cost of the new AI feature, you must focus the conversation entirely on inference volume. Every time a user clicks the button, you incur a small compute cost that will eat your margins if not capped.A user pastes a rough draft into your text box, clicks "Polish", and waits two seconds for a beautifully rewritten paragraph to appear. The model did not learn anything about the user, it did not update its internal parameters, and it did not get smarter for the next person. It simply pushed the user's text through its frozen web of parameters to calculate the most statistically likely response. This process of generating an output from a frozen model is called inference. It is the operational phase of AI, and it is the only phase your users will ever experience.
Who pays for what
Why using GPT-4 via API is fundamentally different from building your own model
Key takeaway
Using a vendor API means paying a retail markup on variable inference costs, while hosting an open-source model means absorbing fixed infrastructure costs to get wholesale inference.
Why this matters for you
When you choose between OpenAI and hosting Llama 3, you are not making a technology choice; you are making a gross margin choice. You must project your user volume to determine which cost structure fits your business model.Your engineering lead presents two options for the new summarization feature: call the Anthropic API, or download an open-source model and host it on your own AWS instances. Both will achieve the exact same user experience. The difference lies entirely in how the money moves. You are choosing between paying a premium per usage, or paying a flat rate to keep servers running even when no one is using them. This is the classic build-versus-buy dilemma, weaponized by the sheer cost of AI compute.
Inference cost at scale
How 10x users becomes 10x your AWS bill
Key takeaway
Generative AI breaks traditional software economics because processing 10x more users requires exactly 10x more compute, offering zero economies of scale.
Why this matters for you
When your marketing team celebrates a viral spike in usage, you need to be the one checking the dashboard. If you mispriced your feature, that viral spike will bankrupt the product line before the end of the quarter.A viral tweet sends thousands of new users to your SaaS platform overnight. In traditional software, this is pure profit. Serving a web page to ten thousand users costs virtually the same as serving it to ten users; the database queries are cheap and the infrastructure scales effortlessly. Generative AI destroys this paradigm. Generating text or images requires heavy, dedicated compute for every single request. You do not get a discount for volume.
Fine-tuning as a middle path
Starting from a trained model and adapting it — cost and tradeoff explained
Key takeaway
Fine-tuning is a lightweight training run that alters the model's tone and format, acting as a bridge between the massive cost of pre-training and the constraints of inference.
Why this matters for you
When marketing complains that the AI writes like a robot, your engineers will suggest fine-tuning. You must ensure they are doing it to teach the model how to speak, not what facts to know, otherwise you will waste time on the wrong solution.Your customer support bot is technically accurate, but it sounds like a generic corporate manual. You want it to speak in your brand's specific, empathetic tone. You cannot fix this with a better prompt without consuming half your context window, and you certainly aren't going to train a new model from scratch. You need the model to fundamentally adopt your style. This is where you take a fully trained model and run a much smaller, cheaper training pass over your specific data. You are not teaching it English; you are just teaching it your accent.
The latency problem
Why inference speed is a product problem, not just an engineering problem
Key takeaway
Because models generate text sequentially, high intelligence inherently means high latency, forcing you to design UI paradigms that mask the waiting time.
Why this matters for you
When your CEO asks why the new AI feature feels sluggish compared to the rest of the app, you need to explain that you cannot simply optimize the database. You must manage the user's perception of time while the model thinks.A user clicks "Generate Report" and stares at a loading spinner. Three seconds pass. Five seconds. They assume the app has crashed and click refresh, abandoning the process entirely. The model hasn't crashed; it is performing flawlessly. It is simply grinding through billions of parameters to calculate the very first word of the report. In traditional software, latency is an engineering bug; in generative AI, latency is a law of physics. You cannot buy your way out of it, so you must design your way around it.
PM decision lens: unit economics
The cost-per-query calculation every AI PM must own
Key takeaway
You must calculate the cost-per-query and multiply it by expected user volume before a single line of code is written, or your feature will accidentally burn your runway.
Why this matters for you
In the roadmap review, when stakeholders push to add AI to every surface area of the product, your financial model is your only defense against shipping features that lose money on every click.The design team pitches a feature that proactively reads every inbound email and generates a suggested reply, storing it as a draft so the user sees it instantly when they open the app. The UX is flawless. The engineering is trivial. But the unit economics are catastrophic. You are proposing running expensive inference on 100% of emails, even though users only open and reply to 20% of them. You have designed a feature that burns money in the background for zero user value.
Real product examples
Meta's Llama 3 — The open-source subsidy
Meta spent billions of dollars purchasing NVIDIA H100 GPUs solely to train their Llama 3 models over several months. They absorbed this massive upfront capital expenditure so the open-source community wouldn't have to. The model they released is just the frozen weights produced by that multi-billion dollar run.
Sort each line item on the AI cloud bill into the bucket it belongs to.
Drag each item into a category
Training (large one-time CapEx)
Inference (per-request OpEx)

Vetted by Krishna KumarCurator, FactorBeam

