Founder 01Chapter 6 of 8

Model Evaluation — Why your team's metrics lie to you (and what to measure instead)

~8 min essentials·26 min full·8 sections

The founder's guide to AI evaluation beyond accuracy slides — offline vs online truth, benchmark theatre, production signals, regression discipline, and the five questions that protect you before every model swap.

Full — every example, fold, and depth note.

Key takeaway

Evaluation is a business governance system, not a dashboard exercise. Founders who ask what error costs money, whether offline metrics predict customer reality, and how regressions get caught before users do — ship AI that survives diligence and renewals; founders who accept headline accuracy ship demos that collapse in week three.

Highlight any sentence below for a plain-English explanation

§6.1·~1 min

Why accuracy is the wrong metric — the founder version

The 99% slide that impresses investors and bankrupts pilots

Key takeaway

Accuracy is a headline number that hides what your business actually needs. When the event you care about is rare — fraud, churn, disease, defects — a model can score 99% while catching none of it. Founders who lead with accuracy signal they have not translated the product into economics yet.

Why this matters for you

Investors and enterprise buyers have seen enough AI pitches to distrust top-line accuracy. When a founder cannot explain what 'accurate' means for the minority class their revenue depends on, diligence turns skeptical and pilots expose the gap within weeks.

Your ML lead reports 99% accuracy on a fraud-detection model. If only 0.1% of transactions are fraudulent, a model that labels every transaction 'legitimate' is 99.9% accurate while catching zero fraud. Founders who celebrate accuracy without asking 'accurate at what?' set themselves up for a production disaster and a painful board conversation.

§6.2·~1 min

Precision vs recall — a business decision disguised as a technical one

The threshold meeting is a strategy meeting — even if engineering scheduled it

Key takeaway

Precision asks: when the model speaks, can we trust it? Recall asks: when something important happens, did we catch it? You cannot maximize both at once. The tradeoff is not a tuning detail — it is a product, pricing, and liability decision that belongs on the founder's calendar.

Why this matters for you

Engineers will optimize for whatever metric is easiest to plot. Founders must translate customer pain into which error type the company can afford. Get this wrong and retention, margin, and regulatory posture all break — regardless of how impressive the offline charts look.

Your CTO asks: 'Catch 90% of churn signals with 40% false alarm rate, or 60% of signals with 10% false alarms?' That is not a technical question. It is a question about how much sales team attention costs and how much a missed churning account costs. Founders who defer this decision get a product that optimizes for engineering convenience.

Precision vs recall — a business decision disguised as a technical one

Precision asks: when the model speaks, can we trust it? Recall asks: when something important happens, did we catch it? You cannot maximize both at once.…

Precision

Precision profile

Where precision creates stronger fit.

recall - a business decision disguised as a technical one

recall - a business decision disguised as a technical one profile

Where recall - a business decision disguised as a technical one creates stronger fit.

§6.3·~1 min

Offline evaluation vs online evaluation

Why your test set is a rehearsal — and production is opening night

Key takeaway

Offline evaluation runs the model against historical labeled data in a controlled lab. Online evaluation measures what happens when real users, real adversaries, and real drift interact with the product. Offline metrics are necessary; online metrics are truth. Founders who conflate them fundraise on fiction.

Why this matters for you

Every AI startup has impressive offline numbers. The ones that survive prove those numbers predict customer outcomes. Diligence teams and design partners increasingly ask how offline eval connects to production — founders who cannot answer lose deals and board trust.

Your team ships a model that beats the old one on every offline benchmark. Week two in production: support tickets spike and a design partner threatens to churn. Offline data is frozen history. Production is live adversaries, changing user behavior, seasonality, and feedback loops the test set never saw. The gap between offline win and online failure is one of the most common causes of AI startup embarrassment.

§6.4·~1 min

Benchmark theatre — when your team's metrics don't reflect reality

Impressive charts that survive the meeting and die in the pilot

Key takeaway

Benchmark theatre is when metrics look rigorous but fail to represent how customers use the product — leaky test sets, cherry-picked time windows, synthetic tasks, or evaluation data that does not match production traffic. The team is not necessarily lying; they are often optimizing for what is easy to measure instead of what matters.

Why this matters for you

Founders who accept benchmark theatre discover the truth in the worst venues: customer escalations, diligence deep-dives, and competitor bake-offs on real data. Catching theatre early saves quarters of misallocated engineering and prevents fundraising narratives that cannot survive contact with production.

Your team presents a 15-point offline improvement. You ask one question: 'Would our largest customer recognize this dataset?' Long pause. Benchmark theatre often starts innocently — a clean internal dataset, a public leaderboard task, a demo prompt suite that never appears in support tickets. Founders must audit whether evaluation data looks like paying customers, not like Kaggle.

§6.5·~1 min

Evaluation in production — what to measure after you ship

The metrics that show up in renewals, not in research papers

Key takeaway

Production evaluation asks whether the AI is making the business healthier — not whether the model still scores well on a frozen CSV. Founders need a short list of online metrics tied to revenue, cost, risk, and customer experience that update weekly and trigger action when they move.

Why this matters for you

Boards and customers do not renew on offline F1. They renew when fraud losses fall, support cost per ticket drops, time-to-hire improves, or error rates customers feel go down. Founders who instrument production eval early catch model decay before churn; founders who do not learn from angry QBRs.

After launch, the question changes from 'how accurate is the model?' to 'is the AI worth the infrastructure, risk, and customer trust it consumes?' Production metrics include business outcomes (dollars saved, conversion lift, handle time), product health (override rate, thumbs-down rate, escalation rate), and risk signals (incident count, near-miss log, segment disparity). Founders should pick three to five production metrics and review them in every exec staff meeting.

Evaluation in production — what to measure after you ship

Production evaluation asks whether the AI is making the business healthier — not whether the model still scores well on a frozen CSV. Founders need a short…

Define business KPIsRevenue, retention, support cost — not just model accuracy

Instrument live trafficLog inputs, outputs, overrides, and downstream outcomes

Segment performanceBreak metrics by cohort, channel, and edge cases

Detect driftAlert when distributions or error rates shift

Iterate or roll backRetrain, change prompts, or revert based on business impact

§6.6·~1 min

Regression testing for AI — why models break in non-obvious ways

The update that fixed fraud and broke refunds — and how to catch it before users do

Key takeaway

Regression testing for AI means proving a new model still handles the cases the old model got right — not just that average metrics improved. Models break in non-obvious ways: a fraud upgrade that silently degrades refund classification, a prompt change that breaks JSON output, a fine-tune that forgets a key customer segment.

Why this matters for you

Traditional software regressions are binary — a test passes or fails. AI regressions are statistical and segment-specific — average metrics improve while a high-value customer cohort collapses. Founders who skip regression discipline ship 'improvements' that cause emergency rollbacks and erode customer trust.

Your team ships v2 of the support classifier. Aggregate accuracy rises 4 points. Enterprise customers open tickets: the model now misroutes billing disputes — the one category they pay premium for. Global metric improvements hide local catastrophes. Regression testing asks: what got worse, for whom, on the cases we cannot afford to break? Founders should mandate regression gates alongside improvement targets for every model release.

§6.7·~1 min

Building an evaluation culture — not just an evaluation metric

Why the best AI companies argue about eval in every sprint review

Key takeaway

Evaluation culture means the whole company — product, sales, support, legal — feeds failure cases into a shared system, disagreements about metrics are welcomed, and nobody ships because 'the number went up.' Metrics are the output of culture; without culture, metrics become theatre.

Why this matters for you

Startups that treat eval as ML's private homework repeat the same production failures quarterly. Startups that treat eval as company infrastructure compound learning speed — every churned pilot, every sales loss, every support escalation improves the next model. Investors bet on learning velocity; eval culture is how you prove it.

Two startups with identical models diverge when one treats eval as a dashboard and the other as organizational memory. Eval culture routes customer escalations into labeled examples, puts support leaders in model review meetings, and celebrates catches before launch — not just launches. Founders set the tone by asking 'what did we learn?' before 'did we ship?'

§6.8·~1 min

Founder decision lens: the five evaluation questions to ask your team before every major model update

Five questions that take twenty minutes and prevent twenty-week rollbacks

Key takeaway

Before every major model update, founders should ask five evaluation questions and refuse to ship until the answers are documented. This is not micromanagement — it is the minimum governance layer between probabilistic software and customer trust.

Why this matters for you

Model updates are irreversible in reputation even when reversible in code. A single bad deploy becomes the story customers remember. Five questions create a consistent decision lens that works whether you understand backpropagation or not — because they force the team to connect metrics to money, customers, and risk.

Question 1 — What error costs us more money or trust, false alarms or misses? If the team cannot answer in dollars or customer terms, they are not ready to ship. Question 2 — What offline metric predicts the online outcome we sell, and what happened to it on fresh data?

Real product examples

As a founder: before your next model update ships, ask your team for offline metrics, online metrics, and the last regression test result — in the same meeting. If they can only produce one of the three, you do not have an evaluation strategy. You have a launch prayer.

Medical screening startup — the 99.9% trap

A seed-stage health AI pitched 99.9% accuracy detecting a disorder affecting 1 in 10,000 patients. A hospital diligence team calculated that a script printing 'healthy' for everyone beat the model. The pilot was cancelled before term sheet discussions. Founder lesson: always present performance on the cases you actually sell, not headline accuracy.

Concept check · 1 of 6

Multiple choice

Your fraud model reports 99.5% accuracy. Fraud represents 0.5% of transactions. What should concern you most?

Vetted by Krishna KumarCurator, FactorBeam