Model Evaluation — Why "accuracy" alone will mislead you every time
Accuracy hides models that miss rare events. Choose precision, recall, F1 or AUC based on the cost of false positives vs false negatives — it's a product call.
Full — every example, fold, and depth note.
Key takeaway
The evaluation metric is a product decision. If you leave it to engineering, they will optimize for mathematical elegance instead of commercial viability.
Why accuracy is almost always the wrong metric
The cancer screening model that gets 99% accuracy and is completely useless
Key takeaway
Accuracy is a deceptive metric in highly imbalanced datasets, often masking models that do absolutely nothing useful.
Why this matters for you
When a vendor or internal team boasts about a 99% accurate model, your immediate reflex must be to ask what the baseline distribution of the data is. If 99% of the dataset belongs to one category, the model isn't smart; it's just guessing the majority class.Your data science team bursts into your office to celebrate. They just built a cancer screening model that achieves 99% accuracy on historical patient data. You deploy it to a pilot hospital, only to discover that the model simply outputs "No Cancer" for every single patient it sees. Because only 1% of the patients in the historical dataset actually had cancer, the model mathematically achieved 99% accuracy by literally doing nothing. The metric looked flawless on a dashboard, but the product was functionally useless and actively dangerous. You have just learned why accuracy is a trap.
False positives vs false negatives
The two ways a model can be wrong — and why they have different costs
Key takeaway
A model makes two distinct types of errors—crying wolf (false positive) and missing the wolf (false negative)—and their business costs are almost never equal.
Why this matters for you
When defining the success criteria for a model, you must map out the financial and UX cost of a false positive versus a false negative. If you leave this to engineering, they will assume both errors cost the company the same amount of money.You are designing the AI logic for an autonomous vehicle's emergency braking system. If the model sees a shadow and mistakenly thinks it's a pedestrian, it slams the brakes for no reason. If the model sees an actual pedestrian and mistakenly thinks it's a shadow, it fails to brake at all. These are both 'errors' in the model's loss function, but in the real world, one causes a minor annoyance and the other causes a catastrophic fatality. The cost of the mistakes is massively asymmetrical.
Precision explained
Of everything the model flagged, how much was actually correct
Key takeaway
Precision measures the trustworthiness of the model's alerts; it answers the question, "Of everything the model flagged, how much was actually correct?"
Why this matters for you
When user attention or operational time is highly expensive, you must optimize for precision. If a model generates too many false alarms, users will simply turn it off or learn to ignore it completely.Your operations team asks for an AI tool to automatically flag customer accounts for manual review when they look like money laundering. You deploy a model, and the next day the operations team is furious. The model flagged 10,000 accounts for review, but only 10 of them were actually money launderers. The team wasted thousands of hours reviewing 9,990 legitimate accounts, completely destroying the operational efficiency the AI was supposed to create. The model caught the bad guys, but its precision was so abysmal that the tool was useless.
Recall explained
Of everything that should have been caught, how much did the model actually catch
Key takeaway
Recall measures the model's ability to sweep the board; it answers the question, "Of everything that should have been caught, how much did the model actually catch?"
Why this matters for you
When the business cost of missing a rare event is catastrophic or legally fatal, you must optimize for recall. You are actively choosing to tolerate false alarms to ensure nothing slips through the cracks.Your company builds an AI system that scans x-rays to detect early-stage lung tumors. During testing, the model achieves 95% precision—meaning that when it flags a tumor, it is almost always right. However, doctors soon realize that the model is missing 40% of the actual tumors in the dataset. The model is incredibly trustworthy when it speaks, but its silence is deadly because it fails to catch nearly half of the patients who need treatment. The model's recall is unacceptably low.
The precision-recall tradeoff
Why you can't maximise both — and how to choose
Key takeaway
Precision and recall are a zero-sum game; increasing the model's sensitivity catches more targets (recall) but inevitably generates more false alarms (precision).
Why this matters for you
Engineering cannot maximize both metrics simultaneously. Your primary strategic job during model evaluation is to look at the business unit economics and explicitly choose the exact point on the tradeoff curve where the model will operate.The engineering lead shows you a chart with a curve on it. She asks, "Do you want us to catch 90% of the fraud with a 50% false alarm rate, or do you want us to catch 60% of the fraud with a 5% false alarm rate?" She is asking you to make a product decision about the precision-recall tradeoff. The model's underlying intelligence is fixed; what you are deciding is the mathematical threshold at which the model is allowed to pull the trigger.
F1 score
When to use it, when it hides the truth
Key takeaway
The F1 score is a single, blended metric that balances precision and recall, useful for high-level model comparisons but dangerous when business costs are asymmetrical.
Why this matters for you
When data scientists report that a new model is "better" because its F1 score increased, you must unpack the metric. F1 assumes a false positive and a false negative are equally bad, which is almost never true in a real product.Your data science team proposes replacing the current production model. The old model has 90% precision and 40% recall. The new model has 65% precision and 65% recall. They argue the new model is mathematically superior because its F1 score is higher. If you blindly accept the F1 score, you might deploy a model that drastically increases the number of false alarms your users experience, destroying trust. You must look past the blended average to see how the errors actually shifted.
AUC-ROC
The model comparison metric — what it tells you and what it doesn't
Key takeaway
AUC-ROC evaluates a model's fundamental ability to separate the good from the bad across all possible thresholds, making it the ultimate metric for comparing different algorithms.
Why this matters for you
When you are deciding whether to upgrade from a simple linear model to a massive deep learning model, AUC is the metric that proves whether the new architecture actually possesses more underlying intelligence.The engineering team wants to spend $50,000 in compute to upgrade your fraud system from a simple decision tree to a deep neural network. You ask for proof that it's worth it. They don't show you precision or recall—because those depend on where the threshold is set. Instead, they show you that the Area Under the Curve (AUC) jumped from 0.75 to 0.92. This metric proves that regardless of how you configure the final product, the new neural network is fundamentally better at separating the fraudulent transactions from the legitimate ones.
Train set, validation set, test set
Why you need three buckets of data and what each one is for
Key takeaway
A dataset must be strictly partitioned into three isolated buckets—learning, tuning, and the final exam—to ensure the model is actually generalizing rather than just memorizing the answers.
Why this matters for you
If an engineer reports a 99% accuracy metric without explicitly confirming it was run on a quarantined "test set," you must assume the metric is entirely fraudulent. A model tested on its own training data is taking an open-book exam.Your team is building an AI to predict stock market movements. They train the model on data from 2018 to 2022. They test the model on data from 2021, and the metrics are staggering—it predicts market crashes perfectly. You deploy the model, and it immediately loses millions of dollars. The model didn't learn the underlying mechanics of the stock market; it simply memorized the historical timeline. Because the test data was included in the training data, the model had already seen the answers before taking the test.
Evaluation in production vs offline
Why a model that aced the test set fails in the real world
Key takeaway
Offline metrics prove a model works mathematically; online metrics prove the model works commercially. A model that aces the test set can still fail the business.
Why this matters for you
You must never launch an AI feature based solely on offline F1 or AUC scores. You must define the online, product-level tracking metrics (conversion rate, churn, latency) that will determine if the feature stays in production.The offline metrics for the new AI search ranking algorithm are spectacular. The test set proves the model is returning highly relevant results with pristine precision and recall. You push the model to 10% of live users via an A/B test. Three days later, the analytics team reports that revenue in the test cohort has dropped by 8%. The offline metrics were mathematically perfect, but the model took 200 milliseconds longer to generate the results, and that latency caused users to abandon their carts. The offline test set completely missed the UX impact.
PM decision lens: choosing your metric from the business problem, not the model
The cost asymmetry question that determines everything
Key takeaway
The PM dictates the metric based on the cost asymmetry of the business problem; the engineers optimize the architecture to hit that metric.
Why this matters for you
If you ask engineering to "build the best model," they will default to optimizing accuracy or F1. If that default metric misaligns with your unit economics, you will launch a product that technically works but destroys business value.You are kicking off a project to build an automated content moderation system for a new social network. The engineering lead asks you what the target metric is. If you say, "Make it as accurate as possible," you have failed. You must translate the business strategy into a mathematical target. You realize that letting a violent video slip through (false negative) will cause a PR disaster, while accidentally taking down a safe video (false positive) just requires a quick human review. You tell engineering: "Optimize for 99% recall; we will eat the cost of the false positives."
Real product examples
Medical Diagnosis Trap
A startup pitched a 99.9% accurate model for detecting a rare genetic disorder. Hospital buyers quickly realized the disorder only occurs in 1 in 10,000 people. A simple hardcoded script that prints "Healthy" achieves 99.99% accuracy, proving the startup's metric was completely meaningless for clinical use.
For each product, sort which evaluation metric the PRD should optimise for.
Drag each item into a category
Optimise for Precision
Optimise for Recall
Balanced (F1)

Vetted by Krishna KumarCurator, FactorBeam

