AI Foundations for PMs
PM 01Chapter 4 of 7

Data & Labels — Why your data strategy is your AI strategy

~10 min essentials·20 min full·10 sections

Algorithms are commoditized; data is the moat. How training data and labels work, and why a product-driven data flywheel is your real AI defensibility.

Full — every example, fold, and depth note.

Key takeaway

Proprietary data is the ultimate moat. Algorithms are rented; data is owned.

Highlight any sentence below for a plain-English explanation
§4.1·~1 min

What is training data

Why data is the raw material and the moat

Key takeaway

Training data is the raw material that shapes the model; in a world of commoditized algorithms, proprietary data is the only defensible moat.

Why this matters for you

When an investor or executive asks what your AI competitive advantage is, you cannot say "we use a really advanced model." You must point to a proprietary dataset that your competitors cannot easily buy or scrape.

An executive asks you why the new AI feature feels so generic compared to a specialized competitor. You check the architecture and realize both companies are calling the exact same foundation model API. The difference isn't in the code; it is in the diet. Models are simply statistical mirrors reflecting the information they were fed. You are looking at a product failure caused entirely by a lack of unique, high-quality training data.

§4.2·~1 min

What are labels and annotations

The difference between raw data and learning signal

Key takeaway

Labels provide the ground truth target that the model uses to measure its mistakes during training; without labels, raw data provides no learning signal.

Why this matters for you

When engineering says they have a million rows of data to train on, you must immediately ask who labelled it. A million rows of unlabelled data is just storage cost; the label is what actually teaches the model what to do.

You hand a junior engineer a spreadsheet containing 10,000 historical customer support emails and ask them to build a sentiment analysis model. A week later, they tell you the project is blocked. The emails are raw text; nowhere in the spreadsheet does it actually say which emails are "happy" and which are "angry." The model cannot learn to predict a category if it has never been shown a concrete example of that category. You have data, but you lack the instructional signal.

§4.3·~1 min

Supervised learning

Learning from labelled examples — the most common paradigm

Key takeaway

Supervised learning is the paradigm of training a model using clearly labelled examples, guaranteeing a specific, highly bounded output.

Why this matters for you

When you need a highly reliable system with strict behavioral guarantees—like predicting churn or detecting fraud—you must demand supervised learning. It is the only paradigm where you control exactly what the model is trying to achieve.

Your compliance team needs a system to flag potentially fraudulent wire transfers. They do not want the system to be creative, and they do not want it to discover new financial theories. They want it to look at a transaction and accurately output either a 1 or a 0. You achieve this rigid reliability by showing the model thousands of historical transactions that have been explicitly marked as 'fraud' or 'safe' by human auditors. This is supervised learning: teaching by explicit example.

§4.4·~1 min

Unsupervised learning

Finding structure in data with no labels

Key takeaway

Unsupervised learning finds hidden structures, clusters, and anomalies in raw, unlabelled data without human guidance.

Why this matters for you

When marketing asks you to "find interesting segments" in the customer database, unsupervised learning is the tool. But you must be prepared for the reality that the model will find mathematical clusters that might make zero logical sense to a human marketer.

You have a massive database containing millions of user clickstreams, but you have no idea how to categorize the behaviors. You cannot use supervised learning because you don't have predefined labels like "power user" or "churn risk." You just want the algorithm to group similar users together based on their actions. You are asking the model to discover the structure of the data without giving it an answer key. This is unsupervised learning: finding patterns in the dark.

§4.5·~1 min

Semi-supervised and self-supervised learning

How modern LLMs learned without human-labelled everything

Key takeaway

Self-supervised learning is the paradigm that unlocked large language models by using the structure of the raw data itself as the label, eliminating the need for human annotators.

Why this matters for you

When you wonder how OpenAI trained a model on the entire internet without hiring a billion people to label the web, this is the answer. Understanding self-supervision explains why foundation models possess such a massive, unconstrained breadth of knowledge.

You want to train a model that fundamentally understands the English language. If you use supervised learning, you have to pay linguists to manually label billions of sentences with their grammatical structure. This would cost trillions of dollars and take decades. Instead, you feed the model a raw sentence, digitally hide the last word, and ask the model to guess it. The sentence itself provides the correct answer, completely bypassing the need for human intervention. The data is creating its own labels.

§4.6·~1 min

Reinforcement learning from human feedback (RLHF)

How ChatGPT and Claude were made useful and safe

Key takeaway

RLHF is the alignment technique that transforms a chaotic, internet-trained autocomplete engine into a safe, helpful chatbot by scoring its outputs against human preferences.

Why this matters for you

When your base model is generating accurate but rude or dangerous answers, RLHF is the architectural solution. It is how you teach a model subjective human values that cannot be captured by simple next-word prediction.

Your new medical chatbot is technically brilliant. It has ingested every medical journal on earth via self-supervised learning. But when a user asks for advice on a headache, the bot responds with a terrifying, highly technical lecture on brain tumors. The model is statistically correct, but behaviorally disastrous. You need a way to teach the model bedside manner, which is a subjective human preference that cannot be found in raw data. You must train the model to prioritize being helpful and safe.

§4.7·~1 min

The labelling cost spectrum

From near-zero (web text) to extremely expensive (medical expert annotations)

Key takeaway

The unit economics of data acquisition depend entirely on domain expertise; capturing user clicks is free, while hiring doctors to annotate MRIs will destroy your budget.

Why this matters for you

When engineering requests a budget for a massive data labelling project, you must model the cost. If the task requires specialized degrees, the labelling cost might exceed the total revenue potential of the feature.

A startup pitches you an AI tool that automatically highlights suspicious clauses in commercial real estate contracts. The architecture is standard, but the unit economics are terrifying. To train this supervised model, the startup cannot use cheap crowdsourced labor; they must hire corporate lawyers billing at $300 an hour to read and manually annotate thousands of dense documents. The cost of creating the ground truth data is so astronomically high that it threatens the viability of the entire business. You are looking at the brutal reality of the labelling cost spectrum.

§4.8·~1 min

Data quality vs data quantity

Why 10,000 clean examples beat 1 million noisy ones

Key takeaway

In the era of modern AI, feeding a model millions of noisy, inaccurate examples will actively degrade performance; a small dataset of pristine quality always beats a massive dataset of garbage.

Why this matters for you

When your team wants to scrape the entire internet to increase the training dataset size, you must stop them. You need to enforce data hygiene over volume, because bad data will permanently corrupt the model's behavior.

The data science team is excited. They just acquired an additional two million rows of customer interactions from a third-party data broker. They run the retraining pipeline, expecting a massive boost in accuracy. Instead, the model's performance drops by 15%. The new data was riddled with duplicates, formatting errors, and contradictory labels. The model faithfully learned the noise, actively forgetting the clean patterns it had previously mastered. You have just witnessed the reality that garbage in equals garbage out.

§4.9·~1 min

The data flywheel

Why product usage generating training signal is your deepest moat

Key takeaway

A data flywheel is an architectural loop where product usage automatically generates new training data, which improves the model, which drives more usage—creating an unbeatable compounding moat.

Why this matters for you

If you design an AI feature that requires manual data exports and offline retraining, you have built a static tool. If you design the UX so that user actions seamlessly stream back as training labels, you have built an engine that your competitors can never catch.

A competitor launches an exact clone of your AI recommendation engine. They hired the same engineers, used the same algorithms, and copied your UI. But within six months, your product is vastly superior and they are going out of business. The difference is that your system learns from every single user interaction in real-time, constantly updating its weights based on what users actually click. You have a data flywheel; every user action makes the product smarter for the next user, pulling away from the competition at an accelerating rate.

§4.10·~1 min

PM decision lens: designing your product to capture training signal from day one

The one architectural decision most PMs miss

Key takeaway

You must design UI friction that forces users to explicitly accept, reject, or modify AI outputs, because implicit data is noisy and explicit corrections are the gold standard for retraining.

Why this matters for you

If you surface an AI summary and provide no way for the user to rate it, edit it, or complain about it, you are flying blind. You must build the feedback mechanism into the core user flow before launch, or your model will stagnate on day two.

Your team ships a generative AI feature that automatically drafts weekly status reports for managers. The adoption numbers look great; users are opening the reports every week. But the engineering team has no idea if the model is actually doing a good job, because the UI only has a "Close" button. You know the users are looking at the output, but you have zero signal on whether the output was accurate, hallucinated, or completely useless. You have launched a product that is incapable of improving.

As a PM: You must treat data acquisition as your primary product feature. If you wait until you need to train a model to figure out where your data is coming from, your model will stagnate on day two.

Bloomberg — The 40-year data moat

Bloomberg spent decades building a massive, proprietary terminal network that captured the daily financial workflows of Wall Street. When they trained BloombergGPT, they didn't need a better algorithm; they had exclusive access to forty years of pristine financial data that no competitor could replicate.

Concept check · 1 of 12
Sort into categories

Match each setup to the learning paradigm it actually uses.

Drag each item into a category

1M emails, each manually tagged spam/not-spam, used to train a classifier.10M customer records clustered into behavioural segments with no predefined labels.Billions of web pages where the model is trained to predict the next masked word.Human raters score pairs of model outputs; a reward model is trained on their preferences.A churn model trained on 18 months of accounts labelled 'churned' or 'retained'.GPT-style pre-training that uses raw text as its own answer key.

Supervised

Unsupervised

Self-supervised

RLHF

Portrait of Krishna Kumar, Curator

Vetted by Krishna KumarCurator, FactorBeam