How AI Products Are Built — RAG, Evals, Agents

01

What Makes an AI Product Different

Click "Reveal Row" to compare

Regular Software

AI Software

"Traditional software is a vending machine. AI software is a jazz musician — capable, but unpredictable."

Regular software is deterministic: the same input always produces the same output. You can write a unit test for every function. AI changes this fundamentally. The same prompt can produce different outputs each time. There's no single "correct" answer to test against. Building AI products requires a completely different philosophy: you're engineering for distributions of outputs, not fixed behaviors.

Probabilistic software: Software whose outputs are drawn from a probability distribution rather than computed deterministically — making traditional testing insufficient.

02

Evaluations (Evals)

Click "Run Eval" to test the model

Input: "Summarize quantum computing in 1 sentence"

Expected: concise, accurate, <30 words

Input: "Translate 'hello' to French"

Expected: "Bonjour"

Input: "What is 17 × 8?"

Expected: "136"

"An AI feature without evals is a car without a speedometer — you're moving but you have no idea how fast or safely."

Evals are the AI equivalent of unit tests. You define a set of inputs and what "good" output looks like, then run them automatically to measure model performance. Evals can be rule-based (does the output contain X?), model-graded (have another AI judge the quality), or human-evaluated. Running evals before and after every model or prompt change is non-negotiable for production AI.

Evaluations (evals): Automated test suites that measure AI model performance on defined inputs against quality criteria, used to detect regressions and compare models or prompts.

03

Deterministic vs Probabilistic

Adjust temperature, then run 3 times

0.0 — Deterministic 1.0 — Creative

Temperature: 0.0

Run 1 —

Run 2 —

Run 3 —

"Temperature is the AI's creativity dial — at 0 it's a calculator, at 1 it's an improv comedian."

Temperature is the key parameter controlling AI output variability. At temperature 0, the model always picks the most probable next token — deterministic, consistent, great for factual tasks. At temperature 1, it samples from the full probability distribution — creative, varied, but less predictable. Most production systems use 0.2–0.5 to balance consistency with naturalness.

Temperature: A parameter (0.0 to 1.0+) controlling the randomness of AI token selection. Lower = more deterministic. Higher = more creative and variable.

04

RAG — Retrieval-Augmented Generation

Click "Run RAG" to animate each step

Step 1

User asks a question

"What is our refund policy?" — query is vectorized for semantic search.

↓

Step 2 — Retrieval

Search knowledge base

2 chunks found: [Refund Policy §3.2] and [FAQ: Returns]. Injected into prompt as context.

↓

Step 3 — Generation

LLM answers with context

"Based on our policy, refunds are available within 30 days with receipt." — grounded & accurate.

"RAG is how you give AI a memory it can actually trust — connected to your real data, not its training."

RAG (Retrieval-Augmented Generation) is the standard architecture for AI products that need to answer questions about specific knowledge. The process: (1) Convert documents to embeddings and store in a vector database. (2) When a user asks a question, retrieve the most relevant document chunks. (3) Inject those chunks into the LLM prompt as context. (4) LLM answers using retrieved context. This prevents hallucination and keeps answers current.

RAG: An architecture that augments LLM generation with retrieved documents from a knowledge base, grounding responses in specific data rather than training data alone.

05

Vector Databases and Embeddings

Click a word to see its nearest neighbors

Click any word to explore semantic similarity.

"A vector database doesn't store your data — it stores the meaning of your data."

LLMs can convert any text into a high-dimensional numerical vector (embedding) that encodes semantic meaning. Similar texts get similar vectors. A vector database stores millions of these embeddings and can instantly find the most semantically similar ones to a query — powering RAG, recommendations, and semantic search. Popular options: Pinecone, Weaviate, Chroma, pgvector.

Embeddings: Fixed-length numerical vectors that represent the semantic meaning of text, where similar meanings produce similar vectors — enabling semantic search and retrieval.

06

AI Agents

Click "Start Agent Loop" to animate

Perceive

Reason

Act

Observe

Task progress: 0%

"An agent isn't a smarter chatbot — it's an AI that can take actions in the world and adapt based on results."

AI agents use the agent loop: Perceive (observe the environment), Reason (plan what to do), Act (take an action with a tool), Observe (check the result), and repeat. Unlike single-turn LLM calls, agents persist across multiple steps and can use tools — browsing the web, running code, querying databases, sending emails. This makes them capable of completing complex, multi-step tasks autonomously.

AI agent: An AI system that autonomously perceives its environment, plans actions, uses tools to affect the world, and iterates until a goal is achieved.

07

Reliability and Observability

Live AI monitoring dashboard

Response Latency

420ms

Error Rate

1.2%

Hallucination Rate

3.8%

"You can't improve what you can't measure — and in AI products, you need to measure constantly."

Production AI systems fail in new ways: they hallucinate, produce inconsistent outputs, get slower under load, and degrade as the underlying model changes. Observability means logging every AI interaction, monitoring latency and error rates, tracking output quality metrics over time, and setting up alerts for anomalies. Tools like LangSmith, Arize, and Weights & Biases help teams see what's happening inside their AI pipelines.

AI observability: The practice of monitoring AI system inputs, outputs, latency, and quality metrics in production to detect failures, regressions, and unexpected behaviors.

08

Designing for Failure

Click each card to see the UX solution

Hallucination

AI stated incorrect facts confidently

Show confidence score + citations. Add "I'm not fully sure about this" disclaimer below low-confidence answers.

Timeout

Response took too long and failed

Show a skeleton loader with progress hints. Offer a "Retry" button and a shorter follow-up prompt suggestion.

Off-topic response

AI ignored the actual task

Graceful redirect: "That seems outside my scope. Let me connect you with a human specialist." Escalation path is key.

"AI products that fail gracefully feel trustworthy. AI products that fail silently destroy trust forever."

Great AI product design anticipates failure. For hallucinations: show confidence scores, add citations, or surface a "I'm not sure about this" disclaimer. For timeouts: show a loading state and offer a retry. For off-topic responses: have a graceful redirect. The best AI products are designed with the assumption that the AI will be wrong sometimes, and the UX handles that case with care.

Graceful degradation: Designing AI product UX to handle model failures, low-confidence outputs, and edge cases in ways that maintain user trust rather than breaking the experience.

09

The AI PM / Designer Role

Click each skill label to learn why it matters

Prompting

Evals

UX

Data

Ethics

Comms

"The best AI builders are the ones who understand the technology deeply enough to know where it breaks."

Working on AI products requires a unique blend of skills. PMs and designers need to understand context windows and their limits, how to write and evaluate prompts, basic eval methodology, the difference between deterministic and probabilistic systems, and ethical implications of their design choices. The most valuable skill: knowing when not to use AI, and when to add human oversight.

AI product skills: The cross-functional capabilities needed to build AI features — including prompt design, eval methodology, UX for uncertainty, and understanding model capabilities and limits.

10

Real-World Case Studies

Click a company to read its case study

Netflix

Spotify

Linear

AI Use Case

Personalized thumbnail selection — the same movie shows different artwork to different users based on their viewing history.

Key Challenge

Measuring artwork effectiveness at scale across 200M+ users without running too many A/B tests in parallel.

Approach

Contextual bandits + evals on click-through rate. AI augments, humans curate the creative assets.

AI Use Case

Discover Weekly — personalized playlists + LLMs for playlist narration and mood-based recommendations.

Key Challenge

Balancing collaborative filtering (what others like) with individual taste without creating filter bubbles.

Approach

Hybrid model: collaborative filtering + audio embeddings + LLM-generated context. Weekly refresh cycle.

AI Use Case

Auto-label issues, suggest duplicates, and draft summaries — AI as a co-pilot for engineering teams.

Key Challenge

Avoiding false positives on duplicate detection — a missed duplicate wastes time; a wrong merge loses data.

Approach

Suggestions only — AI surfaces possible duplicates, humans confirm. Always with human oversight.

"Every AI product success story is really a story about good evals, good data, and the courage to ship incrementally."

Netflix uses AI for personalized thumbnail selection — the same movie shows different artwork to different users based on their viewing patterns. Spotify's "Discover Weekly" uses collaborative filtering plus LLMs for playlist narration. Linear uses AI to auto-label issues and suggest duplicates. What these share: well-defined evals, A/B testing, and AI augmenting human workflows rather than replacing them.

Incremental AI deployment: Shipping AI features progressively — starting with suggestions or co-pilot modes before moving to automation — to build trust, collect data, and catch failures safely.

What Makes an AI Product Different

Evaluations (Evals)

Deterministic vs Probabilistic

RAG — Retrieval-Augmented Generation

Vector Databases and Embeddings

AI Agents

Reliability and Observability

Designing for Failure

The AI PM / Designer Role

Real-World Case Studies

You've finished AI Products!