MTG Commander AI

A RAG pipeline turned 32,807 podcast transcript chunks into 41,790 structured insights, then Claude Sonnet cross-checked every one against real card text and deleted 5,236 it had invented.

Live site

Why this stack

The backend and knowledge pipeline are Python: the raw material is 33K+ cards from Scryfall plus tens of thousands of podcast and YouTube transcript chunks, and Python's ecosystem for batch LLM calls, embeddings, and direct Postgres access fit that better than anything else considered. The web frontend is SvelteKit with the Vercel AI SDK, not Next.js. The project's own feature-branch folder is literally named 026-nextjs-frontend, a name left over from the initial plan before SvelteKit's smaller client bundle and native streaming primitives won the actual build. The 'Vercel' in the SDK's name is just a library brand, not the deploy target: both the SvelteKit frontend and the Python backend deploy to Railway via a Node adapter, not to Vercel. Supabase Postgres with pgvector holds the 33K-card embedding index and the structured-insight knowledge base in the same database that already handles auth and conversation storage, instead of standing up a separate vector store for 1 extra table. FastMCP exposes the deck-building tool surface (search_cards, find_similar_cards, analyze_cube, build_commander_deck, and more) as the same interface any MCP client, not just this web app, could connect to directly.

AI-assist note

It's spec-driven, built with Claude Code across 164 specs (spec, plan, tasks, and research docs per feature, with a dedicated research pass on the trickier ones like the hallucination cleanup and the tool-routing audit). I wrote every spec and reviewed every diff; Claude Code wrote most of the implementation and tests. The RAG grounding fix, the tool-routing consolidation, and the conversation-history hardening below all came out of that same spec-plan-implement-review loop, not a rewrite from scratch.

Stack

Python 3.12
FastMCP
Anthropic Claude (Opus for extraction, Sonnet Batch API for validation)
OpenAI text-embedding-3-small
Supabase (Postgres + pgvector + Auth)
SvelteKit + Vercel AI SDK v6
Mana Pool (affiliate) + Patreon + Ko-fi
PostHog (product analytics)
Railway

Domains

Agentic AI
RAG & Knowledge Systems
AI Safety & Data Quality
Systems Reliability
Full-Stack Product Engineering

Live4 mo

Users570 registered users, 2,271 decks built, 2,490 chat conversations (Supabase, Jul 2026)

PaymentsFree forever; Patreon (1 patron at $20/mo) + Ko-fi tips + Mana Pool affiliate

InfraPython 3.12, Supabase Postgres + pgvector, FastMCP, SvelteKit + Vercel AI SDK v6, Railway

AuthSupabase Auth (Google, Discord, Twitch, email) + RLS per user

CommunityDiscord (4 Patreon-tier roles, auto-synced on pledge/cancel)

Specs164

Why this exists

A friend of mine, Wes, was building a homebrew Magic: The Gathering cube by hand. He had 10 defined archetypes and a card list, but no easy way to check whether his black-green reanimator support was actually deep enough or just felt deep. Commander deckbuilding has the same problem at a bigger scale. EDHREC gives you popularity data, not the reasoning a good player would give you for why a specific card fits your specific 99. mtgcommander.ai is a conversational assistant that reads real card data (33K+ Scryfall cards) and real expert reasoning (transcript chunks from 10+ podcast and YouTube shows), builds full Commander decks against a stated budget and bracket, and hands the actual purchase off to Mana Pool through an affiliate link. It runs as a live, free product today, not a prototype, funded by 1 Patreon patron at $20/month plus Ko-fi tips instead of a paywall.

Architecture

Card data, podcast and YouTube transcripts, and cube or deck imports feed a Python pipeline that chunks, embeds, and extracts structured insights, validating every mechanic claim against the card’s own oracle text before anything is stored. All of it lives in 1 Supabase Postgres database with pgvector, alongside auth and conversation history, not a separate vector store. A FastMCP tool server exposes that knowledge base as discrete, individually callable tools (the same tools a raw MCP client like Claude Desktop could call directly), and a SvelteKit chat frontend running the Vercel AI SDK orchestrates those tools through phase-based routing informed by a real usage audit. The product is free forever, funded by optional Patreon support (1 patron at $20/month) and Ko-fi tips instead of a paywall, and when a user is ready to buy the deck, the purchase itself completes on Mana Pool through an affiliate link, not inside this app.

What shipped

By spec 164 the platform had grown from a favor for Wes’s cube into a real free product: a Mana Pool affiliate checkout handoff, a Patreon-synced Discord community, and a knowledge base that validates itself before storing anything new. The hallucination cleanup alone processed 33,040 insight-card pairs, deleted 5,236 confirmed hallucinations, and left the extraction pipeline validating every new insight against oracle text before storage, not just the backfill. The tool-routing audit replaced guesswork with 293 real tool calls across 50 conversations, consolidating or rerouting 12 tools that had 0 calls in the sample. All of these fixes came out of the same spec, plan, tasks, and research loop with Claude Code, not a rewrite from scratch.

The extraction pipeline still runs as a batch process, not real time. That’s deliberate: validating a new insight against a card’s oracle text before it ever reaches a user matters more than shipping it a few minutes faster.

Skill stories

Click a skill to open the story behind it: the decision, what broke, how it got measured, and how it got fixed.

RAG Knowledge PipelineRAG & Knowledge Systems
Insight Hallucination Diagnosis and CleanupAI Safety & Data Quality
MCP Tool Server and Usage-Driven RoutingAgentic AI
Persisted Conversation HistorySystems Reliability

RAG & Knowledge Systems

RAG Knowledge Pipeline

Decision: The knowledge base started as 32,807 raw podcast and YouTube transcript chunks (500-1000 tokens each, embedded and stored in Supabase pgvector) retrieved by plain similarity search. I added a structured extraction layer on top: a Claude pipeline that discards pure noise (sponsor reads, show intros, off-topic tangents) and extracts card-linked insights tagged with color identity, archetype, and a recency era marker, with any creator or show attribution stripped before storage.
What broke: Raw-chunk similarity search alone wasn't good enough to ship. A query about a specific commander returned transcript fragments rambling about a different, unrelated commander, a sponsor read, and a personal anecdote, because vector similarity over raw transcript noise has no way to separate signal from filler.
How I measured it: Compared raw-chunk retrieval against the new structured-insight retrieval side by side for the same card queries, and tracked the share of the 32,807 chunks that actually yielded a usable insight versus discardable noise, plus the card-name resolution rate against the existing card database.
How I fixed it: The extraction pass turned the 32,807 raw chunks into 41,790 structured, card-linked insights, each tagged with color identity and an archetype label, with source attribution stripped before it ever reaches the database. Recency weighting (post-2024 bracket-system content ranked above older meta advice) replaced a flat similarity score, so a 2020 evaluation of a since-banned card doesn't outrank a 2026 one.

AI Safety & Data Quality

Insight Hallucination Diagnosis and Cleanup

Decision: An earlier pilot validator (a Graphiti spike) had already caught 1 confirmed hallucination: an insight claiming a card had an ability it doesn't actually have. That pilot flagged 24% of insights with many false positives, since it was regex matching a mechanic keyword anywhere near a card name. I replaced it with a Claude Sonnet Batch API validator that reads each insight next to the actual card's oracle text and keywords, and labels it VALID, HALLUCINATION, or OPINION instead of a binary pass/fail.
What broke: The regex validator couldn't tell 'Card X has flying' (a false claim to catch) from 'Card X enables landfall strategies' (a true strategic observation to keep), because it had no word boundaries, no proximity window, and no attribution-verb check. It matched 'ward' inside 'reward' as often as it caught a real hallucination.
How I measured it: Submitted 33,040 insight-card pairs as a Sonnet Batch API job (about $15) and tallied the 3 labels, then manually reviewed a sample of the flagged insights to separate confirmed hallucinations from false positives that mentioned a stat or usage claim not literally present in the oracle text.
How I fixed it: 8,380 of the 33,040 pairs (25.4%) were flagged. Of those, 5,236 were confirmed wrong mechanic claims and got deleted; the other 3,144 were false positives (a stats claim outside the oracle text, not a mechanic error) and were kept. A separate partial-matching pass on unresolved card names (mostly double-faced cards saved under 1 half of their full name) resolved 1,208 more links, leaving 902 genuinely unresolvable. The knowledge base went from 41,790 unvalidated insights to 36,554 confirmed-clean ones, and the extraction pipeline now validates every new insight the same way before it's ever stored.

Agentic AI

MCP Tool Server and Usage-Driven Routing

Decision: The conversational agent is a FastMCP tool server exposing the deck-building surface (search_cards, get_card, find_similar_cards, analyze_cube, build_commander_deck, search_expert_knowledge, and more) as the same interface Claude Desktop or any other MCP client could connect to directly, not a bespoke API only the web chat can call. After launch, instead of guessing which tools the model actually needed, I audited real usage: 293 tool calls across 50 logged conversations.
What broke: The audit showed a long tail of dead weight: 12 of the 30 exposed tools recorded 0 calls across all 50 conversations, while search_cards and get_card alone accounted for 160 of the 293 calls (55%). A flat, always-available tool list makes tool selection noisier for the model and gives no signal about which capability is actually doing anything.
How I measured it: Built a frequency table (calls and percentage per tool) directly from the 50-conversation sample, then classified each tool as keep, merge, internalize, or needs-routing based on that data instead of intuition. search_expert_knowledge had 16 calls but got reclassified INTERNALIZE, since it turned out to be called by build_commander_deck and brainstorm_cards internally, never directly by the model in response to a user ask.
How I fixed it: 0- and near-0-call tools got consolidated by their real verdict: explore_commanders (4 calls) merged into search_commanders_by_strategy, get_card_price (1 call) merged into price_card_list, and suggest_creator_decks (0 calls) got removed as a standalone tool since brainstorm_cards already baked it in. The remaining 0-call tools (rules lookups, post-build analysis) moved behind phase-based activeTools routing, so they only surface once their precondition step has actually run instead of sitting in front of the model on every single turn.

Systems Reliability

Persisted Conversation History

Decision: Before this feature, every page refresh wiped the chat entirely. I built full conversation persistence: every message and tool invocation auto-saves to Postgres scoped to the authenticated user, a sidebar restores the full workspace state (including rendered tool results) on reload, and conversations auto-title from the first message or the commander name if a deck was built.
What broke: After persistence shipped, a runaway multi-call tool loop in 1 conversation left corrupted, incomplete tool-invocation records in that conversation's stored history. Reloading it crashed the chat entirely with AI_MessageConversionError: 'ToolInvocation must have a result,' because the AI SDK v4 message-reconstruction path reads message.parts first, and the existing filter only cleaned the older message.toolInvocations field.
How I measured it: Reproduced the crash by loading the specific corrupted conversation and confirming the same AI_MessageConversionError client-side, then compared the 2 code paths (toolInvocations vs. parts) the SDK actually reads from to find where the existing filter fell short.
How I fixed it: Added a parts-level filter (checking state === 'result' && result !== undefined) alongside the legacy toolInvocations filter, so both message-shape paths a corrupted conversation could take are covered. A later defense-in-depth pass, sanitize-model-history, now runs before every streamText call and drops any output-error or incomplete tool part before the model ever sees it (29/29 unit tests), so 1 bad turn can't crash every future reload of that conversation. A Patreon patron who specifically requested the memory feature bumped their pledge from $10 to $20/month after it shipped.