Working title and alternates
Original "AI Still Needs You. AI Still Hasn't Beaten You." was the seed; B expands the scope to include the pedagogy gap, making it a fuller decision-science piece.
- ⭐ "What the Algorithms Can't Do (And Why You Still Need Coaches and Human Experts)" — cultural pitch
- "Three Things AI Hasn't Replaced in Poker — and Probably Won't" — list-style, op-ed
- "AI Still Needs You. AI Still Hasn't Beaten You. (And the Solver Charts Aren't Enough Either.)" — the original, expanded
- "The Limits of Formalism in Real-World Decision-Making" — academic-leaning, broader
- "What the Solver Doesn't Solve" — short and provocative
One-line goal
A cultural-frame argument for the irreducibility of human judgment in modern gameplay AI. Three structural gaps the marketing keeps trying to write out: autonomy (AI methodology depends on humans), outperformance (AI hasn't beaten top humans in any published controlled setting), and pedagogy (GTO study alone doesn't make a player). The article uses poker as the case study but reaches for adjacent decision domains where the same pattern holds.
Story arc
The cultural assumption — across poker, finance, medicine, law, anywhere AI is being marketed as a replacement for expert judgment — is that the algorithms are catching up to humans. In poker, where we have the cleanest measurement environment for this kind of claim, the assumption fails on three fronts. The article walks through the three gaps with poker-specific evidence, then generalizes the lesson for other decision domains.
Section-by-section beats
Open: the cultural narrative we keep being sold
Not a poker-specific open. A cultural one: the AI that's going to replace your coach, your doctor, your lawyer, your editor. Poker is the cleanest case study because measurement is unambiguous — you can play hands and count the result. So poker is the right place to look for whether the marketing claim survives contact with the data.
The autonomy gap: AI methodology is human-in-the-loop
Walk through where humans are required at every step:
- Tree designers are humans (Tombos21: "the human operator must define exactly what bet sizes are allowed... this is the heart of the problem")
- Reward functions and training curricula are human-built
- Failure-mode identification (the AceGuardian + QuintAce team caught the action-set ceiling, MC-5 MTT meta-bias, MC-6 binary ICM gradient, KI-5 thin-spot — all surfaced by humans, not the model)
- Eval design and exploit validation (which exploits matter, which spots are representative — human judgment)
- Ambiguous anti-cheat cases (every behavioral-detection model has an edge-case review queue)
- New game variants and format definitions (Squid family wasn't auto-discovered — humans defined the rules)
- Real-game performance evaluation (Rung 4) — the standard that ultimately matters — requires humans evaluating
- Coach-specific judgment AI doesn't have — live tells, emotional dynamics, table talk, opponent learning patterns, tilt management, study-plan personalization, motivational psychology
The cultural takeaway: even the most automated-looking AI runs on humans. The question for any AI claim in any domain is where the humans are, not whether.
The outperformance gap: even where AI matches, it's overfitting, not generalization
Two layers to this gap.
Layer 1 — the empirical record. No public controlled setting has shown the best gameplay AI outperforming the best human players and coaches at scale.
- Cepheus 2015 — HU LHE, a research curiosity nobody plays seriously
- Libratus 2017 — HU NLHE, small samples, "superhuman" disclaimer, no Nash claim
- DeepStack 2017 — HU NLHE, expert-level, LBR couldn't exploit but no upper bound proven
- Pluribus 2019 — 6-max NLHE, AIVAT asymmetry concession, individual results anonymized, Loeliger result undisclosed, p=0.014/0.028 borderline
- Brown's July 2024 X thread — "Pluribus cost $150 to train... cautionary tale on overoptimizing for benchmarks"
- Russian bot operations (per Bloomberg's BFC reporting) — succeed at mid-stakes against recreational players; explicitly avoid pros (per the GipsyTeam interview: "afraid of losing to the regs who have studied their features")
- Counter-example search — no public verified pure-GTO real-money win against top humans beyond HU
- No pro-vs-AI head-to-head event at the level of AlphaStar's Grandmaster matches or AlphaGo's Lee Sedol match has happened in poker for 6-max+ formats
Layer 2 — the overfitting layer. Even where AI matches top humans on a specific benchmark, the matching is typically achieved by throwing massive data and compute at one narrow setting — one game, one ruleset, one stack depth, one opponent pool — and overfitting the model to that setting. Change the setting and performance degrades. This pattern holds across game-playing AI more broadly: AlphaStar plays one version of StarCraft against a specific opponent distribution; OpenAI Five plays one specific Dota 2 hero matchup; Suphx plays Mahjong at a specific ruleset; Pluribus plays 6-max NLHE at exactly the configuration its training expected. There is no generalized AI poker player — one that outperforms humans across the variety of poker the actual game presents (different formats, stack depths, opponent populations, exotic variants, ruleset shifts).
The claim "AI matches the best" is in practice "AI matches the best in this one narrow setting we trained for." When the setting shifts — when the rules change, when true reasoning and adaptation is required, when the opponent pool isn't in the training distribution — current AI overfits. The technology hasn't demonstrated the leap to general game-playing reasoning that the marketing implies.
The cultural takeaway: the headline "AI beats humans" claim — in poker as in elsewhere — usually rests on one paper, one experiment, one narrow setting, and a set of caveats the marketing erases. When you pull the experiment apart, it's almost never what the headlines said. And even where the matching claim survives, what's been demonstrated is narrow-setting overfitting, not generalized reasoning. The two layers compound.
The pedagogy gap: GTO study alone doesn't make a player
Compressed version of Article A's §2 + §3. Eight foundation skills the chart teaches; eight ceiling skills the chart doesn't reach (live reads, opponent-specific deviations, ICM-heavy spots, tilt management, table selection, study-plan personalization, format-cross-pollination, multi-way / exotic formats).
The cultural takeaway: in any expert domain, the formalism (the textbook, the chart, the algorithm's output) is the floor. The expert's role is the application of the formalism plus everything the formalism leaves out.
Synthesis: three irreducible roles of human judgment
Stack the three gaps:
- Building the AI (autonomy)
- Outperforming the AI (outperformance)
- Teaching past the AI (pedagogy)
Each gap is independent. Each is an open problem, not a "we just need more compute" problem. Each generalizes beyond poker to any decision domain where AI is being marketed as a replacement for expert judgment.
The right question for any AI claim isn't "can the AI do this?" It's "on which gap is this claim quietly conceding?"
What this means for AI in adjacent decision domains
Generalize. In medicine, law, finance, education — the same three gaps appear. The cultural narrative ("the algorithm will replace the expert") fails on the same three fronts. The article points at examples (legal-doc AI that depends on attorney review queues; medical-imaging AI whose performance ceiling matches but doesn't beat top radiologists; financial-trading AI whose strongest deployments are exploitative rather than equilibrium-seeking).
This is where the piece earns the cultural / decision-science register. The poker case study isn't just about poker; it's a clean diagnostic instrument for a much broader class of AI claims.
Close: what coaches (and other experts) should make of this
Practical for the audience that reads Atlantic / HBR. If you're an expert in a domain being told an AI is replacing you, the three gaps are the diagnostic. Where is the human in the loop? Where has the AI's outperformance been measured against the best of you? Where does the formalism the AI implements stop reaching real practice? When all three answers are clean, the AI claim survives. When any one is muddy, the claim doesn't.
The next decade isn't AI versus experts. It's experts who learn to teach with AI versus experts who don't.
Length budget
| § | Beat | Words |
|---|---|---|
| §1 | Open | 400 |
| §2 | Autonomy gap | 1,000 |
| §3 | Outperformance gap | 1,200 |
| §4 | Pedagogy gap | 1,200 |
| §5 | Synthesis | 700 |
| §6 | Adjacent domains | 700 |
| §7 | Close | 500 |
| Total | ~5,700 | |
Slightly above the 5,000-word target. Trim §6 if needed for outlet length cap.
Cross-references in the sub-series
- A — same pedagogy gap, but A is the practical poker-community version. B treats it as one of three gaps in a broader cultural argument.
- C (flagship) — comprehensive merge. B's content is roughly half of C, with cultural framing. A is the other half.
- Manifesto (A1) — field-level argument behind §3 (outperformance). Cite directly.
- B7 (DRL Is the Default) — for the autonomy framing in §2.