B — What the Algorithms Can't Do | Outline

Outline only. This is the v0.1 working outline for Article B in the Coach + AI Critique sub-series. v1 prose will be drafted later, replacing the section-beat summaries below with full prose. Editorial questions are visible in yellow callouts.

Audience: Decision-science and AI-curious cultural readers; sophisticated coaches and players who want the broader frame; tech-media editors evaluating how to think about AI claims in adjacent decision domains.
Outlets: The Atlantic / HBR / IEEE Spectrum / Wired / New Yorker (op-ed-leaning).
Length: ~5,000 words target.
Tone: Op-ed-leaning, cultural, evidence-grounded. Less "how to study," more "what does this technology actually do, and what doesn't it."
Sub-series role: B is the cultural / decision-science version. Companions: A (practical poker-community) and C (flagship comprehensive).

Working title and alternates

Original "AI Still Needs You. AI Still Hasn't Beaten You." was the seed; B expands the scope to include the pedagogy gap, making it a fuller decision-science piece.

⭐ "What the Algorithms Can't Do (And Why You Still Need Coaches and Human Experts)" — cultural pitch
"Three Things AI Hasn't Replaced in Poker — and Probably Won't" — list-style, op-ed
"AI Still Needs You. AI Still Hasn't Beaten You. (And the Solver Charts Aren't Enough Either.)" — the original, expanded
"The Limits of Formalism in Real-World Decision-Making" — academic-leaning, broader
"What the Solver Doesn't Solve" — short and provocative

One-line goal

A cultural-frame argument for the irreducibility of human judgment in modern gameplay AI. Three structural gaps the marketing keeps trying to write out: autonomy (AI methodology depends on humans), outperformance (AI hasn't beaten top humans in any published controlled setting), and pedagogy (GTO study alone doesn't make a player). The article uses poker as the case study but reaches for adjacent decision domains where the same pattern holds.

Story arc

The cultural assumption — across poker, finance, medicine, law, anywhere AI is being marketed as a replacement for expert judgment — is that the algorithms are catching up to humans. In poker, where we have the cleanest measurement environment for this kind of claim, the assumption fails on three fronts. The article walks through the three gaps with poker-specific evidence, then generalizes the lesson for other decision domains.

Section-by-section beats

§1 · ~400 words

Open: the cultural narrative we keep being sold

Not a poker-specific open. A cultural one: the AI that's going to replace your coach, your doctor, your lawyer, your editor. Poker is the cleanest case study because measurement is unambiguous — you can play hands and count the result. So poker is the right place to look for whether the marketing claim survives contact with the data.

§2 · ~1,000 words

The autonomy gap: AI methodology is human-in-the-loop

Walk through where humans are required at every step:

Tree designers are humans (Tombos21: "the human operator must define exactly what bet sizes are allowed... this is the heart of the problem")
Reward functions and training curricula are human-built
Failure-mode identification (the AceGuardian + QuintAce team caught the action-set ceiling, MC-5 MTT meta-bias, MC-6 binary ICM gradient, KI-5 thin-spot — all surfaced by humans, not the model)
Eval design and exploit validation (which exploits matter, which spots are representative — human judgment)
Ambiguous anti-cheat cases (every behavioral-detection model has an edge-case review queue)
New game variants and format definitions (Squid family wasn't auto-discovered — humans defined the rules)
Real-game performance evaluation (Rung 4) — the standard that ultimately matters — requires humans evaluating
Coach-specific judgment AI doesn't have — live tells, emotional dynamics, table talk, opponent learning patterns, tilt management, study-plan personalization, motivational psychology

The cultural takeaway: even the most automated-looking AI runs on humans. The question for any AI claim in any domain is where the humans are, not whether.

§3 · ~1,400 words

The outperformance gap: even where AI matches, it's overfitting, not generalization

Two layers to this gap.

Layer 1 — the empirical record. No public controlled setting has shown the best gameplay AI outperforming the best human players and coaches at scale.

Cepheus 2015 — HU LHE, a research curiosity nobody plays seriously
Libratus 2017 — HU NLHE, small samples, "superhuman" disclaimer, no Nash claim
DeepStack 2017 — HU NLHE, expert-level, LBR couldn't exploit but no upper bound proven
Pluribus 2019 — 6-max NLHE, AIVAT asymmetry concession, individual results anonymized, Loeliger result undisclosed, p=0.014/0.028 borderline
Brown's July 2024 X thread — "Pluribus cost $150 to train... cautionary tale on overoptimizing for benchmarks"
Russian bot operations (per Bloomberg's BFC reporting) — succeed at mid-stakes against recreational players; explicitly avoid pros (per the GipsyTeam interview: "afraid of losing to the regs who have studied their features")
Counter-example search — no public verified pure-GTO real-money win against top humans beyond HU
No pro-vs-AI head-to-head event at the level of AlphaStar's Grandmaster matches or AlphaGo's Lee Sedol match has happened in poker for 6-max+ formats

Layer 2 — the overfitting layer. Even where AI matches top humans on a specific benchmark, the matching is typically achieved by throwing massive data and compute at one narrow setting — one game, one ruleset, one stack depth, one opponent pool — and overfitting the model to that setting. Change the setting and performance degrades. This pattern holds across game-playing AI more broadly: AlphaStar plays one version of StarCraft against a specific opponent distribution; OpenAI Five plays one specific Dota 2 hero matchup; Suphx plays Mahjong at a specific ruleset; Pluribus plays 6-max NLHE at exactly the configuration its training expected. There is no generalized AI poker player — one that outperforms humans across the variety of poker the actual game presents (different formats, stack depths, opponent populations, exotic variants, ruleset shifts).

The claim "AI matches the best" is in practice "AI matches the best in this one narrow setting we trained for." When the setting shifts — when the rules change, when true reasoning and adaptation is required, when the opponent pool isn't in the training distribution — current AI overfits. The technology hasn't demonstrated the leap to general game-playing reasoning that the marketing implies.

The cultural takeaway: the headline "AI beats humans" claim — in poker as in elsewhere — usually rests on one paper, one experiment, one narrow setting, and a set of caveats the marketing erases. When you pull the experiment apart, it's almost never what the headlines said. And even where the matching claim survives, what's been demonstrated is narrow-setting overfitting, not generalized reasoning. The two layers compound.

§4 · ~1,200 words

The pedagogy gap: GTO study alone doesn't make a player

Compressed version of Article A's §2 + §3. Eight foundation skills the chart teaches; eight ceiling skills the chart doesn't reach (live reads, opponent-specific deviations, ICM-heavy spots, tilt management, table selection, study-plan personalization, format-cross-pollination, multi-way / exotic formats).

The cultural takeaway: in any expert domain, the formalism (the textbook, the chart, the algorithm's output) is the floor. The expert's role is the application of the formalism plus everything the formalism leaves out.

§5 · ~700 words

Synthesis: three irreducible roles of human judgment

Stack the three gaps:

Building the AI (autonomy)
Outperforming the AI (outperformance)
Teaching past the AI (pedagogy)

Each gap is independent. Each is an open problem, not a "we just need more compute" problem. Each generalizes beyond poker to any decision domain where AI is being marketed as a replacement for expert judgment.

The right question for any AI claim isn't "can the AI do this?" It's "on which gap is this claim quietly conceding?"

§6 · ~700 words

What this means for AI in adjacent decision domains

Generalize. In medicine, law, finance, education — the same three gaps appear. The cultural narrative ("the algorithm will replace the expert") fails on the same three fronts. The article points at examples (legal-doc AI that depends on attorney review queues; medical-imaging AI whose performance ceiling matches but doesn't beat top radiologists; financial-trading AI whose strongest deployments are exploitative rather than equilibrium-seeking).

This is where the piece earns the cultural / decision-science register. The poker case study isn't just about poker; it's a clean diagnostic instrument for a much broader class of AI claims.

§7 · ~500 words

Close: what coaches (and other experts) should make of this

Practical for the audience that reads Atlantic / HBR. If you're an expert in a domain being told an AI is replacing you, the three gaps are the diagnostic. Where is the human in the loop? Where has the AI's outperformance been measured against the best of you? Where does the formalism the AI implements stop reaching real practice? When all three answers are clean, the AI claim survives. When any one is muddy, the claim doesn't.

The next decade isn't AI versus experts. It's experts who learn to teach with AI versus experts who don't.

Length budget

§	Beat	Words
§1	Open	400
§2	Autonomy gap	1,000
§3	Outperformance gap	1,200
§4	Pedagogy gap	1,200
§5	Synthesis	700
§6	Adjacent domains	700
§7	Close	500
Total		~5,700

Slightly above the 5,000-word target. Trim §6 if needed for outlet length cap.

Cross-references in the sub-series

A — same pedagogy gap, but A is the practical poker-community version. B treats it as one of three gaps in a broader cultural argument.
C (flagship) — comprehensive merge. B's content is roughly half of C, with cultural framing. A is the other half.
Manifesto (A1) — field-level argument behind §3 (outperformance). Cite directly.
B7 (DRL Is the Default) — for the autonomy framing in §2.

Open Editor's Qs

⚑ Q1 — Adjacent-domain claims in §6 How aggressive on the medicine / law / finance comparisons? My recommendation: 1 specific example per domain, sourced. Don't generalize beyond what the data supports.

⚑ Q2 — Length 5,000 vs 6,000 words? §6 is the swing variable. My recommendation: 5,000 with §6 tighter; expand to 6,000 if a flagship outlet wants the longer cultural feature.

⚑ Q3 — Title Pick from the candidates above. My recommendation: ⭐ "What the Algorithms Can't Do (And Why You Still Need Coaches and Human Experts)".

⚑ Q4 — Outlet pitch order HBR for the business-decision frame, Atlantic for the cultural frame, Wired backup for the AI-critique frame? Or different order?

⚑ Q5 — Coach byline / co-byline Should B carry a named-coach co-byline (Petrangelo, Brad Wilson, Annie Duke / Maria Konnikova for the decision-science angle)? Optional but lifts the piece for tier-1 outlets.