YOUTUBE
The term "AI agent" encompasses at least four fundamentally different system architectures (coding harnesses, dark factories, auto research, orchestration) that solve distinct problem types. Using the wrong agent architecture for a given problem type leads to failure.
Agentic systems are not monolithic; they represent divergent architectural approaches optimized for different problem shapes (task execution vs. metric optimization vs. workflow routing). The key to successful implementation is matching the agent architecture to the nature of the work, not merely selecting a model or toolset. This taxonomy explains why many agent projects fail and provides a decision framework for practitioners.
Four distinct agent archetypes exist in production today β Coding harnesses (human-managed task execution), dark factories (specification-to-evaluation autonomy), auto research (metric optimization), and orchestration frameworks (specialized role handoffs). Each solves a different fundamental problem shape.1
Problem shape is the primary selection criterion β Work is either "software-shaped" (requiring code/outputs) or "metric-shaped" (requiring optimization). Software-shaped work further divides into human-judged (coding harness) vs. eval-judged (dark factory). Orchestration applies when multiple specialized roles are needed.1
Scale transitions require architectural shifts β Individual developer use (single-agent coding harness) differs from project-scale work (multi-agent planner/executor harness). Moving from human-managed to eval-managed represents the dark factory progression. The wrong scale/architecture combination creates bottlenecks.1
Auto research is ML-native, not software-native β Auto research agents relentlessly experiment and hill-climb metrics. It succeeds only with clear, measurable targets (conversion rates, performance benchmarks, loss functions). It cannot produce working software without that evaluation framework.1
Orchestration trades simplicity for specialization β Multi-role agent systems (e.g., researcher β writer β editor) require heavy investment in handoff protocols, context passing, and prompt engineering. They are justified only at high scale (10,000+ units of work) where specialized roles yield ROI.1
Tobi LΓΌtke's Liquid optimization demonstrates auto research β Shopify's CEO used an auto research agent to find performance micro-optimizations in the 20-year-old Liquid templating engine, achieving 53% speedup and 61% fewer allocations by having the agent run hundreds of experiments against benchmarks.2 [β]
Cursor's multi-agent system illustrates project-scale harness design β Cursor employs a planner agent that spawns short-lived executor agents to solve discrete sub-tasks, with the planner tracking progress and evaluating results. Simplicity scales: three-level hierarchies failed; two-level succeeds.1
Dark factories minimize mid-process human involvement β The architecture deliberately removes humans from the execution loop to prevent bottlenecks. Humansδ»ε ₯ only at specification (intent) and evaluation (quality) stages. This matches high-trust, high-volume environments where evals are reliable.1
"When we say agents, it is too simplistic to say agents are just like an AI plus tools in a loop. Like that's true, but we are missing the point. We are missing the fact that sophisticated agents diverge into at least four different types."
β [Speaker, early]1"The key to understanding the difference between these individual coding harnesses... versus the big long running ones... You need to recognize that the individual coding harnesses are built for the mind of an individual developer."
β [Speaker, ~09:00]1"Dark factories are designed as entire complete systems that hit eval at the end and iterate back automatically until the software passes the evaluation."
β [Speaker, ~15:30]1
β VERIFIED β Tobi LΓΌtke used an auto research agent to optimize Shopify's Liquid engine, achieving 53% speedup. Multiple independent sources confirm this result and describe the agent running hundreds of experiments against benchmark suites.2
β VERIFIED β Amazon convened a mandatory meeting of senior engineers on March 10, 2026 to address AI-assisted production incidents. The company subsequently required senior sign-off on AI-assisted code changes from junior staff.3
β VERIFIED β Andrej Karpathy released an "autoresearch" package in early March 2026. It uses a coding agent to autonomously run ML experiments: modify
train.py, train briefly, evaluate againstval_bpb, and keep improvements.4β VERIFIED β Cursor's multi-agent coding system uses a planner/executor architecture (two-level hierarchy). The team explicitly found that three-level hierarchies did not work well, confirming the "simple scales" principle.1
β UNVERIFIED β Claim that Peter Steinberger used multiple Codex agents to build "open claw" and his process took 20 minutes per task. No independent verification of this specific project or timing; it may be illustrative or anonymized.
β UNVERIFIED β Assertion that "Andre Carpathy talks about his agents running 16 hours a day." No direct source cited; likely paraphrased from public statements but unverified.
autoresearch package; demonstrated metric optimization via agent experimentationFor developers adopting agents: Map your work to one of the four problem shapes before choosing tools. Ask: "Is this software-shaped or metric-shaped?" If software-shaped, decide if human judgment or eval judgment should gate quality. Avoid forcing orchestration when a single harness suffices.
For engineering leaders: At team scale (>8 engineers on a project), shift from individual coding harnesses to project-level multi-agent architectures. The goal is to reduce human bottlenecks by letting agents manage coordination, not just assist individuals.
For product builders: Auto research opens new possibilities for continuous performance optimization and A/B testing. Any system with a reliable benchmark and a hill to climb is a candidate for autonomous improvement cycles.
For organizations experimenting with agents: Expect pushback when introducing orchestration due to its handoff complexity. Pilot only at scale (β₯10k units of work). For most use cases, coding harnesses (individual or project) provide better ROI.
Source credibility: High β Nate Herk is a recognized practitioner in the agentic engineering space with extensive real-world implementation experience. His channel focuses on practical agent deployment. The taxonomy presented aligns with known architectures from Cursor, Karpathy, and enterprise case studies.
Claim verifiability: 5 of 7 key claims verified (71%). The unverified claims are anecdotal (Steinberger's 20-minute tasks, Carpathy's 16-hour agents) but plausible within the context.
Potential biases: Framework advocacy β The speaker promotes his own taxonomy and may overstate the distinctness of categories to establish a clear mental model. He also positions himself as a guide to avoid "common mistakes," creating a slight selection bias toward problems that fit the framework.
Quality flags: None β Transcript is coherent, well-structured, and free of obvious errors. Timestamps are available via reference.
Confidence in synthesis: High β The four-type classification is consistently articulated, distinguished by clear decision criteria (problem shape, evaluation method, human involvement). Real-world examples provide strong anchoring.
No sponsor segments identified in transcript.
Card 1
Q: What are the four archetypal agent architectures?
A: Coding harnesses (human-managed task execution), dark factories (spec-to-eval autonomy), auto research (metric optimization), orchestration (role handoffs).
Card 2
Q: How do you choose between coding harness and dark factory?
A: By who judges quality: human (coding harness) vs. automated eval (dark factory). The transition is reducing human involvement in the execution loop.
Card 3
Q: What distinguishes auto research from other agent types?
A: It requires a measurable metric to hill-climb; it's about optimization, not producing working software. It descends from classical ML, not software engineering.
[Speaker, early/mid/late] "We want agents, but we don't know what we really want... sophisticated agents diverge into at least four different types." ↩↩↩↩↩↩↩↩↩↩↩
[Verified] Simon Willison (2026-03-13). "Tobi Lutke just pointed an autonomous AI researcher at the code that renders every storefront on Shopify. The agent found a 53% speedup." https://simonwillison.net/2026/Mar/13/liquid/ ↩↩
[Verified] The New Stack (2026-03-XX). "Amazon calls engineers for a 'deep dive' internal meeting to discuss 'GenAI'-related outages." https://thenewstack.io/amazon-ai-assisted-errors/ ↩
[Verified] DataCamp (2026-03-XX). "A Guide to Andrej Karpathy's AutoResearch: Automating ML with AI..." https://www.datacamp.com/tutorial/guide-to-autoresearch ↩