Schlagwort: AI Engineering

  • Three Modes, Three Economies

    Three Modes, Three Economies

    There is a kind of AI that has not filled many conference rooms these past two years. It sits in production systems where it has been working for years or decades, without anyone registering it as an event. It is the layer that, in most organizations, actually produces value — and it is the layer that almost never shows up on a slide.

    This is not accidental. An AI that functions as a performance has to remain visible. An AI that functions as a component gains with every additional hour in which no one notices it. Both modes are legitimate. They are simply not the same thing, and the difference explains a large part of the current discourse confusion.

    The three modes, briefly

    At least three kinds can be distinguished today in which AI enters production software systems. Class 1 (embedded) works as a component in a larger system, in the background of a product that would still be understood without it. Class 2 (tooling) works as a tool in the hand of a practitioner who uses it in two distinct ways — to think (asking, comparing, reframing) and to produce (generating artifacts under constraints, then checking them), with the human responsible for the outcome in both cases. Class 3 (orchestration) works as a composition, coordinating multi-step work across tools and services with limited or no human review at the per-step level.

    These modes are composable but not interchangeable. They have different architectures, different economies, different failure modes, and different surfaces of accountability. Treating them as stylistic variants of the same thing is a category error — it addresses problems with the wrong class and pulls investment toward the most visible layer without the less visible layers becoming any better for it.

    The following sorting takes each of the three modes seriously on its own terms. It begins with the first class — the embedded component — and then works its way through tooling and into orchestration.

    What Class 1 concretely is

    Three examples are enough to make the layer visible without turning it into an inventory.

    A spam filter is a classification component sitting between a receiving service and an inbox. The AI decides what gets through — and the AI here is not a language model but a classical statistical classifier, typically a Bayesian filter or a gradient-boosted tree trained on features of the incoming message. The spam filter is one of the earliest broadly productive ML use cases in the software industry, and one of the most successful — measured by the fact that hardly anyone still consciously registers how many thousand messages an average inbox does not receive per year. The operational reality of this filter is drift: spammers respond to classifiers, yesterday’s training signal is thinner tomorrow, and the model that was excellent last quarter visibly loses precision without maintenance.

    A fraud-detection component sits in a payment flow and issues risk scores on which a decision point hangs: let the transaction through, challenge it, block it. This too is not a language model but a classical ML system — gradient-boosted trees, graph-based anomaly detection, sometimes a narrow neural network on tabular features. Here the AI is not only a classifier but also the carrier of a hard trade-off between two kinds of cost — falsely blocked legitimate transactions, which produce customer-service load and lost trust, against passed-through fraudulent transactions, which mean direct damage. Every recalibration of the model shifts this trade-off. There is no position at which both error types vanish, and the model is not what resolves the trade-off — it is what makes it manageable.

    An AI summary over search results is a generative sub-component in a significantly larger product workflow. It reads search results and produces a paragraph presented to the user above the actual results. Unlike the first two examples, this one does use a language model, and unlike the first two it is visible — recognized by the user as AI and a subject of public debate. But its operational reality remains that of a Class-1 component: drift in the underlying data sources, factual fidelity as a continuous downstream-cost line, human-review load in cases where the summary diverges between formally correct and actually right.

    The order is not accidental. The first two examples are invisible classical ML, continuously in need of calibration; the third is visible generative AI and contested. The architectural role is the same in all three, and that is what defines the class. But visibility is not operationally neutral. A failure in an invisible component produces operator load and slow customer attrition; a failure in a visible component produces brand damage, public discourse, and liability questions in the same hour. The class is the same; the cost surface of failure is not. Visibility shifts the radius of effect into the reputational economy, and the architecture has to account for that — through tighter human review, stricter rollback discipline, faster intervention paths. Class identity is defined by the architectural role, but the operational shape within the class is modulated by visibility, by adversarial pressure, and by the regime the component sits in.

    The operational reality

    A Class-1 component does not only add capability to a system, it adds a particular maintenance surface. Deterministic systems have their own quiet failure modes — retry storms, cascading timeouts, thread starvation, slow resource leaks — but they also fail loudly and locally, with exceptions, stack traces, and bug tickets that point at the place the failure occurred. AI components less often produce that kind of failure. They tend to fail quietly and longitudinally instead: through drift, through shifted input distributions, through stale assumptions, through slow quality erosion that no one sees on a dashboard, because the dashboard was tuned to the point at which the model was still good last month.

    But the shape of this failure is not uniform across the class. The three examples above occupy three different operating regimes. The spam filter and the fraud detector stand in an adversarial regime: there is a counterparty actively optimizing against the classifier, yesterday’s signal is structurally weaker tomorrow, and degradation is not neutral — an unmaintained spam filter eventually makes an inbox unusable, an unmaintained fraud detector costs money per day of neglect. The search summary stands in a non-adversarial regime with a very different property: no one is optimizing against it, and the surrounding product — the actual search results — remains functional without it. It can be switched off, rate-limited, or narrowed in scope without the product losing its core function. This is graceful degradation as an architectural option, available here and not available to the first two.

    From this follows a point that is consistently missing from the public narrative around AI value: monitoring is not an activity beside the feature in Class 1. It is part of the feature — but how much of the feature, and with what weight, depends on the regime. An adversarial component without continuous evaluation, retraining, and human-review capacity is not a cheaper component; it is only the one whose costs later become visible as an incident. A non-adversarial component with a degradation path can get by with lighter monitoring, because the fallback is cheap. The economy of a Class-1 component does not exhaust itself in compute and inference. It encompasses evaluation harnesses, post-deployment monitoring, human-review volume, intervention costs, and the ongoing maintenance of the training-data pipeline. These are real line items. They show up in real budgets, and they scale with the regime the component sits in.

    There is a point here that deserves to be named, because it is normally left in the subtext. Good monitoring of an AI component is itself usually a statistical, self-optimizing system — drift detectors that learn the baseline, anomaly scorers that adjust their thresholds, outlier models that update with new input distributions. The monitoring layer for an AI system is, architecturally, another AI system. This is not a paradox; it is the condition under which monitoring at scale is possible at all. Deterministic checks catch the cases you already anticipated; statistical monitoring catches the cases you did not. But it means the line between „the system“ and „the oversight of the system“ is not a clean one. The oversight has its own drift, its own calibration, its own training-data pipeline. It is an AI component with the operational reality of a Class-1 component, wrapped around another Class-1 component. The regress is finite in practice — at some point a human looks at a dashboard — but it is worth knowing that the wrapping happens, because it explains why „just add monitoring“ is never as cheap as it sounds, and it quietly raises a question this text does not answer: what, exactly, counts as an AI and what counts as ordinary software, when the latter is increasingly built out of the former. A related observation runs in the data direction rather than the oversight direction: nominally verified outputs from Class 2 flow back as training or context material into later Class-1 components, making the class boundary permeable in a direction this sorting does not address. That too belongs in a later text.

    Latency is the third place where the operational reality forces an architectural decision. The synchronous call to a generative model is often only the surface above a system that has to be thought through asynchronously, fault-tolerantly, and latency-tolerantly. What looks like an „API call“ becomes, in practice, a flow of asynchronous and fault-tolerant sub-steps — not because anyone loves complexity, but because the behavior of the system under load is otherwise not defined.

    The selection question

    From this operational reality follows the question that precedes every Class-1 decision and is the one most frequently skipped in the current discussion: Which of the problems in front of me actually needs an AI component — rather than a deterministic alternative that would be cheaper, faster, more debuggable, and more maintainable?

    It is a question that compresses generations of ML-engineering experience into a single sentence. An established guideline from machine learning has for years begun with the instruction to launch without machine learning when the problem allows it. Practitioners building with generative models have been making a similar case: begin with the simplest workable approach and add model complexity only where simpler approaches demonstrably fall short. Both observations say the same thing: AI is a component with downstream costs, and the downstream costs are not negligible just because the component is new and exciting.

    The question is not whether AI can work in Class 1. It works. Spam filters work. Fraud detection works. Generative sub-components inside larger products work. The question is whether the concrete problem on someone’s table actually needs a probabilistic answer — or whether the AI component is offering itself as the solution to a problem that a script, a heuristic, or a deterministic rule would have served better, more cheaply, and with less maintenance surface.

    The architectural mismatch

    There is a recognizable shape in which this question is answered wrongly, and it is frequent enough to warrant structural description. A Class-1 problem — a classification, an extraction, a scoring — is addressed with a Class-3 solution: an orchestrated composition of multiple model calls, tool calls, intermediate steps, and a coordination layer that ties the whole thing together. The result is more architecture, more failure surface, more validation load, more state to debug, and a unit economics that shifts in ways that were not anticipated during planning.

    The point is not that orchestration is bad. It is legitimate and has its own applications, which are covered later in this text. The point is that orchestration is expensive, and these costs have to be justified by a problem that needs orchestration. A pure classification problem does not, as a rule, need orchestration. It needs a classifier. And a good classifier is visually unremarkable — which may explain why the temptation in the other direction is so persistent.

    The layer that pays for the slides

    Class 1 is a layer in which a large share of the already-realized production value of AI in software systems sits today. It is also the layer that speaks least about itself, because components that function invisibly need no marketing of their own. Both sentences are true at the same time, and the tension between them cannot be dissolved — it is the form in which this layer exists.

    What can be drawn from this observation is neither a lesson about hype nor a recommendation about investment priorities. It is a plainer sorting: before anyone decides on AI in a software system, the first usable question is which mode the AI would be sitting in. If the answer is embedded component, then everything described so far applies. If the answer is something else, then other rules apply, which deserve other questions. Those questions follow next.

    Tooling and orchestration, with the human at the tool

    The two remaining modes share a feature that the first class lacks: a human holds the tool in hand and uses it as a tool — directly in Class 2, more remotely in Class 3.

    A practitioner asks questions and checks answers. They have options proposed and discard the ones that do not fit. They produce drafts under conditions they have formulated themselves, and decide what is kept and what is not. The work that emerges is recognizably theirs. They carry responsibility for the result operationally — at the end of the day, something exists that they have to be able to justify.

    This layer receives little attention, although a considerable share of the hours in which engineers today actually work with AI fall inside it. The discourse speaks of it rarely, because it has neither the spectacle of fully automated systems nor the invisible production maturity of embedded components.

    Two forms within Class 2

    Work in Class 2 happens in two distinct forms. They are not subclasses but grips on the same tool, and the practitioner moves between them within a single session.

    In the divergent form, the tool helps with thinking. Someone stands in front of a problem whose shape is not yet clear — an architecture decision, a comparison between options, a diagnosis, a translation between domain need and technical description. The tool opens the space: it suggests readings the questioner had not thought of. It compares. It reframes. It holds up a mirror in which the practitioner’s own assumptions become visible, because the tool does not share them. The value of this form lies not in the answer, but in the sharpening of the question.

    In the convergent form, the tool helps produce something. Someone has understood a problem far enough to formulate the conditions under which a solution has to hold — a function, a schema, a text, a configuration. The tool produces a candidate under these conditions. The practitioner checks whether the candidate actually holds the conditions, and decides what stays and what is to be rewritten. The value of this form lies not in sharpening, but in the load it shifts onto verification.

    The same tool, two different activities. The divergent form opens rooms; the convergent form closes them under constraints. Whoever conflates the forms produces drafts at a moment when they should still have been thinking, or remains stuck in thought when the tool could already have delivered a candidate.

    The divergent form as hinge

    A working observation that is almost never stated: the divergent form of Class 2 is not only a useful grip in its own right, it is the upstream condition for the targeted use of every class.

    Whoever goes into the convergent form with divergent preparation writes different prompts. The constraints are sharper because they were interrogated before they were formulated. The requirements are more complete because the divergent grip surfaced the conditions that would otherwise have shown up only in the verification step, when they are expensive to repair. The candidate the tool produces is already closer to what is needed, because what is needed has been thought through.

    Whoever goes into Class 3 (orchestration) with divergent preparation plans different compositions. The orchestration is scoped against a problem whose shape has been tested, not against a problem whose shape was assumed. Tool boundaries, state transitions, failure modes — all of these land better when they were first thought through in dialog with a divergent counterpart that pushed back on the plan.

    Whoever goes into a Class-1 selection with divergent preparation asks the selection question from the previous block more sharply. Does this problem need a probabilistic answer. Would a simpler approach hold. What would the maintenance surface look like if the problem shifted. These are questions the divergent form is good at opening, precisely because the model does not share the questioner’s assumptions.

    Downstream, the same hinge works in the other direction. A convergent artifact can be interrogated by the divergent form before it goes to review. A Class-3 plan can be stress-tested by the divergent form before it is executed. A Class-1 component’s behavior can be examined in the divergent form before it is deployed. The divergent grip is useful wherever a convergent artifact or a committed decision needs pressure applied to it before it becomes expensive to revise.

    This hinge function is underdescribed in the current discourse for the same reason the divergent form generally is: it produces no artifact of its own. What it produces shows up in the quality of the artifacts that the other modes then generate. The work is invisible in the output. That is not a reason to leave it out of the sorting. It is a reason to name it explicitly — because a practice that produces no visible output is the first to be cut when an organization asks what „using AI“ means in its workflows.

    The actual bottleneck

    In the convergent form there is an observation that is regularly absent from the current discussion. In many Class-2 workflows, the harder bottleneck is not the speed at which the tool produces candidates, but the bandwidth with which a human can review them.

    Verification is cognitive work. It requires attention, memory, comparison against a mental representation of the requirements, the recognition of subtly wrong places at which the candidate looks formally right and diverges in substance. This work has an upper limit, which does not depend on model size but on the condition of the reviewer in the twelfth hour of a long day.

    When the tool delivers faster than the human can check, throughput does not shift upward. Oversight shifts from real to nominal. The human is still in the loop, in the sense of the org chart. The loop is just cognitively no longer closed.

    That is a structural limit of this layer in review-heavy workflows. Not a limit that disappears with a better model, but one that moves closer with every faster model. Automated evaluation, function-call accuracy checks, and model-based graders can absorb part of the load in domains where a reference set exists; the residual — the work that requires a human to weigh whether a candidate actually holds the conditions — is what runs into this ceiling.

    The hinge described above has a downstream application here, and it is the most frequently reached-for partial remedy. A reviewer working in the convergent form can pull in the divergent form as cognitive support during review itself — the same model in a different grip, tasked not with producing but with interrogating a candidate that already exists. Where does this candidate diverge from the requirements. What assumption does it make silently. What alternative would hold the constraints differently. Used this way, the divergent form extends the reviewer’s bandwidth by sharpening the questions asked of the candidate, and by surfacing failure modes the tired reviewer would not have seen. It is a legitimate and often underused move, and it is a variant of the hinge described in the previous block — applied in the middle of verification rather than upstream of generation.

    What it does not do is transfer the responsibility. The model can grade a candidate against defined criteria, surface inconsistencies, and propose alternatives — those are real capabilities, and there is a growing practice of using models as graders or judges for that purpose. What it cannot be is the accountable signature on the artifact. Its output is more material for the reviewer to weigh, not the loop-closure itself. Whoever treats it as the loop-closure has not resolved the bottleneck — they have added a second generator whose output also flows through with the same nominal-rather-than-real oversight, and moved the closure one step further out while calling it closed. „The other model said it looked fine“ is not verification. It is the same nominal oversight, dressed up with a second opinion that carries no more accountability than the first.

    When throughput in a Class-2 workflow exceeds the reviewer’s verification capacity — whether or not the divergent form was used as support — the workflow does not simply slow down. It changes class. What was a tool-in-hand mode tips over into something that operationally resembles an embedded component: outputs flow through with nominal rather than real oversight. But the resemblance is partial. The workflow has now acquired the failure mode of a Class-1 component without acquiring its monitoring discipline. No drift detection, no evaluation harness, no post-deployment review of where the candidates were actually wrong. The failure surface of Class 1, without the maintenance surface that makes Class 1 viable.

    The same bottleneck has a second escalation path, this one upward. When several Class-2 convergent steps are composed into a Class-3 chain, verification does not scale linearly. Reviewing a single artifact already pressed cognitive limits; reviewing a chain of artifacts where each step’s correctness depends on the previous step’s interpretation is not n times harder, it is operationally a different problem. The reviewer can no longer hold the full state in mind, can no longer compare each step against an unchanged reference, and is increasingly forced to spot-check rather than verify. Spot-checking is not verification at chain length — it is sampling, with all the failure modes sampling carries on adversarial or drift-prone surfaces. On a floor whose specific failure mode is the chain that succeeds step-by-step but drifts as a whole, sampling is structurally inadequate. The Class-2 bottleneck does not disappear when the work moves up to Class 3; it gets multiplied by the chain length, and the cognitive ceiling that bounded a single reviewer becomes the ceiling that bounds the entire orchestration’s verifiability.

    Composition as the next floor

    Once these review and generation steps are no longer organized individually but as a repeatable, branched, and stateful flow, Class 2 tips over into Class 3.

    A practitioner working with the tool on a single artifact stays in the grip just described. The moment that same person begins to assemble several such steps into a system — steps that run in sequence, pass results to one another, branch on conditions, call external tools, hold state over time — the work moves up a floor. What was previously a sequence of tool invocations becomes a composition. The hand is the same. What it works on is no longer an artifact but a plan, according to which several artifacts and calls cooperate.

    This floor has acquired its own names: orchestration, agent system, autonomous pipeline. Technically, it is a composition of bounded capabilities plus a layer that takes on coordination, state, routing, guardrails, and workflow management. The question that carries this floor is not whether it works — it works, in a growing number of cases. The question is which problem actually needs this form and which would be better served elsewhere.

    Two specification regimes

    On this floor, public discussion regularly draws a distinction that is built askew. The load-bearing contrast is not „vibe vs. enterprise“ and not „pros vs. amateurs.“ Both sides iterate, both can use the same tools, both produce running systems. The real distinction lies in the specification regime: how far is the intention according to which the system acts externalized, versioned, checkable, and handoff-capable?

    At the industrial polecontract-first — the authoritative intention sits in explicit artifacts. Brief, architecture plan, security plan, test plan, approvals, documentation, change trail, accountabilities. Conversation can be input, but it is not authority. Authoritative are the artifacts against which architecture, security, testability, handoff, and accountability can be hung. In some practices the code or the configuration is itself one of those artifacts — infrastructure-as-code, GitOps repositories, declarative manifests treated as source of truth. The point is not that code cannot carry intention; it is that something has to carry it in a form that survives the people who wrote it, and that form has to be the one against which subsequent change negotiates. Whether that artifact is a separate document or the code itself is a question of practice. That an artifact exists, is maintained, and is authoritative is not.

    At the conversational poleconversation-first — the authoritative intention sits primarily in the dialog, in the tool state, and in the platform’s defaults. A part of the specification sits in the ongoing conversation, a part in the prompts and plugins the system ships with, a part in conventions that have never been written out. The specification is not absent — it is situational, lighter, and more strongly embedded in the ongoing exchange.

    Both poles are legitimate. For a personal tool that only its builder and immediate context need, for a one-off automation hack, for an artifact that regenerates faster than it can be maintained, the conversational regime is economically rational. The overhead of externalizing intention into maintained artifacts would not be justified.

    Run, not tool

    The sorting into these two poles does not apply to a tool as a whole, but to the way it is operated. The same system can carry the same build process in a conversational run or in an industrial run. What decides is not the machine, but the mode.

    This observation matters because it prevents readers from sorting their favorite tools into one pole or the other. The question is not whether a tool is „industrial“ or „vibe.“ The question is whether the concrete run someone is currently driving anchors its intention in artifacts or holds it in dialog.

    The cheapest entry

    From this separation follows an operational observation that is decisive for most industrial runs: the conversational mode is the cheapest place at which risk can be removed at entry.

    In dialog, assumptions, boundary conditions, and contradictions are cheap to move. A requirements conversation that surfaces a wrong assumption costs a question and an answer. The same wrong assumption, once poured into code, deployment, and operation, costs a revision, a rollback, a debugging session, a post-mortem. The distance between the two costs is the entire industrialization machinery standing behind it.

    This is where the Class-2 hinge described earlier becomes economically visible. The conversational entry is the moment at which the divergent grip is at its cheapest and most useful. Assumptions that are interrogated here cost a sentence. Assumptions that are interrogated later cost a change request. An industrialized run that treats the conversational entry as a cost-minimization mechanism — not as a warm-up phase — is running the hinge in its highest-leverage position. A run that skips this phase because it looks unproductive is paying the same cost later, and paying it with industrialization overhead layered on top.

    An industrialized run that begins conversationally has an additional advantage: the interviewer knows the guardrails against which they are asking. They know which security requirements apply, which architectural decisions have already been made, which test discipline the practice runs. Their questions are more targeted because they run against something. A conversation-first run without these guardrails can ask the same questions, but not with the same precision against already-settled conditions.

    The hybrid — conversational at entry, contract-first in execution — is therefore not a special case next to the two poles, but an economically common shape of an industrial run. The poles themselves are real. Their most frequent combination is also real.

    The limit of the conversational regime

    The conversational regime has a scaling limit that does not lie in the technology but in the lifespan and headcount of the system. As long as the builder is still reading along in the tool state — as long as they know what they meant last week, which assumption they implicitly set in the dialog, which platform default they accepted — the regime holds. Beyond that condition, it becomes unstable.

    A system that is operated for three years by rotating humans needs a specification that is not bound to the original builder. If this specification does not exist in artifacts, it exists in the heads of the humans who still remember it — and loses its carriers when they move on. Escalation then lands with the person who, as the last one, still knows what was meant. This role is rarely named explicitly, is often carried implicitly, and is operationally an expensive escalation point.

    This is neither a forecast nor a judgment against the conversational regime. It is the operational condition under which it remains stable.

    The governance façade

    From the separation of the two regimes, an anti-pattern can be named operationally. It begins where an organization builds industrialized governance machinery — permissions, tracing, approval workflows, policy gates, pipeline discipline — while the specification itself still exists only as conversation. The form is industrial, the substance is conversational. Governance machinery is being maintained; the specification artifacts that this machinery exists to enforce have never been written down.

    The anti-pattern is not that the wrong regime was chosen. It is that one form is executed in the substance of the other. The operational test is simple: if no load-bearing artifact answers exist for these questions, and the practically effective answers are at their core „it’s in the chat,“ „it sits in the prompt,“ „it is person-bound,“ „it is assumed in the tooling“ — then the system is conversation-first, regardless of which governance façade sits on top.

    The result is more permissions administration, more orchestration surface, more tracing, more validation load, and more governance machinery than the underlying specification substance carries. The point is not that governance is bad. The point is that governance has its prerequisite, and the prerequisite is called: maintained artifacts against which governance can enforce something.

    From conversation to artifact

    The diagnosis above is incomplete without its constructive counterpart. A conversational specification is not condemned to remain conversational. It can be extracted — turned into declarative artifacts that, from the moment they exist, serve as the basis for every subsequent change. This is the path that makes a conversational entry compatible with an industrial run, and on the industrial side it is not optional.

    The mechanism is recognizable. A working dialog produces, in passing, the material that a specification is made of: requirements as they were actually formulated, decisions as they were actually made, constraints as they were actually accepted, alternatives as they were actually rejected. This material exists. It is just not yet in a form against which governance can enforce anything. Extraction transforms it: requirements become a requirements list, decisions become decision records, constraints become a constraints document, rejected alternatives become a documented „considered and not chosen“ trail. The form changes. The substance was already there.

    This extraction is itself a Class-2 task that uses both forms of the grip. The divergent form surfaces what was implicit in the conversation that the participants themselves no longer see — the assumptions that were made silently, the constraints that were treated as obvious, the decisions that were never explicitly named as decisions. The convergent form then produces the artifact under the form constraints of the practice’s documentation discipline. Both forms are needed; neither alone is enough. A pure convergent extraction produces a transcript with headers, which is not the same as a specification. A pure divergent interrogation produces insight without artifact, which is not the same as documentation.

    The operative condition that makes this path real, rather than ceremonial, has one sentence: from the moment the artifact exists, it is the authority. Subsequent changes negotiate against it, not around it. Conversation may continue, and is often the cheapest place to surface the next change — but the change lands in the artifact before it lands in the system. An organization that produces extraction artifacts and then continues to make changes in conversation, leaving the artifacts to drift, has not built an industrial run. It has built a governance façade with documentation attached, which is the same anti-pattern with one more component to maintain.

    The path therefore has two halves. The extraction is the easier half — it is a workflow that, once practiced, becomes routine, and Class 2 makes it cheaper than it used to be when it had to be done by hand from memory. The harder half is the discipline that follows: treating the artifact as the basis for change, refusing to let the conversation become the de facto specification again as soon as the next iteration begins. This is not a tooling problem. It is an organizational one, and it is the same discipline that the industrial pole has always required of any specification regime, AI-related or not.

    This is also the half that most reliably fails, and the failure mode is not ignorance but incentive. Sprint velocity does not reward documentation work that produces no shipped artifact. Ticketing systems privilege whatever produces a closeable item; specification maintenance produces no closeable item. Conversational tooling has zero friction; artifact maintenance has continuous friction. Under those incentives, an organization that initially extracts artifacts will, within two or three iteration cycles, find itself making changes in conversation again — not because anyone decided to, but because the path of least resistance was reinstated by the surrounding system. The text describes the mechanism. It does not describe a fix for the incentive structure that erodes it, because that fix is not technical. It is organizational, and the practice that survives is the one whose incentive structure has been redesigned to reward specification maintenance — by making it visible, by making it part of definition-of-done, by escalating drift between artifact and system as a defect class. Whether an organization is willing to do that is not a question the sorting can answer.

    A correctly built hybrid is recognizable by this rhythm. Conversational at entry, extraction to artifact at the threshold, contract-first in execution, change requests routed back through the artifact rather than around it. The conversation is preserved as a tool, not promoted to authority. This is what the industrial pole’s claim to load-bearing artifacts looks like in an organization that has integrated the conversational entry as a cost-reduction mechanism rather than refused it as a discipline failure.

    Containment as an architectural form

    Once a system on this floor can plan, persist, call external tools, and act across workflows, the question of what it means for the system to be „wrong“ shifts. In the lower layers, a wrong AI answer is a wrong output that is processed further or not. On this floor, a wrong AI answer is potentially a wrong action that changes state in another system.

    The phenomena that follow have been described for several years and named in several agentic safety taxonomies. Inputs can bend goals. Persisted context can infect later steps. Local errors can amplify across a pipeline rather than being caught.

    What governance looks like on this floor is not decorative contract machinery. It is containment: what limits permissions, scopes the radius of effect of a single action, narrows the authority of individual tool calls, restricts memory, and prevents a once-compromised state from propagating across further steps. Containment is part of the architecture on this floor. Technical mechanisms — sandboxing, minimal privileges, approval gates on sensitive operations — make containment possible at all; maintained artifacts make it scalable, auditable, and handoff-capable.

    Perfection through handling of imperfection

    Containment limits what a single wrong action can reach. It does not address the prior question, which is architecturally the more important one on this floor: what the system does when a wrong action occurs — and it will occur, because probabilistic components fail probabilistically.

    This is where a large body of ordinary systems engineering, predating the current AI discourse by decades, becomes load-bearing again. Fault tolerance. Graceful degradation. Circuit breakers. Retry with backoff. Idempotent operations. Compensating transactions. Dead-letter queues. Recovery procedures that bring the system back to a known state after a step produces nonsense. Observability dense enough to locate which step produced the nonsense and why. None of this is new — these patterns have been mandatory in deterministic distributed systems for a long time, for the same structural reason: components fail, networks partition, dependencies time out. What changes on this floor is the failure rate per step and the nature of the failure. The patterns are familiar; the calibration is not. Circuit-breaker thresholds, retry budgets, and idempotency boundaries that were sufficient when the failure rate was 0.1% become inadequate when a probabilistic step fails meaningfully more often, and „fails“ includes outputs that pass schema validation and are still wrong.

    An industrial Class-3 run is recognizable by the fact that its architecture assumes errors will happen and is organized around handling them. A conversational run often carries the opposite assumption implicitly — that the model „usually gets it right“ and that error cases are edge cases to be addressed later. On this floor, that assumption is structurally wrong. The model will get it wrong. Not rarely. The architecture has to hold regardless.

    The formulation that compresses this best is old: perfection through handling of imperfection. The perfection of the system is not the perfection of each step. It is the perfection of the handling of the imperfection of the steps. Building Class-3 systems without this framing produces systems whose reliability is bounded by the reliability of their least reliable component — which, on this floor, is a probabilistic one.

    The principle has a stronger reading than error handling alone. If imperfection is genuinely accepted as the operating condition of this floor, then handling it can extend beyond degradation and recovery into the full development loop: defect reports become tickets, tickets are routed to the swarm, the swarm produces fixes, fixes ship through declarative deployment, and the next defect report enters the same loop. End-acceptance is no longer a per-artifact human gate. It becomes a property of the loop as a whole — the system’s correctness measured by whether the rate of fix-shipping keeps up with the rate of defect-surfacing, not by whether each individual change passed a human signoff before deployment.

    This variant is sometimes presented as fully autonomous end-to-end development. The framing is misleading and worth correcting at the point at which the architecture is named. The human has not been eliminated from the loop; the human position has moved. End-acceptance is removed; the signal edge — where defects and feature requests enter — is where the human now sits. Most defect signals in such a system come from people: users encountering wrong outputs, operators noticing degraded behavior, stakeholders requesting changes. A secondary signal channel is the monitoring layer itself — often a Class-1 component — that detects an anomaly and files a ticket against the swarm without a human in between. Both channels are real. Neither makes the loop autonomous. They make it human-fed at the edge instead of human-checked at the center.

    Naming the architecture this way matters because the operational reality follows the naming. A system described as „fully autonomous“ produces one set of expectations about responsibility, oversight, and competence requirements. A system described as „human-fed at the edge, fix-loop closed in the middle“ produces a different set — one in which the quality of the signal edge becomes load-bearing. If users are the primary defect detector, the cost surface includes every defect users do not surface: silent corruption, drift, outputs that are wrong but not visibly wrong from the user’s vantage point. If a Class-1 monitoring component files the tickets, the swarm’s behavior is bounded by what that component is configured to detect. The closed loop is only as good as the signal it closes around, and the signal does not originate inside the loop. The economic shape this gives the cost surface — what fails between defect-introduction and edge-detection, and how its external reach scales with detection latency — is a question this sorting names but does not resolve.

    This is the strongest reading of perfection through handling of imperfection on this floor. It accepts that individual steps will fail, accepts that even the swarm’s fixes will sometimes fail, and organizes the architecture so that the loop converges anyway — under the condition that the signal edge is well-positioned and the rate of correct fixes exceeds the rate of defect-introduction. Whether that condition holds in a given system is an empirical question, not an architectural guarantee.

    This connects back to the specification-regime distinction. Error handling of this kind is not something a conversation carries. It is something artifacts carry: the failure modes enumerated, the compensations specified, the recovery paths tested, the circuit-breaker thresholds documented. A conversational Class-3 run without artifact-carried error handling can look impressive in the happy path and collapse in the first non-trivial failure. The governance façade described earlier has a sibling here: the orchestration façade, which runs the happy path beautifully and has nothing to say when a tool call returns nonsense.

    Who builds this floor

    Class 3 is built with Class-2 work. Eroding the craft layer erodes the build capacity of this floor. Orchestration, too, emerges from the same skill: sharpening questions, formulating constraints, checking intermediate states, making handoffs explicit. But this craft layer is necessary, not sufficient — the difficulties that follow draw on a different engineering tradition, one that the lower classes do not require in this form.

    Three of these difficulties deserve to be named, because they are the difficulties that distinguish a working orchestration from one that merely runs.

    The first is coordination. Class 3 is multiple model calls cooperating, not in the sense of sequential execution — that part is trivial — but in the sense of semantic interoperability. One step’s output has to be in a form the next step can productively work with. This is interface design on a floor where the interfaces are not deterministically defined. Classical API design asks: is this output formally valid. Class-3 coordination has to ask the harder question: is this output formally valid and semantically usable by what comes next, or is it the kind of correct-but-useless artifact that passes the schema check and breaks the downstream reasoning. Coordination on this floor is not solved by stricter schemas alone; it is solved by handoff design that anticipates what the receiving step actually needs to do its work.

    The second is staffing the swarm. Which step gets which model. Generalist or specialist. Large model or small. Expensive reasoning model for the steps that need to plan, cheap classifier for the steps that need to route. This is resource allocation, but it is probabilistic resource allocation — the choice of model at a given step does not only shift cost, it shifts the probability distribution of that step’s outputs and therefore the behavior of the system as a whole. A pipeline staffed entirely with the most capable model is expensive and overkill; a pipeline staffed entirely with the cheapest is brittle in the steps that needed reasoning. The competence required here is staffing judgment under probabilistic constraints, and it has very little prior art in the deterministic engineering tradition that the rest of the build draws on.

    The third is alignment across the chain. Each step has its own latitude for interpretation. The question is whether the chain as a whole still does what the initial intention was. Drift between steps, cumulative shift of interpretation across the pipeline, consistency of assumptions across tool calls — these are alignment problems specific to multi-step probabilistic systems. They are not the same as containment (which limits the radius of a wrong action) and not the same as error handling (which addresses what to do when a step fails outright). Alignment addresses the case where every individual step succeeds on its own terms and the chain as a whole nevertheless arrives somewhere other than the initial intention pointed. This failure mode is hard to detect because no single step looks wrong. It is visible only at the chain level, which means it is visible only to whoever is watching the chain — and on a floor whose appeal is partly that it runs without watching, that is a competence that has to be deliberately built and deliberately preserved.

    These three difficulties — coordination, staffing, alignment — are not eliminated by any of the lower classes‘ techniques and do not yield to better individual models, though each can be partially mitigated by interface discipline, evaluation infrastructure, and stronger steps in isolation. They are properties of the composition itself, and they are the reason that Class 3 is genuinely a class of its own and not a stack of Class-2 calls in a trench coat.

    The strategic precondition

    The three modes described so far are operational. They are executed by people working on artifacts. One level above the work, the same modes appear differently — as objects of decision rather than execution. Where to invest, what to build up, how to organize, which competencies to develop. At that level, the sorting stops being a description of how AI is used and starts being a precondition for how it is procured. That is the point at which sorting touches strategy.

    Strategic consistency means that comparable decisions fall comparably. Whoever decides without sorting implicitly means a different class in every decision, without knowing it. A Class-1 problem gets a Class-3 solution, because Class 3 is the current language that receives investment. A Class-2 need is procured as a Class-1 component, because the component can be booked against a budget line. These are not strategic errors. They are the inevitable consequences of missing sorting.

    The costs are paid anyway. Mis-staffings, maintenance surfaces dragged in later, governance façades without substance, pilot projects that take the field from one another without ever answering the same question. They show up as infrastructure overspend, review-team headcount, incident-response load, retraining cycles, compliance retrofitting, and the slow accrual of architectural debt that has to be serviced before the next thing can be built on top of it. The only difference lies in whether they are later booked as a learning item or as an unattributable loss.

    From this follows a hard observation. A strategy that says nothing about which of the three modes an organization invests in, how it builds the competencies required for that, how it organizes work in those modes, which practices it regards as load-bearing in each mode, and by which criteria it selects tools, remains closer to an investment catalog than to a strategy.

    The tool question deserves its own sentence because it is the one most often left unasked. Reputation and default behavior usually answer it before anyone formulates it: what gets taken is what is currently ahead in the discourse or what the platform suggests. The question of whether the chosen tool fits the concrete task — whether a smaller model is enough for calibration and learning, whether a specialist model would be more appropriate for a Class-1 component than a generalist, whether the task even needs a model — is rarely raised in practice, because it is rarely asked.

    This sorting does not prescribe a strategy. It makes one possible.

    The end of the sorting

    With this piece, the sorting of the three modes is complete. Embedded component, tool in hand (with its two forms — divergent and convergent), composition into orchestration. Three modes, three economies, three surfaces of accountability.

    What to do with this sorting — which of the three fits which problem, how investment decisions, team composition, and architectural choices shift under this lens, what that concretely means for the work in this or that organization — does not belong in this piece. It belongs in later texts that can fall back on this sorting as common ground. The piece proposes a taxonomy and argues for distinctions; it does not prescribe what to do with them.


    Follow this blog on Mastodon or in Fediverse, to get updates directly in your Feed.

    Christian Albert
    Christian Albert
    @calbert@christianalbert.photography
    3 Beiträge
    1 Folgende(r)
    Fediverse-Reaktionen