High-Velocity Decision Making and Agentic Systems Architecture

High-Velocity Decision Making and Agentic Systems Architecture

Amazon takes its culture seriously — more seriously than any company I’ve worked at before. The first 90 days are a firehose of Jeff-isms, Leadership Principles, and decision-making frameworks, and the expectation is that you’ll be fluent in them by the time you’re contributing. Two-way doors was one of them. High-velocity decision making. And the idea that you shouldn’t wait for 90% of the information before making a call — that if you wait that long, you’re almost always being too slow, and the cost of being slow is usually higher than the cost of being wrong on something reversible. The payoff is real: teams move faster, escalation paths stay clear, and the organization doesn’t grind itself into consensus paralysis as it scales.

I’d been applying the one-way and two-way door idea at the tool layer in my most recent builds — designing MCP servers for my agent team so the reversible actions stayed reversible and the irreversible ones got their own surface. Then we did a high-velocity decision making refresher at a recent offsite, and I realized I’d been thinking about this too narrowly. The door concept is one piece of a bigger framework, and the whole framework maps onto agentic systems architecture — not just the MCP surface, but how you design the team, authorize its actions, monitor its behavior, and decide when to expand its trust envelope.

So that’s what this post is. A walk through high-velocity decision making the way Amazon teaches it, alongside what each piece means for how you build an agent team. Some of it confirms what the leading voices on agentic architecture are already saying. Some of it sharpens their framing. And in a few places, it points somewhere different — because Amazon has been operating this model on humans for two decades, and that’s a long head start on a problem the industry is still trying to name.

One-way and two-way doors

The door concept is the load-bearing piece of high-velocity decision making, so it’s worth getting precise. A two-way door is a decision you can walk back through. You try something, you watch what happens, and if it’s wrong you reverse it. The cost of being wrong is bounded — mostly the time you spent walking through the door in the first place. A one-way door is a decision that commits you. Once you walk through it, the door closes behind you. Reversing it is either impossible or so expensive that “impossible” is the right planning assumption.

The point Bezos made in 2016 — and the reason the framing stuck inside Amazon — is that most decisions are two-way doors, but large organizations have a gravitational pull toward treating every decision like a one-way one. The more layers of approval, the more review cycles, the more meetings about the meeting. Applied to a two-way door, that process is waste. Applied to a one-way door, it’s the appropriate amount of care. The mistake isn’t having a careful process. The mistake is using the same process for both kinds of doors.

Now look at your agent team.

Every tool call your agent makes is a door. Some of them are two-way — reading a file, listing the calendar, fetching a document, creating a draft. The action is reversible, the blast radius is small, and if the agent gets it wrong you can delete what it created and try again. Some of them are one-way — sending the email, wiring the payment, deleting the row, publishing the post. Once the action lands, you’re not walking it back without real effort and often without real cost.

Here’s where agentic architecture diverges from how Amazon operates humans. When you onboard a new employee, they show up carrying the door framework in their head. They know that hitting send on a customer-facing email is different from drafting one. They know that a production deploy is different from a code review comment. The classification is internal to the person. Your agent doesn’t carry that. It sees a tool, it sees a description, and it calls it. Whatever door-awareness exists has to live in the interface you hand it.

That’s why the tool surface matters more than the prompt. You can write the most careful agent instructions in the world — “be cautious with destructive actions, prefer drafts when possible, ask before committing” — and the agent will still blow through a one-way door if your tool signature doesn’t distinguish the draft from the commit. The agent is doing what you told it. The tool is what lied about which kind of door it was.

The Model Context Protocol specification has a vocabulary for this and the vocabulary is almost right. Tools can be annotated with readOnlyHint, destructiveHint, and idempotentHint — the spec explicitly describes these as fields that help clients decide when to ask for user confirmation. The problem is the word hint. These are advisory metadata, not structural guarantees. A tool that lies about its destructiveHint is still a valid MCP tool. An agent using that tool has no way to know it’s been misled.

Amazon didn’t rely on annotating its decisions. It built the two-way-door framework into how people are taught to think, and then into how escalation actually works. Managers operate as the enforcement layer — they push back when a two-way door is being over-processed, and they push harder when a one-way door is being under-processed. The enforcement is human judgment applied consistently, not a destructiveHint: true tag on a meeting invite.

Your agent team needs the same enforcement, and it needs to be structural, not advisory. The door classification belongs in the shape of the tool itself. A reversible action and an irreversible action are different tools. Not different parameters on the same tool. Not different values of a boolean flag. Different tools, with different names, different signatures, and different authorization boundaries.

The 70% information rule

One of the specific numbers Bezos put in the 2016 letter was 70%. Most decisions, he said, should be made with somewhere around 70% of the information you wish you had. If you wait until you have 90%, you’re almost always being too slow. And the cost of slow is usually worse than the cost of wrong, because wrong on a two-way door is just a reversal — you walk back, try again, update your model — while slow is compounding opportunity cost that never shows up on anyone’s dashboard.

The cultural weight of this inside Amazon is hard to overstate. It changes how meetings end. It changes how proposals get written. It changes the tolerable error rate on a decision, which is a variable most organizations don’t realize they have. A team operating at 90%-information decision making looks careful and responsible from the outside and is actually bleeding speed in every direction. A team operating at 70% looks aggressive and is actually running at the efficient frontier.

Now apply that to your agent team.

The dominant posture in agentic architecture writing right now is the opposite of the 70% rule. The default is: when in doubt, escalate to a human. When confidence is low, ask for approval. When the action has any risk profile at all, put a human in the loop. The framing has a name — human-in-the-loop — and it’s presented as the mature, responsible, safety-conscious approach.

It isn’t. It’s the 90% posture, reinvented for agents, with all the same costs.

The problem is the same problem Bezos was pointing at. When every decision requires approval, approval stops meaning anything. The reviewer rubber-stamps because the volume is too high to engage with each item. The 4% of decisions that actually needed careful review get the same glance as the 96% that didn’t. The governance theater of “a human is in the loop” produces exactly zero governance value, because the human is drowning in two-way-door traffic and missing the one-way doors that slip past.

The 70% rule applied to agents looks like this: if the action is a two-way door, the agent acts. No approval, no escalation, no human hand-hold. The action goes through, gets logged, and is available for post-hoc review if anyone wants to look. If the agent is wrong, the action gets reversed. The bias for action is structural — it’s encoded in which tools don’t gate and which tools do.

If the action is a one-way door, the posture flips. Full stop. Named human approval, synchronous, with enough context to actually evaluate the decision. Not a yes/no prompt at the end of a long agent chain where the reviewer has lost the thread. A real decision moment, scoped tightly enough that the human can bring 70% judgment to it — which is the right amount, because the reviewer shouldn’t be at 90% either.

This is where the current human-in-the-loop literature starts to agree with the Amazon framing, even if it hasn’t named the connection. The recent writing on “calibrated autonomy” is converging on exactly this split — full autonomy for reversible, low-stakes, high-confidence actions; human review for irreversible, high-stakes, or low-confidence ones. The framing is right. The vocabulary is worse than what Amazon has been using for a decade, but the destination is the same.

Where the industry writing still goes wrong is the implementation. It treats the human-in-the-loop decision as orchestration — something the agent framework handles by pausing execution and routing to a reviewer. That’s a control-flow solution to what is actually a tool-surface problem. If your tools correctly represent which door they are, the orchestration layer barely has a job. The gate is structural, not procedural. The human gets invoked because they own a one-way door, not because the framework decided to pause.

There’s a second-order effect of getting this right that’s worth naming. The reviewer’s cognitive load drops dramatically when they only see one-way doors. They have time to think. They’re not in rubber-stamp mode. The approval actually means something because the volume is manageable. This is how the 70% rule produces better decisions, not worse ones — by concentrating human judgment on the decisions that need it and getting out of the way everywhere else.

The industry version of this — having a human rubber-stamp every action just in case — is the failure mode Bezos was warning against in 2016, applied to a workforce that doesn’t even have the satisfaction of being slow in a human way. It’s slow in a machine way, which is somehow worse.

Disagree and commit

Disagree and commit is the phrase inside Amazon that people outside Amazon hear most often and understand least. It’s not “go along to get along.” It’s not “the loudest voice wins.” It’s a specific operational move: once a decision is made, people who disagreed with it are expected to commit to executing it as if they had agreed. Not to undermine it. Not to slow-walk it. Not to wait for it to fail so they can say they were right.

The reason this works, and the reason it matters, is that consensus is expensive and often unavailable. On a hard decision, you can often get to 60% alignment. Getting to 90% alignment requires either dumbing down the decision until everyone likes it, or running an extended debate that costs more than the decision is worth. Disagree and commit is the mechanism for moving forward without paying that cost. The dissenter says their piece, the decision gets made, and then everyone pulls the same direction. The dissenter can still be proven right later — and when they are, the organization learns from it. But the execution doesn’t stall waiting for unanimity that was never going to happen.

Now think about multi-agent orchestration.

The current fashion in multi-agent systems is some version of consensus — agents vote, agents debate, agents negotiate until they reach agreement, and then the system acts. There are papers and frameworks and products built around getting agents to deliberate. The assumption is that agreement produces better outcomes.

It doesn’t, or at least not at the cost the agreement takes. A debate loop between three agents about whether to call the database or the API is not producing insight. It’s producing latency and token burn. The subordinate agent’s “disagreement” is not usually a load-bearing objection — it’s a confidence score below some threshold, or a preference for a different tool path, or a hallucinated concern. Treating that as veto power is how you build an agent system that can’t finish anything.

Disagree and commit applied to agent teams is a specific architectural pattern. The orchestrator makes the plan. Subordinate agents execute the pieces they were assigned, even when their own internal scoring would have preferred a different approach. If a subordinate agent has a real objection — not a preference, an actual concern about correctness or safety — it surfaces that objection through a structured channel, once, and then commits to the plan unless the orchestrator revises it. No infinite negotiation. No consensus loop. No agent politics.

The structured channel matters. This isn’t “shut up and execute.” It’s “raise the concern through the designated path, then execute.” That channel is observable — which is where the Dive Deep principle will show up later — and it gives the system a record of where subordinate agents disagreed and whether they were right. Over time, that record informs which subordinates get more decision authority on which kinds of calls. Which is disagree and commit producing the same organizational learning inside the agent team that it produces inside Amazon.

The contrast with the industry default is sharp. Most multi-agent frameworks today treat disagreement as something to resolve before action. The Amazon move is to treat disagreement as something to record, commit past, and learn from after the action. These are incompatible postures, and which one you pick determines whether your agent team moves at human speed or at machine speed.

One thing worth saying plainly: disagree and commit is not the right posture on a one-way door. When the decision being made is irreversible, consensus is worth paying for, because the cost of being wrong is higher than the cost of the debate. This is consistent across both the human and agent versions. A team at Amazon will spend more time on a one-way door than a two-way one, and so should your agents. The framework scales care appropriately, which is the whole point.

Escalation paths

Escalation at Amazon is not a failure state. It’s a mechanism. When a decision is above the level where it should be made — when the stakes are higher than the current owner’s authority, or when the decision crosses team boundaries in a way the current owner can’t unilaterally resolve — it escalates. The escalation path is known in advance, the escalation criteria are known in advance, and the person receiving the escalation has both the authority and the context to make the call.

What makes this work is that most decisions don’t escalate. The path exists specifically so that it doesn’t have to be used most of the time. A team that’s escalating every decision isn’t operating the escalation mechanism correctly — they’re operating a consensus mechanism wearing the escalation mechanism’s clothes. The whole value is in the contrast. Routine decisions flow; exceptional decisions escalate; the exceptional ones get the attention they deserve because they’re actually exceptional.

Here’s where the agent team story gets interesting, because the industry has been building escalation the wrong way around.

The default pattern in current agentic systems is: everything the agent does is potentially reviewable, and the reviewer decides what to pay attention to. This is backward. It puts the attention-allocation decision on the reviewer, at the moment when they have the least context to make it well. The reviewer sees a queue of agent actions and has to figure out which ones matter. Most of them don’t. The ones that do don’t look meaningfully different in the queue. The reviewer either pays attention to everything and burns out, or pays attention to nothing and misses the important ones.

The Amazon pattern flips this. The agent decides, in advance, which actions escalate. Escalation criteria are encoded in the tools themselves — which is the same structural point that keeps coming up, because it’s the same problem every time. A two-way-door tool doesn’t escalate. A one-way-door tool always escalates. A tool whose blast radius exceeds a threshold escalates. A tool whose confidence score is below a threshold escalates. The criteria are baked in, the agent doesn’t have discretion about whether to escalate, and the reviewer only sees what genuinely belongs in their queue.

This is the delegation chain piece that the agentic identity writing has been working on. The current vocabulary talks about subject, actor, purpose, and policy — the subject is the human, the actor is the agent, the purpose is what the action is trying to accomplish, and the policy decides whether the combination is allowed. That’s a good decomposition. The piece it’s missing is the door classification. A correctly designed delegation chain doesn’t just record who authorized what — it structures the authorization around which doors are one-way and which are two-way, and puts the human in the chain only at the one-way doors.

This is also where “human-in-the-loop” starts to become a useful phrase again, if you scope it correctly. Humans don’t belong in the loop — they belong at the escalation points. The loop is the routine operation of the agent team, which should run without human intervention. The escalation points are where the loop hands off to humans for a decision that actually requires human judgment. Those are two different architectural positions, and the current vocabulary blurs them into one.

The monitoring implication follows directly. If escalation is structural, monitoring doesn’t need to catch every agent action — it needs to catch escalations that should have happened but didn’t, and non-escalations that went through successfully. The monitoring surface is much smaller, which means it can be much more careful. The reviewer actually reads the escalations instead of glancing at them.

The Amazon version of this, running on humans, has a name: it’s called being a manager. A good manager doesn’t review every email their reports send. They review the ones that got flagged, the ones that crossed a threshold, the ones a report asked for review on. The rest flow. Your agent team needs the same kind of manager — which is what the human-in-the-loop position actually is, when it’s scoped correctly. Not a reviewer of everything. A manager of exceptions.

Expanding trust with demonstrated reliability

Amazon has a specific stance on trust. It isn’t given in advance. It isn’t granted by title. It’s earned by demonstrated reliability over time, and it expands the scope of decisions someone is expected to make without escalation. A new hire’s authority envelope is small. A senior engineer’s is much larger. The difference between them isn’t mostly the title — it’s the accumulated track record that makes it appropriate to let the senior engineer make calls that the new hire would escalate.

This is how Earn Trust, as a Leadership Principle, actually operates inside the company. It sounds like a soft interpersonal value from the outside. From the inside, it’s an authorization model. You demonstrate reliability, your decision authority expands. You fail at something, your authority contracts in the relevant area until you’ve rebuilt the track record. The mechanism is granular — you can be very trusted in one domain and still earning trust in another — and it’s based on observable behavior, not on how long you’ve been in the chair.

Your agent team needs exactly this model, and almost nobody is building it that way.

The current default for agent authorization is the one-time grant. You set up the MCP server, you configure the scopes, you give the agent access to a list of tools, and those scopes remain until someone manually changes them. The agent at day one has the same authority as the agent at day 90. Reliability doesn’t factor in. The agent could have been wrong a hundred times on a particular action class and its permission to keep taking that action is unchanged.

This is not how any organization actually manages its humans. It’s not how Amazon manages its humans. It shouldn’t be how your agent team manages its agents.

The right model is a progressively expanding permission envelope, granular by action class, based on observed reliability. A new agent — or an agent performing a new kind of task — starts with a tight envelope. Most actions escalate for review. The human reviewing the escalations produces a signal every time: this was the right call, this was wrong. That signal accumulates. When the agent’s observed reliability on an action class crosses a threshold, the envelope expands — fewer escalations, more autonomous execution. The reviewer stops seeing routine cases and starts seeing only the edges.

The inverse has to be true too. When an agent’s reliability on an action class drops — when it starts getting things wrong at a rate that exceeds the tolerable threshold for that action class — the envelope contracts. More escalations, more supervision, until the reliability rebuilds or the agent is pulled off that task class entirely.

This is the piece that’s furthest from current industry practice. The agentic identity writing from the last few months is starting to touch it — the “just-in-time provisioning” framing with TTL and purpose-scoped credentials is in the same neighborhood — but it’s mostly about rotating identities and scoping sessions, not about the dynamic expansion and contraction of an agent’s decision authority based on its track record. The dynamic piece is what makes it analogous to how humans earn trust. Without it, you have static permissioning with better session management.

Building this is not easy, which is probably why nobody has. It requires the reviewer signal to be structured enough to aggregate. It requires per-agent, per-action-class reliability scoring. It requires the authorization layer to read from that scoring in real time. And it requires the infrastructure to react when scores cross thresholds in either direction. That’s a real engineering effort. But it’s the right shape, and the Amazon version of it — running on humans — has been in production for two decades. The pattern is known. The work is translating it.

One note on why this matters more as the team scales. With ten agents, static permissioning is fine — a human can manually adjust things when they misbehave. With a thousand agents, static permissioning is a production risk. The agents will drift in their reliability at rates no human can track manually. The trust envelope has to expand and contract automatically or it won’t happen at all. The Earn Trust principle, built into the authorization layer, is how you operate an agent team at scale without the operators losing the plot.

Dive deep and observability

Dive Deep is one of the stranger Leadership Principles to encounter from the outside, because it sounds like a personal work style rather than an organizational practice. The wording is about leaders staying connected to the details, auditing frequently, being skeptical when metrics and anecdotes diverge. Read literally, it’s a virtue. Read as an operating practice, it’s a specific claim about where the useful information lives in an organization: at the level of the raw data, two layers below where a leader would naturally be looking.

Amazon’s version of this is concrete. Leaders are expected to read the actual customer complaints, not just the summary. Read the raw log lines, not just the dashboard. Sit in on the incident, not just get briefed on the postmortem. The reason is that abstractions lose information, and the information they lose is disproportionately the information that mattered. The weird cases, the edges, the tells that something’s off before it’s off enough to show up in the aggregate — all of that lives at the raw level and gets smoothed out by the time it reaches a summary.

Agentic systems make this ten times more important, and most operator tooling is built as if it were ten times less.

The default interface for operating an agent team is a dashboard. Actions taken, success rate, latency percentiles, cost per task, maybe a leaderboard of which tools got called the most. These are fine things to have. They are also almost useless for the question an operator actually needs to answer, which is: is my agent team doing what I think it’s doing, and when it’s not, why not.

That question cannot be answered at dashboard resolution. It can only be answered by reading traces. Not sampled traces, not summary traces — the actual sequence of tool calls, tool responses, reasoning tokens, and decisions the agent made on a specific task. That’s where the failure modes live. The agent that’s confidently wrong about an entity resolution. The agent that’s using the right tool for the wrong reason and getting lucky. The agent that’s making a decision the operator didn’t know was in scope. None of these show up in aggregate. All of them show up in traces.

The Dive Deep posture, translated to agent operations, is: the operator reads traces. Not every trace — that’s impossible at any scale that matters — but a meaningful sample, including the ones the system flagged, the ones that were close to a threshold, and a random sample just to keep the operator’s sense of the baseline honest. The operator spends real time in the raw data, not just in summary views.

This has an architectural implication that’s not obvious from the dashboard-centric framing. It means traces have to be readable. That sounds trivial and isn’t. Most agent traces today are either buried inside a framework’s internal logging, flattened into a format that strips the reasoning structure, or captured at a level of detail that’s either too shallow to be useful or too deep to be scannable. The Dive Deep principle, applied to agent infrastructure, is a requirement on the trace format: it has to be organized so that a human who cares can actually read through it and understand what happened.

The current industry default is observability tooling that assumes the operator is looking at dashboards. The traces are there, technically, but accessing them takes real effort, they’re formatted for machines rather than for humans, and nothing in the workflow encourages the operator to actually read them. The operator ends up at dashboard resolution whether they wanted to be or not, which is exactly the Day 2 posture Bezos was warning against.

The fix is not more dashboards. It’s better traces, presented in a format a human can reason about, with workflow that nudges the operator toward reading them regularly. An operator who dives deep into their agent team’s traces every week will catch things no dashboard will ever show. An operator who relies on the dashboard will discover the interesting problems six months late, from a customer complaint, the same way Day 2 companies always do.

Lab: the Outlook example, walked through

Take Outlook as a concrete example. Most teams I’ve seen wire up an Outlook MCP server end up with something close to this:

send_email(to, subject, body, draft: bool = false)
delete_message(id)
move_message(id, folder)
update_message(id, ...)

This is a clean mapping of the underlying Graph API. It’s also a CRUD surface that has thrown out the door classification entirely. send_email does two completely different things depending on the value of a boolean flag. From the agent’s point of view, calling it with draft=true and draft=false look like the same call — same tool, same shape, same description block — and the difference between “save a draft I can delete later” and “deliver a message to a human’s inbox” is one boolean.

The agent instructions to compensate look like this:

Always set draft=true when sending email unless the user has explicitly approved sending. If the user has not explicitly approved, ask before changing draft to false.

This works until it doesn’t. Most of the time the agent will follow it. Some non-zero percentage of the time — under instruction-following pressure, when the prompt drifts, when context is long — the agent will hit draft=false because it pattern-matched on the wrong example, or because the instruction tokens dropped out of attention. The boolean is an attractive nuisance. There’s no structural reason for it to be safe.

The split version looks like this:

list_messages(folder)
read_message(id)
create_draft(to, subject, body) -> draft_id
update_draft(draft_id, ...)
send_draft(draft_id)
delete_message(id)

Five of these are two-way doors. send_draft and delete_message are one-way. The authorization for the two-way tools is open — the agent calls them as needed, no escalation. The authorization for send_draft and delete_message routes through a human approval point, structurally, before the tool actually executes.

The agent instructions for the split surface look like this:

Use the available tools to draft, review, and prepare email. The send_draft and delete_message tools require human approval; describe what you’re about to send or delete and let the approval flow run.

Shorter. More honest. The instructions are no longer carrying load that the tool surface should be carrying. The agent can still pattern-match its way to the wrong tool — but the wrong tool in the split surface is structurally gated, so the worst case is a human getting an approval prompt for a draft that should have been quietly sent. That’s the right failure mode. The CRUD-surface failure mode is the agent silently sending an email it wasn’t supposed to.

This generalizes. Whenever you’re writing an MCP server against a CRUD-shaped API, you have two moves:

Rewrite — when you own the underlying API, separate the irreversible operations into their own surface. Don’t pass draft: bool; have create_draft and send_draft as different tools with different authorization.

Wrap — when you don’t own the underlying API (Outlook’s Graph API is what it is), build the split at the MCP layer. The wrapper knows that one Graph call is a draft and another is a send, and exposes them as different tools with different authorization characteristics, even though the underlying API doesn’t care.

The decision flow generalizes to a small chart you can keep next to you:

flowchart TD
    A[Operation in underlying API] --> B{Reversible by the caller, cheaply?}
    B -->|yes| C[Expose directly<br/>two-way door<br/>no gate]
    B -->|no| D{Do you own the API?}
    D -->|yes| E[Rewrite: split into<br/>reversible prep +<br/>irreversible commit]
    D -->|no| F[Wrap: synthesize prep<br/>step at the MCP layer<br/>gate the commit]
    E --> G[Human approval point<br/>on the commit tool,<br/>not in the prompt]
    F --> G

The wrap version takes more work. It’s also the version that scales, because most of the APIs your agents will touch were designed for human users and treat the door classification as an implementation detail. Your MCP layer is where the classification gets put back in.

Closing

The framework Amazon teaches isn’t agent-specific. It was built for humans two decades ago, and it’s been refined under operating pressure ever since. Every piece of it — the door classification, the 70% rule, disagree and commit, escalation paths, earned trust, dive deep — translates onto agent teams with the same logic that made it work on humans. The translation is not metaphorical. The mechanics are the same. What changes is where the framework lives. For humans, it lives in culture and judgment. For agents, it has to live in the tool surface, the authorization layer, the escalation criteria, and the trace format. The agent doesn’t carry the framework in its head. The infrastructure has to carry it instead.

The current industry posture treats agent governance as an orchestration problem and an approval problem. It’s neither. It’s a structural problem, and the structures that solve it have been operating at scale on humans for twenty years. The opportunity isn’t to invent new governance. It’s to translate governance that already works.