TL;DR
- Tag every incoming AI PR with a blast radius tier (T0/T1/T2/T3) before reading any diff. A 30-second decision based on paths plus CODEOWNERS, not line count.
- Allocate your read budget by tier: T0 gets the full six-item checklist from Part 1, T1 gets a trimmed checklist, T2 gets a behavioral spot-check, T3 gates on CI green plus a sample.
- When three AI PRs land in the same hour, triage takes under two minutes and you still ship review quality on the dangerous one.
📊 Result proof. Tried on a real PR queue (private repo, May 2026, ~14 AI PRs/week): triage averaged 41 seconds per PR; total review time dropped from “couldn’t finish” to ~70 minutes/day with zero T0 misses over six weeks. Sample size is small. Treat the numbers as a baseline, not a benchmark.
Part 1 gave you a six-item checklist that works when you can read a whole PR end-to-end. The catch is the one I left you with: three AI PRs landed in the same hour, and you can’t read all of them line-by-line. The checklist isn’t wrong. Your read budget is finite, and the queue isn’t.
This tutorial gives you the triage layer that sits in front of Part 1. You’ll learn to assign a blast radius tier to each AI PR in roughly 30 seconds without opening the Files Changed tab, allocate read time per tier, and run a behavioral spot-check on the low tiers instead of reading diffs you can’t afford to read. By the end you’ll have a workflow that survives a four-PR hour.
Prerequisites
- You’ve read Part 1 and the six-item checklist is muscle memory.
- You’re a senior dev or tech lead with approve/reject authority, not a first-reviewer.
- Your repo has at least one ownership signal: CODEOWNERS, path-based labels, or service ownership in a manifest. Triage uses this; it does not build it.
Three AI PRs in the same hour: why the Part 1 checklist breaks
The checklist doesn’t break because it’s wrong. It breaks because it assumes you have time to read a whole PR, and when input triples your read budget does not. I’ve watched senior devs react three ways and all three are wrong: read serially and merge the last one in a hurry, read in parallel and lose context between them, or defer all three and grow a queue you’ll never catch up on.
Define the term once: read budget is the maximum attention you can spend reading carefully in one review shift. It’s wall time times focus, not lines per minute. Two hours of meetings drops it to zero even if your calendar says “free.”
Key insight: Read budget is finite and shift-bound. Treat it like an SRE error budget, not like a queue depth.
| Wrong reaction | What it costs you | How you spot yourself doing it |
|---|---|---|
| Read serially, merge the third in a hurry | The dangerous one usually arrives last | You’re skimming the diff at minute 45 |
| Read all three in parallel | Context bleed: you approve PR A’s pattern in PR B | You can’t remember which PR raised a question |
| Defer everything until tomorrow | T0 stays unreviewed; trust decays in the team | Your queue is six PRs deep on Friday |
Triage isn’t a productivity hack. It’s how you stop spending the same budget on a docs PR and a Terraform PR.
Triage AI pull requests by blast radius: what T0/T1/T2/T3 mean
A blast radius tier classifies a PR by how much damage it can do after it merges, not by how clever the diff is. T0 is highest impact, T3 is lowest. Size doesn’t enter the model. A three-line IAM policy change is T0; a 300-line test file is usually T3.
Key insight: Blast radius is “what breaks if this is wrong in production,” not “how hard was this to write.”
| Tier | What it covers | Example | Decision authority |
|---|---|---|---|
| T0 | Infra, security, data integrity | IAM policy, Dockerfile, schema migration, auth middleware | Senior + second senior, or tech lead |
| T1 | Core business logic | Refactor of the checkout service, pricing rules, payment retry | Senior, solo OK |
| T2 | Isolated feature behind a flag or boundary | New endpoint gated by feature flag, new admin page | Senior, solo OK |
| T3 | Docs, tests against existing coverage, fixtures | README updates, new unit tests on covered code, formatting | Any reviewer |
Two cautions. First, T0 is the only tier I won’t approve alone if I can avoid it; the failure mode is too cheap for the agent to produce and too expensive to recover from. Second, “behind a feature flag” only counts as T2 if the flag has been used to roll back before. A flag nobody has flipped under load is theatre.
Tier an AI PR in 30 seconds: do this before opening the diff
What. Open the PR page. Don’t click Files Changed. Look at three things: the paths touched, the CODEOWNERS match, and the labels the agent already self-applied.
Why. Coding agents (Claude Code, Cursor, Codex) routinely mis-label scope. They see a .md file in the diff and tag the PR “docs-only” while the same PR also touches a migration script. Self-verifying paths is faster than trusting the agent’s label.
# Production: run against the real PR queue. Read-only.gh pr view 4821 --json files,labels,additions,deletions \ | jq '{ paths: [.files[].path], ai_labels: .labels | map(.name), size: (.additions + .deletions) }'Expected output (real PR, lightly redacted):
{ "paths": [ "services/checkout/pricing.ts", "services/checkout/__tests__/pricing.test.ts", "infra/terraform/iam_checkout.tf" ], "ai_labels": ["enhancement", "tests"], "size": 287}Look at the paths. infra/terraform/iam_checkout.tf is in the diff. The agent labeled this “tests” and “enhancement.” It’s T0. Assign the tier, add a tier/T0 label, move on.
Verify. You wrote down T0/T1/T2/T3 (label, comment, or a sticky note) within 30 seconds of opening the PR, and you haven’t read a line of the diff yet.
A failure mode I’ve lived through: I tiered a PR T3 because the agent labeled it “test only.” The “tests” touched a shared fixture file that a pending migration depended on. It should have been T1. Now I treat agent-applied scope labels as untrusted input. Read the paths.
Read budget: how to allocate time across tiers
Read budget is allocated inversely to tier number. T0 gets full attention, T3 gets the least. The numbers below are the baseline I run; tune them to your team’s velocity and incident history.
| Tier | Treatment | Wall time per PR | Trade-off you’re accepting |
|---|---|---|---|
| T0 | Full Part 1 checklist, line-by-line, paired review when possible | 20–30 min | Slow. You will not merge two T0s in an hour. |
| T1 | Part 1 checklist, allowed to skip item 6 (commit message) if CI is green | 10–15 min | You miss commit-message lies; you keep checklist coverage on the dangerous items. |
| T2 | Behavioral spot-check (next section) instead of full checklist | ~5 min | You miss bugs only reachable through diff reading, not through behavior. |
| T3 | Sample one or two spots + gate on CI green; reject on red | ~2 min | You will miss trivial bugs. You traded that for finishing T0 properly. |
You can argue with the times. You can’t argue with the trade-off shape: you don’t escape risk by reading every line of every PR, because your read budget runs out before the queue does. You only get to choose where the risk lives.
Key insight: Don’t allocate read budget by diff size. Size doesn’t correlate with blast radius. A 3-line IAM change can wreck a region; a 300-line new test file usually can’t.
A quick aside on T1: if your team’s checklist item 6 (commit-message lies) has caught real bugs, don’t skip it on T1. The rule is “earn your skips from postmortems,” not “skip what feels safe.”
Behavioral spot-check for T2 and T3: what to do instead of reading the diff
A behavioral spot-check is the protocol I run instead of reading every line on T2 and T3. The definition (behavioral diff) was introduced in Part 1; here I’ll show the operational protocol.
Four steps. Run them in order. Any failure escalates the PR to the next tier up and the full checklist.
Step 1 — Pull the branch and run tests locally.
# Local sandbox. Don't run on a production-connected machine.gh pr checkout 4821make test # or `pnpm test`, `cargo test`, whichever your repo usesExpected output: green. If red, the PR is rejected; you don’t owe it a behavioral diff.
Step 2 — Behavioral diff: compare one important code path before and after.
# Same sandbox. Capture behavior, switch to main, capture again, compare.curl -s localhost:3000/api/pricing?sku=ABC > /tmp/after.jsongit stashgit checkout main && make build && make run-bgcurl -s localhost:3000/api/pricing?sku=ABC > /tmp/before.jsongit checkout - && git stash popdiff /tmp/before.json /tmp/after.jsonExpected output for a “refactor only” PR: empty diff. Anything non-empty deserves a second look, even if the agent’s PR description claims no behavior change.
Step 3 — Spot-check one place in the diff.
Pick one file at random (not the file the agent’s description highlights; the one it doesn’t). Apply Part 1 item 2 only: “API used incorrectly but plausibly.” Five minutes maximum.
Step 4 — Gate.
Approve only if all three hold: CI green, behavioral diff matches the PR’s stated intent, and the one spot-checked spot passes item 2. Any failure escalates the tier and you read the full checklist.
Verify. You spent under 10 minutes per T2 PR and you can name the one behavior you confirmed and the one spot you checked.
Key insight: Behavioral diff covers the code paths your tests cover. It does not cover error paths without tests. You accepted that trade-off when you tiered the PR T2.
Scaling triage when N AI PRs land per hour
The triage protocol scales linearly because each tiering step takes under 30 seconds. The bottleneck is T0, which still demands serial line-by-line review. When the queue grows from three to eight per hour, you don’t scale review; you scale defer on T3.
Concrete scenario: eight AI PRs land between 10:00 and 11:00. Triage takes about two minutes total. The distribution from my own logs (six weeks, ~80 PRs) tends to look like: 1 T0, 2 T1, 3 T2, 2 T3. That’s not a universal claim, it’s an observed shape; your distribution depends on what your agents are tasked with.
Schedule:
10:02 Triage all 8. T0 labeled, T1/T2/T3 queued.10:02 Read T0 in full. (~25 min)10:30 Batch T1 (2 PRs). (~25 min)14:00 Batch T2 (3 PRs) with behavioral spot-checks. (~15 min)17:00 T3 cleanup (2 PRs). (~5 min)Two cautions worth saying loud. First, if a T0 PR depends on a T3 PR (rare but real: a Terraform PR can depend on a docs-as-data file), unblock the T3 first; defer order is not always tier order. Second, deferred T3 has a cap: 24 hours. Past that, the backlog erodes team trust in the review process, which is worse than the bugs you’d catch by reading the diff.
Key insight: Triage scales because it’s bounded per PR. The bottleneck moves to serial T0 review. Hire (or train) accordingly.
FAQ
Q: What if I tiered a PR wrong?
A: Re-tier when the spot-check reveals broader scope. Always move up, never down. Treat tiering as a one-way ratchet; the cost of accidentally lowering a tier is a missed T0, which is exactly what triage is built to prevent.
Q: Two senior devs tier the same PR differently. How do we resolve it?
A: Default to the higher tier. Argue async; don’t block review. After five disagreements of the same shape, write the tier criteria into a team doc and call it done.
Q: Is a behavioral spot-check really enough for T2?
A: It’s enough for the code paths your tests cover. It is not enough for error paths without tests. That’s the trade-off you accepted when you assigned T2. If the PR materially changes error handling, escalate to T1 even if the diff looks small.
Q: My repo has no CODEOWNERS. How do I tier?
A: Use path-based heuristics. A regex match on infra/, migrations/, auth/, or payments/ is good enough for T0/T1 detection. Building CODEOWNERS is its own tech-debt ticket; it does not block triage.
Q: Should I let the AI agent self-tier its PRs?
A: Not in 2026. The agents I’ve tested mis-label scope often enough that their tier becomes noise. A senior dev tiers manually. Part 3 will automate this with CI rules (path globs and ownership lookups), not with an LLM judging itself.
What you have now — and what Part 3 picks up
You’ve got a triage protocol that runs before you read any diff: a 30-second tier assignment per PR, a read budget allocated by tier, a behavioral spot-check that replaces line-by-line reading on T2 and T3, and a defer policy that holds when the queue spikes. The model shift is the one to keep: blast radius is decided by ownership and impact, not by diff size.
Three things that can still go wrong, with fixes:
- You tiered a PR wrong because you trusted the agent’s label. Re-tier on spot-check; treat agent labels as untrusted input.
- T3 deferred too long and the team noticed. Cap defer at 24 hours; batch T3 at end of day.
- The team disagrees on tier criteria. Default high, then write the criteria down after five disputes.
Triage holds the line while you’re at the keyboard. The moment you step into a two-hour meeting or take PTO, the AI PR queue grows on its own, and these checks still need to run without you. That’s the gap Part 3 closes: encoding Part 1’s checklist and Part 2’s tier model into CI jobs and a PR bot, so the review loop keeps running while you’re not.
Part 3 — Push the review loop into CI →
What to read next
- Part 1 — A six-item checklist for reviewing a single AI PR. The checklist this tutorial sits on top of. Re-read item 2 if you skipped the behavioral spot-check section.
- Part 3 — Push the review loop into CI. The automated version of triage: auto-label tiers, gate merges on behavioral coverage, post spot-check reports as PR comments.
- Series index — Reviewing AI-generated pull requests in 2026. Shared terminology in one place: AI-authored PR, behavioral diff, read budget, blast radius tier, review-loop in CI.