TL;DR AI-authored PRs fail on six axes your old human-author checklist was never calibrated for: fake/over-mocked tests, plausible-but-wrong API usage, scope creep, hallucinated imports, hidden side effects, and misleading commit messages. Run this six-step pass in one sitting, then decide ship-with-nudge or reject. No “comment and forget.”

📊 Result Proof: Worked example below catches all six failure modes on a single 12-file, 47-assertion PR that compiles, passes CI, and reads like a clean refactor.

In 2026, your inbox isn’t getting one human-author PR a day. It’s getting three coding-agent diffs an hour. The “read every line” style doesn’t survive that throughput, and the checklist you built for human authors catches the wrong bugs. Coding agents write plausible code that compiles, tests that pass without verifying anything, and commit messages that describe a PR you didn’t get.

This tutorial gives you a six-check pass for one AI-authored PR, end to end. Run it linearly, decide in one sitting.

Two terms used throughout (defined here, referenced for the rest of the series):

  • AI-authored PR: a pull request whose diff is mostly generated by a coding agent (Claude Code, Cursor agent, Codex), not a human-written diff that an agent later patched. The failure modes differ. Human-with-agent-assist PRs still fail on logic and edge cases. Agent-authored PRs fail on plausibility.
  • Behavioral diff: comparing a PR by observable behavior (tests that actually exercise production code, snapshot tests, contract tests) rather than line-by-line diff reading. You’ll lean on this in Step 2 and across the rest of the series.

Prerequisites:

  • A year-plus of reviewing PRs; comfortable with git diff, gh pr view, and running tests locally.
  • Have seen at least one coding agent’s output (Claude Code, Cursor, Codex) end-to-end.
  • A repo where you can run the test suite against a PR branch (CI or local).

Why your old checklist misses AI-authored PR review failures

Your human-author checklist is calibrated for the bugs humans introduce: off-by-one, race conditions, edge cases nobody thought to handle, refactors that broke an undocumented invariant. Agents write a different class of bug: code that looks correct to a reader who’s pattern-matching on shape rather than semantics. Six axes show up repeatedly:

Human-author failure modeAI-author failure mode
Logic bug under edge inputTest that asserts on a self-returning mock
Missed null checkPlausible signature with wrong semantics
Out-of-date dependency assumptionHallucinated import that resolves to a same-named symbol elsewhere
Refactor breaks invariantSide effect hidden in module-level code
Forgot to update docsCommit message describes a different diff
Reaches outside the ticket scope by accidentReaches outside scope and rationalizes it in the commit

The checks below are calibrated for the right column. If you’re reviewing a human-with-agent-assist PR, run your old checklist alongside this one. They catch different things.

Key insight: The six axes share a structural cause: agents optimize for outputs that look like passing PRs. Tests that pass, diffs that compile, commit messages that sound reasonable. None of those signals are evidence of correctness. The checklist below replaces “looks right” with a concrete verification action per axis.

Step 1: Pin the expected scope before you open the diff

What: Before clicking Files changed, read the PR description and the first commit title. Write the expected scope in one or two lines. “Adds verifyJwt to auth/jwt.ts and calls it from the login handler” is a scope. “Auth refactor” is not.

Why: Confirmation bias is real. If the first thing you see is a 12-file diff, you’ll rationalize whatever scope creep is in there. Writing the scope first gives you a baseline for Step 4 and Step 6.

Command:

Terminal window
gh pr view 4821 --json title,body,baseRefName,headRefName \
| jq -r '.title, "", .body'

Expected output (truncated):

refactor: extract auth helper
Pull `verifyJwt` out of `login.ts` into a reusable
helper in `auth/jwt.ts`. Update call sites in the
login and refresh flows. No behavior change.

Verify: Is the scope concrete enough to filter git diff --stat? “Auth refactor” fails. “Touch auth/jwt.ts, login.ts, refresh.ts; no other files in scope” passes. If you can’t write it concretely, the PR description is too vague; push back before reviewing.

Step 2: Do the tests actually test anything, or are they fake?

What: Open the test files before the production diff. Two checks. First, invert one assertion at random and re-run. If the test still passes, the test is fake. Second, count mock(, patch(, stub( and compare to assertions on real return values. Ratio above 1:1, you’re looking at mocks asserting on mocks.

Why: This is the most common agent failure mode I see. Agents are very good at writing tests that pass; they’re not yet good at writing tests that exercise the diff. Coverage goes up. The behavioral diff stays empty. The last AI-authored PR I rejected for this added 47 assertions; 46 of them asserted that a mock had been called with arguments the mock itself synthesized. Coverage went up 8%. Zero production code paths were verified.

Command:

Terminal window
# Pick an assertion at random and invert it
git show HEAD -- tests/auth/test_verify_jwt.py | grep -n 'assert'
# Edit the file, flip == to !=, then:
pytest tests/auth/test_verify_jwt.py -x --tb=short

Expected output if the test is real:

FAILED tests/auth/test_verify_jwt.py::test_valid_token
AssertionError: assert 'user-42' != 'user-42'

Expected output if the test is fake:

1 passed in 0.07s

That second output is the failure mode. The assertion changed and the test didn’t care.

Verify: A real test fails when its assertion is inverted. Run this on at least one assertion per test file in the PR; sample three at random for files with many assertions. If tests don’t fail under inversion, the behavioral diff is zero and you have no evidence the production diff does what the description claims.

Key insight: A test that doesn’t fail when its assertion is inverted is not a test; it’s a coverage line. Treat the inversion result, not the green CI badge, as the signal that the suite exercises production.

Step 3: Does every new call site match the real API signature?

What: For every new or changed function call in the diff, jump-to-definition and compare the signature against the official docs. Don’t trust the diff. Don’t trust autocomplete.

Why: Common shapes: an await on a function that doesn’t return a Promise (silently no-ops), an argument order matching an overload the library deprecated three minors ago, a sync method called as if async. These compile. They sometimes pass tests, if the tests are over-mocked (Step 2). They fail in production.

Code shape:

// Diff added this line. Looks fine.
await db.query(sql, params, callback);
// Real signature in node-postgres v8:
// query(sql, params) -> Promise<Result>
// query(sql, params, callback) -> void (callback-based)
// The await resolves to undefined. Result lands in `callback`.
// The next two lines read `result.rows` and TypeError at runtime.

Command:

Terminal window
git diff origin/main...HEAD -- '*.ts' '*.py' \
| grep -E '^\+.*[a-zA-Z_]+\.[a-zA-Z_]+\(' \
| head -40

Walk that list. For each call, open the symbol’s defining module and confirm the signature.

Verify: Every new call site has been compared against its actual definition. Drive at least one call with an edge input (null, empty array, very large value) to confirm runtime behavior matches the type signature. Type checks alone don’t catch overload confusion.

Step 4: Did the agent widen the diff beyond the scope you pinned?

What: Filter git diff --stat against the Step 1 scope. Every out-of-scope file needs a one-line justification in the commit message. “Cleanup” is not a justification. “Fixed a lint error the new code surfaced” is.

Why: Agents reflexively widen the diff. They rename a variable in a neighboring file for “consistency.” They reorder imports because the linter prefers it. Each out-of-scope file is a review surface you weren’t planning to spend attention on.

Command:

Terminal window
git diff --stat origin/main...HEAD \
| grep -vE '^(auth/|login\.ts|refresh\.ts| [0-9]+ files? changed)'

Real output, often:

src/utils/logger.ts | 3 +-
src/api/users/profile.ts | 12 +++--
src/components/Avatar.tsx | 4 +-

Verify: Each out-of-scope file has a justification you can read aloud and defend. Three or more without justification: ask the author to split the PR. Splitting forces a description for the second PR, the cheapest way to surface unintended changes.

Step 5: Does every import resolve where you expect, with no module-level side effects?

What: Two passes. First, grep every import in the diff and verify the symbol resolves to the module you expect, not a same-named symbol elsewhere in the namespace. Second, read the constructor and module-level statements of every new or changed file, because side effects love module-level. Then run the test suite with network blocked.

Why: Hallucinated imports sometimes compile. A from utils.security import sanitize_html works if there’s a sanitize_html somewhere in that namespace, even if it’s a different function. The last time I hit this, the resolved symbol was a no-op stub from a vendored package the agent didn’t know existed. Production silently accepted unsanitized HTML for two weeks before someone filed an XSS report. Module-level side effects are similar: an agent imports a client and instantiates it eagerly, the client opens a connection on import, and your unit tests start hitting the real backend.

Command:

Terminal window
# 5a. Verify the import resolves where you expect.
python3 -c "from utils.security import sanitize_html; \
print(sanitize_html.__module__, sanitize_html.__qualname__)"
# Expected: utils.security.sanitizers sanitize_html
# Bad: fallback.shim NoOpSanitizer.sanitize_html
# 5b. Read module-level code for every new file.
git diff origin/main...HEAD --name-only \
| xargs -I{} sh -c "echo '=== {} ==='; head -40 {}"
# 5c. Run tests with network blocked (macOS sketch; adapt per OS).
sudo pfctl -e -f /etc/pf.conf.test-isolated && pytest -x

Verify: Every unfamiliar import resolves to the module you expect. No new file has a side-effecting statement at module level (no client = APIClient(), no cache.warm(), no requests.get(...) outside a function body). Tests pass with network blocked. If any of those fail, this is a Step 5 fail, which per Step 7 is not a nudge.

Key insight: A symbol that resolves to the wrong namespace is not the symbol you imported. The fix is environment evidence, not a re-read of the diff; agents cannot hallucinate a runtime, so make the runtime answer the question.

Step 6: Does each commit message actually describe its diff?

What: Read each commit message with one question in mind: is this claim verifiable from the diff? For at least one randomly chosen commit, walk the message against git show {sha} and confirm every claim is supported by the actual changes.

Why: Agents write commit messages confident about what the diff does. Sometimes they’re accurate. Sometimes they describe a refactor the diff doesn’t perform, or omit a behavior change that’s right there. The last bad one I caught said “fix null check on user id” and the diff changed the retry policy on the auth client from 3 attempts to unbounded. The null check was real. The retry change was nowhere in the message and would have shipped a thundering-herd risk.

Command:

Terminal window
git log --oneline origin/main..HEAD
# Pick one sha at random:
git show 7c4f9a2
# Read the message. Read the diff. Do they match?

Verify: Every commit message accurately describes its diff. On a mismatch, push the author to rewrite the message. Don’t rewrite it yourself; the message is the author’s claim about their own work, and replacing it hides the disconnect from future readers.

Step 7: Do you ship-with-nudge or reject in one pass?

What: Translate the six checks into one of two outcomes. Ship-with-nudge: leave a focused review comment the author can resolve in a one-sentence change, and trust the PR to land after that. Reject: PR goes back to draft. There is no third option called “comment and forget.” That option creates a queue of half-reviewed PRs nobody owns.

Why: Nudges work when the root cause is local: rename a variable, add an assertion, remove an out-of-scope file. Nudges don’t work when the cause is structural. A fake test (Step 2 fail) means the author didn’t understand what the test was supposed to verify; rewrite. A hallucinated import (Step 5 fail) means the code never ran in the right environment; rewrite. A misleading commit message (Step 6 fail) means you can’t trust the description against the diff, and the rest of your review was built on that description; rewrite.

Decision table:

Failing checkAction
Step 1 (scope unclear)Push back on description; no review until rewritten
Step 2 (fake/over-mock tests)Reject. Not a nudge.
Step 3 (wrong API)Nudge if one or two call sites; reject if pervasive
Step 4 (scope creep)Nudge if one file; ask to split if three-plus
Step 5 (hallucinated import / hidden side effect)Reject. Not a nudge.
Step 6 (misleading commit)Reject. Not a nudge.
All six passShip

Verify: For each check, you can name the specific evidence that made it pass or fail. If you’re hand-waving (“looked fine”), you haven’t run the check; go back. The whole point of this pass is that “looked fine” was the failure mode you started with.

Key insight: The decision rule isn’t about severity; it’s about repairability. A nudge presumes the author can fix the root cause from one sentence. Fake tests, hallucinated imports, and misleading commits all require the author to redo a step you can’t shortcut. Treating them as nudges is how you ship a clean-looking PR that fails in production a week later.

Verified outcome and common pitfalls

You now have a fixed six-check pass for one AI-authored PR that runs end-to-end in roughly the time it used to take to read the diff twice. Paste it into your team’s Notion or a CI comment template.

Three pitfalls:

  • Sampling on huge PRs. Step 2 (assertion inversion) doesn’t scale to a 100-test PR. Sample three random assertions per test file; if any of the three pass after inversion, drill the whole file.
  • Step 3 vs. Step 5 overlap. Step 3 catches a real symbol used wrong; Step 5 catches a symbol that doesn’t exist or resolves somewhere unexpected. Run both.
  • Letting Step 4 turn into Step 7. Scope creep is a nudge if there’s one out-of-scope file with a justification, a reject if there are five. The line: “can the author fix this with a one-sentence comment?”

FAQ

Q: Does this checklist work for human-authored PRs?

A: It runs, but it’s over-engineered for that case. The six axes are calibrated for agent failure modes. Keep your old checklist for human PRs; layer this one on top when the PR is agent-authored.

Q: Step 2 says invert an assertion. What if the PR has a hundred tests?

A: Sample. Invert three at random across test files. If all three still pass after inversion, treat the suite as suspect and drill it. If two of three fail correctly, the suite is probably real; spot-check the rest.

Q: How long should I spend on a single AI-authored PR?

A: Wrong question. The right question is how to allocate attention across the queue. Part 2 addresses that. For one PR in isolation, this checklist runs in roughly 25-40 minutes of focused attention.

Q: Which steps can I automate?

A: Step 4 (scope-creep filtering), Step 5a (import resolution), and Step 6 (commit-vs-diff comparison) encode into CI checks. Step 2 (assertion inversion) and Step 3 (signature verification) are harder to automate cleanly. Part 3 builds the CI side.

Q: A PR passes all six checks but ships a bug. What went wrong?

A: This checklist reduces AI-specific false-pass; it doesn’t replace domain judgment. Logic bugs in the change itself still require the senior judgment you bring on the seventh axis. The checklist exists to free attention for that axis by handling the other six structurally.

Part 1 stops here

This checklist holds up when you have time to read one PR end-to-end. It falls apart the moment three AI-authored PRs land in the same hour and you don’t have the read budget to walk all three line by line.

Next in the series: Part 2 — Triage your AI PR queue by blast radius →