TL;DR

  • Encode the six-item Part 1 checklist as six CI assertions, map Part 2’s blast radius tiers to auto-labels, and gate behavioral coverage delta on T0/T1 PRs. Three jobs, one bot.
  • The review loop keeps running while you’re in a two-hour meeting, on PTO, or asleep. The queue still grows, but each PR is labeled, gated, and spot-check-commented before you log back in.
  • The bot does not judge taste, architectural intent, or business fit. Naming that boundary is half the point of this post.

📊 What you’ll have at the end:

  • A tier-label.yml workflow auto-labeling PRs tier:T0/T1/T2/T3 and ai-authored.
  • A six-asserts.yml matrix workflow with four blocking and two soft assertions.
  • A behavioral-gate.yml required check that fails T0/T1 PRs with zero behavioral coverage delta.
  • A small GitHub Action that posts one idempotent spot-check comment per PR.

You closed Part 2 with the triage loop holding. Then you stood up, walked to a meeting, and forty minutes later seven new AI-authored PRs sat in the queue. Triage works while you’re on the keyboard, but as soon as you’re off, the queue grows and the checks still need to run. That’s where Part 3 starts. (If you skipped Part 2, the triage layer is here: Part 2, triage the AI-PR queue by blast radius.)

The promise: three CI jobs plus one PR bot, all paste-and-adapt for a repo you already control. By the end, the AI-PR queue self-labels, self-gates on behavioral coverage, and self-comments with a spot-check checklist before any human shows up.

Prerequisites:

  • You read Parts 1 and 2. I reference their checklist and tier definitions, not redefine them.
  • You can edit CI config (GitHub Actions in the examples; the patterns port cleanly to GitLab CI and Buildkite).
  • You can create a GitHub App or bot account with scopes to post review comments and create check runs.

How to automate AI pull request review CI bot 2026: the three-job loop

The setup is direct: push the Part 1 checklist into Job 2, push the Part 2 tiering into Job 1, then add a behavioral-coverage gate (Job 3) that only fires on the dangerous tiers. The PR bot reads all three job outputs and renders one comment with the things only a human can sign off on. Order matters because each layer narrows the next.

Here’s the pipe-level shape:

PR opens
-> Job 1: label tier + detect ai-authored
-> Job 2: six failure-mode asserts (4 blocking, 2 soft)
-> Job 3: behavioral diff gate (T0/T1 only)
-> PR bot: single idempotent comment with spot-check checklist

Key insight: The Part 2 cliffhanger is “the checks need to run without you.” The CI loop is not a smarter reviewer. It’s the same reviewer, encoded so it doesn’t sleep. That framing keeps the scope honest, you’re not building an AI judge, you’re externalizing a process you already trust. Source: Part 2’s blast radius tiers.

One trade-off to name upfront: encoding the checklist means accepting some false positives. The script can’t read your team’s taste. You’ll get noisy comments on legitimate PRs the first two weeks. Plan for a tuning sprint, not a one-shot setup.

Job 1: how do you auto-label blast radius tier and detect AI-authored PRs?

Job 1 attaches tier:T0/T1/T2/T3 and ai-authored labels to every opened PR using path heuristics plus commit-trailer detection. Required reviewers then route via CODEOWNERS keyed on tier. Run order: this job must finish before Job 2 and Job 3, because both branch on the tier label.

What: Map changed paths to a tier; detect AI-authored signature; emit two labels.

Why: Every later job (gate, bot) reads these labels. Without Job 1, the gate has nothing to branch on and the bot doesn’t know which spot-check checklist to render.

Code (.github/workflows/tier-label.yml):

name: tier-label
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
label:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- name: Compute tier and ai-authored
id: classify
run: python3 .ci/classify_pr.py "${{ github.event.pull_request.base.sha }}" "${{ github.sha }}"
- name: Apply labels
uses: actions/github-script@v7
with:
script: |
const labels = ${{ steps.classify.outputs.labels }};
await github.rest.issues.addLabels({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: context.issue.number, labels,
});

The classifier (.ci/classify_pr.py, sketch):

import subprocess, sys, json, re
base, head = sys.argv[1], sys.argv[2]
paths = subprocess.check_output(
["git", "diff", "--name-only", base, head]
).decode().splitlines()
TIERS = [
("T0", re.compile(r"^(infra|security|auth|db/migrations)/")),
("T1", re.compile(r"^(core|domain|api)/")),
("T3", re.compile(r"^(docs|tests)/")),
]
tier = "T2" # default
for name, pat in TIERS:
if any(pat.match(p) for p in paths):
tier = name
break
trailers = subprocess.check_output(
["git", "log", f"{base}..{head}", "--format=%(trailers)"]
).decode()
ai = bool(re.search(r"Co-Authored-By:\s*(Claude|Cursor|Codex)", trailers, re.I))
labels = [f"tier:{tier}"]
if ai: labels.append("ai-authored")
print(f"::set-output name=labels::{json.dumps(labels)}")

Verify: Open a test PR touching core/billing/invoice.py. Within 30 seconds, the PR should carry tier:T1 plus ai-authored if the trailer is present. Both labels appearing means Job 1 is wired.

Key insight: Path heuristics are never 100% right. Allow maintainers to override the label by hand, and have Job 2 and Job 3 re-read the label rather than re-classify. Borrowed pattern from Promote registry triple to live, where commit-boundary overrides matter.

Job 2: how do you encode the Part 1 six-item checklist as CI assertions?

Job 2 is a matrix of six independent assertions, each mapping to one failure mode from Part 1’s review checklist. Assertions 1, 2, 4, and 5 are blocking. Assertions 3 and 6 are soft, posting a warning comment instead of failing the check. Blocking assertions catch errors with low false-positive rates; soft ones catch noisier drift signals.

What: Six independent CI jobs, parallel, each returning pass/fail plus a one-line reason.

Why: Bundling six checks into one script makes failures opaque; a matrix makes the PR comment readable.

Code (.github/workflows/six-asserts.yml, abridged):

name: six-asserts
on: pull_request
jobs:
assert:
strategy:
fail-fast: false
matrix:
check: [fake-tests, schema-diff, scope-drift, hallucinated-import, side-effect, commit-msg]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- name: Run ${{ matrix.check }}
run: python3 .ci/asserts/${{ matrix.check }}.py "${{ github.event.pull_request.base.sha }}" "${{ github.sha }}"

The six scripts, each ~60 lines, do the following:

  1. fake-tests.py (blocking). AST-scan new test files; flag tests whose only assertions are mock.call_count == N or mock.assert_called_with(...) without a return-value or state assertion afterward. Fails if mock-only assertions exceed 30% of total. Output: FAIL: 8 assertions, 6 mock-only (75%).

  2. schema-diff.py (blocking). Diff openapi.yaml, *.prisma, or JSON Schema files between base and head. If schema changed but no migration landed under db/migrations/, fail. Output: FAIL: openapi.yaml changed without migration.

  3. scope-drift.py (soft). Extract declared scope from a fenced ### Scope block in the PR description, compare to git diff --name-only. If overlap is below 80%, warn. Soft because stale descriptions are common. Output: WARN: PR touches src/auth, Scope lists src/billing.

  4. hallucinated-import.py (blocking). Extract every new import line; resolve against package.json, pyproject.toml, go.mod, or the equivalent manifest. Fail on any import that doesn’t resolve. This single check has paid for the whole setup more than once on my teams. Output: FAIL: 'utils.retrylib' not in pyproject dependencies.

  5. side-effect.py (blocking). Diff for changes in *.config.*, *.env*, db/migrations/, cron/, or DI container files. If any are touched and the PR lacks an infra-change label, fail. Output: FAIL: terraform/main.tf changed without 'infra-change' label.

  6. commit-msg.py (soft). Regex check ^(feat|fix|chore|docs|refactor|test|perf|build|ci)(\(.+\))?: .{10,} on every non-merge commit. Warn if any fails. Soft because squash-merge teams ignore per-commit messages. Output: WARN: 1/3 commits missing prefix.

Verify: Open a PR with a single hallucinated import (from utils.fakelib import foo). The hallucinated-import matrix cell should turn red within two minutes. Add the dependency, re-run, the cell turns green.

Key insight: These six assertions catch the machine-catchable subset of the Part 1 checklist. They don’t catch over-mocked tests that mock the right shape but the wrong contract. They don’t catch a hallucinated function inside an existing module. The blocking-vs-soft split is the only place you tune false-positive tolerance, do that consciously, not by accident.

The trade-off for assertion 3 (scope-drift) is worth naming: setting it soft means real scope explosions slip past CI and land in human review. The alternative (blocking) means every PR that touches a file outside the scope block fails, including legitimate “while I was here” fixes. I picked soft after two weeks of false positives during the pilot. Your repo’s culture decides which side is right.

Job 3: how do you gate merges on behavioral test coverage delta?

Job 3 fires only on PRs labeled tier:T0 or tier:T1. It computes a behavioral coverage delta, the count of new or changed test names plus contract tests between base and head, and fails the required check behavioral-gate if the delta is zero on a PR that touches src/*. Behavioral coverage delta is not line coverage.

What: A required check that blocks merge on T0/T1 PRs lacking behavioral test signal.

Why: AI-authored PRs frequently change behavior without adding tests for that behavior (Part 1 failure mode #1). Job 3 is the same check at queue scale.

Code (.github/workflows/behavioral-gate.yml):

name: behavioral-gate
on: pull_request
jobs:
gate:
if: contains(github.event.pull_request.labels.*.name, 'tier:T0') || contains(github.event.pull_request.labels.*.name, 'tier:T1')
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- name: Compute behavioral coverage delta
run: python3 .ci/behavioral_delta.py "${{ github.event.pull_request.base.sha }}" "${{ github.sha }}"

The script (.ci/behavioral_delta.py, sketch):

import subprocess, sys
base, head = sys.argv[1], sys.argv[2]
def test_names(ref):
out = subprocess.check_output(["git", "show", f"{ref}:tests/"], stderr=subprocess.DEVNULL).decode("utf-8", "ignore")
return set(re.findall(r"def (test_\w+)", out))
base_tests = test_names(base)
head_tests = test_names(head)
new_tests = head_tests - base_tests
src_changed = subprocess.check_output(
["git", "diff", "--name-only", base, head, "--", "src/"]
).decode().strip()
if src_changed and not new_tests:
print(f"FAIL: src/* changed ({len(src_changed.splitlines())} files) but zero new test names")
sys.exit(1)
print(f"OK: {len(new_tests)} new test names for {len(src_changed.splitlines())} changed src files")

Verify: Create a test PR touching src/billing/invoice.py with no test changes. Label it tier:T1. Expected: behavioral-gate reports red within two minutes. Add a test_invoice_rounds_half_up() to tests/billing/. Re-run. Green.

Key insight: Behavioral coverage delta is not line coverage. A test that mocks the new code path raises line coverage without proving the behavior is correct. Counting new test names plus contract test entries is a coarse but honest proxy, the AI can’t write a new test name unless it’s testing something at least nominally new. Honest because it’s measurable, coarse because a deceptive author could name a test poorly. Tighten with contract-test integration later.

A real trade-off: this gate has false positives on refactors. If you renamed src/billing/invoice.py to src/billing/invoicing.py with no behavior change, Job 3 will fail unless you also touch tests. The fix is a refactor-only label that bypasses the gate, and a culture norm that pure refactors get that label or get split. I prefer the friction.

How do you build the PR bot that posts an idempotent spot-check comment?

The bot is one small GitHub Action that reads outputs from Jobs 1, 2, and 3, then renders a single PR review comment, idempotent so it updates instead of duplicating. The comment lists the tier, six assertion results, and a checklist of spot-checks only a human can do: taste, architectural intent, and business fit.

What: One workflow, one comment per PR, updated on every push.

Why: Without the bot, a reviewer arrives at the PR and has to mentally aggregate three job result pages. With it, they see one comment saying “here’s what the machine confirmed, here’s what you still owe.”

Code (.github/workflows/pr-bot.yml):

name: pr-bot
on:
workflow_run:
workflows: [tier-label, six-asserts, behavioral-gate]
types: [completed]
jobs:
comment:
runs-on: ubuntu-latest
steps:
- uses: actions/github-script@v7
with:
script: |
const body = require('.ci/render_comment.js')(context);
const { data: comments } = await github.rest.issues.listComments({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: context.issue.number,
});
const existing = comments.find(c =>
c.user.login === 'github-actions[bot]' &&
c.body.includes('<!-- pr-review-bot:v1 -->'));
if (existing) {
await github.rest.issues.updateComment({
owner: context.repo.owner, repo: context.repo.repo,
comment_id: existing.id, body,
});
} else {
await github.rest.issues.createComment({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: context.issue.number, body,
});
}

The rendered comment looks like this:

<!-- pr-review-bot:v1 -->
## Review status — tier:T1, ai-authored
**Automated checks**
- [x] fake-tests: OK (14 assertions, 2 mock-only, 14%)
- [x] schema-diff: OK
- [ ] scope-drift: WARN (PR touches src/auth, scope block lists src/billing)
- [x] hallucinated-import: OK
- [x] side-effect: OK
- [x] commit-msg: OK
- [x] behavioral-gate: OK (3 new test names for 2 changed src files)
**Spot-check (human required)**
- [ ] Taste review: does the API shape match how the rest of the codebase reads?
- [ ] Architectural intent: does this fit the boundary the domain layer expects?
- [ ] Business intent: is this on the current sprint scope, or scope creep?
- [ ] Logic review: does the new test actually cover the behavior, not just the call shape?

Verify: Open a test PR, wait for all three jobs to finish, expect one bot comment. Push a fixup. Expect the same comment to update in place, not a duplicate.

Key insight: actions/github-script is enough. You do not need to stand up a separate service, a Danger.js install, or a GitHub App with its own server. The bot is fifty lines of YAML plus a hundred lines of JS, and it lives in the same repo as the assertions, which keeps the encoding loop tight.

What does the loop look like running while you’re away, and what doesn’t it cover?

By 8 a.m. Monday: seven AI-authored PRs landed overnight. Four passed all six assertions and the behavioral gate, the bot rendered “automated checks green, spot-check pending” on each, you spend five minutes per PR on taste review and merge. Two failed behavioral-gate; you comment “needs tests” without reading the diff. One failed hallucinated-import; you close it. Total time from laptop-open to inbox-zero on the AI-PR queue: forty minutes. Without the CI loop, that morning is two and a half hours.

That’s the success picture. Now the limits, named so nobody confuses CI with judgment.

What the loop does not cover:

  • Taste. The bot can’t tell you the API shape is wrong, that a function name lies about what it does, or that the abstraction leaks. Naming is hard, and CI can’t help.
  • Architectural intent. The bot can flag schema changes (Job 2 assertion 2), but it can’t tell you the schema change violates a contract the rest of the system depends on.
  • Business fit. The bot can’t tell you this feature was deprioritized last sprint and shouldn’t ship at all. That’s a roadmap decision, not a code check.
  • Subtle logic. The behavioral gate counts new test names; it doesn’t read them. A test named test_handles_edge_case that asserts True == True passes Job 3. Only a human catches that.
What the CI loop doesWhat you still do
Auto-label tier and AI-author signatureOverride label when path heuristic is wrong
Block PRs with hallucinated imports, schema drift without migrations, side-effect changes without infra-label, mock-only testsRead the diff for taste, naming, and abstraction quality
Fail T0/T1 PRs with zero behavioral coverage deltaConfirm the new tests actually exercise the new behavior
Render one consolidated comment with assertion resultsTick the spot-check checklist after thinking

Key insight: The line between “encode into CI” and “still requires a human” tracks roughly the line between syntactic correctness and semantic intent. The machine handles signature-level checks; anything that requires modeling what the code means stays human. Naming the line clearly is how you stop the bot from being overtrusted six months in.

You have the three jobs, the bot, and the boundary. If you’re new here, Part 1 walks through reviewing a single AI-authored PR end to end. If you have the checklist but haven’t built queue-level triage yet, Part 2 covers blast-radius tiering by hand. The full arc lives at the series index.

FAQ

Q: How do you detect an AI-authored PR in CI?

A: Two signals. First, scan commit trailers for Co-Authored-By: Claude, Co-Authored-By: Cursor, or Co-Authored-By: Codex. Second, check the PR description for a magic marker that your team’s agent template emits. Combine both because different agents leave different signatures, and the marker survives squash-merges that strip trailers.

Q: How is the behavioral-gate different from a line-coverage gate?

A: Line coverage can rise when a test adds a mock that touches a new code path without asserting anything meaningful about it. Behavioral coverage delta requires a new test name or a new contract-test entry, evidence that a new behavior exists and is being tested by name. It’s coarser than line coverage but harder to game with mocks.

Q: Should the bot auto-close PRs that fail blocking assertions, or just comment?

A: Comment, do not auto-close. Closing PRs by bot creates a culture where contributors stop trusting the queue. Failed required checks block merge; that’s enough. The human reviewer decides whether to reopen-and-fix or close-and-redo, and that decision is part of the social contract you do not want to automate.

Q: How long does this take to set up on a normal repo?

A: One to two days of focused work for a medium repo that already has GitHub Actions. Day one: Job 1 plus the six assertions in Job 2. Day two: Job 3 plus the bot. Add a third day if you don’t have CODEOWNERS yet, the required-reviewer routing depends on it.

Q: Do you still need manual review after this is set up?

A: Yes. The bot handles roughly 60-70% of the line-by-line checking the Part 1 checklist used to ask of you. The remaining 30-40%, taste, architectural intent, business fit, subtle logic, is precisely the work that justifies your salary. The setup buys back the time you were spending on the mechanical part.