agent-tool-pr-reviewer

CLI that reviews the current git branch’s diff against a base ref using a single Pydantic AI call. Emits a typed findings.json plus a human-readable review-output.md under <repo>/.ai-review/runs/<timestamp>/.

Pairs with the pr-review skill in the agent-skills Claude Code plugin marketplace, which runs this CLI and surfaces blocker/high findings to the user.

Install

uv tool install --editable D:/agent-tool-pr-reviewer

Verify:

agent-tool-pr-reviewer --version    # 0.2.2

The default model is openrouter:google/gemini-2.5-pro, which expects OPENROUTER_API_KEY in the environment. See “Recommended models” below for the rationale and alternatives.

Quick start

From inside any repo with at least one rule file:

mkdir .ai-review
cat > .ai-review/no-bare-fences.md <<'EOF'
---
description: All fenced markdown code blocks must specify a language.
---

# no-bare-fences

Reason: bare ``` blocks render unstyled and lose semantic info.
EOF

git checkout -b feature/x
# make some changes ...
git commit -am "test"

agent-tool-pr-reviewer review

Output goes to .ai-review/runs/<UTC-timestamp>/. Read findings.json for the typed contract or review-output.md for a quick human read.

How it works

git diff --merge-base   →   .ai-review/*.md   →   single Pydantic AI call   →   typed Report
                                                                              ↓
                                                            findings.json + review-output.md

Deterministic everywhere except the one LLM call. The schema is the contract — every finding has a stable category (bug or project_rule), a severity (blocker | high | medium | low), a file/line range, an evidence quote (verbatim diff lines, 1–500 chars), and (for project_rule) a rule_id matching the .md filename of the violated rule.

See D:/ai-agents/docs/superpowers/specs/2026-05-07-agent-tool-pr-reviewer-design.md for full design rationale.

CLI reference

`review`

Reviews HEAD against the resolved base ref.

Flag	Default	Notes
`--base <ref>`	auto-detect	`git symbolic-ref refs/remotes/origin/HEAD` → `main` → `master` → fail
`--budget <tokens>`	`80000`	Refuses with exit 2 if the assembled prompt exceeds this. Heuristic: ~4 chars/token.
`--rules-dir <path>`	walk up from cwd	First `.ai-review/` directory found before hitting `.git/` or filesystem root
`--out <path>`	`<repo>/.ai-review/runs/<ts>/`	When set, suppresses `latest.txt` write
`--model <model-string>`	`openrouter:google/gemini-2.5-pro`	Any Pydantic AI model string (`anthropic:claude-sonnet-4-6`, `openai:gpt-4o`, `ollama:llama3.1`, etc.) OR `openrouter:<model>` to route through OpenRouter (see Configuration). See “Recommended models” below.

`rules list`

Prints discovered rules (<rule_id>\t<description>) without making any LLM call. Useful for sanity-checking which rules will be applied.

Exit codes

Code	Meaning
`0`	Review completed; no `blocker` findings
`1`	Review completed; one or more `blocker` findings present
`2`	Configuration error: base ref unresolvable, budget exceeded, malformed rule frontmatter, missing API key, etc.

Rule file format

Rules live in <repo>/.ai-review/<rule-id>.md (flat, no subdirectories). The filename — minus .md — is the rule_id. Each file:

---
description: One-line summary used by `rules list` and surfaced to the model.
---

# rule-id-here

Free-form prose: explain why, give examples, anti-examples, edge cases.
The model receives the full body verbatim.

The frontmatter description: field is required — the CLI exits 2 with a clear message on missing or non-string descriptions. Subdirectories of .ai-review/ (notably runs/) are skipped during discovery.

Output layout

<repo>/.ai-review/
  <rule>.md                      ← committed; rules
  runs/                          ← gitignored; one subdir per run
    2026-05-08T03-55-23Z/
      findings.json              ← typed Report (Pydantic-validated)
      review-output.md           ← human-readable rendering
    latest.txt                   ← single line: basename of the freshest run

findings.json round-trips through pr_reviewer.schema.Report.model_validate_json cleanly; downstream consumers (e.g., the pr-review skill) can rely on the schema without defensive re-validation.

Configuration

Env var	Required	Notes
`OPENROUTER_API_KEY`	when using the default model or any `--model openrouter:...`	OpenRouter routes to many providers (Anthropic, OpenAI, Google, etc.) under one key
`ANTHROPIC_API_KEY`	when using `--model anthropic:...` directly	Pydantic AI’s default for the `anthropic:` provider
`OPENAI_API_KEY`	when using `--model openai:...`
(other provider keys)	as needed	See Pydantic AI provider docs

No config file in v1. Everything is via flags + env.

Using OpenRouter

--model openrouter:<model-name> wraps an OpenAIChatModel with Pydantic AI’s OpenRouterProvider. Model names follow OpenRouter’s slash convention:

agent-tool-pr-reviewer review --model openrouter:anthropic/claude-sonnet-4
agent-tool-pr-reviewer review --model openrouter:openai/gpt-4o
agent-tool-pr-reviewer review --model openrouter:google/gemini-2.5-pro

A single OPENROUTER_API_KEY covers all of them. The model name shows up in findings.json’s metadata.model exactly as you typed it (e.g., openrouter:anthropic/claude-sonnet-4), so runs across providers stay distinguishable.

Recommended models

Two trials (16 distinct models, 39 successful runs across 4 chorus-sqlserver PRs) produced this preference order — both for single-model use and for the eventual Tier 2 consensus mode:

openrouter:google/gemini-2.5-pro — default. Caught both real bugs across the trials (:r regex in trial 1, error-message wording in trial 1) at ~$0.06/run. Has one known FP class (scope-misalignment on generated fixtures) that the deferred Tier 2 scope filter will eliminate.
openrouter:moonshotai/kimi-k2.6 — precision pick. 1 TP, 0 FPs across 4 PRs at ~$0.06/run. Slow on large diffs (up to ~22 min on a 50K-token diff), so a poor fit for interactive use but well-suited to CI and consensus mode.
openrouter:deepseek/deepseek-chat-v3.1 — quietness sentinel. 0 TPs, 0 FPs at ~$0.006/run. Useless as a primary reviewer, valuable in a basket: when DeepSeek does emit a finding, it’s worth a closer look because it almost never speaks.

The retrospective with full data is in D:/ai-agents/CONTRIBUTING.md under the agent-tool-pr-reviewer section. Models that did NOT make the cut despite costing more or being marketed for code: Claude Sonnet 4.6, Claude Opus 4.7, GPT-5, GPT-5-mini, Codestral 2508, Qwen3 Coder 480B, GLM 4.6, Grok Code Fast 1, MiniMax M2.7, Llama 4 Maverick (architecturally unusable), DeepSeek R1 Distill (architecturally unusable).

Troubleshooting

UnicodeDecodeError: 'charmap' codec ... on Windows. Already fixed in v0.1.0 — _git forces encoding="utf-8" for subprocess output. If you see it on a non-Windows platform, file an issue.

ImportError: Please install anthropic to use the Anthropic model. The anthropic SDK lower-bound floats; recent 0.100+ versions removed types that pydantic-ai 0.8.x still imports. The pin in pyproject.toml (anthropic>=0.61,<0.70) addresses this. Run uv tool install --editable . --reinstall after pulling.

error: Could not determine base ref. Pass --base <ref> explicitly. Your repo has no origin/HEAD, no main, and no master branch. Pass --base <whatever-your-default-is>.

error: prompt exceeds --budget N tokens. The diff is too big. Either split the PR, or pass --budget with a higher value if you trust your model’s context window.

Calibration notes

Phase-1 trial (2026-05-08, v0.1.0): 4 models (Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro, DeepSeek V3.1) via OpenRouter on a single Docker/Flyway diff in chorus-sqlserver. 4 findings, 0 true positives. Two failure modes named: external-tool-hallucination (GPT-5 confidently flagged a valid sqlcmd -No flag as invalid) and speculative-downstream-consequences (GPT-5 chained “X fails → Y fails → Z blocked” without grounding). v0.2.0’s required evidence field, system-prompt exclusions, and hedging-word guard on blockers are the targeted response.

Bug-vs-rule FP rate: in phase-1, bug-category findings (no rule_id) had a higher false-positive rate than project_rule findings. The skill body’s calibration note recommends surfacing the evidence quote prominently and asking the user to verify before acting on bug findings. Re-evaluate this once phase-2+ data accumulates.

What’s NOT in v1

Deferred deliberately:

GitHub PR mode (--pr <num>, gh integration) — local branch only.
Verifier / evaluator-optimizer pass — single LLM call, no second-pass grounding.
Security findings — covered by Anthropic’s /security-review slash command. Out of scope here.
API/contract-breaking-change, doc-drift, test-coverage categories.
Auto-chunking for oversized diffs — refuse with exit 2.
Per-rule scoping (glob applies_to) — all rules apply to all changed files.
Token cost estimation in dollars.
MCP server wrapping the same logic.
Auto-apply for suggested fixes (--fix mode).

See the spec’s “Out of scope” section for the full list.

Development

cd D:/agent-tool-pr-reviewer
uv sync --extra dev
uv run pytest -v

63 tests across 8 modules: schema, paths, rules, diff, prompt, render, agent, CLI smoke. Tests use Pydantic AI’s TestModel for deterministic LLM stubbing.

Architecture

Small, focused modules in src/pr_reviewer/:

Module	Responsibility
`schema.py`	`Finding`, `RunMetadata`, `Report` Pydantic models
`diff.py`	git subprocess wrappers (`resolve_base_ref`, `extract_diff`, SHAs)
`rules.py`	`.ai-review/` walk-up discovery + frontmatter parsing
`prompt.py`	system prompt + user prompt builder
`render.py`	`Report` → `review-output.md`
`paths.py`	run dir naming + `latest.txt` pointer
`agent.py`	Pydantic AI agent construction
`cli.py`	argparse, subcommand dispatch, exit codes

Implementation plan: D:/ai-agents/docs/superpowers/plans/2026-05-07-agent-tool-pr-reviewer.md.