tool

agent-tool-pr-reviewer

CLI that reviews the current git branch's diff against a base ref using a single Pydantic AI call. Emits a typed `findings.json` plus a human-readable `review-output.md` under `<repo>/.ai-review/runs/<timestamp>/`.

agent-tool-pr-reviewer

CLI that reviews the current git branch’s diff against a base ref using a single Pydantic AI call. Emits a typed findings.json plus a human-readable review-output.md under <repo>/.ai-review/runs/<timestamp>/.

Pairs with the pr-review skill in the agent-skills Claude Code plugin marketplace, which runs this CLI and surfaces blocker/high findings to the user.

Install

uv tool install --editable D:/agent-tool-pr-reviewer

Verify:

agent-tool-pr-reviewer --version    # 0.2.2

The default model is openrouter:google/gemini-2.5-pro, which expects OPENROUTER_API_KEY in the environment. See “Recommended models” below for the rationale and alternatives.

Quick start

From inside any repo with at least one rule file:

mkdir .ai-review
cat > .ai-review/no-bare-fences.md <<'EOF'
---
description: All fenced markdown code blocks must specify a language.
---

# no-bare-fences

Reason: bare ``` blocks render unstyled and lose semantic info.
EOF

git checkout -b feature/x
# make some changes ...
git commit -am "test"

agent-tool-pr-reviewer review

Output goes to .ai-review/runs/<UTC-timestamp>/. Read findings.json for the typed contract or review-output.md for a quick human read.

How it works

git diff --merge-base   →   .ai-review/*.md   →   single Pydantic AI call   →   typed Report

                                                            findings.json + review-output.md

Deterministic everywhere except the one LLM call. The schema is the contract — every finding has a stable category (bug or project_rule), a severity (blocker | high | medium | low), a file/line range, an evidence quote (verbatim diff lines, 1–500 chars), and (for project_rule) a rule_id matching the .md filename of the violated rule.

See D:/ai-agents/docs/superpowers/specs/2026-05-07-agent-tool-pr-reviewer-design.md for full design rationale.

CLI reference

review

Reviews HEAD against the resolved base ref.

FlagDefaultNotes
--base <ref>auto-detectgit symbolic-ref refs/remotes/origin/HEADmainmaster → fail
--budget <tokens>80000Refuses with exit 2 if the assembled prompt exceeds this. Heuristic: ~4 chars/token.
--rules-dir <path>walk up from cwdFirst .ai-review/ directory found before hitting .git/ or filesystem root
--out <path><repo>/.ai-review/runs/<ts>/When set, suppresses latest.txt write
--model <model-string>openrouter:google/gemini-2.5-proAny Pydantic AI model string (anthropic:claude-sonnet-4-6, openai:gpt-4o, ollama:llama3.1, etc.) OR openrouter:<model> to route through OpenRouter (see Configuration). See “Recommended models” below.

rules list

Prints discovered rules (<rule_id>\t<description>) without making any LLM call. Useful for sanity-checking which rules will be applied.

Exit codes

CodeMeaning
0Review completed; no blocker findings
1Review completed; one or more blocker findings present
2Configuration error: base ref unresolvable, budget exceeded, malformed rule frontmatter, missing API key, etc.

Rule file format

Rules live in <repo>/.ai-review/<rule-id>.md (flat, no subdirectories). The filename — minus .md — is the rule_id. Each file:

---
description: One-line summary used by `rules list` and surfaced to the model.
---

# rule-id-here

Free-form prose: explain why, give examples, anti-examples, edge cases.
The model receives the full body verbatim.

The frontmatter description: field is required — the CLI exits 2 with a clear message on missing or non-string descriptions. Subdirectories of .ai-review/ (notably runs/) are skipped during discovery.

Output layout

<repo>/.ai-review/
  <rule>.md                      ← committed; rules
  runs/                          ← gitignored; one subdir per run
    2026-05-08T03-55-23Z/
      findings.json              ← typed Report (Pydantic-validated)
      review-output.md           ← human-readable rendering
    latest.txt                   ← single line: basename of the freshest run

findings.json round-trips through pr_reviewer.schema.Report.model_validate_json cleanly; downstream consumers (e.g., the pr-review skill) can rely on the schema without defensive re-validation.

Configuration

Env varRequiredNotes
OPENROUTER_API_KEYwhen using the default model or any --model openrouter:...OpenRouter routes to many providers (Anthropic, OpenAI, Google, etc.) under one key
ANTHROPIC_API_KEYwhen using --model anthropic:... directlyPydantic AI’s default for the anthropic: provider
OPENAI_API_KEYwhen using --model openai:...
(other provider keys)as neededSee Pydantic AI provider docs

No config file in v1. Everything is via flags + env.

Using OpenRouter

--model openrouter:<model-name> wraps an OpenAIChatModel with Pydantic AI’s OpenRouterProvider. Model names follow OpenRouter’s slash convention:

agent-tool-pr-reviewer review --model openrouter:anthropic/claude-sonnet-4
agent-tool-pr-reviewer review --model openrouter:openai/gpt-4o
agent-tool-pr-reviewer review --model openrouter:google/gemini-2.5-pro

A single OPENROUTER_API_KEY covers all of them. The model name shows up in findings.json’s metadata.model exactly as you typed it (e.g., openrouter:anthropic/claude-sonnet-4), so runs across providers stay distinguishable.

Two trials (16 distinct models, 39 successful runs across 4 chorus-sqlserver PRs) produced this preference order — both for single-model use and for the eventual Tier 2 consensus mode:

  1. openrouter:google/gemini-2.5-prodefault. Caught both real bugs across the trials (:r regex in trial 1, error-message wording in trial 1) at ~$0.06/run. Has one known FP class (scope-misalignment on generated fixtures) that the deferred Tier 2 scope filter will eliminate.
  2. openrouter:moonshotai/kimi-k2.6 — precision pick. 1 TP, 0 FPs across 4 PRs at ~$0.06/run. Slow on large diffs (up to ~22 min on a 50K-token diff), so a poor fit for interactive use but well-suited to CI and consensus mode.
  3. openrouter:deepseek/deepseek-chat-v3.1 — quietness sentinel. 0 TPs, 0 FPs at ~$0.006/run. Useless as a primary reviewer, valuable in a basket: when DeepSeek does emit a finding, it’s worth a closer look because it almost never speaks.

The retrospective with full data is in D:/ai-agents/CONTRIBUTING.md under the agent-tool-pr-reviewer section. Models that did NOT make the cut despite costing more or being marketed for code: Claude Sonnet 4.6, Claude Opus 4.7, GPT-5, GPT-5-mini, Codestral 2508, Qwen3 Coder 480B, GLM 4.6, Grok Code Fast 1, MiniMax M2.7, Llama 4 Maverick (architecturally unusable), DeepSeek R1 Distill (architecturally unusable).

Troubleshooting

UnicodeDecodeError: 'charmap' codec ... on Windows. Already fixed in v0.1.0 — _git forces encoding="utf-8" for subprocess output. If you see it on a non-Windows platform, file an issue.

ImportError: Please install anthropic to use the Anthropic model. The anthropic SDK lower-bound floats; recent 0.100+ versions removed types that pydantic-ai 0.8.x still imports. The pin in pyproject.toml (anthropic>=0.61,<0.70) addresses this. Run uv tool install --editable . --reinstall after pulling.

error: Could not determine base ref. Pass --base <ref> explicitly. Your repo has no origin/HEAD, no main, and no master branch. Pass --base <whatever-your-default-is>.

error: prompt exceeds --budget N tokens. The diff is too big. Either split the PR, or pass --budget with a higher value if you trust your model’s context window.

Calibration notes

Phase-1 trial (2026-05-08, v0.1.0): 4 models (Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro, DeepSeek V3.1) via OpenRouter on a single Docker/Flyway diff in chorus-sqlserver. 4 findings, 0 true positives. Two failure modes named: external-tool-hallucination (GPT-5 confidently flagged a valid sqlcmd -No flag as invalid) and speculative-downstream-consequences (GPT-5 chained “X fails → Y fails → Z blocked” without grounding). v0.2.0’s required evidence field, system-prompt exclusions, and hedging-word guard on blockers are the targeted response.

Bug-vs-rule FP rate: in phase-1, bug-category findings (no rule_id) had a higher false-positive rate than project_rule findings. The skill body’s calibration note recommends surfacing the evidence quote prominently and asking the user to verify before acting on bug findings. Re-evaluate this once phase-2+ data accumulates.

What’s NOT in v1

Deferred deliberately:

See the spec’s “Out of scope” section for the full list.

Development

cd D:/agent-tool-pr-reviewer
uv sync --extra dev
uv run pytest -v

63 tests across 8 modules: schema, paths, rules, diff, prompt, render, agent, CLI smoke. Tests use Pydantic AI’s TestModel for deterministic LLM stubbing.

Architecture

Small, focused modules in src/pr_reviewer/:

ModuleResponsibility
schema.pyFinding, RunMetadata, Report Pydantic models
diff.pygit subprocess wrappers (resolve_base_ref, extract_diff, SHAs)
rules.py.ai-review/ walk-up discovery + frontmatter parsing
prompt.pysystem prompt + user prompt builder
render.pyReportreview-output.md
paths.pyrun dir naming + latest.txt pointer
agent.pyPydantic AI agent construction
cli.pyargparse, subcommand dispatch, exit codes

Implementation plan: D:/ai-agents/docs/superpowers/plans/2026-05-07-agent-tool-pr-reviewer.md.