Methodology
The goal is transparency. Everything the CRS testimony left unstated — which bills, which models, which prompts, and how summaries were graded — is fixed here and published in full.
1. The baseline: real CRS summaries
Two datasets (switch between them with the toggle on the leaderboard):
- 119th Congress — 50 recent bills (2025–26), sampled by most-recently-updated and interleaved across chambers (House/Senate bills and joint resolutions). A mix of activity levels.
- 2024 · high-activity — 50 bills from the 118th Congress (2024). Here
"high activity" means the number of legislative actions a bill accumulated — every step
recorded on Congress.gov (introduction, committee referrals and markups, floor votes, chamber passage,
enrollment, signing, etc.), read from the bill's
/actionsendpoint. More actions is a proxy for how far a bill advanced and how consequential it was, so this set skews toward substantive, contested legislation rather than routine measures. The selected bills range from 27 to 76 actions each (shown as an "N actions" badge on every bill). One filter is applied: bills whose full text exceeds the model input budget (~180k characters) are skipped, so every summary is graded on text the models read in full — this excludes the largest omnibus appropriations bills, which would otherwise top the list.
2. Models under test
3. The grading rubric — binary criteria
The headline metric, "passes all criteria," counts a summary only if it
satisfies every applicable criterion below. Criteria marked conditional are
auto-passed when the judge decides they don't apply to a given bill (e.g. a bill with no dollar figures
isn't penalized on correct_figures). Note: this is this project's rubric, derived from
the qualities CRS says it values (accuracy, coherence, relevance, objectivity) — it is not an official CRS
standard or determination. We make no claim to define what CRS considers acceptable.
- …
4. The judge
Known limitation — not validated against human judgments. The pass/fail verdicts here are the LLM judge's alone. They have not yet been compared against human experts applying the same criteria to the same summaries, so there is no measure of how often the judge and a human would agree on what counts as a pass or a fail. Where a criterion is a close call, the judge's boundary may differ from a human's, and some verdicts are genuinely contestable. Treat the scores as one consistent automated rater's opinion, not ground truth. A human-agreement study (inter-rater reliability on a sample of bills) is the most important next step.
Known limitation — self-preference. This judge (Opus 4.8) is also one of the contestants, so some bias toward the Claude summaries is possible. The grading of the human CRS summaries on the same rubric is a partial safeguard: if the judge were badly miscalibrated, the human baseline would not score sensibly. The judge is a single configuration line to swap, and a multi-judge panel is a natural extension.
Both datasets judged by Opus 4.8. The 119th Congress set was graded by
anthropic/claude-opus-4.8 via OpenRouter (with the Claude Code fallback noted below); the
2024 high-activity set by the same model running as ten parallel Claude Code subagents, given
the full bill text and a verify-before-failing rule.
A discarded judge pass — why the judge matters. The 2024 set was first graded by a faster,
cheaper model (claude-sonnet-4-6) as subagents. That pass produced a dramatic, very different
ranking — but it contained real errors (e.g. it failed the CRS summary of the FISHES Act for "hallucinating" a
notice requirement that is plainly in the bill text) and an over-strict conciseness bias that pushed the more
verbose models down. We discarded it and re-judged with Opus, after which the field clustered at 96–100% much
like the 119th set. This is the not-validated-against-humans caveat made concrete: a weaker judge gave
a confidently wrong leaderboard. Treat all scores as one automated rater's opinion, and read the published
per-bill reasons rather than the headline alone.
Disclosure — judge access path (119th set). Because the OpenRouter account's credits were exhausted at points, some 119th-set judgments were made via the OpenRouter web plugin and others by the same model (Opus 4.8) running in Claude Code with live web search — same model and method, a different access path. Roughly 206 model summaries and 6 CRS summaries via OpenRouter; the 44 re-generated model summaries and 44 CRS summaries via Claude Code. Every per-bill verdict and reason is published so anyone can check them.
5. Re-runs & corrections
- Output-length fix. The first run capped model output at 1,500 tokens, which truncated
40 of
gemini-3.5-flash's summaries mid-sentence (it writes long) and unfairly tanked its score. The cap was raised to 4,000 tokens and the truncated summaries were re-generated and re-judged. Gemini's score moved from an artifact-driven 24% to 100%. - A caught CRS error. On H.R. 2066, the official CRS summary states a $450M cap that the bill text puts at $475M; all five models stated it correctly. The CRS row was corrected to reflect the error — an illustration that the rubric is strict enough to catch even an occasional human slip.
6. Prompts (verbatim)
Summarization prompt
…
Judge prompt
…
7. Reproduce it
fetch_bills.py → run_models.py → evaluate.py → report.py.
The criteria live in an editable criteria.yaml; models and the judge live in config.yaml.
See the repository
for setup and the raw per-bill results.