CRS Summary Benchmark Frontier LLMs vs. real Congressional Research Service summaries

Methodology

The goal is transparency. Everything the CRS testimony left unstated — which bills, which models, which prompts, and how summaries were graded — is fixed here and published in full.

1. The baseline: real CRS summaries

CRS legislative analysts write the official plain-language summaries served by the Congress.gov API. We collect bills that have both a CRS summary and retrievable full text. The bill text is the input given to models; the CRS summary is the human reference. Crucially, the CRS summaries are also graded on the identical criteria (shown as the CRS (human) row) — a calibration check on the judge.

Two datasets (switch between them with the toggle on the leaderboard):

  • 119th Congress — 50 recent bills (2025–26), sampled by most-recently-updated and interleaved across chambers (House/Senate bills and joint resolutions). A mix of activity levels.
  • 2024 · high-activity — 50 bills from the 118th Congress (2024). Here "high activity" means the number of legislative actions a bill accumulated — every step recorded on Congress.gov (introduction, committee referrals and markups, floor votes, chamber passage, enrollment, signing, etc.), read from the bill's /actions endpoint. More actions is a proxy for how far a bill advanced and how consequential it was, so this set skews toward substantive, contested legislation rather than routine measures. The selected bills range from 27 to 76 actions each (shown as an "N actions" badge on every bill). One filter is applied: bills whose full text exceeds the model input budget (~180k characters) are skipped, so every summary is graded on text the models read in full — this excludes the largest omnibus appropriations bills, which would otherwise top the list.

2. Models under test

Each model receives the same neutral summarization prompt and the bill text, with no CRS-specific coaching. Models tested: . All are called through OpenRouter.

3. The grading rubric — binary criteria

The headline metric, "passes all criteria," counts a summary only if it satisfies every applicable criterion below. Criteria marked conditional are auto-passed when the judge decides they don't apply to a given bill (e.g. a bill with no dollar figures isn't penalized on correct_figures). Note: this is this project's rubric, derived from the qualities CRS says it values (accuracy, coherence, relevance, objectivity) — it is not an official CRS standard or determination. We make no claim to define what CRS considers acceptable.

4. The judge

A single strong model, , scores each summary against the bill text (the ground truth) and returns a pass/fail plus a one-line reason per criterion.

Known limitation — not validated against human judgments. The pass/fail verdicts here are the LLM judge's alone. They have not yet been compared against human experts applying the same criteria to the same summaries, so there is no measure of how often the judge and a human would agree on what counts as a pass or a fail. Where a criterion is a close call, the judge's boundary may differ from a human's, and some verdicts are genuinely contestable. Treat the scores as one consistent automated rater's opinion, not ground truth. A human-agreement study (inter-rater reliability on a sample of bills) is the most important next step.

Known limitation — self-preference. This judge (Opus 4.8) is also one of the contestants, so some bias toward the Claude summaries is possible. The grading of the human CRS summaries on the same rubric is a partial safeguard: if the judge were badly miscalibrated, the human baseline would not score sensibly. The judge is a single configuration line to swap, and a multi-judge panel is a natural extension.

Both datasets judged by Opus 4.8. The 119th Congress set was graded by anthropic/claude-opus-4.8 via OpenRouter (with the Claude Code fallback noted below); the 2024 high-activity set by the same model running as ten parallel Claude Code subagents, given the full bill text and a verify-before-failing rule.

A discarded judge pass — why the judge matters. The 2024 set was first graded by a faster, cheaper model (claude-sonnet-4-6) as subagents. That pass produced a dramatic, very different ranking — but it contained real errors (e.g. it failed the CRS summary of the FISHES Act for "hallucinating" a notice requirement that is plainly in the bill text) and an over-strict conciseness bias that pushed the more verbose models down. We discarded it and re-judged with Opus, after which the field clustered at 96–100% much like the 119th set. This is the not-validated-against-humans caveat made concrete: a weaker judge gave a confidently wrong leaderboard. Treat all scores as one automated rater's opinion, and read the published per-bill reasons rather than the headline alone.

Disclosure — judge access path (119th set). Because the OpenRouter account's credits were exhausted at points, some 119th-set judgments were made via the OpenRouter web plugin and others by the same model (Opus 4.8) running in Claude Code with live web search — same model and method, a different access path. Roughly 206 model summaries and 6 CRS summaries via OpenRouter; the 44 re-generated model summaries and 44 CRS summaries via Claude Code. Every per-bill verdict and reason is published so anyone can check them.

5. Re-runs & corrections

In the spirit of full transparency, the substantive changes made after the first run:
  • Output-length fix. The first run capped model output at 1,500 tokens, which truncated 40 of gemini-3.5-flash's summaries mid-sentence (it writes long) and unfairly tanked its score. The cap was raised to 4,000 tokens and the truncated summaries were re-generated and re-judged. Gemini's score moved from an artifact-driven 24% to 100%.
  • A caught CRS error. On H.R. 2066, the official CRS summary states a $450M cap that the bill text puts at $475M; all five models stated it correctly. The CRS row was corrected to reflect the error — an illustration that the rubric is strict enough to catch even an occasional human slip.
This is why the human CRS baseline lands slightly below the models: on this faithfulness-and-completeness rubric the current models are uniformly thorough and accurate, and the rubric flags the rare CRS omission or misstatement. It does not mean the models are "better than CRS" at CRS's actual job.

6. Prompts (verbatim)

Summarization prompt

Judge prompt

7. Reproduce it

The full pipeline is open source: fetch_bills.pyrun_models.pyevaluate.pyreport.py. The criteria live in an editable criteria.yaml; models and the judge live in config.yaml. See the repository for setup and the raw per-bill results.