CRS Summary Benchmark Frontier LLMs vs. real Congressional Research Service summaries

Can today's LLMs match the Congressional Research Service?

The CRS testified that, across ~1,000 bills tested in 2024, fewer than 3% of AI-generated summaries met its standards for accuracy, coherence, relevance, and objectivity — but it never published the bills, the rubric, or the models. This is an open, reproducible re-run: current frontier models summarize real bills, and an LLM judge grades every summary against an inspectable checklist of binary criteria. The genuine, human-written CRS summaries are graded on the same checklist as a baseline.

Every number here is auditable. Browse each bill to read the model and CRS summaries side by side with the judge's per-criterion verdicts, or read the exact criteria, prompts, and judge.

Leaderboard

Loading…

Loading results…

Per-criterion pass rate

Share of applicable summaries that passed each criterion. Greener is better. This is where the headline "passes all criteria" number comes from.

Loading…

Summary length

How verbose is each summarizer? Each lane is a pavement plot (rendered with the pavement library) of that summarizer's 50 summaries, on a shared word-count axis. Each lane is split into eight equal-mass blocks, so a narrow block means lengths cluster there and a wide block means they spread out — read the block widths like density. Hover any block for its exact share and value range.

Loading…