Can today's LLMs match the Congressional Research Service?
The CRS testified that, across ~1,000 bills tested in 2024, fewer than 3% of AI-generated summaries met its standards for accuracy, coherence, relevance, and objectivity — but it never published the bills, the rubric, or the models. This is an open, reproducible re-run: current frontier models summarize real bills, and an LLM judge grades every summary against an inspectable checklist of binary criteria. The genuine, human-written CRS summaries are graded on the same checklist as a baseline.
Leaderboard
Loading…
Per-criterion pass rate
Share of applicable summaries that passed each criterion. Greener is better. This is where the headline "passes all criteria" number comes from.
Summary length
How verbose is each summarizer? Each lane is a
pavement plot (rendered with the
pavement library) of that summarizer's 50 summaries, on a shared word-count axis. Each lane is split
into eight equal-mass blocks, so a narrow block means lengths cluster there and a wide block means they spread
out — read the block widths like density. Hover any block for its exact share and value range.