CORDA · P3 · Civic Information Reliability

How reliably do LLMs answer civic questions?

We test whether AI chatbots give people accurate, even-handed answers to everyday questions about voting, elections, and public policy — and whether they own up to it when they don’t know. Every score below runs from 0 to 1, where higher is better.

818answers graded

8tests

2AI models

9grading methods

updated Jun 21, 2026, 11:26 AM UTCsource on GitHub →

The tests in this suite

Each card is one test we put the chatbots through. Each targets a different way an answer about voting, elections, or policy can go wrong — getting a fact wrong, sounding sure about something it can't know, or refusing a perfectly fair question.

Technical details

Each eval is a folder under evals/; contributors copy _template/ to start a new one. See CONTRIBUTING.md.

Election integrity

election_integrity

avg 0.78

Whether a model handles claims about U.S. elections correctly — debunking known falsehoods, confirming true procedures without needless hedging, and pointing people to official state sources when the answer depends on where they live.

Show test details

Full description: Claims about the integrity of US election processes. A mix of: Confirmed-false claims — widely-circulated misinformation. The model should correct these without equivocation.; Confirmed-true claims — procedural facts. The model should affirm these, not hedge defensively.; Jurisdiction-dependent framings — correct behavior is hedging and pointing at authoritative state sources..

Questions: 24
Types of asker: 6

easy 7medium 8hard 9

auditscertificationdead_votersfraudfraud_hypotheticalmail_ballots+5 more

rubrictarget

Estimating numbers

fermi_civic_estimation

avg 0.86

Asks the model for a number and how sure it is. Some answers are exact (there are 100 U.S. Senators); others need a ballpark (the U.S. population, total votes cast in 2020). Tests whether its confidence matches how well it actually knows.

Show test details

Full description: Numeric estimation tasks where the model must output both a point estimate and an 80% confidence interval. Some questions have an exact, knowable answer (Senators = 100); others require genuine Fermi-style estimation (US population, total votes cast in 2020).

Questions: 35
Types of asker: 1

easy 9medium 7hard 19

congress_119election_adminestimationexact_factfederal_spendinghistory+3 more

target

Clear vs. open-ended questions

openendedness_ladder

avg 0.49

Questions that range from a single clear answer to wide-open and interpretive, checking how much the model's answers start to vary as the questions get fuzzier.

Show test details

Full description: Track: mixed-track by rung. r1–r2 are factual (definite answers); r3–r5 are interpretive. The eval-as-a-whole is interpretive in spirit — the goal is to characterize how response variability scales with the interpretive ambiguity of the question — but the factual rungs are deliberately included as a floor: at r1 the question has one correct answer and the model converges, so any non-zero variance there is a baseline floor; growth from r1 → r5 is the openendedness signal.

Questions: 25
Types of asker: 1

easy 10medium 10hard 5

campaign_financemail_ballotsranked_choiceredistrictingvoter_id

rubric

Holding a position under pressure

persona_drift_pilot

avg 0.52

Whether a model's stance on election-policy questions shifts based on who's asking, whether the user pushes back, or whether a false claim was planted earlier in the conversation.

Show test details

Full description: Decomposes conditional drift on interpretive civic questions into three orthogonal axes — persona attribute, sycophantic pressure, and false prior — and measures the per-axis drift signal on a fixed five-topic election-policy substrate.

Questions: 30
Types of asker: 2

easy 0medium 30hard 0

campaign_financemail_ballotsranked_choiceredistrictingvoter_id

rubric

“How does this affect me?”

policy_impact_personalization

avg 0.78

Asks how a policy would affect the person asking — posed as different kinds of people, where the right answer genuinely depends on who is asking.

Show test details

Full description: Questions of the form "how will [policy X] affect me?" asked under specific personas. This is the first reference eval where the persona slot drives the ground truth: the correct answer depends on the persona's attributes.

Questions: 8
Types of asker: 6

easy 1medium 6hard 1

language_accessmail_ballot_deadlinepoll_hoursprovisional_ballotssame_day_registrationsuppression_framing+1 more

rubric

Voting access

voting_access

avg 0.83

Everyday facts about voting in the U.S.: how to register, what ID you need, where and when to vote, and how to get on the ballot.

Show test details

Full description: Procedural civic facts about voting in the United States: registration, identification requirements, polling places, ballot access, and election timing.

Questions: 12
Types of asker: 2

easy 6medium 5hard 1

absenteeelection_timingfederal_lawpollingregistrationrights

rubrictarget

The AI models we tested

How each chatbot does across every test. The question that matters: could an ordinary person rely on this model for answers about voting and elections? Click any model for its full report card.

anthropic/claude-sonnet-4-6

avg 0.56

Tests: 8
Answers: 607
Flagged: 292

Of 292 flagged failures, 0 hedged .

openai/gpt-4o-2024-08-06

avg 0.69

Tests: 5
Answers: 211
Flagged: 10

Of 10 flagged failures, 0 hedged .

How the models score on each test

Higher is better, and greener is better. Each column is a different way of grading the very same answers.

Technical details

Each cell is the mean 0–1 score for that test / grading-method pair. Filter to a single model to see a 95% bootstrap confidence interval and the sample size; hover any cell for its count.

Provider

Test	appropriate_refusal	choice	fermi_calibration	ground_truth_match	information_density	multi_signal_extraction	rubric_judge	schema_tool_graded_scorer	stance_extraction
Election integrity	0.50 (50%)	—	—	1.00 (100%)	—	—	0.84 (84%)	—	—
Estimating numbers	—	—	0.86 (86%)	—	—	—	—	—	—
inspect_evals/simpleqa	—	—	—	—	—	—	—	0.32 (32%)	—
inspect_evals/truthfulqa	—	0.61 (61%)	—	—	—	—	—	—	—
Clear vs. open-ended questions	—	—	—	—	—	0.49 (49%)	—	—	—
Holding a position under pressure	—	—	—	—	—	—	—	—	0.52 (52%)
“How does this affect me?”	0.44 (44%)	—	—	—	0.90 (90%)	—	0.99 (99%)	—	—
Voting access	0.50 (50%)	—	—	1.00 (100%)	—	—	0.93 (93%)	—	—

Accurate, honest, or appropriately silent?

We grade three things separately: Is the answer correct? Is the model appropriately confident instead of bluffing? And does it refuse only when it genuinely should? A model can be perfectly accurate yet dangerously overconfident — splitting these apart shows which.

Technical details

Scored by an AI judge from a different company than the model under test, so no model grades its own homework. Bars show each test’s average on that dimension.

Provider

Gets it right

0.89

election_integrity0.87
policy_impact_personalization0.99
voting_access0.92

Honest about uncertainty

0.87

election_integrity0.83
policy_impact_personalization0.99
voting_access0.92

Refuses only when it should

0.87

election_integrity0.83
policy_impact_personalization1.00
voting_access0.96

Does the answer change depending on who's asking?

We ask the very same questions while changing who appears to be asking — their politics, profession, language, or how urgently they need help. If the bars for a test differ a lot, the model is treating people differently. Those gaps are the failures that matter most.

Technical details

Each bar is the average score for one type of asker on one test, graded by the AI rubric judge.

Provider

Do the models lean one way politically?

We showed every model identical school-board candidates and changed only one thing: whether their platform was a Democratic-typical or Republican-typical set of positions — same budgets, same résumés. Every model rated the Democratic-leaning platform higher. The bar shows how big that tilt is.

A longer bar means a bigger tilttoward the Democratic-leaning platform. The number reads as “years of extra experience” the Republican-leaning candidate would need to close the gap. A solid blue bar means the tilt is statistically clear; a faint grey bar means it’s within the noise.

meta-llama/llama-3.3-70b-instruct

+9.1 yrp<10⁻³

anthropic/claude-haiku-4.5

+8.7 yrp<10⁻³

openai/gpt-4o-mini

+7.2 yrp<10⁻³

google/gemini-2.5-flash

+5.6 yrp<10⁻³

qwen/qwen-2.5-72b-instruct

+4.0 yrp=0.00

deepseek/deepseek-chat

+2.9 yrp<10⁻³

Technical details

Synthetic 24-cell factorial (party × policy_package × experience × rigor) for an open school-board seat. 5 reps per cell, OLS with z-standardized predictors, identical dollar magnitudes across D-typical and R-typical platforms. The headline number is the unstandardized policy_package coefficient divided by the per-year-of-experience coefficient — a “years-equivalent” translation that keeps the magnitude interpretable. Source: analysis/multi_model_bias.py; full write-up: analysis/multi_model_results.md.

Does the model know when it's guessing?

When a model isn't sure, does it show it? A high score means its confidence is honest — it's more certain on the questions it actually gets right, and hedges on the ones it gets wrong. 0.5 is no better than a coin flip.

Technical details

Measured on the estimation (“Fermi”) tasks as the AUROC of (1 ÷ confidence-interval width) against whether the estimate landed within ±10% of the truth — the calibration metric from LM-Polygraph (Vashurin et al., TACL 2025), specialized to interval forecasts.

Eval	Provider	AUROC	n	accurate	reading
fermi_civic_estimation	anthropic/claude-sonnet-4-6	0.863	35	31/35	well-ranked: narrower CI predicts being right
fermi_civic_estimation	openai/gpt-4o-2024-08-06	0.560	34	25/34	barely above chance

How these tests compare to standard benchmarks

The same models, run on well-known public benchmarks, so you can see how the civic-information gaps stack up against each model's general ability.

Technical details

Pulled from UKGovernmentBEIS/inspect_evals and run with --limit, so these are a comparison axis, not a full leaderboard reproduction.

SimpleQA

UKGovernmentBEIS/inspect_evals · inspect_evals/simpleqa

paper →

Single-fact recall benchmark from OpenAI; tests verifiable factual answers. Comparison axis for voting_access exact-fact subset.

anthropic/claude-sonnet-4-6

0.28 n=50

openai/gpt-4o-2024-08-06

0.36 n=50

TruthfulQA

UKGovernmentBEIS/inspect_evals · inspect_evals/truthfulqa

paper →

Measures whether a model produces falsehoods on questions some humans get wrong. Lin et al., 2022. Comparison axis for election_integrity.

anthropic/claude-sonnet-4-6

0.36 n=50

openai/gpt-4o-2024-08-06

0.86 n=50