BenGER: Dataset & Benchmark Released

Today we are releasing BenGER, an open benchmark for legal reasoning in German law — and, for us, the real centerpiece of the project. The platform we published earlier was the means; this dataset is the end. To our knowledge it is the first large-scale, openly licensed benchmark that measures whether language models can actually reason through German legal cases — not whether they can recall a statute, but whether they can carry a problem through the full Gutachtenstil: hypothesis, definition, subsumption, conclusion.

What's in the dataset

BenGER is a single benchmark with three thematic subsets, spanning civil, criminal, and public law:

  • Benchathon — long-form, exam-style legal analyses, collaboratively graded, collected at our Benchathon in March 2026.
  • ZJS — published exam-style cases from the Zeitschrift für das Juristische Studium.
  • Grundprinzipien — short, focused items on foundational legal principles.

In total: 596 exam-style free-text tasks and 531 short doctrinal items. What makes the Benchathon subset unusual is its human baseline — a controlled set of timed, human-written solutions, produced under two conditions: traditional, unaided work and human-AI co-creation. That lets us compare, on the very same tasks, what a model writes alone, what a jurist writes alone, and what the two produce together.

A real human baseline, and a judge you can trust

Grading long-form legal writing is hard: there is no single correct string to match against. BenGER scores every solution on a ten-dimension rubric built around subsumption — result correctness, issue spotting, legal grounding, subsumption quality, structure, language, and more — aggregated to a 100-point scale and mapped onto the familiar German 0–18 grade scale. The scoring is done by a rubric-aligned LLM-as-a-Judge.

The obvious objection to an LLM judge is whether you can trust it, so we validated it against people. Seven reviewers graded the Benchathon solutions by hand, and we checked the judge against that human pool. It tracks human grading at Pearson r = 0.76 and Cohen's κ = 0.60 — inside the band of disagreement between the human reviewers themselves. Rankings stay stable across different judge families, and two judges from independent providers clear the bar for replacing a single human reviewer. The headline numbers are not an artifact of one model grading in its own favour.

What we found

We ran 12 contemporary LLM systems — closed-flagship, efficiency-tier, and open-weight — for 14,614 generations across the three subsets. Three results stand out:

  • Closed-flagship systems lead across the board. Opus-4.7, Gemini-3.1-Pro, and GPT-5.4 top all three corpora, with a clear gap to the open-weight tier.
  • Human + AI beats either alone. Co-creation solutions score roughly +15.7 points (≈ +2.8 German grade points) above unaided human work — enough to put AI-assisted jurists in the same band as the strongest standalone models. The lift holds across civil, criminal, and public law.
  • Generic metrics miss the point. Off-the-shelf lexical and embedding metrics correlate only loosely with rubric-based legal quality — which is exactly why a domain-aligned, validated judge matters.

Get the data, cite the work

BenGER is released openly:

  • Dataset (canonical): archived on Zenodo under CC BY 4.010.5281/zenodo.20489788.
  • Paper (preprint): arXiv:2605.28183.
  • Code & analysis pipeline: the corpora, judge prompts, rubric, and the scripts that produced every number in the paper are on GitHub (Apache-2.0).

Everything was produced end-to-end on the BenGER platform; you can explore the hosted instance at what-a-benger.net.

A benchmark like this only exists because many people did careful, unglamorous work: the Benchathon participants who sat down and solved real cases, the reviewers who graded them, our co-authors across TUM, LMU, Konstanz, and Saarbrücken, and TUM for backing the project throughout. Thank you — and if you build on BenGER, we would love to hear about it.