Platform Release
The time has come: our paper "BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks" has been accepted into the demo track at ICAIL 2026 (June 8-12, Singapore) (preprint on arXiv), and the full source code is available to the community as free software under the Apache 2.0 license, effective immediately. This marks the end of a longer development and validation phase that began with a first poster at the Tübingen AI & Law conference in autumn 2025, ran through several iterations with annotators from legal practice, and most recently faced its first solid test in front of a larger user base at the Benchathon in March 2026.
Why BenGER
Anyone trying to measure the performance of large language models in German law typically faces the same tooling problem: tasks are scribbled together in notebooks, annotations are managed in Excel spreadsheets, model runs are executed as ad-hoc scripts against the respective provider APIs, and the evaluation at the end is done with a wildly assembled collection of metrics. This fragmentation is not merely inconvenient - it reduces the control of legal experts over precisely the steps where their expertise matters (task formulation, reference solution, quality assurance), and it makes studies between groups practically impossible to reproduce.
The problem is particularly noticeable in German law because high-quality legal expertise is expensive and scarce. Every lawyer we can win over for an annotation is a valuable resource - it is worth giving them a tool in which they can operate the entire workflow without scripting.
That is why BenGER bundles the entire benchmarking workflow in a browser-based tool: from creating legal tasks and reference solutions, through collaborative annotation and execution against various LLM providers, to structured evaluation. Everything that emerges in the process - task catalogs, annotations, model outputs, metrics - remains linked within the same system and can be frozen and shared as a citable artifact.
What the Platform Can Do
Tasks and reference solutions. Legal experts create tasks directly in the browser, with full text, structured metadata, and, depending on the format, a reference solution. Free-text reasoning, multiple choice, and span annotations on source documents are supported - the three formats that cover the bulk of tasks in existing research on German law.
Collaborative annotation. Multiple annotators work in parallel on the same tasks without overwriting each other. Progress and consistency indicators are continuously calculated and made visible, so project leads can systematically steer the construction of a human baseline instead of discovering at the end that the conclusions are being carried by noisy annotation.
Optional formative feedback. A point that is particularly close to our hearts: annotators can, on request, receive LLM-based feedback anchored in the reference solution after submitting their answer - deliberately modeled on the logic of a German Repetitor, who points out missing reasoning steps and typical pitfalls without replacing expert control over the evaluation itself. The feature can be deactivated or restricted per project and per role.
LLM execution. Tasks can be played out in batches against virtually all relevant LLM providers - currently OpenAI, Anthropic, Google, Mistral, Cohere, DeepInfra, and Zhipu AI. Provider keys are stored per user or per project, so research groups and law firms can cleanly map their own quotas and institutional policies without sharing them with others. Prompts, sampling parameters, and system messages are versioned and stored along with each run.
Evaluation. Over 40 metrics are available, grouped into classification (accuracy, F1, Cohen's kappa), lexical (BLEU, ROUGE, METEOR), semantic (BERTScore, MoverScore, sentence-transformer embeddings), factuality, and LLM-as-a-judge with a configurable judge model. Each metric is linked to its scientific source - a point at which many comparative studies otherwise quietly fail because it remains unclear which variant of a metric was actually computed.
Multi-organization and data protection. Multiple research groups, chairs, public authorities, NGOs, or law firms can work in parallel and isolated from one another on the same instance. The separation takes effect at the data layer and via role-based permissions (admin, contributor, annotator); at the project level, it can be defined in fine-grained detail who sees what. This makes the platform suitable for collaborations involving potentially sensitive legal materials, without raw data ever having to be handed over to external engineering teams.
Differentiation from Existing Tools
Existing tools usually cover only part of the pipeline: general annotation platforms such as LabelStudio, Doccano, or the legal-focused Lawnotation offer flexible labeling UIs and dataset export, but leave multi-tenancy, LLM connectivity, and standardized evaluation open. DeepWrite and similar solutions take care of data management, not evaluation. The evaluation itself, in practice, almost always runs as project-specific code in notebooks that can hardly be reused between groups.
BenGER closes this gap by turning tasks, model configurations, and metrics into explicit, auditable artifacts that can be shared and reproduced between organizations - without domain experts ever having to leave the browser tab.
Open Core
The code is organized following an open-core model. The public platform under the Apache 2.0 license contains everything needed for a complete benchmarking workflow - annotation, generation, evaluation, multi-organization, reports. In addition, there are some extensions that we use in research at our chair (e.g., exam-solution annotations, review workflows, human leaderboards from the Benchathon context).
Technology
The frontend runs Next.js 15 with App Router, TypeScript, and Tailwind, with state management via Zustand and TanStack Query. The backend is a FastAPI service in Python with PostgreSQL as storage and SQLAlchemy/Alembic for ORM and migrations. The compute-intensive evaluation pipeline runs in Celery workers on a Redis basis, loading models for embedding and factuality metrics on demand. Authentication is JWT-based with role-based access control.
For operations, there is a Helm chart that deploys the platform - including Postgres, Redis,
Traefik, and worker pools - to any K3s or standard Kubernetes cluster. Anyone who prefers
something smaller can get by with a docker compose up from the included Compose configuration.
We operate the hosted instance ourselves on a K3s server and welcome experience reports from
other setups.
Testing
Anyone who wants to try it out has two options:
- Hosted: Create an account at what-a-benger.net, found an organization, store provider keys, and get started.
- Self-hosted: Clone the source code from the
GitHub repository, copy the env files from
the templates, run
docker compose up- the onboarding is documented in the README.
What's Next
Later this month, we will publish the first complete benchmark dataset created end-to-end on BenGER - including human baselines from the Benchathon and co-creation baselines, i.e., solutions in which humans and models worked together. This will, for the first time, provide an open point of comparison between pure LLM answers, pure expert answers, and jointly developed solutions for German legal tasks.
In the medium to long term, we want to integrate BenGER at all law faculties in Germany and make the platform freely available to all students. From our point of view, all parties involved benefit: students get exam training with immediate, reference-grounded feedback. Universities get a teaching platform that supports the practice and Repetitor system digitally without replacing it. And we as researchers receive a regular flow of fresh, fully anonymized baseline data on a scale that is practically unattainable in individual studies.
If you want to use the platform at your faculty, public authority, law firm, or NGO: please just get in touch - we will help with setup, with onboarding the first users, and with adaptation to the respective use case.
In addition, we are working on further annotation formats (in particular structured subsumption), additional judge configurations, an open REST API for external pipelines, and a clean way to freeze finished benchmarks as citable snapshots and publish them. The metric and provider integration layers are designed in such a way that new tasks, providers, or scoring methods can be added without rewriting the end-to-end pipeline - the platform is intended as a foundation for long-lived benchmark initiatives, not as a throwaway tool for a single paper.
We would be very happy about issues, pull requests, and feedback from the community - and equally about hints as to which task types, metrics, or providers should be added in the next iteration. A big thank-you finally goes to everyone who contributed between the Tübingen poster, the Benchathon, and the ICAIL demo: to the annotators from the chair's environment, to the student assistants, to the participants of the Benchathon, and to TUM, which has supported the project from the very beginning.