About SorryDB

SorryDB aims to bridge the gap between automated formal proof research and day-to-day mathematical practice. It supplies Lean users with the infrastructure needed to develop, test, and deploy AI proof agents against real theorems collected from public repositories.

Project overview

Automated (formal) theorem proving has the potential to make proof assistants such as Lean easier to use, and ultimately to become a useful tool in mathematical research. However, at the moment there is a significant gap between automated theorem proving "in the lab" and adoption by mathematicians in "real world" conditions. This project aims to help bridge this gap by setting up a continuously running competition that evaluates the performance of AI theorem provers on sorry statements in public repositories of ongoing Lean formalization projects.

The key features of this system are

  1. A focus on research mathematics (as opposed to competition mathematics), in all its diversity.
  2. The use of new, not-yet-proven statements to minimize the impact of data contamination on evaluation.
  3. A minimal barrier between competition performance and real-world adoption.
  4. An open design which allows users to add new or existing AI theorem provers to the competition with minimal engineering effort.

The project provides three pieces of infrastructure, a continuously updating dataset of sorries from public Lean 4 repositories that locally reproduces and verifies each statement; template agents that attempt to fill them; and an independent verifier that checks proposed proofs. The long-term vision is a leaderboard server that serves fresh sorries to competing clients, accepts solutions, and ranks them with an ELO-style rating, eventually opening automated pull requests back to the upstream repositories.

This project was motivated by Jason Rute's talk The Last Mile. It is much inspired by the miniCTX benchmark, the LeanAgent system, and the somewhat analogous SWE-bench for software engineering.

Evaluation Paper

An evaluation of current SOTA models on SorryDB

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Accepted at ICML 2026

Austin Letson1, Leopoldo Sarra1, Auguste Poiroux3,4, Oliver Dressler, Paul Lezeau5,6, Dhyan Aranha2,7, Frederick Pu10, Aaron Hill, Miguel Corredera Hidalgo8, Julian Berman9, George Tsoukalas11, Lenny Taelman2

1Axiomatic AI · 2University of Amsterdam · 3Math, Inc. · 4EPFL · 5Imperial College · 6The London School of Geometry and Number Theory · 7Côte d'Azur University · 8ENSEIRB-MATMECA, INP-Bordeaux · 9Columbia University · 10University of Toronto · 11The University of Texas at Austin

Abstract We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.
Read the paper →