SorryDB Leaderboard

SorryDB is a dataset of open proof obligations drawn from real-world Lean formalization projects, designed as a benchmark for measuring the effectiveness of AI theorem provers for day-to-day Lean practitioners.

Snapshot · SorryDB_2601 · January 2026

Loading leaderboard…

Evaluation pipeline

SorryDB continuously monitors active Lean projects listed on Reservoir. For every open sorry it finds, it records the repository, commit, Lean version, and source location so the task can be reproduced locally.

Each strategy then proposes a proof to replace the sorry, and an independent verifier compiles the candidate inside the original project to check whether the proof closes.

Evaluation Paper

An evaluation of current SOTA models on SorryDB

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Accepted at ICML 2026

Austin Letson¹, Leopoldo Sarra¹, Auguste Poiroux^3,4, Oliver Dressler, Paul Lezeau^5,6, Dhyan Aranha^2,7, Frederick Pu¹⁰, Aaron Hill, Miguel Corredera Hidalgo⁸, Julian Berman⁹, George Tsoukalas¹¹, Lenny Taelman²

¹Axiomatic AI · ²University of Amsterdam · ³Math, Inc. · ⁴EPFL · ⁵Imperial College · ⁶The London School of Geometry and Number Theory · ⁷Côte d'Azur University · ⁸ENSEIRB-MATMECA, INP-Bordeaux · ⁹Columbia University · ¹⁰University of Toronto · ¹¹The University of Texas at Austin

Abstract We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

Read the paper →