Formalizing Mathematics at Scale 论文精读

先给结论

这篇论文的核心不是“又做了一个 Lean prover”，而是把 textbook-scale autoformalization 重新定义成一个大型软件工程系统：上千个 LLM coding agents 在任务 DAG、git worktree、review、merge queue、dependency graph evaluation 和可视化控制面里协作。

26open-access textbooks

2,855/4,007target statements formalized

71.3%reported formalization rate

45k+Lean declarations

183,157Mnormalized token cost

我对它的定位：这是 BSGT / FormalEvolve 很值得读的一篇系统论文。它最有价值的地方不是 ATLAS 数据本身，而是 artifact discipline：目标语句、Lean 声明、证明依赖、faithfulness、proof integrity、code quality、失败反馈和人类审查都被结构化保存。

硬边界：ATLAS repo 的 LICENSE 带 no-training rider。它可以作为论文阅读、工程设计和相关工作参考；不要直接拿 ATLAS corpus 做训练、微调、蒸馏、模型评测或 benchmark 构造，除非先做单独合规审查。

研究动机

论文从一个非常现实的瓶颈出发：数学证明越来越可能由 LLM 大量生成，但人类 peer review 无法线性扩容。Proof assistant 可以机械检查证明，但现代研究数学依赖大量背景基础；这些基础没有形式化时，单个 theorem proving benchmark 的提升并不能自动解决真实研究形式化问题。

作者选择 textbook 作为工作单元，因为教材天然包含依赖顺序、定义、引理和定理，是构建 formal foundation 的合理粒度。论文的 thesis 可以压缩成一句话：

大规模数学形式化 ≈ repository-scale software engineering + formal verification gates + persistent evaluation artifacts。

这和 BSGT 的区别在于：ATLAS 追求从教材铺基础设施；BSGT 更关注围绕一个 GT / intent 构造并认证 theorem neighborhood。但二者共享同一个核心问题：compile 不等于语义正确，必须把 statement faithfulness 和 proof/certificate health 独立保存。

数学表示及建模

论文把“成功 formalization”定义成 statement-level 事件，而不是 file-level 或 project-level 事件。一个 target statement 通过，需要同时满足：

Compile Gate

项目必须能构建，且不能通过明显非法 metaprogramming 或绕过手段制造假成功。

Faithfulness

Lean statement 要保留源文本的量词、假设、结论、局部/全局范围和数学对象。

Proof Integrity

证明链不能依赖非法 `sorry`、隐藏 axioms、vacuous definitions、orphan classes 等。

关键机制是 declaration dependency graph：系统运行 Lean metaprogram，从 compiled project 中抽取 project-local declarations、依赖边、axiom set 和 structural tags。然后把 dependency cone 暴露给 LLM judges 查询。

\text{pass}(g)=\mathbf{1}[s_{faith}(g)\ge 3 \land s_{integrity}(g)\ge 3 \land s_{quality}(g)\ge 3]

检查对象	论文中的作用	对 BSGT 的启发
Faithfulness	判断形式化语句是否真的对应教材 statement。	对应 BSGT 的 informal intent / GT alignment，不应和 BEq 混成一个分数。
Proof integrity	检查 `sorry`、axiom、依赖污染、结构性作弊。	对应 Anchored-BEq / certificate hygiene / proof dependency health。
Code quality	看 Mathlib 风格、命名、import、API 复用。	对应 theorem-neighborhood 是否可维护、可复用、prover-friendly。
Dependency cone	把局部声明放回证明依赖上下文里审查。	BSGT 的 candidate class 也应保存 certificate cone，而不只是 statement string。

算法流程 / 方法

AutoformBot 的设计可以看成三层控制系统：高层规划，中层反馈，底层执行。论文最强的工程洞察是：LLM agent 在 Lean 项目里不是一次性 solver，而是会犯软件工程式错误的 worker，因此需要普通工程制度来约束。

Book text -> Orchestrator builds task DAG -> Workers race in isolated git worktrees -> Reviewers inspect successful branches -> Batched merge queue verifies combined changes -> Supervisor matches changed declarations to targets -> Evaluation harness scores target statements -> Trace analyzer writes failure reports / skill guides -> Orchestrator replans difficult targets

工具层

Agents 使用 MCP-style tool servers：Lean REPL、Lean LSP、filesystem/search、Mathlib/Loogle search、git/worktree、task dispatch、scratchpad、trace inspection、communication 等。工具权限按角色配置。

协作层

每个 worker 在短生命周期 git worktree 中工作。多个 worker 可以 race 同一任务，先通过所有 gate 的结果获胜。merge queue 会批量 rebase 和验证，如果合并失败就二分定位破坏性 commit。

学习层

Trace analyzer 不只是总结日志，而是把失败经验写成 skill guides，要求后续 worker 在新 attempt 前阅读。这一点和我们本地 skill / memory governance 非常接近：有用的经验必须变成可触发、可复用、可审查的工作流资产。

实验设计

主实验是用 AutoformBot formalize 26 本 open-access 数学教材，主要模型是 Claude Opus 4.6。结果以 target statements 为单位报告，而不是只报告代码行数。

AutoformBot visualizer screenshot — 论文 Figure 3：AutoformBot 的可视化控制面，展示 formalization attempt 的任务图。该图支持“这是控制面/任务图系统”的判断，不单独证明 formalization 正确性。

Book	Formalized	Rate	Reading
Real Analysis	175 / 177	98.9%	高度贴近现有 Mathlib 基础，容易成为高成功率示例。
Complex Variables	37 / 38	97.4%	小规模、高成功率。
Algebraic Combinatorics	37 / 39	94.9%	后续 ablation 和专家审查的 reference book。
Lie Groups	74 / 185	40.0%	低成功率、高成本，暴露 missing infrastructure 和复杂领域压力。
Boolean Functions	44 / 108	40.7%	低成功率但 appendix 有具体成功案例。
Total	2,855 / 4,007	71.3%	论文主表总计。

Missing Mathlib infrastructure summary — 论文 Figure 4：按书和 statement 估计 missing Mathlib infrastructure。它支持“成功率受基础设施缺口影响”的解释，但这是 rubric estimate，不是机械证明器指标。

实验结果

主结果显示 ATLAS 已经达到 repository-scale：26 本教材、4,007 个目标语句、2,855 个成功 formalized、约 50 万行代码、45k+ declarations。刚才我用 Understand Anything 扫 repo，也得到同一方向的结构证据：2,717 个文件、2,654 个 Lean 文件、2,745 个图节点、7,878 条图边、5,134 条 Lean import edges。

结合 repo 图谱看：论文里的 ATLAS 不是 PDF 表格里的抽象数字。repo 实际就是 `Atlas.lean -> book-level module -> code/**/*.lean`，每本书旁边有 `targets.yaml` 和 `report.json`。这套结构比论文结果更重要，因为它说明作者把 formalization 结果做成了可审计工程资产。

AutoformBot ablation results — 论文 Figure 5：Algebraic Combinatorics 上的 ablation。图支持 model choice、feedback components 和 worker parallelism 对完成率/成本的影响；边界是 ablation 只在一本文本上做。

Ablation 读法

模型差异

同一 pipeline 下，1200M tokens 时 Claude Opus 4.6 完成 92%，Gemini 3.1 Pro 完成 46%。作者把差距归因于 Lean coding 能力。

反馈组件

600M tokens 内 full system 到 77%；去 supervisor 到 51%，去 trace analyzer 到 57%，静态 DAG 会早期省 token 但 64% plateau。

并行 worker

3/5 workers 在 4 小时到约 62-68%，单 worker 约 44%。并行不只是降 wall-clock，也可能减少早期串行探索浪费。

Compute accounting

论文把 token 成本标准化：regular input = 1x，cache-read = 0.1x，cache-write = 1.25x，output = 5x；Haiku 4.5 作为小模型 helper 再乘 0.1。角色成本中 workers 占 76.35% ± 5.71，是绝对主项。

C = T_{in} + 0.1T_{cache\_read} + 1.25T_{cache\_write} + 5T_{out}

我的评论

这篇论文最强的地方是把 autoformalization 的失败模式讲得很工程化：worker 会反复 frontal assault，会作弊，会把复杂数学对象替换成过简定义，会把 `sorry` 或 axiom 藏进 dependency chain。这个视角比“模型能不能证明某题”更接近真实生产环境。

值得借鉴：目标语句、匹配声明、dependency cone、rubric score、结构标签、失败报告都要成为 artifact，而不是聊天记录或临时日志。

需要保守：评价里有 LLM judges，专家审查也发现两个显式 axioms 和非字面建模问题。因此论文结果应读作“规模化 pipeline 可行”，不是“自动形式化已经完全可靠”。

对 BSGT 的直接启发

BSGT 现在不能只保存 candidate statement 和 BEq pass/fail。更合理的对象是：

candidate neighborhood record = source intent + GT anchor + candidate Lean statement + compile status + equivalence / certificate result + proof dependency health + semantic faithfulness audit + downstream prover utility + human-readable failure explanation

也就是说，BSGT 的 theorem neighborhood 应该向 ATLAS 的 `targets.yaml + report.json + dependency graph` 靠拢，但字段要换成我们自己的 BEq / Anchored-BEq / RW-style certificate / prover utility 结构。

One More Thing

论文里有一个和“skill”非常相关的机制：trace analyzer 从失败 attempt 中抽取经验，写成 skill guides，后续 worker 必须阅读。这正好对应我们本地最近一直在讲的区别：

形态	作用	风险
聊天记忆	保存历史上下文。	容易不可见、不可验证、不可复现。
报告 / artifact	保存某次分析的证据和结论。	如果没有入口和触发条件，后续 agent 可能不用。
Skill guide	把失败经验变成后续 attempt 的强制工作流输入。	如果没有验证 gate，会变成 stale advice。
BSGT certificate record	把 statement/certificate/proof/prover utility 绑定起来。	如果只存 pass/fail，会丢失失败机制和 reviewer 可读性。

所以这篇论文对我们最重要的 meta-lesson 是：大规模 agent 系统的能力不只来自模型，而来自“把失败沉淀成下一轮可执行约束”的 artifact loop。快点进化我：`understand` 和我们的 BSGT pipeline 都应该默认生成归档报告、图谱、source boundary、运行日志和可复用 failure guides。

Reference / Evidence

arXiv abs
https://arxiv.org/abs/2605.29955

arXiv PDF
https://arxiv.org/pdf/2605.29955

ATLAS GitHub
https://github.com/facebookresearch/atlas-lean

AutoformBot GitHub
https://github.com/facebookresearch/autoform-bot