FormalEvolve 论文精读

先给结论

FormalEvolve 的贡献可以浓缩成一句话：在 Lean autoformalization 中，statement 本身是 prover-facing interface，不只是语义翻译终点；所以与其赌一个输出，不如在固定 generator-call 预算内构造一个去重的、语义通过的候选集合。

我认为它真正解决的问题

它抓住了 single-output evaluation 的盲点：一个 informal theorem 可以有多个 faithful Lean 表达，而这些表达对 prover 的难度可能不同。标准 autoformalization 只看“有没有一个对”，会丢掉这种 prover 相关差异。

我会怎么读它

这不是单纯的“LLM + evolution”论文。更准确地说，它是一个 compile-gated archive search 协议：把 Lean compiler 当硬门，把 LLM judge 当语义过滤，把 archive 当 test-time memory。

一句锐评：这篇最强的地方是问题建模非常对，最弱的地方是主结果仍强依赖 LLM-as-judge 和固定 prover protocol；它证明了“repertoire 有用”，但还没有完全证明“这个具体 evolution 机制是最优的 repertoire 构造方式”。

研究动机：为什么 single-output 不够

Autoformalization 表面上是把自然语言数学题翻译成 Lean statement。但本文强调，这个任务天然是 many-to-many：同一个 informal theorem 可以对应多个 faithful formal statements。问题在于，下游 proving 并不是只关心语义是否一致，还关心这个 formal statement 是否容易被 prover 搜索到证明。

换句话说，formal statement 有两层角色：第一，它是语义翻译结果；第二，它是 prover 看到的接口。即使两个 statement 都满足 $\operatorname{CompOK}(c)=1$ 和 $\operatorname{SemOK}(c)=1$，它们在同一个 prover、同一个 attempt budget 下也可能产生完全不同的 proof outcome。

Motivating illustration from the paper — Figure: 论文的 motivating illustration。重点不是图形本身，而是它表达的 gap：compile + semantic pass 不等于 prover-equivalent。

Significance：如果这个观察成立，那么 autoformalization evaluation 就不应该只问“第一个输出是否命中”，而应该问：“在同样预算下，系统能不能构造出一个覆盖更广、结构更多样、对 prover 更友好的候选集合？”

数学表示及建模

核心对象

给定 informal statement $x$，系统搜索 Lean 4 candidate $c=(\hat{h}, y)$。其中 $\hat{h}$ 是 import/header context，$y$ 是主要 Lean statement 或 declaration target。每个 candidate 要过两个门：

$$\operatorname{CompOK}(c)\in\{0,1\},\qquad \operatorname{SemOK}(c)\in\{0,1\}.$$

$\operatorname{CompOK}$ 是 Lean 编译/ elaboration 门；$\operatorname{SemOK}$ 是 CriticLean-Qwen3-14B 判断 statement 是否 faithfully formalizes $x$。archive 只允许 $\operatorname{CompOK}(c)=1$ 的候选进入，但 archive 内可以保留 $\operatorname{SemOK}(c)=0$ 的 stepping stones，因为它们可能提供有用的 binder、import 或结构。

真正报告的对象：semantically consistent repertoire

最终用于 SH@T 和 downstream proving 的不是 archive 全体，而是预算内生成的、去重后的语义通过集合：

$$G_T=\mathrm{Dedup}\{c\in E_T:\operatorname{CompOK}(c)=1\land \operatorname{SemOK}(c)=1\}.$$

这点很关键：archive 是搜索记忆，$G_T$ 是评估与 prover 输入。论文避免了把“编译可行但语义错误”的 archive stepping stone 直接算作成功。

parent selection：质量与反复使用惩罚

在 island archive $\mathcal{A}_I$ 中，候选 $c_i$ 被选为 parent 的概率是：

$$\Pr(p=c_i\mid\mathcal{A}_I)=\frac{w_i}{\sum_{c_j\in\mathcal{A}_I} w_j}.$$

权重由 gate score、robust normalization 和 usage penalty 组成：

$$ \begin{aligned} s_i &= \operatorname{CompOK}(c_i)(1+\operatorname{SemOK}(c_i)),\\ \tilde{s}_I &= \mathrm{median}_{c_j\in\mathcal{A}_I}(s_j),\\ d_I &= \max(\mathrm{MAD}(\{s_j:c_j\in\mathcal{A}_I\}),\varepsilon),\\ z_i &= (s_i-\tilde{s}_I)/d_I,\\ u_i &= [1+(1+\beta)n_i]^{-1},\\ w_i &= \sigma(\lambda z_i)u_i . \end{aligned} $$

因为 archive 里都已经 compile-ok，所以 $\operatorname{SemOK}=0$ 的候选 $s_i=1$，$\operatorname{SemOK}=1$ 的候选 $s_i=2$。$u_i$ 则惩罚一个 parent 被反复抽中，降低 mode collapse。

concentration metric：不是只看覆盖率

论文特别避免“成功都集中在简单题”的假象。设 $s_j(T)$ 是第 $j$ 个 problem 在预算 $T$ 内获得的 semantic-success candidate 数量，Gini 定义为：

$$\mathrm{Gini}(T)=\frac{\sum_{i=1}^{N}\sum_{j=1}^{N}|s_i(T)-s_j(T)|}{2N\sum_{j=1}^{N}s_j(T)+\varepsilon}.$$

低 Gini 和低 Top-10% share 表示 semantic successes 没有过度堆在少数 easy problems 上。

算法流程：compile-gated archive search

FormalEvolve framework overview — Figure: FormalEvolve framework。Seed model 初始化 archive，patch model 做 full/diff/cross edits，compiler 和 semantic judge 负责过滤，最后把 accepted repertoire 送入 prover。

1. Seedbank 初始化

用 $M_{\mathrm{seed}}$ 采样初始 Lean candidates；每个 seed 也计入 generator-call budget。能 compile 的进入 archive，语义通过的也进入 $G$。

2. Island archive

默认 $K=2$ 个 island，archive capacity 40。parent 和 inspirations 只在当前 island 内采样；每 10 代做 0.1 rate migration，并保护 top-1 elite。

3. LLM variation operators

full 重写整个 statement；diff 做局部编辑；cross 借用 archive 中 inspiration candidate 的 binder/import/type 结构。

4. Bounded repair + EvolAST

compile fail 触发 bounded compilation repair；compile-ok 但 semantic fail 可触发 semantic repair。重复或 compile fail 时还可用 generator-call-free EvolAST 做结构变体。

精简伪代码

Input: informal theorem x, budget T, seed model, patch model
Archive A = compile-feasible candidates
Accepted G = compile-ok and semantic-ok candidates

Seed A with initial candidates; debit calls
while budget remains and A is non-empty:
  sample island I
  sample parent p from A_I using weighted + usage-penalized rule
  sample archive context / inspiration candidates
  propose c via full / diff / cross patch
  if duplicate or compile fail:
      try EvolAST fallback; optionally bounded compile repair
  if compiles:
      judge semantic consistency
      insert compile-ok candidate into archive
      if semantic-ok: add to accepted repertoire G
      else optionally run bounded semantic repair
return Dedup(G)

实验方法与实验设计

实验设计的主线是：在固定 generator-call budget $T=100$ 下比较 archive search 与 no-archive controls，再用固定 prover budget $B=64$ 测试这些 statement repertoires 是否真的能转化成 proof utility。

维度	设置
Benchmarks	ProofNet Lean 4 test split $N=186$；CombiBench $N=100$，作为 domain/style shift。
Generator budget	$T=100$ per problem；seed、proposal、compile repair、semantic repair 都按 generator-side LLM call 记账。
Compiler	Lean 4.15.0 + Mathlib commit `9837ca9`，通过 Kimina Lean Server 编译。
Main models	Seed: Kimina-Autoformalizer-7B；patch/repair: Qwen3-30B-A3B；semantic judge: CriticLean-Qwen3-14B。
Prover	Goedel-Prover-V2-32B；每个 problem 最多接收 64 个 deduplicated semantic-ok candidates。
FormalEvolve config	$K=2$ islands；archive capacity 40；$\lambda=10$, $\beta=0.05$；full/diff/cross = 0.5/0.3/0.2；migration interval 10, rate 0.1；repair max attempts 2。
Baselines	Sample；Compile Repair；Compile+Semantic Repair；Qwen3 no-archive variants；Hybrid control = same Kimina/Qwen3 stack but no archive, no parent reuse, no island, no migration。

为什么 matched Hybrid control 很重要

如果只和 Kimina baseline 比，FormalEvolve 的收益可能来自更强的 Qwen3 repair model。Hybrid control 把 seeding/repair stack 尽量对齐，但去掉 archive search，因此它是判断“archive 是否真的有用”的关键对照。

Prompt / repair 复现要点

初始生成和 evolution prompt 都要求输出完整 Lean 4 file，必须从 import Mathlib 和 import Aesop 开始，且包含 exactly one theorem，结尾是 := by sorry。compile repair 只修编译，不应改变语义；semantic repair 可以改 hypotheses/conclusion，但必须对齐原 informal statement，不能弱化成 tautology。

实验结果及核心结论

1. Statement generation：FormalEvolve 提高 semantic hit，也降低集中度

Method	ProofNet SH@100	ProofNet Gini	CombiBench SH@100	CombiBench Gini
Compile+Semantic Repair	0.780	0.555	0.460	0.813
Hybrid control	0.828	0.505	0.530	0.790
FormalEvolve $K=2$	0.849	0.443	0.580	0.759
FormalEvolve $K=1$	0.866	0.362	0.550	0.726
FormalEvolve w/o patch repair	0.780	0.449	0.470	0.772

最清晰的结论是：相对 matched hybrid，FormalEvolve 在两个 benchmark 上都有 SH@100 增益，并且 Gini 更低。也就是说，它不只是多命中几个题，还让 semantic successes 更不集中。

Semantic coverage vs calls — Figure: Semantic coverage vs generator calls。虚线是 seedbank boundary $t=16$。

2. Patch repair 是最直接的驱动，island/EvolAST 是更细的实现敏感项

移除 patch repair 后，CombiBench SH@100 从 0.580 掉到 0.470，ProofNet 从 0.849 掉到 0.780，说明 bounded repair 是最强的直接贡献。相比之下，$K=1$、$K=2$、w/o EvolAST 的差异更 dataset-dependent：它们说明实现细节重要，但不应该被过度解读成某个 island 设置绝对更优。

3. Downstream proof utility：repertoire 确实能帮 prover，但不是全面碾压

Method	Combi pass@64	Combi complete@64	Combi thm-complete@64	ProofNet pass@64	ProofNet complete@64	ProofNet thm-complete@64
Ours $K=2$	44/100	27/100	13/100	127/186	52/186	45/186
Hybrid control	40/100	22/100	9/100	110/186	49/186	40/186
Sample	41/100	23/100	8/100	106/186	46/186	41/186
C+S Repair	40/100	23/100	8/100	119/186	50/186	46/186
GT reference	96/100	68/100	18/100	146/186	46/186	34/186

相对 Hybrid control，FormalEvolve 的 theorem-complete@64 在 CombiBench 从 9/100 到 13/100，在 ProofNet 从 40/186 到 45/186。但在 ProofNet 上，Kimina Compile+Semantic Repair 的 theorem-complete 是 46/186，略高于 FormalEvolve。这是一个重要边界：repertoire search 有帮助，但具体 prover utility 不是单调支配所有 baseline。

Proof utility curves — Figure: proof utility as function of prover attempts。它比单点 table 更能显示不同方法随 $k$ 增长的行为。

Proof utility decomposition — Figure: theorem-complete@64 decomposition。把失败拆成 empty repertoire、non-empty but unsolved、solved 三类。

4. Manual audit：语义风险没有消失，只是被显式报告了

论文承认 $\operatorname{SemOK}$ 来自 LLM-as-judge，所以做了人工 faithfulness audit。statement-stage matched ProofNet slice 中，FormalEvolve / Hybrid / Sample 的 Faithful/Partial/Severe 分别是 42/5/3、43/4/3、43/3/4。这个结果说明 archive search 没有明显放大 judge-positive semantic shift，但也说明 judge-positive 不是绝对可靠。

我的评论：优势、不足与改进方向

最强处

问题建模很对。 论文把 autoformalization 从 single answer prediction 拉到 budgeted repertoire construction，这比“再调一个 prompt”更有研究价值。它也把 compile gate、semantic judge、downstream prover 三个层次分开，避免把 proof success 和 faithfulness 混为一谈。

最弱处

主张边界需要读得很保守。 实验确实支持 matched no-archive 对照下的 archive benefit，但很多结果仍绑定在 CriticLean-Qwen3-14B、Goedel-Prover-V2-32B、$T=100$、$B=64$ 这套协议上。

我会追问的 reviewer 问题

1. Archive 的收益到底来自“多样性”，还是来自 repair budget 被重新分配？ 论文有 call accounting 和 ablation，但如果要更有说服力，可以进一步把同等 repair composition 的 no-archive control 做得更强。

2. Semantic judge 的 calibration 是否足够？ Manual audit 是必要且加分的，但样本量和抽样策略仍决定了结论边界。尤其 theorem-complete solved pool 中仍有 severe mismatches，说明 proof success 不能当作 faithfulness proxy。

3. EvolAST 的定位需要更锋利。 它作为 call-free fallback 是合理的，但主表里 w/o EvolAST 在 ProofNet SH@100 反而更高。当前证据更像“有时改变轨迹”，而不是稳定核心贡献。

可能的改进方向

加入更强的 diversity metric：不仅按 problem-level success 分布，也按 Lean AST / theorem shape / binder structure 聚类。
对 prover utility 做 cross-prover 测试：同一 repertoire 在 Goedel、Kimina、不同 prompt policy 下是否仍有增益。
把 archive policy 学习化：parent selection 目前比较手工，未来可以把 judge/prover feedback 合成 bandit-style policy。
把 semantic judge 从单模型判定改成 calibration ensemble 或 verifier-augmented judge，降低 judge-positive 的系统性偏差。

One More Thing：这篇对“研究路线”的启发

我觉得这篇最值得带走的不是某个具体数字，而是一个范式：当一个任务的正确输出不是唯一的，且下游系统对输出形式敏感时，单输出评价会天然低估 test-time search 的价值。

这对 autoformalization 很明显，对代码生成、工具调用计划、证明策略生成也类似。一个 candidate 是否“语义正确”只是第一层；但“它是否能成为下游搜索系统的好接口”需要靠 proof utility 表和 prover outcome 读出，不能只从 semantic trace 图推出。

证据边界：下面这张 ProofNet 0000 trace 只说明 statement-stage 机制：一个 compile-ok 但 semantic-fail 的 parent，经过 semantic repair 后变成 compile-ok 且 semantic-ok 的 conjugacy statement。它不直接展示 prover 尝试、proof script 或 complete=True；该 case 的 prover 结果来自论文 case-study 文字和 prover excerpt，而不是这张图。

Statement evolution trace — Case-study mechanism: ProofNet 0000 的 statement evolution trace。图中 $C=1$ 表示 compile-ok，$S=1$ 表示 semantic-ok；semantic repair 去掉额外 conjunction，并引入 $g^{-1}$ conjugation 结构。不要把这张图单独读成 prover success 证据。

Reference / Evidence

公开版只列可公开访问的在线来源；私有源文件、内部目录和非公开运行记录不作为网页证据暴露。

FormalEvolve 公开论文页arXiv:2603.19828。公开版论文标题、摘要、作者与版本入口。

Lean 4 / Mathlib 背景Lean theorem prover; mathlib4 repository。用于理解本文的 proof assistant 与库环境。

ProofNet benchmarkProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics。本文 ProofNet 评测语境的公开基准来源。

CombiBench benchmarkBEq+ / ProofNet# paper and linked benchmark discussion。用于理解 combinatorics-oriented autoformalization evaluation 的公开背景。

Goedel-Prover 背景Goedel-Prover GitHub。本文 proof-stage baseline/prover 语境的公开项目入口。

页面边界正文解读来自论文内容本身；公开网页只暴露公共来源链接，不列本机路径、私有 appendix 文件名或内部实验目录。

维度	设置
Benchmarks	ProofNet Lean 4 test split \(N=186\)；CombiBench \(N=100\)，作为 domain/style shift。
Generator budget	\(T=100\) per problem；seed、proposal、compile repair、semantic repair 都按 generator-side LLM call 记账。
Compiler	Lean 4.15.0 + Mathlib commit `9837ca9`，通过 Kimina Lean Server 编译。
Main models	Seed: Kimina-Autoformalizer-7B；patch/repair: Qwen3-30B-A3B；semantic judge: CriticLean-Qwen3-14B。
Prover	Goedel-Prover-V2-32B；每个 problem 最多接收 64 个 deduplicated semantic-ok candidates。
FormalEvolve config	\(K=2\) islands；archive capacity 40；\(\lambda=10\), \(\beta=0.05\)；full/diff/cross = 0.5/0.3/0.2；migration interval 10, rate 0.1；repair max attempts 2。
Baselines	Sample；Compile Repair；Compile+Semantic Repair；Qwen3 no-archive variants；Hybrid control = same Kimina/Qwen3 stack but no archive, no parent reuse, no island, no migration。

Method	Combi pass@64	Combi complete@64	Combi thm-complete@64	ProofNet pass@64	ProofNet complete@64	ProofNet thm-complete@64
Ours \(K=2\)	44/100	27/100	13/100	127/186	52/186	45/186
Hybrid control	40/100	22/100	9/100	110/186	49/186	40/186
Sample	41/100	23/100	8/100	106/186	46/186	41/186
C+S Repair	40/100	23/100	8/100	119/186	50/186	46/186
GT reference	96/100	68/100	18/100	146/186	46/186	34/186