What are the Right Symmetries for Formal Theorem Proving? 论文精读

先给结论

这篇真正问的不是“怎么多采样”，而是 prover 应该对什么保持不变

这篇论文的中心问题是：在 Lean formal theorem proving 中，同一个数学内容可以有很多等价表述，那么 LLM prover 的成功率到底应该对哪些表述变化保持不变？作者的答案不是普通 group symmetry，而是 rewriting categories：Lean tactic 引发的变换通常可组合、可验证，但不一定可逆。

问题Representation sensitivity

同一数学内容换一种 Lean 写法，prover 成功率可能大幅变化。

建模Rewriting category

把 theorem statement 当对象，把 tactic-induced rewrite 当 arrow。

目标Success invariance

等价 statement 应该有相同或近似相同的 proof success probability。

方法Test-time ensemble

在推理时生成等价改写，把固定 budget 分配到多个表述上。

一句话压缩

FormalEvolve 说“一个 informal theorem 不止一种 faithful Lean 表达，所以要构造 prover-effective repertoire”；这篇说“同一个 formal theorem 的等价表述也不止一种，prover 的成功率不该被表面写法支配”。前者重点在 autoformalization output space，后者重点在 proving input symmetry。

研究动机：形式证明里的 symmetry 为什么不是普通不变性

传统几何深度学习里，symmetry 往往被建模为 group action，例如旋转、平移、置换。但 Lean 里的 theorem transformation 不是这么干净：tactic 会把 proof state 从一个状态推到另一个状态，这个过程可组合，却经常不可逆，也可能丢失信息。比如把更强的目标变成更弱的目标，反向并不自动成立。

因此，形式定理证明里“正确的 symmetry”不是“所有等价表达都由一个 group 操作得到”，而是：哪些 Lean 可验证的 rewrite 可以把一个 statement 与另一个 statement 放进同一个可证明等价类？论文用 category 而不是 group 来承载这个结构。

Figure 1 evidence：DeepSeek-Prover-V2 对语义保持的表述改写非常敏感。例如同一类 statement 的成功率可从 32/64 降到 3/64。caption 只说明原图显示的表述敏感性。

这里的核心风险

如果 prover 的成功率主要取决于 Lean statement 的表面形式，那么 benchmark score 就会混入“写法运气”。一个模型看起来不会证明，可能只是不会处理当前表述；一个模型看起来会证明，也可能只是撞中了熟悉的形式。

数学表示：rewriting category、proof equivariance、success invariance

论文先把 Lean proving 写成状态转换问题。设 \(\mathsf{Stmt}\) 是合法 theorem statements， \(\mathcal T\) 是 tactics，\(\Omega\) 是 Lean states。一个 statement \(T\) 被 Lean elaborator 变成初始状态 \(\mathcal L(T)\)。 tactic sequence \(\mathbf t=(t_1,\dots,t_k)\) 是 \(T\) 的 proof，当且仅当：

\[ \mathbf t(\mathcal L(T))=(t_k\circ\cdots\circ t_1)(\mathcal L(T))=\varnothing. \]

对 whole-proof LLM prover，模型给出 proof script 分布 \(p_\theta(\cdot\mid T)\)，成功率是：

\[ s_\theta(T)=\sum_{\mathbf t}p_\theta(\mathbf t\mid T)\mathbf 1[\mathbf t(\mathcal L(T))=\varnothing]. \]

1. Rewriting category

一个 Lean rewriting category \(\mathscr R\subseteq\mathscr R_{\mathcal T}\) 的 objects 是 theorem statements， arrows 是由指定 tactic set 诱导的 statement transformations。若某个 tactic sequence 把 \(\mathcal L(T)\) 变成 \(\mathcal L(T')\)，就有 arrow \(T\to T'\)。composition 对应 tactic sequence 拼接。

2. Proof equivariance

若 \(T\xrightarrow{\mathbf r}T'\)，那么 \(T'\) 的 proof 可以通过前接 \(\mathbf r\) lifting 成 \(T\) 的 proof。 proof equivariance 要求 proof distribution 沿着 arrow 的传输是一致的：

\[ p_\theta(\cdot\mid T')=\mathcal P(\mathbf r)(p_\theta(\cdot\mid T)). \]

这很强，要求模型不只是成功率相同，还要 proof 分布结构可以正确转移。论文指出 state-based next-tactic provers 更接近这种归纳偏置，但 whole-proof LLM provers 通常不满足。

3. Success invariance

论文主要实证处理的是较弱但更实用的 success invariance。若 \(T\sim_{\mathscr R}T'\)，也就是二者在 \(\mathscr R\) 中互相可达，那么理想 prover 应满足：

\[ s_\theta(T)=s_\theta(T'). \]

为什么这个定义重要

它把“同义改写鲁棒性”从模糊直觉变成了可评价性质：先定义可接受的 rewrite category，再检查 prover 在这些 equivalence classes 上的成功率是否稳定。

方法：test-time rewriting ensemble 如何近似 invariance

作者没有重新训练 prover，而是在 test time 构造新 prover \(\bar p_\theta\)：给定原始 statement \(T\)，采样 \(K\) 个等价改写 \(T_1,\dots,T_K\)，把总 proof budget \(N\) 平均分给这些 variants。任何 variant 上找到的 proof 都通过 rewrite arrow lift 回原始 \(T\)。

\[ \bar s_{\theta,\mu}^{(K,N)}(T)=1-\mathbb E_{T_1,\ldots,T_K\sim\mu(\cdot\mid T,K)} \left[\prod_{i=1}^K(1-s_\theta(T_i))^{\lfloor N/K\rfloor}\right]. \]

理论部分有两个要点：如果 sampler 在样本数极限下覆盖整个 equivalence class，则 ensemble 的成功率会恢复 success invariance；在 reciprocity 和独立成功率先验下，固定总 budget 时，分散到更多 rewrites 在期望上不差。

Appendix Figure：test-time augmented Lean statement sampling。图中流程分为 rewriting tactic sampling、augmented statement composition、energy-function reranking。

在线采样的三步

步骤	具体机制	为什么需要
Sampling	从 Mathlib 中超过 30,000 个 unconditional lemmas 里采样 \(P=Q\) 或 \(P\Leftrightarrow Q\) 类 rewrite，并加入 commutativity / simp / ring_nf 等规则。	保证语义保持，同时比纯 AST symmetry 更丰富。
Composition	对 theorem 中多个 expressions 独立采样改写，再把 proof state 转成 theorem statement。	避免每次只改一个局部，提升表述多样性。
Reranking	用 theorem surprise \(\mathrm{{ThSur}}\) 选择更自然的 7 个 variants，加上 seed 形成 8-way ensemble。	用低成本 proxy 近似 prover 对 statement 的熟悉程度。

实验设计：miniF2F-rw 与跨 benchmark test-time evaluation

论文构造了 miniF2F-rw：基于 miniF2F 的 488 个问题，每个 seed 配 5-15 个语义保持改写，最终得到 5,726 个 theorems，其中 validation 2,922 个，test 2,804 个。每个 variant 都配有 Lean certificate，保证 variant 与 seed 可以互相转化；作者还人工过滤掉改变数学内容、过度复杂化或过度简化的 variants。

Appendix Figure：miniF2F-rw 数据集生成流程。图本身显示 GPT-5.4 生成候选、人工过滤、Lean certificate 验证和最终 dataset 统计。

评估设置	含义	用途
seed	所有 \(k\) 次尝试都跑原始 theorem statement。	对应普通 miniF2F 风格 baseline。
random	随机选一个 miniF2F-rw variant，所有 budget 跑它。	直接测 representation sensitivity。
controlled	从已验证 variant pool 中抽 8 个，budget 均分。	在离线 verified rewrite 池中测 ensemble 上限。
test-time	在线生成 seed + 7 个 variants，budget 均分。	真实推理时不依赖预计算 rewrite 池。

主实验模型包括 DeepSeek-Prover-V2-7B、Goedel-Prover-DPO、Goedel-Prover-SFT，附录还检查了 reasoning prover。额外 benchmark 包括 ProofNet 和 Ineq-Comp，后者的 trans split 专门测表述变换鲁棒性。

实验结果：随机改写会伤模型，ensemble 能把伤害变成收益

1. miniF2F-rw：random rewrite 暴露 invariance failure

所有 tested models 在 random perturbation 下都比 seed 差，说明 current LLM provers 对等价表述不稳定。例如 DeepSeek-Prover-V2 在 miniF2F-rw-valid 的 PASS@8 从 78.4 降到 71.1，下降 7.3 points；在 test 上从 61.7 降到 59.7，下降 2.0 points。

Model / Split	Seed PASS@64	Random PASS@64	Controlled PASS@64	Test-time PASS@64	读法
Goedel-Prover-SFT valid	64.8	60.9 ± 1.1	63.4 ± 0.2	63.6	ensemble 追回大部分 random 损失。
Goedel-Prover-DPO test	59.0	56.4 ± 1.1	59.9 ± 0.2	57.1	controlled variant pool 强于 seed，online 仍弱。
DeepSeek-Prover-V2 valid	79.5	75.3 ± 1.2	80.8 ± 0.4	81.5	test-time ensemble 超过 seed。
DeepSeek-Prover-V2 test	66.8	64.8 ± 1.1	68.8 ± 0.3	67.6	controlled 和 test-time 都改善鲁棒性。

2. ProofNet / Ineq-Comp：test-time ensemble 在 trans split 上最有价值

Model	Budget	ProofNet-test	Ineq-Comp-seed	Ineq-Comp-trans	Ineq ratio
EvolProver	32	--	52.2	34.0	65.2%
DeepSeek-Prover-V2-7B	64	23.1	66.7	37.3	55.9%
DeepSeek-Prover-V2-7B + Ensemble	64	23.2	64.7	45.5	70.3%

最强结果在 Ineq-Comp-trans：DeepSeek-Prover-V2 的 PASS@64 从 37.3% 到 45.5%。注意 seed split 上 ensemble 反而从 66.7 到 64.7，说明这不是“所有地方都更高”的魔法，而是专门缓解 transformation sensitivity。

Appendix Figure：ensemble size ablation。图本身显示在 miniF2F-rw 上，2、4、8 个 variants 的曲线通常随 ensemble size 增大而提高，并与 seed/random baseline 对照。

重点对比：它和 FormalEvolve 的区别在哪里

用户特别要求比较这篇和现有 FormalEvolve paper2html。两篇都在形式化数学里讨论“等价/多样表述”，但问题位置完全不同。 FormalEvolve 的对象是 informal theorem 到 Lean statement 的候选集合； Right Symmetries 的对象是 已经是 formal statement 后，prover 对等价 rewrite 的行为是否稳定。

维度	FormalEvolve	Right Symmetries for FTP	真正差异
任务阶段	Autoformalization：informal math → Lean candidates。	Formal theorem proving：Lean statement → proof script。	前者改“翻译输出”，后者改“prover 输入”。
核心失败模式	Single-output evaluation 低估 faithful Lean 表达的多样性。	同一 formal content 的表述变化导致 proof success 波动。	前者是候选空间不够，后者是成功率不 invariant。
数学抽象	CompileOK / SemOK / semantically consistent repertoire / concentration。	Rewriting category / proof equivariance / success invariance。	前者偏搜索协议，后者给出 symmetry 理论语言。
正确性门	Lean compiler + LLM semantic judge + manual audit。	Lean tactic-induced rewrites + Lean certificates + manual filtering。	FormalEvolve 的 semantic judge 是关键边界；本篇更强调 proof-transfer certificate。
搜索形态	Archive search：seedbank、islands、LLM patch、repair、EvolAST。	Test-time ensemble：rewrite sampling、statement composition、theorem-surprise reranking。	一个维护长期候选 archive，一个临场生成等价 variants 分配 proof budget。
下游指标	Semantic hit、Gini/concentration、theorem-complete@64、manual faithfulness audit。	PASS@k、random rewrite drop、controlled/test-time ensemble gain、Ineq trans/seed ratio。	FormalEvolve 问 repertoire 是否 prover-effective；本篇问 prover 是否 representation-robust。
最值得接的研究点	如何生成更多 faithful 且 prover-friendly 的 Lean statements。	如何让 prover 对 rewrite category 下的等价类保持稳定成功率。	两者可以拼成“先生成 repertoire，再做 symmetry-aware proving”。

我会怎么把两篇放进同一条研究线

FormalEvolve 可以看作 statement-level repertoire construction：给一个 informal theorem，构造一组 compile-ok、semantic-ok、prover-useful 的 Lean 入口。 Right Symmetries 可以看作 equivalence-class-aware proving：给一个 Lean theorem，围绕它的 rewrite equivalence class 分配 inference budget。如果要做下一步工作，最自然的组合是：用 FormalEvolve 产生多个 faithful formalizations，再对每个 formalization 做 rewriting ensemble，最后以 proof success、faithfulness 和 compute cost 三者一起评价。

不要混淆的点

Right Symmetries 在 related work 里说 FormalEvolve 证明“diverse reformulations benefit autoformalization”，但它并不是 FormalEvolve 的替代品。它扩展的是 proof generation 阶段：当 formal statement 已经存在，如何让 prover 不被 surface form 卡住。

我的评论：强点、边界和可追问问题

最强处

这篇最强的是把“形式证明对表述敏感”提升成了一个清晰的 symmetry 问题。它没有只说 LLM prover 不鲁棒，而是定义了 prover 应该满足的结构性质，并给出一个不改模型的 test-time 近似方案。

主要边界

第一，test-time ensemble 的收益依赖 rewrite sampler 的质量；如果 sampler 生成的 variants 都很陌生或过度复杂，budget 分散可能伤害 seed performance。第二，state-to-statement conversion 仍用了 GPT-5.4，这在工程上方便，但把一部分 pipeline 质量放到了外部 LLM 上。第三，论文主要实验集中在 non-reasoning prover；附录显示 reasoning prover 也敏感，但增益幅度和成本 tradeoff 仍值得更系统分析。

我会追问的 reviewer 问题

theorem surprise 是否真的稳定预测 proof success，还是只是在 DeepSeek/Goedel 这类模型上暂时有效？
如果把 state-to-statement conversion 换成本地小模型或 deterministic pretty-printer，性能损失多大？
rewrite category 的选择会不会引入 benchmark-specific bias？例如 miniF2F-rw 中人工过滤策略对结论有多大影响？
对 FormalEvolve 产生的 repertoire 再做 rewriting ensemble，收益是叠加、重叠，还是互相抵消？

One More Thing：这篇对形式化数学路线的启发

这篇给 autoformalization / ATP 研究一个很重要的方向：未来不能只问“模型会不会证明这个 statement”，还要问“这个 statement 所在的 equivalence class 里，模型是否稳定地会证明”。这会改变 benchmark 设计：一个题不应该只有一个 Lean statement，而应该有可验证的 rewrite family。

对你的研究线来说，这篇和 FormalEvolve 可以形成一个很好的二段式：

1Informal theorem自然语言数学题

2FormalEvolve-style repertoire多个 faithful Lean candidates

3Rewriting category每个 candidate 的等价改写类

4Symmetry-aware prover跨 variants 分配 proof budget

5Auditfaithfulness + proof success + cost

这个组合比单纯“多采样”更有研究味道，因为它把候选生成、多样性、语义一致性、证明成功率和计算预算放进了同一个可审计闭环。

Reference / Evidence

arXiv abstractWhat are the Right Symmetries for Formal Theorem Proving? arXiv PDF32-page public paper PDF, retrieved 2026-05-29. Code repositoryrw-ensembles: rewriting ensembles and miniF2F-rw dataset implementation. Local comparison pageFormalEvolve paper2html page used as conceptual comparison target.