FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

先给结论

这篇最值得看的不是“LLM 会出题”，而是它把开放式编程题变成了一个可筛选、可执行、可训练的数据漏斗。

FrontierSmith 的核心问题是：如果真实 coding agent 任务不是 pass/fail，而是“很多方案都可行，但质量连续变化”，那训练数据怎么规模化来？它的答案不是凭空让模型写题，而是从 closed-ended competitive programming seed 出发，经过 formulation mutation、coarse filter、 idea divergence、test/verifier synthesis、execution-based rerank，再进入 GRPO 训练。

复现边界必须先说清楚： 官方 GitHub 公开了训练代码、评测代码、10 个 synthetic algorithmic problems、sample manifests 和若干复现实验入口；但 repo 明确说明 orchestrator 和 LLM-driven test/checker generators 被 intentionally withheld。所以这不是一个完整可复刻“200 题生成流水线”的 release，更像是训练/评测与 10 题 parity experiment 的 release。

我认为成立的强结论

把 closed-ended 题变异成 open-ended optimization task，再用 solution-diversity signal 过滤，比直接拿 closed-ended HardTests 做 RL 明显更有效。

我不会替它夸大的结论

它还没有证明“任意 open-ended agent environment 都能被自动合成”；当前范围主要是 self-contained algorithmic environments。

研究动机

为什么 closed-ended coding 不够？因为真实优化问题通常没有一个可廉价证明的最优答案。

closed-ended coding 任务的奖励结构很清楚：单元测试全过就是 1，否则就是 0；competitive programming 也通常有明确的正确性判定。但 open-ended coding 任务的困难在于：可行解很多，目标值连续变化，且在目标规模下很难验证全局最优。例如 scheduling、route planning、layout、heuristic contest、系统优化等任务，重点不是“有没有唯一正确答案”，而是解的质量、鲁棒性和计算代价。

论文给出的背景差距很直接：FrontierCS 的 algorithmic tasks 上，人类专家得分约 95.41，而 Gemini 3.0 Pro 约 29.37； ALE-bench 这类 heuristic contest 风格任务上，frontier LLM 也落后于平均人类参赛者。另一方面，FrontierCS/ALE-bench 这种 open-ended 数据很少： FrontierCS 约 240 个问题，论文训练/评测只使用其中 172 个 local-computation algorithmic problems；ALE-bench 人工 curated 任务约 40 个。

FrontierSmith pipeline overview — 论文 teaser 图：从 closed-ended coding seeds 出发，经过 mutation、filtering、test/verifier construction，得到连续奖励的 open-ended problems。

数学表示及建模

FrontierSmith 把“题目”先拆成 formulation，再用 idea divergence 判断它是不是会诱发多种策略。

1. Problem formulation

论文把一个 problem formulation 表示为三元组：

\[ c = (\mathcal{O},\mathcal{C}_I,\mathcal{C}_O) \]

其中 \(\mathcal{O}\) 是 computational goal，\(\mathcal{C}_I\) 是 admissible input instances， \(\mathcal{C}_O\) 是 valid program outputs 的约束。open-ended mutation 的目标，就是让原来有清晰判定/高效解法的 formulation 变成“没有廉价最优证书、但仍有连续质量评分”的 formulation。

2. Idea divergence

如果一个开放式题虽然有连续分数，但所有强解都沿着同一个 dominant trick 走，它对训练的价值会下降。 FrontierSmith 因此定义 idea divergence：两个独立 sampled solutions 使用不同 algorithmic strategy 的概率。

\[ d(c) \coloneqq \mathbb{P}_{s_i,s_j \sim \mathrm{Solver}(c)} \left[\mathrm{strategy}(s_i) \neq \mathrm{strategy}(s_j)\right] \]

实际计算 \(d(c)\) 不可行，所以论文用了两种估计。

LLM-based estimate

先为 candidate \(c\) 采样 \(n\) 个解，然后让 LLM-as-a-Judge 判断每对解是否属于不同策略：

\[ \hat d(c)=\frac{1}{\binom n2}\sum_{i \lt j}\texttt{Judge}(s_i,s_j) \]

Execution-grounded estimate

有了 test cases 和 verifier 之后，把每个解在 \(m\) 个测试上的得分写成向量 \(\mathbf q_i\)，再比较行为差异：

\[ \hat d(c)=\frac{1}{\binom n2}\sum_{i \lt j}\frac{1}{\sqrt m}\lVert\mathbf q_i-\mathbf q_j\rVert_2 \]

Idea divergence classifier plot — 论文 divergence plot 左侧：LLM-based idea divergence 把 FrontierCS、FrontierSmith、ALE-bench 这类 open-ended sources 与 HardTests 区分开。

3. Verifier reward

verifier 把 objective value 归一化到连续 reward。论文给出的一种常见写法是用 baseline solution \(s^*\) 做相对改进：

\[ \mathcal{V}_c(s,t)= \max\left(0,\sigma_{\mathcal{O}}\cdot \frac{\mathcal{O}(s,t)-\mathcal{O}(s^*,t)} {\max\{\mathcal{O}(s,t),\mathcal{O}(s^*,t)\}}\right) \]

这里 \(\sigma_{\mathcal{O}}=+1\) 表示 maximization，\(-1\) 表示 minimization。 crash、timeout、unparseable output 都给 0。这一点很关键：FrontierSmith 不是只写题面，还要把题变成 RL 可用的 scoring environment。

算法流程 / 方法

整个 pipeline 可以看成一个“生成很多，严筛少数，再把少数做成可执行环境”的闭环。

1Seed pool

用 closed-ended CP corpus 初始化，并在每轮加入已通过验证的 synthetic problems。

2Mutate

LLM 抽取原 formulation，并沿三类 mutation 生成 open-ended candidate。

3Coarse filter

检查是否有 optimization objective、是否无已知 optimum、是否可产生 meaningful ranking。

4Sample solutions

每个候选采样 n = 10 个 solution，用于估计 strategy diversity。

5Idea divergence

LLM judge 先选出 top N_div；有 verifier 后再用 execution divergence 重排。

6Build environment

test-case agent 与 verifier agent 互相 cross-validate；不收敛的 candidate 丢弃。

7Select final

每轮取 top N_final = 50，加入 seed pool 进入下一轮。

三类 mutation

Mutation	形式	直觉	论文例子
Changing goals	\(\mathcal{O}\to\mathcal{O}'\)	把 decision/exact-answer goal 改成 graded optimization goal。	2-SAT 变成 Min-True 2-SAT：不只是 satisfiable，而是满足同时最小化 true variables。
Restricting outputs	\(\mathcal{C}_O\to\mathcal{C}_O'\)	保留目标，但收紧有效输出的约束，让精确最优在规模上变难。	MST 加 degree constraints，变成 degree-constrained spanning tree。
Generalizing inputs	\(\mathcal{C}_I\to\mathcal{C}_I'\)	放宽输入结构假设，让原来的多项式特殊情形退回一般困难问题。	bipartite graph 上的 maximum independent set 推广到 arbitrary graph。

coarse filter 筛什么？

论文的 coarse LLM-as-a-judge filter 不是复杂 verifier，而是一个早期降噪器，主要检查三件事：

Objective

是否定义了 optimization objective，且没有已知可廉价获得的 optimum。

Strategy diversity

是否存在多种 plausible strategies，而不是一个明显 dominant strategy。

Scoring

是否能设计 scoring function 来 meaningful rank submissions。

test/verifier cross-validation

FrontierSmith 为每个 surviving candidate 生成测试输入集合 \(\mathcal{T}_c\) 和 scoring program \(\mathcal{V}_c\)。 test-case agent 会利用 sampled solutions 设计不同规模、结构和 adversarial cases； verifier agent 则检查 score 是否崩成窄带、是否违反相对质量判断、是否对 crash/timeout/parse error 处理正确。两个 agent 用彼此输出反向暴露自身错误，迭代到一致；论文报告进入该阶段的 candidates 中约 10% 最终得到 validated pair。

实验设计

主实验不是只看生成题好不好，而是看它能不能作为 RL 训练数据提升 open-ended coding benchmark。

模块	设置	为什么重要
Benchmarks	FrontierCS 172 algorithmic local-computation tasks；ALE-bench-lite 10 tasks。	分别覆盖 FrontierCS 风格开放式算法任务和 AtCoder Heuristic Contest 风格任务。
Baselines	Base、FrontierCS(172)、ALE-bench(40)、HardTests(200)、Random Reward(172)。	区分“开放式 formulation 有用”“闭集题本身有用”“随机 RL 动力学也许有用”这几种解释。
Synthesis	4 iterations；每轮 \(B=1000\)，\(N_\text{div}=100\)，\(N_\text{final}=50\)，最终 200 synthetic problems。	说明它不是一次 prompt 生成，而是 iterative seed-pool expansion。
LLM roles	GPT-5.4 Thinking medium 用于 mutation、coarse filter、LLM divergence；Claude Sonnet 4.6 用于 sample solutions、test generation、verifier generation。	也意味着生成质量依赖强 LLM 和未公开 orchestrator。
Training	veRL + GRPO；9B LR \(10^{-6}\)，27B LR \(5\times10^{-7}\)；rollout batch size 8，group size 8；100 steps。	奖励来自 generated verifier，而不是 binary pass/fail。
Compute	Qwen3.5-9B: 8 A100, 1.5 days, max response 16k；Qwen3.5-27B: 32 H200, 1.5 days, max response 32k。	结果不是低成本复现实验；训练算力规模需要明确。

实验结果

主要结论：FrontierSmith 合成数据明显强于直接训练 closed-ended HardTests，并在 27B 上超过 human-curated FrontierCS 训练。

Main results

Training Data	Qwen3.5-9B				Qwen3.5-27B
Training Data	FrontierCS Avg@5	FrontierCS Best@5	ALE Avg@5	ALE Best@5	FrontierCS Avg@5	FrontierCS Best@5	ALE Avg@5	ALE Best@5
Base	1.80	5.00	327.22	327.22	7.70	13.91	352.52	498.00
FrontierCS (172)	11.17	16.29	558.49	762.28	13.98	21.92	543.80	843.80
ALE-bench (40)	10.64	15.70	657.40	750.15	-	-	-	-
FrontierSmith (200)	10.62	15.73	633.58	782.30	19.82	29.38	661.64	938.10
HardTests (200)	5.38	11.19	397.18	512.83	11.20	18.37	529.12	922.50
Random Reward (172)	3.04	6.93	376.82	466.73	-	-	-	-

对 9B，FrontierSmith 在 FrontierCS 上 10.62 Avg@5，接近 human-curated FrontierCS 训练的 11.17；在 ALE-bench 上 633.58 Avg@5，高于 FrontierCS 训练的 558.49，低于 in-domain ALE-bench 训练的 657.40。对 27B，FrontierSmith 在 FrontierCS/ALE-bench 两个 benchmark 上都超过 FrontierCS 训练 baseline。

Filter ablation

论文报告 Qwen3.5-9B 上 FrontierSmith 的 FrontierCS Avg@5 为 10.62，而 No Filter 为 8.57； ALE-bench 上 FrontierSmith 633.6，No Filter 564.4。这说明 mutation 本身不够，idea-divergence/coarse filtering 对训练数据质量有实际贡献。

coarse filter sanity check

对 100 个 closed-ended HardTests，coarse filter reject 91、retain 9，false-positive rate 9%。

open-ended recall check

对 100 个 FrontierCS，coarse filter reject 19，false-negative rate 19%；论文认为可接受，因为目标偏 precision。

Long-horizon behavior

Long-horizon agent behavior plot — long-horizon plot：ALE-bench 与 FrontierSmith 更容易让 agent 进入多 turn、多 token 的长程解题行为；HardTests 和 FrontierCS 在该实验里更短。

long-horizon 实验从 FrontierCS、FrontierSmith、HardTests、ALE-bench 各采样 10 个任务，通过 Harbor 跑 Claude SDK Sonnet 4.6、 Codex GPT-5.5、Kimi Code K2.6。论文称 ALE-bench 对所有三个 agents 都超过 100 turns 和 \(3\times10^6\) tokens； FrontierSmith 也表现出类似长程行为，其中 Claude SDK 达到 113 turns 和 \(6.3\times10^6\) tokens。

附录里的两个具体题

合成题长什么样？重点是从“判定正确”转成“可行解质量越高越好”。

Concat Factory Compression Challenge

输入是一组 target strings。参赛者输出一个由 literal creation 和 concatenation 组成的小程序，构造所有目标串。 feasible solution 需要语法正确、引用已有变量、长度受限；objective 是让 command 数量和 literal 总长度尽量小。 baseline 是不复用地切块构造，score 用 \(B/C\) 形式鼓励压缩和复用。

Yae Village Defense Network

输入包含村庄图、建路预算、每日巡逻时间和攻击事件。参赛者选择建哪些路、每天怎么巡逻，以 minimize uncleared attacks 的 damage。 feasible solution 需要预算、路径连通性、每日时间限制都满足；score 相对 baseline damage 归一化。

这两个例子能看出 FrontierSmith 的题型倾向：不要求输出唯一答案，而要求构造一个满足约束的方案，并用连续 reward 比较方案质量。

我的评论

我会把 FrontierSmith 放在“self-evolving agent 的任务/环境课程生成器”类，而不是 agent 自我记忆或技能优化类。

优点

它抓住了 open-ended coding 数据稀缺的关键工程点：题面、测试、verifier、reward、solution diversity 缺一不可。 idea divergence 也比“LLM judge 觉得像开放式”更具体，因为它绑定到 sampled solutions 的 strategy diversity 和 execution behavior。

风险

LLM judge、orchestrator、test/checker generators 都是关键组件；其中 orchestrator 和 LLM-driven generators 未公开。这使得外部读者很难判断 200 题生成过程中的失败样本、prompt 细节、人工干预、retry 策略和 selection bias。

我最关心的复现问题

问题	为什么重要	当前公开材料状态
200 题生成流水线能否复跑？	决定 synthesis claim 的外部可验证性。	不能完整复跑；官方 repo 明确 withheld orchestrator 和 LLM-driven test/checker generators。
公开的 synthetic problems 是否足够审计？	10 题可以检查题目形态，但不足以代表全部 200 题分布。	repo 公开 10 个 problems，对应 Frontier-CS IDs 306-315。
LLM-as-a-Judge 是否稳定？	idea divergence 排序依赖 judge 对 strategy difference 的判断。	论文给出 classifier/ablation 证据，但外部复现仍依赖 prompt 和 orchestrator 细节。
训练收益是否来自题面 exposure？	如果只是格式 exposure，Random Reward 也会涨。	Random Reward baseline 接近 Base，支持“真实 reward signal 是必要的”。

One More Thing

和 self-evolving agents 的关系：它更像 curriculum factory，而不是 autonomous agent 本体。

如果把 self-evolving coding agent 拆成“agent policy”“memory/skill update”“environment/task distribution”“evaluator/reward”几层， FrontierSmith 主要贡献在 environment/task distribution 与 evaluator/reward 层：它自动制造会迫使模型探索多种策略的开放式 coding tasks。这对 lifelong agent adaptation 很有价值，因为 agent 是否退化、是否保留能力，最终都要靠一个能暴露长程行为差异的任务分布来测。

但它不是 DGM-lite 那种 agent 自己改自身代码/技能的循环，也不是 prompt memory 优化器。更准确的归类是：coding benchmark / synthetic curriculum / open-ended environment generation。所以放在博客的 self-evolving-agent 下没有问题，但二级分类最好落在 coding-benchmark，而不是 agent-memory 或 tool-use。

Reference / Evidence

arXiv abstract 论文摘要、作者、版本和主要结果。 arXiv PDF 本文方法、实验表格、图和附录样例的主要阅读来源。 Official GitHub 训练/评测代码、10 个公开合成题、复现实验脚本与发布边界。 Frontier-CS blog 作者团队发布的项目解读与 release 说明。 Hugging Face model 公开的 Qwen3.5-9B FrontierSmith 模型卡。