AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

先给结论

AgentXRay 是这组 Drift Monitor 论文里最贴合“人类不完全理解”这句话的一篇。它不承诺恢复真实内部机制，而是从 input-output behavior 重构一个可编辑、可解释的 surrogate workflow。对 self-evolving agent 来说，这可以变成 release review 的核心材料：自改前后 workflow 结构到底变了什么。

0.426AgentXRay All Tools 五域平均 SFE

8-22%Red-Black Pruning token reduction

ICMLarXiv v3 标注 Accepted at ICML 2026

核心判断： Drift Monitor 不能只给 human 一个“高风险/低风险”标签。AgentXRay 给了一个更可审查的对象：\(W_t\to W_{t+1}\) 的 workflow delta。人类不需要看完整 token trace，但可以看 role、tool、thought pattern、workflow length 是否发生危险变化。

版本说明：本页基于公开 arXiv v3 PDF/source、TeX/source inventory、图表抽取和实验表格重建完成；检索日期为 2026-05-26。本页是 paper2html 深度阅读，不声称完成了独立复现实验。

研究动机

很多高性能 agentic systems 对用户来说是黑盒：我们能看到输入和最终输出，却看不到内部角色分工、工具调用顺序、提示词、协调拓扑和中间轨迹。即使某些系统有显式 multi-agent architecture，真实部署中也常通过 API、封装产品或闭源服务暴露。

这对 safety 很麻烦。用户无法理解决策过程，就难以判断系统是否可控；下游适配也困难，因为不知道系统到底靠哪些子能力完成任务。对 self-evolving agent 来说更严重：自改之后不仅输出可能变化，内部 workflow 也可能悄悄变化。

AWR 概念图：只用 input-output pairs，把黑盒 agentic system 近似成可编辑白盒 workflow。

AgentXRay 的解决思路是提出 Agentic Workflow Reconstruction (AWR)：给定黑盒系统的一组输入输出样本，合成一个显式 white-box surrogate workflow，使它在可观测输出指标上尽量接近黑盒系统。这里的 surrogate 很关键：它不是“真相”，而是可编辑、可审查、可比较的行为近似。

数学表示及建模

AWR task

黑盒 agentic system 记作：

\[ \mathcal{M}_{black}:\mathcal{X}\rightarrow\mathcal{Y} \]

我们只能访问 input-output dataset：

\[ \mathcal{D}=\{(\tau_i,o_i^*)\}_{i=1}^{N}, \quad o_i^*=\mathcal{M}_{black}(\tau_i) \]

目标是合成一个白盒 workflow \(\mathcal{W}\)，让它在同样输入上产生与 \(o_i^*\) 相似的输出。

Unified primitive space

AgentXRay 把 agents 和 tools 统一成 primitive space \(\Omega\)。每个 primitive 是：

\[ p=\langle \rho,\mu,\pi,T_{local}\rangle \]

符号	含义	对 workflow reconstruction 的作用
\(\rho\)	role	例如 researcher、coder、reviewer 等职能角色。
\(\mu\)	base model	决定该 primitive 的模型能力。
\(\pi\)	thought pattern	例如 CoT、self-reflection、plan-and-solve。
\(T_{local}\)	attached toolset	可为空；非空时表示 tool-augmented agent primitive。

workflow 被限制为长度不超过 \(L_{max}\) 的线性序列：

\[ \mathbf{s}=[s_1,s_2,\ldots,s_L], \quad s_j\in\Omega,\quad 1\le L\le L_{max} \]

优化目标是：

\[ \mathbf{s}^* = \operatorname*{arg\,max}_{\mathbf{s}\in\Omega^{\le L_{max}}} \mathbb{E}_{(\tau,o^*)\sim\mathcal{D}} \left[ \mathrm{Sim}\left(\Phi(\mathbf{s},\tau),o^*\right) \right] \]

其中 \(\Phi(\mathbf{s},\tau)\) 表示执行 workflow \(\mathbf{s}\)，\(\mathrm{Sim}\) 是任务相关的输出相似度指标。

Linearity hypothesis

任意图拓扑搜索成本很高，论文给出直观量级 \(O(2^{|\Omega|^2})\)。因此 AgentXRay 采用 linearity hypothesis：即使真实系统是图状、多 agent 或复杂 controller，实际执行通常也会序列化成时间顺序 trace。AWR 首先重构这个 execution-time ordering。

这一步是有边界的。它适合解释“观察到的执行结构”，不适合恢复 true concurrency、异步协调、隐藏 memory、branching DAG 或 primitive space 外的 proprietary tools。

算法流程 / 方法

AgentXRay 框架：MCTS + Red-Black Pruning 在 primitive space 中搜索高分 agent/tool 序列。

MCTS over workflow prefixes

AgentXRay 把搜索树节点 \(v\) 定义为一个 partial workflow prefix \(s_{1:t}\)，root 是空 prefix，edge 表示追加一个 primitive \(p\in\Omega\)。每轮 MCTS 包含 selection/expansion、simulation、backup。

阶段	做什么	为什么需要
Color-guided selection / expansion	根据节点颜色决定沿已有子节点下钻，还是创建新 child。	在 exploit 高潜力路径和 explore 新分支之间分配预算。
Simulation / rollout	补全 workflow 到 \(L_{max}\)，在 sample task 上执行，得到输出。	完整 workflow 执行后才能观测 delayed reward。
Reward	若执行失败 \(r=0\)，否则 \(r=\mathrm{Sim}(o,o^*)\)。	把输出相似度作为 reconstruction proxy。
Backup	沿访问路径更新 \(N(v)\leftarrow N(v)+1\)、\(Q(v)\leftarrow Q(v)+r\)。	为后续 UCB 和节点 scoring 提供统计。

Dynamic Red-Black Pruning

Red-Black Pruning：高潜力节点深挖，低潜力节点扩宽，避免预算散掉。

每个节点用 Quality、Depth、Width 三个因素打分：

\[ \mathrm{Score}(v)= \frac{Q(v)}{N(v)} \cdot \left(\frac{d(v)+1}{L_{max}+1}\right) \cdot \frac{|\mathcal{C}(v)|}{M} \]

然后根据 active nodes 的 \(\beta\)-quantile 阈值 \(\theta_\beta\) 给节点染色：

\[ C(v)= \begin{cases} \textsc{Red}, & \mathrm{Score}(v)\ge\theta_\beta\land v\notin\mathcal{L}_{term}\\ \textsc{Black}, & \mathrm{otherwise} \end{cases} \]

Red 节点代表已有路径潜力较高，继续用 UCB 在现有 child 里选择并下钻；Black 节点代表低潜力或尚未覆盖充分，优先创建新 child 扩展宽度。这个机制的核心不是“剪掉所有低分节点”，而是把有限 rollout 预算集中到更可能形成深 workflow 的结构骨架上。

Search-space contraction

无 pruning 时，branching factor \(b=|\Omega|\)，搜索体积为：

\[ \mathcal{V}_{full} = \sum_{d=0}^{L_{max}}b^d = \Theta(b^{L_{max}}) \]

若 realized pruning rate 为 \(p\)，有效体积近似为：

\[ \mathcal{V}_{eff} = \sum_{d=0}^{L_{max}}(b(1-p))^d = \Theta((b(1-p))^{L_{max}}) \]

加速比满足下界：

\[ \eta(L_{max}) \ge \left(\frac{1}{1-p}\right)^{L_{max}} \]

论文也明确说明：这是理想化渐近解释，不是实际 token savings。实验里 \(N=20\) 的 strict budget 下，实际 savings 是 8-22%。

实验设计

AgentXRay 的评估有三个问题：primitive space 是否能表达多样 agentic behavior；MCTS 是否能在 strict black-box access 下恢复高 fidelity workflow；Red-Black Pruning 是否提升预算效率。

Domain	Target system	Tasks	含义
Software development	ChatDev	52 SRDD multi-file Python generation tasks	多文件代码生成，workflow dependency 强。
Data analysis	MetaGPT	52 MatPlotBench visualization tasks	数据分析 pipeline，相对规则。
Education	TeachMaster	25 automated teaching video generation tasks	教学视频生成 workflow。
3D modeling	ChatGPT	100 ScanRefer scripting tasks	proprietary assistant 黑盒输出。
Scientific computing	Gemini	80 SciBench problems	科学计算和推理式 artifact。

Baselines 包括 SFT、Claude Opus 4.5、Claude Opus 4.5 + ReAct、AFlow，以及 AgentXRay 的 w/o tools、w/o pruning、selected tools、all tools variants。特别注意：Claude baselines 用的模型比 AgentXRay primitive space 里的组件更强，所以这个实验不是简单的“更强模型赢”。

评价指标是 proxy fidelity，主要用 Static Functional Equivalence (SFE)：interface similarity 20%、logic similarity 50%、semantic similarity 30%。AST 解析失败时用 TF-IDF cosine fallback，比例低于 5%。作者还做了小规模 human validation：3 位 annotator，30 个 output pairs，SFE 与人工相似度 Spearman \(\rho=0.61\)，\(p<0.001\)，Krippendorff's \(\alpha=0.57\)。

实验结果

Main results

Method	ChatDev	MetaGPT	TeachMaster	3D	Sci	Avg
SFT	0.355	0.272	0.124	0.091	0.139	0.196
Claude	0.256	0.322	0.303	0.282	0.299	0.292
ReAct	0.267	0.331	0.322	0.270	0.305	0.299
AFlow	0.403	0.280	0.348	0.290	0.373	0.339
AgentXRay w/o Tools	0.413	0.301	0.357	0.332	0.378	0.356
AgentXRay w/o Pruning	0.286	0.334	0.378	0.279	0.312	0.318
AgentXRay Selected	0.509	0.470	0.399	0.318	0.407	0.421
AgentXRay All Tools	0.425	0.557	0.390	0.362	0.395	0.426

All Tools 平均 0.426，是五域最高；Selected 平均 0.421，且在 ChatDev、TeachMaster、Sci 上最好。AFlow 平均 0.339；Claude + ReAct 只有 0.299；SFT 0.196。作者的解释是：input-output behavior cloning 缺少中间 decomposition supervision，强模型单体能力也不能稳定替代 explicit workflow search。

Ablation: pruning 让搜索真正深入

Method	ChatDev Len	MetaGPT Len	TeachMaster Len	3D Len	Sci Len
w/o Pruning	2	2	2	2	2
w/o Tools	6	2	4	9	6
All Tools	6	2	5	2	6
Selected Tools	6	4	6	6	6

无 pruning 时，所有 domain 都停在 \(L=2\)，这说明 MCTS 预算被巨大 branching factor 稀释掉。Red-Black Pruning 的真正贡献是预算分配：让搜索能在有限 \(N=20\) 下走到更深 workflow。

Efficiency

与 w/o pruning 相比，Red-Black Pruning 在五个 domain 上减少 8-22% token consumption。ChatDev 最明显，从 17.16M 降到 13.33M tokens，减少 22.3%。w/o tools 的 token 最少，但平均 fidelity 只有 0.356，不如 All Tools 0.426。

Convergence：pruned variants 用更少 cumulative tokens 达到高分候选。

convergence 曲线显示，pruned variants 在 ChatDev 和 Sci 上更早、更陡地到达高分候选。作者还用 5-seed runs 和 paired t-tests 检查随机性；带 pruning 的 AgentXRay variants 平均显著优于 AFlow，且 worst-case runs 仍超过 AFlow。

Open-weight generalization

为了检查 AgentXRay 是否依赖 proprietary models，论文在 Atoms 自动幻灯片生成任务上只用 open-weight models 构造 primitive space。结果如下：

Method	Similarity
SFT	0.275
AFlow	0.321
AgentXRay w/o Tools	0.275
AgentXRay Selected Tools	0.326
AgentXRay w/o Pruning	0.328
AgentXRay All Tools	0.337

All Tools 0.337，高于 AFlow 0.321，趋势与主实验一致。这支持作者的说法：提升来自算法设计，而不只是 proprietary primitive。

我的评论

它给 Drift Monitor 的最强贡献：human-readable workflow delta

前面几篇更多是在回答“怎样检测 drift”：Goal Drift 看目标偏移，OEP 看 memory write 毒化，MemoryGraft 看 retrieval 毒化，Routine Chats 看长期状态写回，ATP 看趋势 tipping。AgentXRay 则回答“人类要看什么”。

我会把它接到 AgentDevel 式 release gate 后面：每个 self-evolution candidate \(u_t\) 不只跑测试，还重构 before/after workflow，并给 reviewer 看 \(\Delta W\)：

\[ workflow\_delta_t = reconstruct(\mathcal{M}_{t+1}) - reconstruct(\mathcal{M}_{t}) \]

\[ review\_gate(u_t)= tests\_pass(u_t) \land risk(u_t)\le\tau \land workflow\_delta\_acceptable(u_t) \]

在实践里，workflow delta 可以标出：新增了什么工具、少了哪个 review step、从 multi-agent 变成 single-agent shortcut、是否出现了更强权限 primitive、是否绕过了原先的 verifier。人类不需要读完整 trace，只需要审查这些结构变化。

不要过度相信 reconstruction

AgentXRay 恢复的是 behavioral surrogate，不是真实内部机制。SFE 也是 proxy，不是 execution-level correctness。尤其对 self-evolving agent，隐藏 memory、prompt mutation、branching workflow、并发 tool call 都可能被线性 surrogate 漏掉。因此我不会把 AgentXRay 当“解释真相”，而会把它当 release artifact：帮助 human review 和 regression analysis。

3-6 个月内的可用版本

完整 AWR 搜索可能太重，不适合每个在线 step。更现实的路径是分层使用：普通小 update 只做 diff + tests；高风险 memory/skill/state promotion 才做 workflow reconstruction；最终只在 release candidate 上生成 human-readable delta。这样既保留解释性，又不把 monitor 成本推爆。

One More Thing

如果我们要把 Drift Monitor 做成论文，AgentXRay 这篇可以帮我们避开一个常见弱点：安全论文最后给人看的往往只有表格指标，没有让人真正理解系统改变了什么。我们可以把 \(W_t\to W_{t+1}\) 的 reconstructed workflow delta 放在主图里。

Delta 类型	可能风险	对应 probe
tool removal	像 ATP 一样学会不用关键工具。	复杂任务 probe / tool-usage distribution。
reviewer/verifier removal	release gate 被绕开。	AgentDevel-style regression tests。
new retrieval primitive	MemoryGraft 风险上升。	retrieval exposure audit。
memory writer priority increase	OEP / Routine Chats 风险上升。	writeback boundary diff audit。
role simplification	多阶段安全流程被压缩成 shortcut。	before/after workflow reconstruction。

这就是 AgentXRay 最值得我们借的地方：它把“黑盒行为变化”转成了可展示、可审查、可比较的结构变化。

Reference / Evidence

arXiv abstract

https://arxiv.org/abs/2602.05353

PDF

https://arxiv.org/pdf/2602.05353

arXiv DOI

https://doi.org/10.48550/arXiv.2602.05353

Reading basis

Based on the public arXiv abstract page, public arXiv v3 PDF/source package, and source inventory retrieved on 2026-05-26. This page is a paper2html deep-reading note, not an independent reproduction report.