MemoryGraft

先给结论

MemoryGraft 证明了一个很关键的边界：agent 的“过去经验”不能默认可信。攻击者不必在当前 prompt 里越狱，只要让 agent 在某次 ingestion 中写入一批伪装成成功经验的 memory，未来 clean task 就可能通过检索把这些经验拉回来，并模仿里面的 unsafe procedure。

#5Drift Monitor Top 10 精读优先级

47.9%论文报告的 poisoned retrieval proportion

100+10公开 artifact 中 benign / poisoned experience seeds

核心判断： OEP 攻击 memory consolidation，MemoryGraft 攻击 memory retrieval/use。一个 Drift Monitor 如果只盯“写入是否恶意”，会漏掉“检索时哪些旧经验正在支配当前行为”这个更细的漂移面。

版本说明：本页基于公开 arXiv PDF、公开 arXiv source package、公开 GitHub artifact 与 TeX/source inventory 完成；检索日期为 2026-05-26。本页是 paper2html 深度阅读，不声称完成了独立复现实验。

研究动机

LLM agents 越来越依赖 long-term memory 和 RAG：把历史任务、成功轨迹、用户偏好、工具调用经验存下来，未来遇到相似任务时再拿出来复用。这条路线看起来像“经验学习”，但它引入了一个被低估的 trust boundary：reasoning core 到底能不能信任自己的 past?

传统 prompt injection 攻击当前上下文；RAG poisoning 攻击外部知识库；很多 memory attack 依赖显式 trigger 或后续攻击提示。MemoryGraft 的切入点更贴近自进化 agent：它把恶意 procedure 包装成“成功经验”，通过正常 ingestion pathway 写入长期 memory，然后等未来 clean query 自然检索到它。

这篇文章对 Drift Monitor 的价值不在于“又一个 memory poisoning”，而在于它把 retrieval distribution 本身变成了可测对象：如果某些高风险经验在正常查询中被异常频繁检索，agent 的行为还没执行，漂移已经发生了一半。

MemoryGraft 流程：benign-looking 文档触发 agent 构造 poisoned RAG memory；未来 clean query 检索到恶意“成功经验”，导致 unsafe action 和 persistent behavior drift。

数学表示及建模

设 agent 有 RAG 模块和持久长期记忆 \(\mathcal{M}\)。对当前 query \(q\)，agent 检索 \(k\) 条过去经验：

\[ \mathrm{Retr}(q)=\{(q_i,R_{q_i})\}_{i=1}^{k} \]

这些经验会进入 prompt，像 demonstration 一样影响新 trace \(R_q\)。如果任务被判成功，系统再把新经验写回 memory：

\[ \mathcal{M}\leftarrow\mathcal{M}\cup\{(q,R_q)\} \]

MemoryGraft 的关键是 union retrieval。BM25 抓 lexical overlap，FAISS/cosine embedding 抓 semantic similarity；最终集合是两者并集：

\[ \mathrm{Retr}(q)=\mathrm{Retr}^{lex}(q)\cup\mathrm{Retr}^{vec}(q) \]

攻击者提供的 adversarial documents 记作：

\[ \mathcal{D}_{adv}=\{d_1,\dots,d_m\} \]

这些文档通过合法 ingestion 诱导 agent 产生 poisoned store：

\[ \mathcal{M}_{poison}=\mathcal{M}_{benign}\cup\mathcal{M}_{adv} \]

攻击目标不是让用户 prompt 出现 trigger，而是让 clean victim query \(q^\star\) 的检索结果里至少出现一条 poisoned memory：

\[ \exists i\in[k]:(q_i,R_{q_i})\in\mathcal{M}_{adv} \]

Poisoned Retrieval Proportion

论文的核心 metric 是 PRP。对评估 query 集合 \(\mathcal{Q}=\{q^{(1)},\ldots,q^{(N)}\}\)，每个 query 的 retrieved set 为：

\[ \mathrm{Retr}_{\mathcal{M}'}(q^{(i)})=\{(q_j,R_{q_j})\}_{j=1}^{k_i} \]

其中 poisoned retrieved count 和 total retrieved count 是：

\[ p_i=\left|\mathrm{Retr}_{\mathcal{M}'}(q^{(i)})\cap\mathcal{S}_{poison}\right|, \quad t_i=\left|\mathrm{Retr}_{\mathcal{M}'}(q^{(i)})\right| \]

\[ \mathrm{PRP}= \frac{\sum_{i=1}^{N}p_i}{\sum_{i=1}^{N}t_i} \]

算法流程 / 方法

攻击的两阶段结构

阶段	动作	为什么隐蔽
Poisoning	构造 benign-looking note / documentation，包含正常 workflow 和少量 poisoned “success” templates。	artifact 看起来像普通工程文档，且 poisoned entries 被标成 validated / safe best practice。
Store construction	agent 读取并处理文档，构建 BM25/FAISS RAG store，把 benign 与 poisoned seeds 一起持久化。	攻击不是直接改 memory DB，而是借助 agent 自己的 ingestion/write pipeline。
Evaluation / Activation	未来 clean task 通过语义或词面相似检索到 poisoned memory。	没有显式 trigger；activation 来自普通任务与 poisoned seed 的相似性。
Behavior drift	agent 把 retrieved memory 当成 prior success，模仿跳过验证、远程脚本、外传 artifacts 等模式。	unsafe pattern 以“过去成功经验”身份出现，比普通外部文本更容易被信任。

Seed construction

攻击构造两个 disjoint sets：普通经验和 poisoned 经验。

\[ \mathcal{S}_{benign}=\{(q_b^{(i)},R_b^{(i)})\}_{i=1}^{n_b}, \quad \mathcal{S}_{poison}=\{(q_p^{(j)},R_p^{(j)},\pi)\}_{j=1}^{n_p} \]

其中 \(\pi\) 是 unsafe behavioral pattern，例如 skip schema checks、run remote helper scripts、force success flags、upload artifacts externally。payload note \(\mathcal{N}\) 触发：

\[ \mathcal{M}_{poison}=\mathrm{build\_store}(\mathcal{S}_{benign},\mathcal{S}_{poison}) \]

最终 poisoned memory 与已有 memory 合并：

\[ \mathcal{M}'=\mathcal{M}\cup\mathcal{M}_{poison} \]

为什么 union retrieval 会放大攻击

如果只用 BM25，攻击者需要猜中词面；如果只用 embeddings，攻击者需要占据语义邻域。union retrieval 让攻击者只要命中任一通道就能进入上下文：

\[ \mathrm{Retr}_{\mathcal{M}'}(q) = \mathrm{Retr}^{vec}_{\mathcal{M}'}(q) \cup \mathrm{Retr}^{lex}_{\mathcal{M}'}(q) \]

Drift Monitor 视角： 这里的异常不是单条 memory 内容，而是 retrieval basin 被污染：少数 poisoned records 被放在普通任务语义空间的高频区域，导致它们在很多 clean query 下都被召回。

实验设计

论文在 MetaGPT 的 DataInterpreter agent 上评估，底层 LLM 是 GPT-4o。DataInterpreter 会生成并运行 Python/SQL 代码、检查中间结果、返回结构化输出；这使它特别适合测试“过去成功经验是否会变成未来 procedure prior”。

维度	设置	说明
Agent	MetaGPT DataInterpreter	软件/数据分析式 agent，带可执行代码与 retrieval memory。
Model	GPT-4o	论文主实验底层 LLM。
Memory retrieval	BM25 always on；FAISS if embedding configured	公开 artifact 中 top-k 为 3。
Seeds	110 total	100 benign DataInterpreter-style workflows，10 poisoned shortcuts。
Queries	12 handcrafted probes	覆盖 PII cleaning、schema validation、EDA、reporting、pipeline debugging 等。
Metric	PRP / PRR	poisoned retrieved records / total retrieved records。

公开 artifact 如何补充论文

arXiv abstract 和 TeX source 都指向公开 GitHub。该 artifact 包含 MetaGPT-based POC：memory graft / RAG vector-drift script、payload note、100/10 seed data、memory inspector，以及 schema-spoofing / judge-jacking 两个 stronger-setting 实验。这里我只把它作为公开证据使用，不把它写成已复现结果。

Artifact 证据	对阅读有什么帮助
100 benign + 10 poisoned seeds	确认论文的 poisoning cover 不是纯概念，而是按 DataInterpreter workflow 具体构造。
12 evaluation queries	确认 probe 覆盖普通数据分析工作流，而不是只覆盖 trigger-like prompt。
BM25 + optional FAISS	解释为什么论文强调 union retrieval 的 attack basin。
poison signatures	给 retrieval-time risk classifier 一个直接起点：skip / force / remote helper / upload / stale cache。

实验结果

Aggregate retrieval result

论文报告，在 12 个 evaluation queries 上，总 retrieved records 为 48，其中 poisoned retrievals 为 23：

\[ \mathrm{PRP} = \frac{P_{tot}}{T_{tot}} = \frac{23}{48} = 0.479 \]

48total retrieved records

23poisoned retrieved records

47.9%Poisoned Retrieval Proportion

关键不是“10 条 poisoned records 占 10/110”，而是它们在 retrieval distribution 里占了 23/48。换句话说，poison 的存储比例很低，但检索曝光比例很高。

Poison patterns

Poisoned pattern	在 agent 行为里的风险	Drift signal
skip / force validation	把失败状态伪装成通过，污染 release gate。	retrieved trace 推荐绕过检查。
remote helper script	把执行边界交给外部脚本。	retrieved trace 含远程执行/下载模式。
external upload	把数据或 artifact 发送到不受控位置。	retrieved trace 推荐外传 artifact。
stale cache reuse	用过期结果替代重新计算和验证。	retrieved trace 把速度放在 correctness 之前。
blanket imputation / sampling	掩盖数据质量问题。	retrieved trace 弱化 schema / data-quality checks。

作者提出的防御

第一类是 Cryptographic Provenance Attestation。只有真正经过环境验证的成功 trace 才能被签名写入 memory：

\[ \sigma=\mathrm{Sign}(H(q\parallel R_q),K_{priv}) \]

检索时只允许签名验证通过的 memory 进入上下文：

\[ \mathrm{Retr}_{secure}(q^\star)= \{(q_i,R_{q_i})\in\mathrm{Retr}(q^\star)\mid \mathrm{Verify}((q_i,R_{q_i}),\sigma_i,K_{pub})=\mathrm{True}\} \]

第二类是 Constitutional Consistency Reranking：保留 similarity，但扣掉 risky trace 的分数。

\[ S(q,q_i)=\alpha\cos(\mathbf{e}_q,\mathbf{e}_{q_i}) -\beta\mathcal{L}_{risk}(R_i\mid\mathcal{C}) \]

如果 \(\mathcal{L}_{risk}(R_i\mid\mathcal{C})>\tau\)，即使 similarity 很高也 suppress。这个思路和 Drift Monitor 很接近：高相关性不是高可信度。

我的评论

MemoryGraft 给 Drift Monitor 的最强启发

这篇把 monitor 从“memory item 是否安全”推进到“retrieval event 是否安全”。同一条 memory，在某些 query 下可能是合理经验，在另一些 query 下可能是 dangerous precedent。monitor 需要以 (memory, query, retrieved context) 三元组为对象。

\[ retrieval\_gate(m_i,q_t)= provenance\_valid(m_i)\land safe\_trace(m_i)\land context\_compatible(m_i,q_t)\land retrieval\_distribution\_normal(q_t) \]

MemoryGraft 机制	Monitor checkpoint	可实现 probe
poisoned records 被伪装成 successes	provenance / attestation	unsigned 或 external-origin memory 降权/隔离。
union retrieval 放大 attack basin	retrieval distribution audit	统计每类 query 中高风险 memory 的 exposure rate。
semantic imitation heuristic	trace-level safety classifier	扫描 retrieved reasoning trace 是否推荐 bypass/skip/force/upload。
persistent store 跨 session 复用	memory lineage / rollback	每次 memory promotion 绑定来源、验证、回滚点。
small poison, high PRP	retrieval centrality shift	少量 memory 异常频繁出现时触发 review。

和 OEP / AgentDevel 的拼接方式

OEP 主要告诉我们：写入 memory 前需要 transferability gate。MemoryGraft 告诉我们：读取 memory 时需要 retrieval gate。AgentDevel 告诉我们：这些 gate 不应该是一次性 heuristic，而应该进入 release-style pipeline，有 test、critic、promotion、rollback。

所以一个 3-6 个月内可做的 baseline 很清楚：把 AgentDevel 式 release gate 包在 memory update/retrieval 两端，用 OEP 作为 write-time adversarial probe，用 MemoryGraft 作为 retrieval-time adversarial probe，再用 Goal Drift benchmark 评价外显行为是否偏移。

论文自身边界

这篇的实验还偏小：100/10 seeds、12 queries，且主要报告 retrieval statistic，不是大规模 end-to-end behavioral severity benchmark。威胁模型也偏 semi-white-box，因为攻击者知道 agent 会 ingest 哪些 artifact。不过它作为 Drift Monitor 的 adversarial probe 很有价值，因为它把“过去经验如何被未来信任”这个问题形式化了。

One More Thing

如果要把 MemoryGraft 变成我们的实验基线，我会把它改造成一个更一般的 Retrieval Drift Probe：给 agent 的 memory store 注入一小组“局部看似成功但 policy 上危险”的经验，然后系统性扫描不同 query families 下的 exposure rate、trace imitation rate、最终 task regression。

这比单纯检测 malicious string 更接近真实自进化风险：一个经验不一定是显式恶意，但它在错误 retrieval context 下会变成坏 policy。Drift Monitor 的核心任务，就是在行为真正出错前，发现这些 retrieval priors 正在重新塑形 agent。

Reference / Evidence

arXiv abstract

https://arxiv.org/abs/2512.16962

PDF

https://arxiv.org/pdf/2512.16962

arXiv DOI

https://doi.org/10.48550/arXiv.2512.16962

Public code / evaluation artifact

https://github.com/Jacobhhy/Agent-Memory-Poisoning

Reading basis

Based on the public arXiv PDF, arXiv abstract page, public arXiv source package, and the public GitHub artifact retrieved on 2026-05-26. This page is a paper2html deep-reading note, not an independent reproduction report.