MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

Summary

[Overview]

4.2 메모리 트리거: 메모리 호출을 학습하기

이 섹션에서는 메모리 트리거 의 구체적인 구현과 훈련 방법을 설명합니다. 앞서 4.1절에서 언급했듯이, 는 동결된 추론자 의 진화하는 추론 상태를 관찰하는 메타인지 모니터입니다. 이 트리거는 추론 과정에서 메모리 호출이 필요한 시점을 결정합니다. 구체적으로, 추론자가 토큰 시퀀스 를 생성할 때, 해당 시퀀스에 대응하는 은닉 상태 벡터 시퀀스 를 생성합니다. 이 시퀀스의 각 은 추론자의 현재 메타인지 상태를 나타냅니다. 트리거는 이 은닉 상태 를 입력으로 받아, 메모리 호출 확률 를 계산합니다:

p_j = \sigma \left( \mathcal{T}_{\text{trigger}}(\mathbf{h}_{t,1}, \dots, \mathbf{h}_{t,j-1}) \right), \tag{4}

이 확률을 기반으로 이진 결정 이 샘플링됩니다. 결정이 [SKIP]인 경우, 는 표준 자동회귀 생성을 계속합니다. 즉, 가 적용됩니다. 그러나 결정이 INVOKE인 경우, 추론 과정이 잠시 중단됩니다. 이때, 프레임워크의 두 번째 핵심 구성 요소인 메모리 웨이버 가 호출됩니다. 이 웨이버는 동일한 인지 상태 를 자극으로 받아, 생성적 회상 행동을 수행합니다. 이 과정에서 특정한 기계적 메모리 을 합성합니다. 이 메모리는 고정된 길이 를 가지며, 다음과 같이 정의됩니다:

\mathbf{M}_{t} := [\mathbf{m}_{t,1}, \mathbf{m}_{t,2}, \cdots, \mathbf{m}_{t,K}] = \mathcal{W}_{\text{weaver}}(\mathbf{H}_{t,< j}), \tag{5}

이 메모리는 단지 웨이버 에 인코딩된 파라메트릭 지식에서만 생성되는 것이 아니라, 외부 메모리 데이터베이스에서 검색된 단서도 포함할 수 있습니다 (구체적인 구현은 4.3절에서 상세히 설명됨). 중요하게는, 는 과거 콘텐츠의 원문 재구성이 아니라, 웨이버 를 통해 필터링되고 통합된 선택적 재구성입니다. 이는 해마가 기억의 조각을 인간 기억으로 통합하는 방식과 유사합니다 (Spens and Burgess, 2024). 생성된 잠재 메모리는 추론자의 진행 중인 역동에 자연스럽게 통합됩니다: 해당 메모리의 은닉 상태는 에 추가되고, 추론자는 이 풍부한 맥락을 기반으로 생성을 재개합니다:

\mathbf{z}_{t,j} \sim \pi_{\theta}(\cdot \mid s_t, \mathbf{z}_{t,< j}, \mathbf{M}_t). \tag{6}

이 생성, 모니터링, 호출, 웨이빙, 재통합의 반복 사이클은 추론을 단순한 선형 펼침에서 메모리와의 재귀적 대화로 승격시키며, 동결된 추론자 를 수정하지 않아 일반적인 능력을 보존합니다. 다음 섹션에서는 메모리 트리거( 4.2절)와 메모리 웨이버( 4.3절)의 구현을 자세히 설명합니다.

목차

MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

[요약]
MemGen은 기존 파라메트릭 메모리와 검색 기반 메모리의 한계를 해결하기 위해 인간과 유사한 인지 능력을 갖는 동적 생성형 메모리 프레임워크를 제안한다. 이는 추론 상태를 모니터링하는 메모리 트리거와 현재 상태를 자극으로 사용해 잠재 토큰 시퀀스를 생성하는 메모리 위버로 구성되어 있다. 8개 벤치마크 실험에서 ExpeL, AWM 대비 최대 38.22% 성능 향상과 GRPO 대비 13.44% 상승을 보이며, 도메인 간 일반화 능력이 뛰어난 것으로 나타났다. 또한 명시적 감독 없이 계획 메모리, 절차적 메모리, 작업 메모리와 같은 인간 유사한 메모리 기능을 자율적으로 진화시킬 수 있는 특성을 보였다.

Guibin Zhang† , Muxin Fu† , Shuicheng Yan

National University of Singapore †Equal Contribution

Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a memory trigger, which monitors the agent’s reasoning state to decide explicit memory invocation, and a memory weaver, which takes the agent’s current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to 38.22%, exceeds GRPO by up to 13.44%, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

z Date: October 14, 2025

§ Github: https://github.com/KANABOON1/MemGen

1 Introduction

[요약]
대규모 언어 모델(LLM) 기반 에이전트의 등장은 다양한 분야에서 패러다임의 전환을 가져왔으며, 에이전트 메모리 개념은 환경 상호작용을 통해 점진적으로 학습하는 데 핵심적인 역할을 한다. 기존의 파라메트릭 메모리와 검색 기반 메모리 모두 한계를 가지며, 특히 파라메트릭 메모리는 파라미터 수정으로 인한 카atastrophic forgetting을 유발하고, 검색 기반 메모리는 맥락 공학에 의존하여 유연한 통합이 어렵다. 본 논문은 이러한 문제를 해결하기 위해 MemGen이라는 동적이고 생성적인 메모리 프레임워크를 제안하며, 이는 메모리 트리거와 메모리 웨이버 두 구성 요소를 통해 추론 과정에서 실시간으로 생성된 잠재 메모리를 통합한다. MemGen은 기존 데이터에 노출될 때 카atastrophic forgetting을 완화시키며, 추론 핵심과 깊이 통합된 유동적인 생성 능력을 제공한다. 실험 결과에 따르면, MemGen은 ALFWorld 및 KodCode 등 9개 벤치마크에서 기존 기법 대비 최대 31.7% 및 27.1%의 성능 향상을 보였다.

The ascent of Large Language Model (LLM)-powered agents marks a paradigm shift across diverse domains (Luo et al., 2025b; Yang et al., 2024b; Qian et al., 2025; Singh et al., 2025; Pantiukhin et al., 2025; Ren et al., 2025). Pivotal to this success is the concept of agent memory (Zhang et al., 2024b; Wu et al., 2025b), which enables LLM agents to learn progressively from environmental interactions (Zhang et al., 2025a; Qiu et al., 2025b). Crucially, this conception of agent memory extends beyond that of conversational agents (i.e., personalized memory (Wu et al., 2025b)), whose primary role is to sustain coherence across long-horizon, multi-turn dialogues (Chhikara et al., 2025; Xu et al., 2025a; Packer et al., 2024; Zhong et al., 2023). Rather, the scope of this paper is primarily on enabling agents to internalize experience, simulate human-like cognitive iteration, and progressively enhance problem-solving competence (Gao et al., 2025).

The memory serving as this self-evolving engine typically manifests in two dominant paradigms. The first is (I) parametric memory, which internalizes experiences by directly updating agents’ parameters (Yao et al., 2024; Zeng et al., 2023; Chen et al., 2024b, 2025). While this approach can yield substantial performance gains, its reliance on parameter modification inevitably entails catastrophic forgetting, i.e., the erosion of general knowledge (Dou et al., 2024). Conversely, the second paradigm is (II) retrieval-based memory, which externalizes past experiences into a structured database, such as (i) raw trajectories (Luo et al., 2025a; Zhang et al., 2025a; Zhao et al., 2024), (ii) high-level experiences (Zhao et al., 2024; Fang et al., 2025; Wang et al., 2024c), and (iii) condensed skills like reusable APIs (Zheng et al., 2025) or MCP boxes (Qiu et al., 2025b,a). Although this non-invasive approach circumvents catastrophic forgetting, its efficacy is fundamentally tethered to context engineering. It adheres to a rigid execution pipeline, providing retrieved context to the agent without achieving the fluid, seamless integration characteristic of truly internalized memory (Su et al., 2025b).

Figure 1 The comparison among parametric memory, retrieval-based memory and MemGen. We drew inspiration from the layout presented in Figure 1 of Li et al. (2025a).

Given these deficiencies, latent memory offers a compelling alternative, leveraging latent states as a machine-native, high-density medium for memory. Existing approaches either use the (i) key-value (KV) cache to maintain dynamic memory set (Gim et al., 2024; Jin et al., 2024; Hongkang Yang et al., 2024), yet which is primarily confined to addressing long-context issues, or (ii) latent token embeddings to store agent experiences (Wang et al., 2024b, 2025a), which still rely on invasive LLM parameter updates. LatentSeek (Li et al., 2025a) and SoftCoT (Xu et al., 2025b,c) similarly belong to this category, utilizing latent embeddings to steer agent generation. Nevertheless, all these methods diverge from human cognition in two critical dimensions: they lack the seamless interleaving of reasoning and memory, a process where thought and memory dynamically reshape one another, and remain largely retrieval-based, fetching memories by embedding similarity (Wang et al., 2024b) rather than generatively reconstructing them into novel, coherent insights. This leads to our pivotal research question:

How can we architect agent memory as a dynamic cognitive faculty, capable of fluid, reconstructive processes that interweave seamlessly with reasoning?

To address this challenge, we introduce MemGen, a dynamic and generative memory framework designed to endow any LLM agent with a more human-esque cognitive faculty. At its core, MemGen continuously monitors an agent’s cognitive state, enabling it to dynamically invoke a generative process that synthesizes a bespoke latent memory at any critical juncture during its reasoning process. Practically, MemGen comprises two synergistic components: a reinforcement learning (RL)-trained memory trigger, which acts as a metacognitive monitor to discern the opportune moments for explicit memory invocation; and a memory weaver, which takes the agent’s current state as a stimulus to draw upon relevant implicit parametric memory (potentially augmented with externally retrieved information) and then reconstructs this synthesis into a succinct, machine-native latent memory. With the reasoning core fixed, MemGen inherently mitigates catastrophic forgetting when exposed to new data, and, moving beyond the static and extractive paradigm of prior memory systems, equips agents with a fluid, generative faculty deeply integrated with reasoning cores.

Experimental Observation. Extensive experiments across nine benchmarks and four baseline categories demonstrate that MemGen delivers ① substantial performance gains, with improvements of up to 31.7% on ALFWorld (Shridhar et al., 2021) and 27.1% on KodCode (Xu et al., 2025d) with Qwen3-8B, surpassing parametric memory (REINFORCE++, +5.8%) and the GRPO method (+5.32%); ② strong cross-domain generalization, where training in the math domain not only avoids degradation elsewhere but also boosts performance in science reasoning (+6.06%) and code generation (+5.1%); and ③ continual learning ability, maintaining stable performance in previously trained domains even after fine-tuning on three additional ones.

Analysis & Interpretation. Beyond quantitative evaluation, we sought to interpret the learned behavior of MemGen. Through post-hoc interventions examining the impact of removing specific latent memory on

different agent failure modes, we found that MemGen implicitly evolves a human-like memory hierarchy without any external guidance, including O planning memory, where certain latent tokens specifically support high-level task planning, O procedural memory, where some latent memory tokens facilitate the agent’s recall of task-specific procedural skills, such as tool usage and answer formatting, and O working memory, where certain tokens help the agent maintain coherence and understanding over long contexts within a single task session. These specializations strongly reveal that MemGen endows the agent with precise, functionally distinct memory.

[요약]
기존 LLM 에이전트의 메모리 메커니즘은 파라메트릭, 검색 기반, 잠재 메모리로 나뉘며, MemGen은 잠재 메모리 패러다임에 속하지만 인간처럼 추론과 메모리를 결합한 생성형 접근법을 특징으로 한다. 잠재 계산과의 관련성에서 MemGen은 잠재 상태를 활용해 추론 과정을 개입하는 방식과 유사하지만, 생성 결과 품질을 조절하는 데 초점을 맞춘다. 또한, 디코딩 측면에서는 사전 토큰 생성과 유사하지만 메모리 전달에 중점을 두며, 강화 학습과의 결합에서는 자가 개선형 메모리 기능을 갖춘 사례가 드물다.

LLM & Agent Memory. As outlined in Section 1, existing memory mechanisms designed to evolve the problem-solving capacity of LLM agents can be broadly categorized into three classes: (I) parametric memory, which either integrates past experiences directly into agent parameters through finetuning, as in FireAct (Chen et al., 2023), AgentLumos (Yin et al., 2024), and others (Zhang et al., 2024a; Fu et al., 2025), or maintains them in external parameter modules (Tack et al., 2024; Wang et al., 2024a); (II) retrieval-based memory, which abstracts prior experiences into transferable knowledge (Zhang et al., 2025a; Zhao et al., 2024), or distills them into reusable tools and skills (Zheng et al., 2025; Wang et al., 2025b; Qiu et al., 2025b,a); and (III) latent memory, which leverages implicit representations to encode and retrieve experience (Wang et al., 2024b, 2025a; Hu et al., 2025b; Liu et al., 2024; Sun et al., 2025). Our MemGen falls within the latent memory paradigm, yet distinguishes itself from prior approaches through its more human-esque interweaving of reasoning and memory, as well as its generative, rather than purely retrieval-based, nature.

Latent Computation. Our method is also closely related to latent computation, wherein latent states are employed to intervene in or reshape the LLM’s reasoning process (Zhu et al., 2025). Prominent paradigms include: (I) architecturally enabling native latent reasoning, exemplified by Coconut (Hao et al., 2024), CODI (Shen et al., 2025), LatentR3 (Zhang et al., 2025b) and CoLaR (Tan et al., 2025), which render the LLM’s inference process inherently latent and machine-native; and (II) employing latent computation to steer LLM generation, as in LaRS (Xu et al., 2023), LatentSeek (Li et al., 2025a), SoftCoT (Xu et al., 2025c,b), and Coprocessor (Liu et al., 2024), which leverage latent representations to modulate the quality of generated outputs. These aforementioned works have greatly inspired the latent memory design in this paper: Latent memory can likewise be viewed as an instantiation of the latter, supplementing essential memory context to enhance the problem-solving capacity of agents (Wang et al., 2024b, 2025a).

LLM Decoding & RL. Two additional topics that relate to our work are LLM decoding and reinforcement learning (RL). From the decoding perspective, MemGen dynamically generates and inserts latent tokens, which shares similarity with speculative decoding where a drafter model receives the current decoding context and produces subsequent drafted tokens (Cai et al., 2024; Fu et al., 2024; Li et al., 2025b; Goel et al., 2025). However, these methods primarily aim to accelerate LLM inference, whereas MemGen focuses on leveraging latent states as effective carriers of memory. From the RL perspective, MemGen employs rule-based RL to train the memory trigger, which is closely related to reinforcement learning with variable reward (RLVR), including GRPO from DeepSeek-R1 (DeepSeek-AI et al., 2025) and its various derivatives (Qian et al., 2025; Wu et al., 2025a; Wei et al., 2025; Fu et al., 2025). While there exist efforts combining RL with agent memory, to our knowledge, most do not address self-improving memory; for example, MemAgent (Yu et al., 2025) and MEM1 (Zhou et al., 2025) focus on handling long-context inputs rather than evolving memory mechanisms.

3 Preliminary

[요약]
본 섹션은 에이전트가 환경과의 상호작용을 수학적으로 정식화하는 방법을 설명한다. 에이전트는 LLM을 기반으로 하며, 고수준 행동 시퀀스를 자동회귀적으로 생성하고, 각 토큰의 생성은 현재 상태와 이전 토큰에 조건된다. 행동 시퀀스가 실행되면 환경 상태가 업데이트되고, 전체 경로의 성공 여부는 보상 함수로 평가된다. 문제 정식화에서는 과거 경험을 활용해 새로운 작업에서의 성능을 극대화하기 위해 정책과 메모리 시스템을 공동 최적화하는 목표를 제시한다. 메모리 생성 함수 은 작업 수준, 단계 수준, 파라메트릭 메모리 등 다양한 패러다임에 따라 달라지며, 본 연구는 토큰 수준에서 메모리 재생성 시점을 자동으로 결정하는 동적 잠재 메모리를 도입한다.

Notation. We formalize the agent’s interaction within an environment . An agent, powered by an LLM parameterized by , is denoted as . For a given task , the agent’s interaction unfolds as a high-level trajectory, denoted as follows , where represents the state of the environment and is the high-level action taken by the agent. More internally, each action is essentially a sequence of tokens, , generated autoregressively by the LLM. The generation of the j-th token is conditioned on the current state and all previously generated tokens within that action:

\mathbf{z}_{t,j} \sim \pi_{\theta}(\cdot \mid s_t, \mathbf{z}_{t, < j}). \tag{1}

Figure 2 The overview of our proposed MemGen.

After an entire action sequence is generated, it is executed in the environment, which transitions the state from to . The success of the trajectory is evaluated by a reward function .

Problem Formalization Given a history of past experiences , the objective is to leverage this history to maximize the agent’s performance on new tasks. The policy and a memory system are thus jointly optimized to maximize the expected reward over a task distribution :

\max_{\theta, \mathcal{M}} \mathbb{E}_{x \sim \mathcal{D}, \tau \sim \pi_{\theta, \mathcal{M}}} [R(\tau)], \tag{2}

during which is to produce a memory representation, m, which conditions the agent’s policy. The action at any timestep t is thus sampled as , where is the inserted memory at that step. Crucially, the nature and timing of memory generation, which we denote as the function , vary across different paradigms. We express the generation of the memory as:

m_t = f_{\mathcal{M}}(s_t, \mathcal{H}, m_{< t}), \tag{3}

which accommodates diverse memory invocation granularities. For task-level memory (e.g., Expel (Zhao et al., 2024) and G-Memory (Zhang et al., 2025a)), is invoked only at t=0, and for all subsequent steps. For step-level memory (e.g., AgentKB (Tang et al., 2025)), is invoked at every step t to update the memory. In parametric memory, the influence of is compiled into , rendering memory generation implicit in the model parameters. Our work, which introduces dynamic latent memory, focuses on designing a more fine-grained that decides for itself the optimal moments to regenerate at the token level during the agent’s reasoning process.

4 Methodology

[요약]

  1. MemGen은 인간의 추론과 기억이 연속적으로 작용하는 특성을 모방하여, 기존 에이전트의 단일 정보 검색 방식과 차별화된 기억-추론 연계 구조를 설계하였다.
  2. 추론 과정에서 토큰 생성 시점마다 메모리 트리거가 내부 인지 상태를 분석하고, 필요 시 메모리 인용 여부를 결정하여 동적으로 기억 삽입을 수행한다.
  3. 메모리 와이버는 트리거에 의해 생성된 인지 상태를 기반으로 외부 메모리 데이터베이스와 연동하여, 추론에 필요한 맞춤형 잠재 기억을 생성 및 통합한다.
  4. 생성된 기억은 추론자의 히든 상태에 추가되어, 동적 맥락에 기반한 추론 재개를 가능하게 하여 추론 과정을 비선형적 대화 구조로 전환한다.
  5. 이 구조는 핵심 추론 모델의 동결 상태를 유지하면서도 기억과의 상호작용을 통해 추론 능력을 강화하는 방식을 제시한다.

4.1 MemGen: Interleaving Memory and Reasoning

[요약]

  1. MemGen은 인간의 연속적인 사고 흐름을 모방하여 기억과 추론을 유기적으로 결합하는 에이전트 메모리 시스템으로, 기존 시스템의 단일 정보 검색 방식과 차별화된다.
  2. 시스템은 동결된 핵심 LLM(πθ)을 기반으로 자동회귀적 추론을 수행하며, 토큰 생성 과정에서 메모리 트리거(Trigger)가 메타인지 상태를 분석해 기억 삽입 여부를 결정한다.
  3. 기억 삽입 시 ‘메모리 웨이버’(Weaver)가 외부 메모리 데이터베이스와 결합된 내재적 기억(𝑀ₜ)을 생성해 추론 맥락에 통합한다.
  4. 이 과정은 추론을 선형적 흐름에서 재귀적 기억 대화로 업그레이드하며, 핵심 LLM의 일반 능력을 유지한 채 실행된다.

Just as a person is the sum of their past experiences (McAdams, 2001), memory critically shapes an agent’s actions (Xiong et al., 2025). Existing agent memory systems, however, often lack the flexibility of human cognition. In the human brain, reasoning and memory form a seamless continuum: active reasoning in the frontoparietal control network and memory retrieval in the hippocampus and prefrontal cortices operate

interweavingly, generating a “continuous flow of thoughts” (Su et al., 2025a). By contrast, many agent memory paradigms retrieve information once at task onset and append it coarsely to the query. MemGen is designed precisely to bridge this gap.

As shown in Figure 2, the reasoning process in an agent equipped with MemGen unfolds autoregressively, driven by a frozen core LLM, the reasoner . For a given state , generates the action . MemGen continuously monitors the token-by-token generation process and performs on-demand memory insertion. At each token-generation step j, a memory trigger monitors the reasoner’s internal cognitive state to determine if a moment of reflection, i.e., a memory invocation, is necessary. Specifically, as the reasoner generates the token sequence , it produces a corresponding sequence of hidden state vectors, , where each . The trigger takes the current hidden states as a representation of the reasoner’s current metacognitive state and computes an invocation probability:

p_j = \sigma \left( \mathcal{T}_{\text{trigger}}(\mathbf{h}_{t,1}, \dots, \mathbf{h}_{t,j-1}) \right), \tag{4}

from which a binary decision, , is sampled. If the decision is to [SKIP], proceeds with its standard autoregressive generation, i.e., . However, if the decision is to INVOKE, the reasoning process is momentarily paused. This summons the second core component of our framework: the memory weaver , which takes the same cognitive state as a stimulus to perform a generative act of recollection. It synthesizes a bespoke, machine-native latent memory, formalized as with fixed length K:


\mathbf{M}_{t} := [\mathbf{m}_{t,1}, \mathbf{m}_{t,2}, \cdots, \mathbf{m}_{t,K}] = \mathcal{W}_{\text{weaver}}(\mathbf{H}_{t,< j}), \tag{5}

where the memory is generated not merely from the parametric knowledge encoded within but may also incorporate cues retrieved from external memory databases (detailed implementation is elaborated in Section 4.3). Crucially, is not a verbatim restatement of prior content but a selective reconstruction, filtered and integrated through , akin to the hippocampus consolidating fragments of recollection into human memory (Spens and Burgess, 2024). Once formed, the latent memory is woven seamlessly into the reasoner’s ongoing dynamics: its hidden states are prepended to , upon which the reasoner resumes generation conditioned on this enriched context,



(6)

This iterative cycle of generation, monitoring, invocation, weaving, and reintegration elevates reasoning from a linear unfolding to a recursive dialogue with memory, all without altering the frozen reasoner , and thereby preserving its general capabilities. In the following sections, we detail the implementations of the memory trigger ( Section 4.2) and the memory weaver ( Section 4.3).

4.2 Memory Trigger: Learning to Invocate Memory

[요약]

  1. 메모리 트리거 는 추론 과정에서 핵심적인 메모리 호출 시점을 결정하는 메타인지 모니터로, 가벼운 LoRA 어댑터로 구현된다.
  2. 문장 단위의 활성화 전략을 채택해 구두점과 같은 delimiter 토큰에서만 호출 결정을 수행하며, 이는 추론 효율성을 유지하는 데 기여한다.
  3. 강화 학습을 통해 훈련되며, 높은 보상 경로에서의 평균 활성화 확률을 기반으로 불필요한 호출을 억제하는 보상 적응형 벌금을 도입한다.
  4. 목적 함수는 메모리 호출의 전략적 최적화를 통해 작업 성능 향상과 계산 효율성을 동시에 달성하도록 설계된다.

In this section, we describe the concrete instantiation and training recipe of the memory trigger . Recall from Section 4.1 that serves as a metacognitive monitor, observing the evolving reasoning state of the frozen reasoner and producing a binary decision that determines whether the memory weaver should be invoked at token j.

Instantiation. We instantiate as a lightweight LoRA adapter attached to the reasoner . At the decoding step j of the timestep t, it receives the sequence of all hidden states, ; conditioned on this context, outputs the action probability . To avoid excessive computational overhead, we adopt a sentence-granularity activation strategy, inspired by recent studies on LLM interpretability (Anthropic, 2025; Chen et al., 2024a), which find that interventions between sentences can more effectively guide LLMs’ reasoning path. Specifically, we define a delimiter token set (e.g., commas, periods) and let the trigger act only when the current token falls in . The invocation decision is computed as:

which ensures that is invoked only at semantically significant boundaries, preserving decoding efficiency. We validate that MemGen does not incur excessive inference delay in Section D.3.3.

Training Recipe. The memory trigger is trained via reinforcement learning, motivated by the need to balance two competing desiderata: ensuring that critical latent memories are invoked to improve task performance, while avoiding unnecessary or spurious invocations that could disrupt reasoning or incur computational overhead. Given a batch of seen tasks , the frozen reasoner generates candidate trajectories while the memory weaver remains fixed. At each activated step, the trigger selects an action and receives a reward reflecting the quality of the resulting trajectory with respect to the task objective. To encourage sparse yet strategically critical memory invocation, we introduce a reward-adaptive penalty, which discourages unnecessary activations while preserving essential ones, into the objective:

\max_{\phi} \mathbb{E}_{\tau_{i} \sim \pi_{\theta}, \tilde{\mathbf{d}} \sim \mathcal{T}_{\text{trigger}}^{\phi}} \left[ R(\tau_{i}) - \lambda \sum_{i,j} \max(0, \tilde{d}_{i,j} - \bar{p}) \right], \tag{8}

where is computed as the mean activation probability across high-reward trajectories, i.e., those with reward exceeding the batch median:


(9)

where ensures that learns to invoke memory selectively at key decision points, maximizing task reward while maintaining computational efficiency.

4.3 Memory Weaver: Synthesizing and Inserting Latent Memory

[요약]

  1. MemGen 프레임워크의 Memory Weaver()는 에이전트가 새로운 경험을 학습할 때 핵심 추론자()의 파라미터를 변경하지 않고, 내부 메모리로 저장한다.
  2. 추론자가 경험 기반 지식이 필요할 때, 맥락에 따라 활성화되는 hook를 통해 유용한 메모리()를 생성하여 에 전달한다.
  3. 는 LoRA 어댑터로 구현되며, 입력된 hook()을 기반으로 고정 길이의 잠재 토큰 시퀀스를 생성하고, 이는 추론자의 히든 상태에 추가되어 토큰 생성을 안내한다.
  4. MemGen은 학습 과정에서 의 일반적인 능력을 유지하면서, 다양한 최적화 전략(SFT, GRPO, DAPO 등)과 호환되는 방식으로 를 훈련시킨다.
  5. 외부 메모리 시스템(MemoryBank, ExpeL 등)과 결합될 수 있어, 내부 지식과 외부 정보를 통합한 더 풍부한 메모리 지원이 가능하다.

In this section, we elaborate on the weaver , the memory carrier within the MemGen framework. When the agent assimilates new experiences, this information is exclusively internalized into the parameters of , leaving the core reasoner entirely unmodified. At junctures where the reasoner requires experiential support, a context-dependent hook activates the weaver to synthesize and externalize pertinent knowledge as a usable memory. To be more specific, recall from Equation (5) that after the signals the need for memory at step j, accepts (as the hook) and generates a latent token sequence (as the memory) for .

Instantiation. We instantiate using anthoer LoRA adapter attached to . Formally, given the incoming hook , the weaver outputs a latent memory matrix: , where K denotes the fixed length of the latent memory sequence and are the trainable LoRA parameters. The synthesized is then prepended to the current hidden states of to guide subsequent token generation, as described in Equation (6).

Training Recipe. The training of proceeds over a batch of past trajectories . Distinct from conventional agent tuning, which directly integrates experiential data into the parameters of (Chen et al., 2025; Yin et al., 2024), MemGen internalizes experiential knowledge solely into , which ensures that ‘s general capabilities remain intact.

Crucially, this separation makes MemGen agnostic to optimization strategies and compatible with diverse LLM backbones. Whether employing supervised fine-tuning (SFT) or RL-based objectives such as GRPO or DAPO, the weaver can be updated under a unified goal: optimizing the generation process of latent memory so as to maximize downstream reward. Formally, let denote the process of rolling out a trajectory for a task x by in conjunction with weaver and trigger . Given a reward functional R, the objective updates only by maximizing the expected reward:


\max_{\theta_{\text{lora}}} \mathbb{E}_{(x_i,\tau_i) \sim \mathcal{H}} \mathbb{E}_{\tau \sim \Pi_{\theta}^{\mathcal{W}_{\theta'}, \mathcal{T}_{(\cdot|x_i)}}} [R(x_i, \tau)], \tag{10}

where the gradients from R are propagated solely to , thereby equipping the weaver to supply precisely the memories that improve end-to-end performance without altering . Equation (10) enables to absorb diverse experiential signals and externalize them as dynamic, context-sensitive latent memories, independent of the architectural or training paradigm of the base reasoner.

Integration with Retrieval-based Memory. Although the memory generation above primarily draws on the weaver’s parametric knowledge, it can be combined with external memory sources. When triggered, any retrieval-based system (e.g., MemoryBank, ExpeL) can provide textual memory, which is merged with

the hook and fed into to produce latent memory. This allows to integrate internal knowledge and external information, supplying the reasoner with richer memory support. Implementation details and results are placed in Appendix E.

5 Experiments

[요약]

  1. 본 섹션에서는 MemGen이 파라메트릭 및 검색 기반 메모리보다 우월한지 평가한다.
  2. MemGen이 학습한 메모리가 다양한 작업 영역에서 일반화 가능한지及其 원인을 분석한다.
  3. MemGen이 연속 학습을 지원하고 카타스트로피적 잊음을 완화할 수 있는지 실험적으로 검증한다.
  4. MemGen이 인간 유사한 메모리 계층 구조를 암시적으로 진화시킬 수 있는지 탐구한다.

In this section, we conduct extensive experiments to answer the following research questions:

  • (RQ1) Can MemGen surpass both parametric and retrieval-based memory?
  • (RQ2) Is the memory learnt by MemGen generalizable across task domains? And why?
  • (RQ3) Can MemGen facilitate continual learning and mitigate catastrophic forgetting?
  • (RQ4) Does MemGen implicitly evolve human-like memory hierarchy?

5.1 Experimental Setup

[요약]

  1. 실험 평가에는 웹 검색, 체감 행동, 수학 추론, 과학 추론, 코딩 등 5개 영역의 9개 데이터셋(TriviaQA, ALF-World, AQuA, GPQA 등)을 포함한다.
  2. 기초 모델 비교 대상은 12개의 기준 모델로, 프롬프트 기반, 파라미터 메모리, 검색 기반 메모리, 잠재 계산 등 4개 그룹으로 분류된다(예: SFT, GRPO, MemoryBank, SoftCoT 등).
  3. 구현 세부 사항에서는 Qwen-2.5-1.5B, Smollm3-3B 등 다양한 LLM 백본을 사용하고, 잠재 메모리 시퀀스 길이 K는 {2,4,8} 중 선택되며, MemGen은 SFT와 GRPO 신호를 기반으로 두 변형( MemGen, MemGen)으로 구현된다.

Evaluation and Benchmarks. Our evaluation covers nine datasets from five domains, including • web search: TriviaQA (Joshi et al., 2017) and PopQA (Mallen et al., 2023); • embodied action: ALF-World (Shridhar et al., 2021); • math reasoning: AQuA (Ling et al., 2017), GSM8K (Cobbe et al., 2021), and MATH (Hendrycks et al., 2021); • scientific reasoning: GPQA (Rein et al., 2023); and • coding: KodCode (Xu et al., 2025d) and BigCodeBench (Jain et al., 2024).

Baselines. We compare MemGen against twelve baselines, categorized into four groups: (I) Prompt-based methods: Vanilla model, CoT (Wei et al., 2023); (II) Parametric memory, where experiential knowledge directly modifies model parameters via: SFT, GRPO (DeepSeek-AI et al., 2025), REINFORCE (Williams, 1992), REINFORCE++ (Hu et al., 2025a), Agent-FLAN (Chen et al., 2024b); (III) Retrieval-based memory, where processing tasks sequentially and storing the experiences in an external database, represented by MemoryBank (Zhong et al., 2023), ExpeL (Zhao et al., 2024), Agent Workflow Memory (AWM) (Wang et al., 2024c); and (IV) Latent computation, where leveraging latent tokens as carriers of experiential knowledge, including SoftCoT (Xu et al., 2025c) and Co-processor (Liu et al., 2024).

Implementation Details. We select LLM backbones of varying sizes, including Qwen-2.5-1.5B (Yang et al., 2024a), HuggingFace’s Smollm3-3B (HuggingFace, 2025), and Qwen3-8B (Yang et al., 2025). The length of each latent memory sequence K is set among . MemGen does not rely on a specific optimization algorithm, so we implement two variants: MemGen and MemGen , in which the weaver is updated using SFT and GRPO signals. Details on these variants are provided in Appendix B. More training setup and parameter configurations are listed in Appendix C.

5.2 Main Restuls

[요약]
MemGen은 다양한 도메인에서 기존 기반 모델보다 높은 성능을 보이며, 특히 추론 중심 작업에서 기존 메모리 기반 방법(예: ExpeL, MemoryBank)의 한계를 극복한다. 예를 들어, ALFWorld에서 MemGen GRPO는 기존 모델 대비 44.64% 향상된 성능을 기록했고, KodCode 및 PopQA에서도 큰 향상을 보인다. 반면, 파라메트릭 파인튜닝 방법은 코드 생성과 같은 구조화된 도메인에서는 우수하지만, 지식 중심 추론 작업에서는 약한 성능을 보인다. MemGen은 교차 도메인 일반화 능력에서도 뛰어난 성능을 나타내며, KodCode에서 훈련된 모델이 MATH 도메인에서 36.6%에서 54.2%로 성능을 향상시킨 사례를 통해 메모리의 전이 효과를 입증한다. 실험 결과, MemGen은 도메인 내 성능 향상과 교차 도메인 일반화를 동시에 달성하는 것으로 나타났다.

[For RQ1] MemGen provides high-performing memory across domains. As shown in Tables 1 and 3, existing baselines exhibit clear limitations in cross-domain adaptivity. Retrieval-based memories (e.g., ExpeL, MemoryBank, AWM) occasionally surpass parazmetric tuning in embodied action; for instance, AWM reaches 36.18% on ALFWorld with Smollm3-3B, exceeding SFT by 3.15%. Yet their effectiveness deteriorates on reasoning-intensive tasks: ExpeL achieves only 8.12% on GPQA+Qwen2.5-1.5B, and even underperforms the vanilla model by 6.9% on TriviaQA, underscoring its heavy reliance on backbone capacity. Parametric finetuning methods display the opposite tendency: they excel in structured domains such as code generation, where REINFORCE++ reaches 63.33% on KodCode with Qwen2.5-1.5B, but remain weak in knowledge-intensive reasoning, with GPQA below 14%. In contrast, MemGen consistently advances performance across all domains. For example, on ALFWorld+Smollm3-3B, MemGen sft and MemGen GRPO attain 50.60% and 63.60%, improving over vanilla by 31.64% and 44.64%, respectively. Similar gains appear with the larger Qwen3-8B, where MemGen GRPO achieves +27.06% on KodCode and +28.17% on PopQA, surpassing GRPO by up to 3.4%. Overall, the dynamic memory insertion of MemGen delivers substantial improvements across diverse task domains.

[For RQ2] MemGen Exhibits Strong Cross-Domain Generalization. To evaluate whether the memory learned by MemGen can transfer across tasks, we train MemGen on one dataset and test it on several others. We include two out-of-domain datasets, ScienceWorld (Wang et al., 2022) and FEVER (Thorne et al., 2018),

Table 1 Results on SmolLM3-3B and Qwen3-8B. All values represent the performance metric for each task (e.g., accuracy %). We highlight the best and second best results.

BackboneMethodALFWorldTrivialQAPopQAKodCodeBigCodeBenchGPQAGSM8KMATH
Vanilla18.9610.478.2337.0535.969.3547.6316.22
CoT17.6012.889.9538.4539.4220.7058.9156.33
SFT32.3655.2537.2259.2540.7919.7063.4845.65
GRPO55.3565.8845.1668.4872.4422.7380.0361.23
REINFORCE53.1363.2046.8165.5367.1423.4482.0358.75
REINFORCE++53.9563.2044.1065.9068.8022.7381.5059.89
SmolLM3-3BAgent-FLAN34.0056.7039.5056.8037.2017.8059.6036.84
ExpeL36.1846.2028.1651.1440.2215.1556.2338.11
MemoryBank32.8043.3025.8144.5031.8010.2058.3043.53
AWM40.5049.8029.60-----
SoftCoT35.0350.3834.9059.2039.1017.2256.3444.62
Co-processor38.3653.2838.9656.2545.4020.1057.6038.81
MemGen SFT50.6068.1342.3462.6542.9926.7570.4257.44
MemGen GRPO63.6079.3058.6072.8574.2425.2083.4763.65
Vanilla58.9352.1834.1349.1033.3338.1889.4879.82
CoT57.1053.8033.2051.2535.5935.1587.6778.24
SFT83.5974.5551.1264.7541.3340.3390.7681.35
GRPO85.6076.1558.9073.3570.2439.5492.3083.54
REINFORCE82.1075.2257.9672.1170.2037.1291.2583.27
REINFORCE++84.8075.9058.3072.9071.8837.6891.9085.24
Owen3-8BAgent-FLAN80.3270.3250.0862.9943.4039.5087.6080.05
ExpeL78.9765.5440.3357.2034.2335.1586.2077.40
MemoryBank70.4160.5641.6056.3940.6135.6690.3580.35
AWM80.3369.3043.69-----
SoftCoT75.6059.4239.4263.2838.2739.6086.3076.23
Co-processor73.2861.4245.5564.9042.1939.1576.2379.20
MemGen SFT85.8277.2254.6566.1540.3543.2391.2583.30
MemGen GRPO90.6080.6562.3076.1675.5640.2493.2088.24

Figure 3 The generalization study of MemGen. We train MemGen SFT on one dataset (ALFWorld or TriviaQA) and evaluate it on four datasets (TriviaQA, ALFWorld, ScienceWorld, and FEVER).

to further probe this. As shown in Figures 3, 9 and 10, baselines such as SFT and MemoryBank achieve gains within their training domains (e.g., on ALFWorld, SFT +14.1% and MemoryBank +5.4% compared with vanilla), yet fail to generalize, with performance dropping sharply on FEVER by 16.2%. In contrast, MemGen not only attains substantial in-domain improvements ( on KodCode, Figure 10), but also exhibits effective transfer: when trained on KodCode, performance on MATH rises from . Having empirically validated MemGen’s generalizability, we take a further step to investigate the mechanism that underlies this capability.

[For RQ2] The Memory Trigger Intelligently Determines When to Activate Memory Insertion, Mitigating Domain Conflict. After training MemGen on GSM8K, we evaluate 150 samples each from GSM8K, KodCode, and GPQA, visualizing the frequency with which the memory trigger invoked the memory weaver at each relative position in the model output. We observe that the invocation frequency varies across domains and correlates directly with performance in Figure 9: GSM8K exhibits the largest improvement

Figure 5 (Left) t-SNE visualization of latent memories generated by MemGen +Qwen3-8B across datasets; (Middle and Right) Latent memory visualization within the TriviaQA and GSM8K datasets, clustered using K-means. The text at each cluster center represents the common pattern shared by many memory sequences in the cluster, such as Cluster 0 in GSM8K, where many sequences end with “_check”.

(+19.64%) and maximal invocations, GPQA achieves moderate gains (+6.06%) with medium invocations, and KodCode shows the smallest improvement (+3.1%) with the fewest invocations. This indicates that MemGen autonomously assesses, based on task-specific context, when memory insertion will be beneficial, invoking the weaver less frequently in unfamiliar domains.

[For RQ3] MemGen Effectively Mitigates Catastrophic Forgetting. In Table 4, we sequentially train on four datasets and evaluate on all benchmarks after each stage, where MemGen exhibits stronger knowledge retention ability compared to baseline methods. For example, unlike SFT which primarily improves performance on the most recent task (54.10% on KodCode but only 2.53% on GPQA), MemGen demonstrates more balanced cross-task generalization, attaining 38.43% on AQuA and 21.72% on GPQA after GSM8K training. Finally, it mitigates forgetting on earlier tasks, preserving 40.34% on AQuA following KodCode training compared to 27.14% for ExpeL and 28.61% for SFT, indicating a more stable continual learning ability. More analysis is placed in Section D.1.

Figure 4 Memory invocation frequency across benchmarks at inference (trained on MemGen SFT+Qwen3-8B+GSM8K).

5.3 Framework Analysis

[요약]

  1. MemGen의 잠재 기억은 t-SNE 시각화 결과로 도메인별 분리된 클러스터를 형성하며, 클러스터 내부에서도 구조적 규칙성을 보이는 비인식 가능한 토큰으로 구성된다.
  2. 후행 개입 실험을 통해 잠재 기억 클러스터의 기능적 역할을 분석한 결과, 계획 기억, 절차 기억, 작업 기억 등 인간 유사한 메모리 계층이 자동으로 조직되는 것으로 나타났다.
  3. 잠재 기억 길이 K에 따른 민감도 분석 및 모듈 제거 실험을 통해 메모리 삽입이 성능 향상(최대 57.66%)을 유도하면서도 추론 지연을 기존 모델의 24-94% 수준으로 유지하는 것으로 확인되었다.

Having established the expressive capabilities of MemGen, we further investigate its underlying mechanisms: what do the learned latent memories look like? Do they have specialized functions?

[For RQ4] The Latent Memory Is Machine-Native and Human-Unreadable. We first visualized the latent memory sequences learned by MemGen across different datasets using t-SNE in Figures 5 and 11. As shown in Figure 5 (Left), sequences from distinct domains form separate distributions, with related domains clustering closely (e.g., KodCode and BigCodeBench, GSM8K and MATH). Examining latent memories within the same dataset, we observed pronounced clustering patterns (as shown in Figure 5 (Middle and Right)). To explore potential commonalities within these clusters, we forcibly decoded the latent tokens. Although the decoded sequences are not human-readable, they exhibit intriguing regularities: many tokens within a cluster share structural conventions. For example, Cluster 0 in TriviaQA frequently follows the pattern ”[…]SOC”, whereas Cluster 3 in GSM8K often adopts the format ”[…]_pick”. A large corpus of latent memory tokens is provided in Appendix F. Despite these sequences being machine-native and human-unreadable, we further investigate whether their underlying semantics can be interpreted.

Figure 6 (Left) Parameter sensitivity analysis on the latent memory length K; (Right) Effects of selectively removing latent memory clusters on different agent failure modes on the TriviaQA dataset.

[For RQ4] MemGen Implicitly Learns a Human-like Memory Hierarchy. To uncover the functional roles of different latent memory clusters, we conducted a post-hoc intervention study. Following the taxonomy from (Song et al., 2025), we study eight distinct types of agent failure, including planning errors, tool response/parsing failures, answer formatting mistakes, etc, providing a structured framework to assess how memory influences performance. During evaluation, we selectively removed latent tokens close to a specific cluster while keeping others intact, measuring the resulting changes in these failure modes. Details on (1) the visualization process, (2) failure mode annotation, and (3) token filtration are in Appendix G. As shown in Figure 6 (Right), distinct memory clusters exhibit varying influence on failure modes and can be mapped to different memory functions:

  • Planning Memory supports high-level task planning and strategic reasoning. Removal of Cluster 2 substantially increases planning and compositional reasoning failures, indicating that this cluster is crucial for guiding the LLM agent’s decision-making and sequencing of reasoning steps.
  • Procedural Memory captures task-specific operational knowledge, such as tool usage and formatting ability. Cluster 3 corresponds to this role, as its removal leads to a marked increase in tool response errors, parsing failures, and answer formatting mistakes.
  • Working Memory manages the retention and effective use of prior context to maintain reasoning consistency. Clusters 1 and 4 contribute to this function: for instance, removing Cluster 1’s memory tokens results in more frequent task misunderstandings and think-act inconsistency.

Nevertheless, these memory clusters are not entirely independent: for example, removing Cluster 1 also negatively affects planning ability, indicating that these memory faculties interact and jointly enable the LLM to leverage past experience effectively. This analysis reveals that MemGen spontaneously organizes latent memory into a structured, human-like hierarchy.

Ablation Study & Sensitivity Analysis. We conduct a sensitivity analysis on the length of the latent memory sequence K, as shown in Figure 6 (Left). It can be observed that as the latent token length increases from 2 → 32, MemGen ‘s performance correspondingly improves, likely reflecting the expanded memory capacity. We then perform an ablation study on the memory trigger module in Table 5, demonstrating the necessity of a dedicatedly trained trigger for effective memory invocation. Furthermore, we analyze different training paradigms of the memory weaver in Table 6. Additional results and discussions are in Section D.3.

Efficiency Analysis. To confirm that the memory insertion process of MemGen does not introduce significant inference overhead, we show in Section D.3.3 that, while achieving up to 57.66% performance improvement over vanilla LLMs, the per-query inference delay remains consistently below the baseline, ranging from 24% to 94% of the vanilla LLM latency. This clearly demonstrates that MemGen delivers substantial performance gains without compromising efficiency.

6 Conclusion

[요약]

  1. 본 연구는 LLM 에이전트를 위한 동적 생성형 메모리 프레임워크인 MemGen을 제안하며, 강화학습 기반 메모리 트리거와 생성형 메모리 웨이버를 통해 추론과 메모리 합성의 교차를 실현했다.
  2. 이는 파라메트릭 및 검색 기반 패러다임의 한계를 넘어설 수 있는 구조를 제공하며, 실험 결과에서 성능 향상, 강력한 교차 도메인 일반화 능력, 지속 학습 능력 등을 확인했다.
  3. MemGen은 계획, 절차적, 작업 메모리로 구성된 명시적 메모리 계층 구조를 모델링하여, 유동적이고 재구성 가능한 지능을 갖춘 자가 진화형 LLM 에이전트 개발로 이어질 가능성을 보여준다.

In this work, we introduced MemGen, a dynamic and generative memory framework designed for LLM Agents. By interleaving reasoning with memory synthesis through a reinforcement-learned memory trigger and a generative memory weaver, MemGen transcends the limitations of parametric and retrieval-based paradigms. Extensive experiments showcase substantial performance gains, robust cross-domain generalization, strong continual learning ability, and MemGen’s explicitly modeled memory hierarchy (i.e., planning, procedural, and working memory). These results suggest a promising path toward self-evolving LLM agents capable of fluid and reconstructive intelligence.

References

Anthropic. On the Biology of a Large Language Model. https://transformer-circuits.pub/2025/attribution-graphs/ biology.html, 2025. [Accessed 24-08-2025].

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024. https://arxiv.org/abs/2401.10774.

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning, 2023. https://arxiv.org/abs/2310.05915.

Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, and Chao Huang. Sepllm: Accelerate large language models by compressing one segment into one separator. arXiv preprint arXiv:2412.12094, 2024a.

Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models, 2024b. https: //arxiv.org/abs/2403.12881.

Zhixun Chen, Ming Li, Yuxuan Huang, Yali Du, Meng Fang, and Tianyi Zhou. Atlas: Agent tuning via learning critical steps, 2025. https://arxiv.org/abs/2503.02197.

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory, 2025. https://arxiv.org/abs/2504.19413.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li,

  • Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948.

  • Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin, 2024. https://arxiv.org/abs/2312.09979.

  • Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory. arXiv preprint arXiv:2508.06433, 2025.

  • Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma Gongque, Weihao Zeng, Wei Wang, Jingang Wang, Xunliang Cai, and Weiran Xu. Agentrefine: Enhancing agent generalization through refinement tuning, 2025. https://arxiv.org/abs/2501.01702.

  • Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding, 2024. https://arxiv.org/abs/2402.02057.

  • Huanang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. A survey of self-evolving agents: On path to artificial super intelligence, 2025. https://arxiv.org/abs/2507.21046.

  • In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems, 6:325–338, 2024.

  • Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, and Mingu Lee. Vocabtrim: Vocabulary pruning for efficient speculative decoding in llms, 2025. https://arxiv.org/abs/2506.22694.

  • Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024. https://arxiv.org/abs/2412.06769.

  • Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. https://arxiv.org/abs/2103. 03874.

  • Hongkang Yang Hongkang Yang, Zehao Lin Zehao Lin, Wenjin Wang Wenjin Wang, Hao Wu Hao Wu, Zhiyu Li Zhiyu Li, Bo Tang Bo Tang, Wenqiang Wei Wenqiang Wei, Jinbo Wang Jinbo Wang, Zeyun Tang Zeyun Tang, Shichao Song Shichao Song, Chenyang Xi Chenyang Xi, Yu Yu Yu Yu, Kai Chen Kai Chen, Feiyu Xiong Feiyu Xiong, Linpeng Tang Linpeng Tang, and Weinan E Weinan E. Memory3 : Language modeling with explicit memory. Journal of Machine Learning, 3(3):300–346, January 2024. ISSN 2790-203X. doi: 10.4208/jml.240708. http://dx.doi.org/10.4208/jml.240708.

  • Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models, 2025a. https://arxiv.org/abs/2501.03262.

  • Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, and Junnan Li. Beyond’aha!’: Toward systematic meta-abilities alignment in large reasoning models. arXiv preprint arXiv:2505.10554, 2025b.

  • HuggingFace. SmolLM3: smol, multilingual, long-context reasoner — huggingface.co. https://huggingface.co/blog/ smollm3, 2025. [Accessed 23-09-2025].

  • Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. https://arxiv.org/abs/2403.07974.

  • Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation. arXiv preprint arXiv:2404.12457, 2024.

  • Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601– 1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. https://aclanthology.org/P17-1147/.

  • Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, and Zilong Zheng. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space, 2025a. https://arxiv.org/abs/2505.13308.

  • Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025b. https://arxiv.org/abs/2401.15077.

  • Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.

  • Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, and Arthur Szlam. Deliberation in latent space via differentiable cache augmentation. arXiv preprint arXiv:2412.17747, 2024.

  • Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. Agentauditor: Human-level safety and security evaluation for llm agents, 2025a. https://arxiv.org/abs/2506.00641.

  • Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Shang Zhu Tarun Venkat, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and Ion Stoica. Deepswe: Training a state-of-the-art coding agent from scratch by scaling rl. https://pretty-radio-b75.notion.site/ DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art-Coding-Agent-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33, 2025b. Notion Blog.

  • Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9 (Nov):2579–2605, 2008.

  • Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories, 2023. https: //arxiv.org/abs/2212.10511.

  • Dan P McAdams. The psychology of life stories. Review of general psychology, 5(2):100–122, 2001.

  • Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. https://arxiv.org/abs/2310.08560.

  • Dmitrii Pantiukhin, Boris Shapkin, Ivan Kuznetsov, Antonia Anna Jost, and Nikolay Koldunov. Accelerating earth science discovery via multi-agent llm systems, 2025. https://arxiv.org/abs/2503.05854.

  • Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025. https://arxiv.org/abs/2504.13958.

  • Jiahao Qiu, Xinzhe Juan, Yimin Wang, Ling Yang, Xuan Qi, Tongcheng Zhang, Jiacheng Guo, Yifu Lu, Zixin Yao, Hongru Wang, Shilong Liu, Xun Jiang, Liu Leqi, and Mengdi Wang. Agentdistill: Training-free agent distillation with generalizable mcp boxes, 2025a. https://arxiv.org/abs/2506.14728.

  • Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025b. https://arxiv.org/abs/2505.20286.

  • David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. https://arxiv.org/abs/2311. 12022.

  • Shuo Ren, Pu Jian, Zhenjiang Ren, Chunlin Leng, Can Xie, and Jiajun Zhang. Towards scientific intelligence: A survey of llm-based scientific agents, 2025. https://arxiv.org/abs/2503.24047.

  • Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation, 2025. https://arxiv.org/abs/2502.21074.

  • Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, 2021. https://arxiv.org/abs/2010. 03768.

  • Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning, 2025. https://arxiv.org/abs/2505.01441.

  • Peiyang Song, Pengrui Han, and Noah Goodman. A survey on large language model reasoning failures. In 2nd AI for Math Workshop @ ICML 2025, 2025. https://openreview.net/forum?id=hsgMn4KBFG.

  • Eleanor Spens and Neil Burgess. A generative model of memory construction and consolidation. Nature human behaviour, 8(3):526–543, 2024.

  • Haowen Su, Xian Li, Savannah Born, Christopher J Honey, Janice Chen, and Hongmi Lee. Neural dynamics of spontaneous memory recall and future thinking in the continuous flow of thoughts. Nature communications, 16(1): 6433, 2025a.

  • Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. Parametric retrieval augmented generation, 2025b. https://arxiv.org/abs/2501.15915.

  • Yuchang Sun, Yanxi Chen, Yaliang Li, and Bolin Ding. Enhancing latent computation in transformers with latent tokens, 2025. https://arxiv.org/abs/2505.12629.

  • Jihoon Tack, Jaehyung Kim, Eric Mitchell, Jinwoo Shin, Yee Whye Teh, and Jonathan Richard Schwarz. Online adaptation of language models with a memory of amortized contexts, 2024. https://arxiv.org/abs/2403.04317.

  • Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, and Ruihua Song. Think silently, think fast: Dynamic latent compression of llm reasoning chains, 2025. https://arxiv.org/abs/2505.16552.

  • Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, and Wangchunshu Zhou. Agent kb: Leveraging cross-domain experience for agentic problem solving, 2025. https://arxiv.org/abs/2507.06229.

  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.

  • Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models, 2024a. https://arxiv.org/abs/2405.14768.

  • Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader?, 2022. https://arxiv.org/abs/2203.07540.

  • Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models. arXiv preprint arXiv:2402.04624, 2024b.

  • Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory, 2025a. https://arxiv.org/abs/ 2502.00592.

  • Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobileagent-e: Self-evolving mobile assistant for complex tasks, 2025b. https://arxiv.org/abs/2501.11733.

  • Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory, 2024c. https: //arxiv.org/abs/2409.07429.

  • Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. https://arxiv.org/abs/2201.11903.

  • Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li, and Li Du. Autotir: Autonomous tools integrated reasoning via reinforcement learning, 2025. https://arxiv.org/abs/2507.21836.

  • Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.

  • Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use, 2025a. https://arxiv.org/abs/2505.19255.

  • Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of llms, 2025b. https: //arxiv.org/abs/2504.15965.

  • Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu Lakkaraju, and Zhen Xiang. How memory management impacts llm agents: An empirical study of experience-following behavior, 2025. https: //arxiv.org/abs/2505.16067.

  • Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents, 2025a. https://arxiv.org/abs/2502.12110.

  • Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. Softcot: Soft chain-of-thought for efficient reasoning with llms, 2025b. https://arxiv.org/abs/2502.12134.

  • Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. Softcot++: Test-time scaling with soft chain-of-thought reasoning, 2025c. https://arxiv.org/abs/2505.11484.

  • Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding, 2025d. https://arxiv.org/abs/2503.02951.

  • Zifan Xu, Haozhu Wang, Dmitriy Bespalov, Xuan Wang, Peter Stone, and Yanjun Qi. Latent skill discovery for chain-of-thought reasoning. arXiv preprint arXiv:2312.04684, 2023.

  • An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024a.

  • An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, and Bo Zheng et al. Qwen3 technical report, 2025. https://arxiv.org/abs/2505.09388.

  • John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Sweagent: Agent-computer interfaces enable automated software engineering, 2024b. https://arxiv.org/abs/2405.15793.

  • Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh Murthy, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Retroformer: Retrospective large language agents with policy gradient optimization, 2024. https://arxiv.org/abs/2308.02151.

  • Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. Agent lumos: Unified and modular training for open-source language agents, 2024. https://arxiv.org/abs/2311. 05657.

  • Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025. https://arxiv.org/abs/2507.02259.

  • Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms, 2023. https://arxiv.org/abs/2310.12823.

  • Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems, 2025a. https://arxiv.org/abs/2506.07398.

  • Jianguo Zhang, Tian Lan, Rithesh Murthy, Zhiwei Liu, Weiran Yao, Ming Zhu, Juntao Tan, Thai Hoang, Zuxin Liu, Liangwei Yang, Yihao Feng, Shirley Kokane, Tulika Awalgaonkar, Juan Carlos Niebles, Silvio Savarese, Shelby Heinecke, Huan Wang, and Caiming Xiong. Agentohana: Design unified data and training pipeline for effective agent learning, 2024a. https://arxiv.org/abs/2402.15506.

  • Yang Zhang, Wenxin Xu, Xiaoyan Zhao, Wenjie Wang, Fuli Feng, Xiangnan He, and Tat-Seng Chua. Reinforced latent reasoning for llm-based recommendation, 2025b. https://arxiv.org/abs/2505.19092.

  • Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents, 2024b. https://arxiv.org/abs/2404.13501.

  • Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, 2024. https://arxiv.org/abs/2308.10144.

  • Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills, 2025. https://arxiv.org/abs/2504.07079.

  • Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory, 2023. https://arxiv.org/abs/2305.10250.

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents, 2025. https://arxiv.org/abs/2506.15841.

Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on latent reasoning. arXiv preprint arXiv:2507.06203, 2025.

Appendix

[요약]

  1. 본 연구는 LLM 디코딩과 강화학습(RL)과의 관련성을 강조하며, MemGen은 잠재 토큰을 동적으로 생성하는 점에서 스펙ulative 디코딩과 유사하지만, 추론 가속보다는 잠재 상태를 메모리 운반체로 활용하는 데 초점을 맞춘다.
  2. RL 관점에서 MemGen은 규칙 기반 RL을 통해 메모리 트리거를 학습하며, 이는 DeepSeek-R1의 GRPO와 같은 가변 보상 RL(RLVR)과 관련이 있다.
  3. 기존 연구는 에이전트 메모리와 RL의 결합을 다루고 있지만, 스스로 개선되는 메모리 메커니즘을 다루는 사례는 드물며, MemAgent와 MEM1은 긴 맥락 입력 처리에 집중한다.

LLM Decoding & RL. Two additional topics that relate to our work are LLM decoding and reinforcement learning (RL). From the decoding perspective, MemGen dynamically generates and inserts latent tokens, which shares similarity with speculative decoding where a drafter model receives the current decoding context and produces subsequent drafted tokens (Cai et al., 2024; Fu et al., 2024; Li et al., 2025b; Goel et al., 2025). However, these methods primarily aim to accelerate LLM inference, whereas MemGen focuses on leveraging latent states as effective carriers of memory. From the RL perspective, MemGen employs rule-based RL to train the memory trigger, which is closely related to reinforcement learning with variable reward (RLVR), including GRPO from DeepSeek-R1 (DeepSeek-AI et al., 2025) and its various derivatives (Qian et al., 2025; Wu et al., 2025a; Wei et al., 2025; Fu et al., 2025). While there exist efforts combining RL with agent memory, to our knowledge, most do not address self-improving memory; for example, MemAgent (Yu et al., 2025) and MEM1 (Zhou et al., 2025) focus on handling long-context inputs rather than evolving memory mechanisms.

B Optimization Algorithm on Memory Weaver

[요약]

  1. 본 섹션에서는 메모리 위버 의 학습을 위한 최적화 알고리즘을 상세히 설명한다.
  2. 식 (10)에서 정의된 핵심 원리는 추론기 의 파라미터를 고정한 상태에서만 메모리 위버의 파라미터 를 업데이트하는 것이다.
  3. 이 모듈러리티는 MemGen이 다양한 최적화 패러다임과 호환될 수 있도록 한다.
  4. 구체적으로는 감독 학습 기반의 양의 사전 훈련(SFT)과 강화 학습 접근법인 그룹 상대 정책 최적화(GRPO)를 구현하였다.

In this section, we provide a detailed exposition of the optimization algorithms for training the memory weaver, , as mentioned in Section 4.3. The core principle, as established in Equation (10), is to update only the weaver’s parameters, denoted as , while keeping the reasoner frozen. This modularity allows MemGen to be compatible with various optimization paradigms. We detail the specific implementations for Supervised Fine-Tuning (SFT) and a reinforcement learning approach, Group Relative Policy Optimization (GRPO).

B.1 Combining MemGen with SFT

The objective of Supervised Fine-Tuning is to train the memory weaver to generate latent memories that guide the frozen reasoner to replicate the behavior observed in a dataset of high-quality demonstration trajectories. We leverage the provided history of past experiences, , where each is treated as an expert demonstration.

Each expert trajectory consists of a sequence of states and actions, . Each expert action is a sequence of tokens, . The goal is to maximize the conditional log-likelihood of generating these expert token sequences.

During the training of the weaver, both the reasoner and the memory trigger are held fixed. At each token generation step j where the trigger activates, the weaver takes the reasoner’s hidden states as input and produces a latent memory . The reasoner then generates the next token conditioned on this memory. The SFT objective is to adjust the weaver’s parameters to maximize the probability of the ground-truth token .

Formally, the optimization problem is to minimize the negative log-likelihood of the expert trajectories, averaged over the dataset . The loss function for the weaver’s parameters is defined as:

\mathcal{L}_{SFT}(\theta') = -\mathbb{E}_{(x_i, \tau_i^*) \sim \mathcal{H}} \left[ \sum_{t=0}^{T_i - 1} \sum_{j=1}^{L_t} \log \pi_{\theta}(\mathbf{z}_{i,t,j}^* \mid s_{i,t}, \mathbf{z}_{i,t,< j}^*, \mathbf{M}_{i,t,j}) \right], \tag{11}

where the latent memory is synthesized by the weaver at that specific step:

\mathbf{M}_{i,t,j} = \mathcal{W}_{\theta'}(\mathbf{H}_{i,t,< j}). \tag{12}

Note that the generation of only occurs if the fixed trigger determines an invocation is needed at step j. In steps where no memory is invoked, the conditioning term is omitted. The gradients are computed

exclusively with respect to the weaver’s parameters and used to update them via gradient descent:


(13)

where is the learning rate. Through this process, the memory weaver learns to synthesize latent memories that effectively steer the frozen reasoner’s generative process to align with the training data’s behavior, thereby internalizing the knowledge from the demonstration data without corrupting the general capabilities of the core LLM.

B.2 Combing MemGen wirh GRPO

The memory weaver can also be trained using a reinforcement learning objective. We specifically adapt the GRPO algorithm. The training process begins by sampling a batch of tasks from the experience history . For each task , we use the policy (composed of the frozen reasoner and the current memory weaver ) to generate a group of K distinct trajectories, denoted as . Each trajectory is generated by the agent’s interaction with the environment and results in a final reward , evaluated by the reward function. Going forward, GRPO computes a group-relative baseline by averaging the rewards of all trajectories within the group :

\bar{R}(\mathcal{G}_i) = \frac{1}{K} \sum_{k=1}^K R(\tau_{i,k}). \tag{14}

The advantage for a specific trajectory is then its reward relative to this baseline:

A(\tau_{i,k}) = R(\tau_{i,k}) - \bar{R}(\mathcal{G}_i). \tag{15}

This formulation allows the weaver to learn by differentiating between better and worse outcomes within a set of its own generations, promoting policies that produce trajectories with above-average rewards. The final objective function, maximized with respect to the weaver’s parameters , is:

where is a fixed reference policy and is a coefficient controlling the strength of the KL regularization. The gradients are computed only for the weaver’s parameters , thus refining its ability to synthesize impactful latent memories while preserving the integrity of the core reasoner.

C Experimental Details

C.1 Training Dataset Setup

Training Datasets. We employ the official training splits of all evaluated datasets, with the sole exception of PopQA. These datasets provide the supervisory signals necessary to train both the memory weaver and the memory trigger. PopQA does not include a dedicated training set; therefore, for this benchmark, we rely on the model trained on TriviaQA, selected for its conceptual proximity, and conduct evaluation directly on PopQA to assess cross-dataset generalization.

Training the Memory Weaver. We begin by training the memory weaver independently, in the absence of a memory trigger. To expose the weaver to varied invocation contexts, we explore two insertion strategies for latent memory tokens: inserting at every punctuation boundary, or sampling a subset of these boundaries at random. This staged setup allows the weaver to learn how to synthesize latent memory from local reasoning states before any explicit decision mechanism is introduced.

Training the Memory Trigger. Once the memory weaver is trained and fixed, we subsequently train the memory trigger. The trigger learns to detect when memory should be invoked, conditioned on the agent’s evolving cognitive state. During this phase, the trigger operates atop the frozen weaver, ensuring that its decision policy is aligned with a stable memory synthesis process.

C.2 Parameter Configurations

Table 2 Hyperparameters used in the training of MemGen.

Training (SFT)
Training (GRPO)
epochs = 2
warmup_ratio = 0.1
optim = adamw_torch
schedular = cosine
rollout_batch_size = 8
train_batch_size = 8
epochs = 2
beta = 0.0
num_iterations = 1
learning_rate = 1e-5
warmup_ratio = 0.1
optim = adamw_torch
schedular = cosine
LoRAr = 16
lora_alpha = 32
target_modules = [q_proj, v_proj]
lora_dropout = 0.1
bias = none
task_type = CAUSAL_LM
Optimizationadam_offload
flash_attn
deepspeed_enable_sleep

D Extra Results

D.1 Continual Learning Result

The results in Table 4 indicate three main findings. First, MemGen exhibits stronger knowledge retention and forward transfer than SFT and ExpeL. For example, when trained on GPQA, MemGen reaches 47.96% on GSM8K and 28.80% on KodCode, surpassing SFT at 45.74% and 18.50% by margins of +2.22% and +10.30%, respectively. Similarly, when trained on KodCode, MemGen maintains 40.34% on AQuA and 20.09% on GPQA, whereas SFT yields 28.61% and 2.53%. Second, MemGen demonstrates more balanced cross-task generalization. Unlike SFT, which primarily improves performance on the most recent task (e.g., 54.10% on KodCode but only 2.53% on GPQA), MemGen achieves competitive results across tasks. After training on GSM8K, it attains 38.43% on AQuA and 21.72% on GPQA, both higher than SFT and ExpeL, suggesting that latent memory captures task-invariant reasoning. Third, MemGen effectively mitigates forgetting on earlier tasks. After sequential training on KodCode, it preserves 40.34% on AQuA compared to 27.14% of ExpeL and 28.61% of SFT, showing stronger robustness to catastrophic forgetting. These findings suggest that MemGen provides a more stable and transferable mechanism for continual learning.

D.2 Trigger Frequency Visualization

More visualizations of the trigger frequency are displayed in Figures 7 and 8, where we paired Qwen2.5-1.5B or SmolLM3-3B with the GSM8K dataset and tested on subsets of GSM8K, KodCode, and GPQA (each

Table 3 Results on Qwen2.5-1.5B. All values represent the performance metric for each task (e.g., accuracy %). We highlight the best and second best results.

MethodALFWorldTrivialQAPopQAKodCodeGPQAGSM8KMATH
Vanilla22.5432.1016.0824.5540.3511.6239.5136.63
СоТ18.3028.6718.3932.3238.5915.6756.7945.22
SFT36.5763.8439.2055.8337.7211.1154.8338.84
GRPO43.5568.2143.1562.1170.3415.6568.1047.42
5BREINFORCE43.2566.5041.8760.2067.8012.5067.4046.89
Qwen2.5-REINFORCE++43.6666.9044.6963.3369.5013.8069.0447.33
Agent-FLAN35.8064.2838.9056.2143.839.3553.0229.82
ExpeL28.9625.2020.2031.1539.788.1245.1238.12
MemoryBank27.8938.1422.7837.9335.8713.8747.8830.47
AWM30.4255.6932.54-----
SoftCoT33.0762.2238.7855.1336.109.3154.5038.55
Co-processor35.6664.7840.1256.6538.1012.1257.1237.40
MemGen SFT40.3065.0241.2858.1642.4718.2858.1547.12
MemGen GRPO54.2773.4249.2865.4372.8118.1873.3953.36

subset having the same number of samples). We then tallied the frequency of memory trigger INVOKE judgments at each relative percentile position in the LLM output.

Figure 7 Memory invocation frequency across benchmarks at inference (trained on MemGen sft+Qwen2.5-1.5B+GSM8K).

Figure 8 Memory invocation frequency across benchmarks at inference (trained on MemGen ).

Figure 9 The generalization study of MemGen. We train MemGen SFT on GSM8K and evaluate it on all four datasets.

Figure 10 The generalization study of MemGen. We train MemGen SFT on KodCode and evaluate it on all four datasets.

Table 4 Continual learning results of Qwen2.5-1.5B-Instruct across four reasoning and programming datasets (AQuA, GPQA, GSM8K, KodCode). The model is sequentially trained on each dataset (AQuA → GPQA → GSM8K → KodCode), and after each training stage, evaluation is conducted on all four benchmarks.

Trained OnMethodAQuAGPQAGSM8KKodCode
Vanilla41.3411.6239.5124.55
AQuASFT42.5216.6742.1018.20
ExpeL41.7312.6740.1616.30
MemGen SFT43.3119.7039.8019.55
GPQASFT38.5517.1745.7418.50
ExpeL37.2414.3542.6715.20
MemGen SFT39.8520.7247.9628.80
GSM8KSFT33.4613.1352.3119.45
ExpeL34.8912.4248.7813.65
MemGen SFT38.4321.7255.6719.75
KodCodeSFT28.612.5324.1454.10
ExpeL27.146.2331.4448.35
MemGen SFT40.3420.0953.7252.95

D.3 Framework Analysis

D.3.1 Ablation Study

To assess the effectiveness of the proposed memory trigger, we conduct an ablation study comparing different memory invocation strategies, as summarized in Table 5. Specifically, Random denotes a naïve baseline where latent memory tokens are inserted at arbitrary token positions with a fixed probability p. All delimiters activated represents a sentence-level strategy that invokes the memory weaver at every delimiter position without any selection mechanism. Finally, MemGen’s dedicated Trigger corresponds to our standard approach, where the memory weaver is activated by the trained trigger Ttrigger. The results reveal several key observations. First, sentence-level intervention already improves performance compared to random invocation. For instance, activating the weaver at all delimiters yields 17.34%, 56.20%, and 64.15% on GPQA, Kodcode, and TriviaQA, respectively, consistently outperforming all random baselines (e.g., p = 0.5 achieves only 16.66%, 52.95%, and 57.28%). This highlights the importance of aligning memory injection with semantic boundaries rather than distributing it across the sequence. More importantly, our trained trigger achieves the best overall performance, reaching 18.28%, 58.16%, and 65.02% on the three benchmarks. This demonstrates that selective activation, i.e., deciding when and where to weave in memory, provides the most effective support for reasoning, as it balances memory utility and interference more precisely than coarse-grained alternatives.

Table 5 Ablation study of different memory invocation strategies. Random denotes a naïve baseline where latent memory tokens are inserted at arbitrary token positions with a fixed probability p. All delimiters activated represents a sentence-level strategy that invokes the memory weaver at every delimiter position without any selection mechanism. Finally, MemGen’s dedicated Trigger corresponds to our standard approach, where the memory weaver is activated by the trained trigger Ttrigger.

Random (p = 0.2)15.6654.5563.55
Random (p = 0.5)16.6652.9557.28
Random (p = 0.8)12.6353.6062.22
All delimiters activated17.3456.2064.15
MemGen’s dedicated Trigger18.2858.1665.02

D.3.2 Analysis of Memory Weaver

We further investigate how the parameterization of the memory weaver influences downstream performance. As shown in Table 6, increasing the number of trainable parameters by adopting a full-parameter SFT setup surely enhances the weaver’s memory and learning capacity, leading to higher task performance across benchmarks. Nevertheless, the LoRA-based instantiation already achieves competitive results while retaining high parameter efficiency, demonstrating that even lightweight adaptations can endow the weaver with sufficient capacity to generate effective latent memories.

Table 6 Ablation study of the latent weaver. We instantiate MemGen SFT’s memory weaver with two strategies: LoRA and full-parameter SFT.

Base LLM: Qwen2.5-1.5B-InstructGPQAKodcodeTriviaQA
LoRA (r = 16, α = 32)18.2858.1665.02
Full SFT21.2160.0067.10

D.3.3 Efficiency Analysis

Table 7 reports average inference time and task performance across three benchmarks. Both SFT and MemGen drastically reduce per-task inference time compared with vanilla models due to fewer tokens required to reach correct answers. For instance, on KodCode with Qwen2.5-1.5B, MemGen SFT completes tasks in 2.94 s, a 75.4% reduction from 11.96 s for vanilla, while improving accuracy by 33.61%. On ALFWorld+Qwen3-8B, MemGen SFT adds only 1.6% more time compared with SFT (20.08 s vs 19.76 s) but increases accuracy from 83.59% to 85.82%. These results confirm that MemGen efficiently enhances reasoning performance without incurring significant inference delays.

Table 7 Average per-task inference time (seconds) and task performance (accuracy %) across three benchmarks. Performance improvement (%) of MemGen SFT over vanilla LLM is also reported.

Model & MethodKodCodeALFWorldTriviaQA
Time (s)Acc (%)Time (s)Acc (%)Time (s)Acc (%)
Qwen2.5-1.5B
Vanilla11.9624.5521.1722.542.1832.10
SFT2.0155.8310.7936.571.9863.84
MemGen SFT2.9458.1612.9440.302.0565.02
Improvement over Vanilla-+33.61-+17.76-+32.92
SmolLM--3B
Vanilla13.1237.0534.8218.964.2610.47
SFT3.0459.2512.8832.363.0555.25
MemGen SFT3.4862.6514.6950.603.1668.13
Improvement over Vanilla-+25.60-+31.64-+57.66
Qwen3-8B
Vanilla16.9949.1055.4258.938.7052.18
SFT7.2464.7519.7683.595.9874.55
MemGen SFT7.5666.1520.0885.826.2577.22
Improvement over Vanilla=+17.05=+26.89=+25.04

E Integration with Retrieval-based Memory

E.1 Formalizing the Integration Process

While the primary mechanism of MemGen leverages the parametric knowledge encapsulated within the memory weaver , the framework is designed to be extensible, allowing for seamless integration with external, retrieval-based memory systems. This hybrid approach enables the weaver to synthesize latent memories that are informed by both its internalized experiences and a vast corpus of external information, thereby providing a richer and more comprehensive context to the reasoner . Let denote an external memory database, and let be a retrieval function that, given a natural language query, returns a set of relevant textual memory snippets. When the memory trigger determines an invocation is necessary at the token-generation step j (i.e., ), the natural language text generated thus far serves as the query for the external memory system. This query, denoted as , is produced by decoding the sequence of tokens generated up to that point:

q_{t,j} = \text{Decode}(\mathbf{z}_{t,< j}). \tag{17}

The retrieval process is then formalized as:

C_t = \mathcal{R}(q_{t,j}; \mathcal{M}_{\text{ext}}), \tag{18}

where is a set of P retrieved textual snippets. These snippets are subsequently encoded into a sequence of embeddings, , where is the total length of the encoded text. This allows the weaver to process the retrieved information in its native latent space.

This retrieved information is subsequently merged with the reasoner’s internal cognitive state . The combined context is then fed into the memory weaver to produce the final latent memory. The invocation of the weaver, as described in Equation (5), is thus modified to:

\mathbf{M}_{t} = \mathcal{W}_{\text{weaver}}([\mathbf{H}_{t,< j}; \mathbf{E}_{t}]), \tag{19}

where [·; ·] denotes the concatenation of the hidden state sequences. This integrated process allows Wweaver to reconstruct both internal parametric knowledge and externally retrieved information into a compact, potent latent memory Mt for the reasoner.

E.2 Experimental Results

As shown in Table 8, even when MemGen’s own parametric memory is disabled (which means that merely the retrieved textual snippets are fed into Wweaver), MemGen significantly enhances the retrieval baseline, boosting performance on ALFWorld from 36.18% to 45.60% and on PopQA from 28.16% to 39.50%. This demonstrates that MemGen serves as a powerful synthesizer, not merely appending but actively reconstructing externally retrieved information into a more potent, task-relevant latent memory, thereby underscoring the necessity of a generative memory mechanism. Most notably, when the weaver’s internal parametric memory is enabled alongside external retrieval, performance is dramatically amplified, achieving scores of 75.90% on ALFWorld and 76.40% on TriviaQA. This highlights the powerful synergy achieved when the memory weaver is empowered to dynamically integrate its own internalized experiences with external textual knowledge.

Table 8 Performance results when integrating MemGen with a retrieval-based memory baseline, ExpeL. The backbone LLM is set as SmolLM3-3B.

Vanilla LLM18.9610.478.23
ExpeL
MemGen + ExpeL (w/o parametric memory)
36.18
45.60
46.20
53.20
28.16
39.50
MemGen + ExpeL (w/ parametric memory)75.9076.4060.23

F Latent Memory Token Demonstration

Cases from TriviaQA

Case 1

system prompt

Answer the given question. You must conduct reasoning inside and first every time you get new information. After reasoning, if you find you lack some knowledge, you can call a search engine by query and it will return the top searched results between and . You can search as many times as your want. If you find no further external knowledge needed, you can directly provide the answer inside and , without detailed illustrations. For example, Beijing .

user prompt

Which Lloyd Webber musical premiered in the US on 10th December 1993?

assistant output

[UPPORT…‘,eniable

certif] I need to find out which Lloyd Webber musical premiered in the US on 10th December 1993. [essengeryyyyMMddELCOME certif] Lloyd Webber musical premiered in the US on 10th December 1993

user prompt

Doc 1(Title: “Andrew Lloyd Webber”) opened at the Adelphi Theatre in London on 12 July 1993, and ran for 1,529 performances …

assistant output

[essengeryyyyMMddELCOME certif] The search results indicate that “Sunset Boulevard” is the Lloyd Webber musical that premiered in the US on 10th December 1993.

[UPPORT…‘,eniable certif] Sunset Boulevard

Case 2

system prompt

Answer the given question. You must conduct reasoning inside and first every time you get new information. After reasoning, if you find you lack some knowledge, you can call a search engine by query and it will return the top searched results between and . You can search as many times as your want. If you find no further external knowledge needed, you can directly provide the answer inside and , without detailed illustrations. For example, Beijing .

user prompt

Who had a 70s No 1 hit with Kiss You All Over?

assistant output

[UPPORT…‘,eniable certif] I need to search who had a 70s No 1 hit with the song “Kiss You All Over.” [essengeryyyyMMddELCOME certif] who had a 70s No 1 hit with Kiss You All Over?

user prompt

Doc 1(Title: “Kiss You All Over”) Exile’s only big hit in the pop rock market (they would later have great success on the country music charts)…

assistant output

[essengerUPPORT’options certif] The song “Kiss You All Over” was a 70s No 1 hit with the band Exile. [UPPORT…‘,eniable certif] Exile

Case 3

system prompt

Answer the given question. You must conduct reasoning inside and first every time you get new information. After reasoning, if you find you lack some knowledge, you can call a search engine by query and it will return the top searched results between and . You can search as many times as your want. If you find no further external knowledge needed, you can directly provide the answer inside and , without detailed illustrations. For example, Beijing .

user prompt

What was the name of Michael Jackson’s autobiography written in 1988?

assistant output

[UPPORTUPPORTeniable certif] I need to search for the name of Michael Jackson’s autobiography written in 1988. [UPPORT…‘,eniable certif] Michael Jackson autobiography 1988

user prompt

Doc 1(Title: “Michael Jackson”) of 4.4 million people. In 1988, Jackson released his only autobiography, ""Moonwalk "", which took four years to complete and sold 200,000 copies…

assistant output

[cmsUPPORT’options Certif] The name of Michael Jackson’s autobiography written in 1988 is “Moonwalk”. [UPPORTUPPORTeniable certif] Moonwalk

Case 4

system prompt

Answer the given question. You must conduct reasoning inside and first every time you get new information. After reasoning, if you find you lack some knowledge, you can call a search engine by query and it will return the top searched results between and . You can search as many times as your want. If you find no further external knowledge needed, you can directly provide the answer inside and , without detailed illustrations. For example, Beijing .

user prompt

Which volcano in Tanzania is the highest mountain in Africa?

assistant output

[UPPORT…‘,eniable certif] I need to search for the highest mountain in Africa, which is located in Tanzania. [UPPORTUPPORTeniable certif] highest mountain in Africa in Tanzania

user prompt

Doc 1(Title: “Mount Kilimanjaro”) Mount Kilimanjaro Mount Kilimanjaro or just Kilimanjaro (), with its three volcanic cones, ""Kibo"", ""Mawenzi"", and ""Shira"", is a dormant volcano in Tanzania…

assistant output

[essengeryyyyMMddeniable certif] The highest mountain in Africa located in Tanzania is Mount Kilimanjaro. [UPPORT…‘,eniable certif] Mount Kilimanjaro

Cases from KodCode

Case 1

MODEL INPUT:

Write an efficient and correct Python function to solve the following problem.Question: Write a function that takes a list of integers and returns a new list containing only the even integers from the original list, sorted in non-decreasing order. The list can contain negative numbers and is not necessarily sorted. If the input list is empty or contains no even integers, the function should return an empty list.

MODEL OUTPUT:

[keyword-kindërgetAs-slide]def even_sorted(lst):

[.keyword_pick] """

[LTRetical] Returns a new list containing only the even integers from the original list,

sorted in non-decreasing order.

[.keyword_pick] """

[LTRetical] even_numbers = [num for num in lst if num % 2 == 0]

[.keyword_pick] even_numbers.sort()

[.keywordetical] return even_numbers<|im_end|>

Case 2
MODEL INPUT:
Write an efficient and correct Python function to solve the following problem.Question: Write a python function to find the first
      maximum and the second maximum number in an array. If the array has less than two unique elements, return 'None' for the
      missing values.
MODEL OUTPUT:
[keyword-kindërgetAs-slide]def find_first_max_second_max(arr):
    """
[.keywordetical] Returns the first and second maximum numbers in the array.
    If the array has less than two unique elements,[.keyword_pick] returns None for the missing values.
[.keyword_pick] """
    if len(arr) < 2:
[.keyword_pick] return None,[.keyword_pick] None
[.keywordetical] unique_elements = set(arr)
[.keyword_pick] if len(unique_elements) < 2:
[.keywordetical] return None,[.keyword_pick] None
[.keywordetical] first_max = max(unique_elements)
    second_max = max(unique_elements, default=None)
    return first_max, second_max<|im_end|>
Case 3
MODEL INPUT:
Write an efficient and correct Python function to solve the following problem.Question: Given a list of integers, write a function that
      returns the length of the longest contiguous subarray where all elements have the same value. The function should also return
      the value of the elements in this subarray. Consider the output to be a tuple where the first element is the length of the longest
      subarray and the second element is the value of the elements in the subarray.
For example, consider the following list as input,
numbers = [1, 2, 2, 3, 3, 3, 2, 2, 2, 2, 1, 1]
Expected Output: (4, 2)
MODEL OUTPUT:
[kindgetAs?-kindërgetAs-slide]def longest_subarray_with_same_value(numbers):
    """
[.keyword_pick] Returns the length of the longest contiguous subarray where all elements have the same value.
[.keyword_pick] The function also returns the value of the elements in this subarray.
[LTR_pick] """
[LTRetical] if not numbers:
[.keyword_pick] return (0, None)
    max_length = 1
[.keyword_pick] current_length = 1
    max_value = numbers[0]
    current_value = numbers[0]
[.keyword_pick] for i in range(1,[LTR_pick] len(numbers)):
[LTR_pick] if numbers[i] == current_value:
            current_length += 1
        else:
            if current_length > max_length:
[LTR_pick] max_length = current_length
                max_value = current_value
            current_length = 1
            current_value = numbers[i]
    if current_length > max_length:
        max_length = current_length
        max_value = current_value
    return (max_length, max_value)<|im_end|>

Cases from GSM8K

Case 1

MODEL INPUT:

Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}. Question: Billy sells DVDs. He has 8 customers on Tuesday. His first 3 customers buy one DVD each. His next 2 customers buy 2

DVDs each. His last 3 customers don’t buy any DVDs. How many DVDs did Billy sell on Tuesday?

MODEL OUTPUT:

[ even. s. even even ... even] Billy's first 3 customers buy 3 * 1 = <<3*1=3>>3 DVDs.
[-gallery.]His next 2 customers buy 2 * 2 = <<2*2=4>>4 DVDs.
[-routing.]Billy sells 3 + 4 = <<3+4=7>>7 DVDs on Tuesday.\boxed{7}<|im_end|>
Case 2
MODEL INPUT:
Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}.
Question: Henry made two stops during his 60-mile bike trip. He first stopped after 20 miles. His second stop was 15 miles before the
     end of the trip. How many miles did he travel between his first and second stops?
MODEL OUTPUT:

[… a even even even… even ]Henry traveled 60-15=<<60-15=45>>45 miles before his second stop.

[-android.]He traveled 20 miles before his first stop,[ecies.] so he traveled 45-20=<<45-20=25>>25 miles between his first and second stops.\boxed{25}<|im_end|>

[ even.]The final answer is 25<|im_end|>

Case 3

MODEL INPUT:

Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}.Question: A basket contains 25 oranges among which 1 is bad, 20% are unripe, 2 are sour and the rest are good. How many oranges are good?

MODEL OUTPUT:

[. even.. pre. even even even… even ]The number of unripe oranges is 20/100*25 = <<20/100*25=5>>5

[-animate.]The number of sour oranges is 2

[-login.]The number of bad oranges is 1

[-login.]The number of good oranges is 25 - 5 - 2 - 1 = <<25-5-2-1=17>>17\boxed{17}<|im_end|>

Cases from GSM8K Cluster

Cluster 1

Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}.Question: Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?

[… pre. even even even… even ]The house was worth $80

Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}.Question: James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?

[… out. even even even.. even even ]He runs 3*60=<<3*60=180>>180 meters a day

Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}.Question: Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them?

[… pre… even… even ]The first 16 glasses cost 16 * 5 = $<<16*5=80>>80

Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}.Question: Toulouse has twice as many sheep as Charleston. Charleston has 4 times as many sheep as Seattle. How many sheep do Toulouse, Charleston, and Seattle have together if Seattle has 20 sheep?

[. even even. s. even even even.. even even ]Charleston has 4*20=<<4*20=80>>80 sheep.

Solve the math problem with proper reasoning, and make sure to put the FINAL ANSWER inside \boxed{}.Question: John drives for 3 hours at a speed of 60 mph and then turns around because he realizes he forgot something very important at home. He tries to get home in 4 hours but spends the first 2 hours in standstill traffic. He spends the next half-hour driving at a speed of 30 mph, before being able to drive the remaining time of the 4 hours going at 80 mph. How far is he from home at the end of those 4 hours?

[… even even even… even ]He drives 60*3 = <<60*3=180>>180 miles in the first 3 hours.

Cluster 2

Janet’s ducks lay 16 eggs per day,[-animate.] and she eats 3 eggs for breakfast every morning

Wendi feeds her 20 chickens 15 cups of feed in the morning,[-animate.] 25 cups in the afternoon

She sold half of what was left at the orange house,[-animate.] she had 1/2 left.

x = 162/18 = <<162/18=9>>9

[-animate.]Allen’s age is 11x = 11*9 = <<11*9=99>>99 years old now.

He runs 20 miles a day because 60/3=<<60/3=20>>20

[-animate.]He runs 10 miles the other two days because 20/2=<<20/2=10>>10

Cluster 3

It takes 2/2=<<2/2=1>>1 bolt of white fiber.

[ecies.]So it takes 2+1=<<2+1=3>>3 bolts in total.

The total cost of the five cell phones is $150 x 5 = $<<150*5=750>>750.

[ even.]The interest for each unit is $150 x 2/100 = $<<150*2/100=3>>3.

The second level is 16*2=<<16*2=32>>32 square footage.
[ even.]The third level is 32*2=<<32*2=64>>64 square footage.
20% of 20 students is 20 x 20/100 = <<20*20/100=4>>4 students.
[ecies.]The remaining students are 20 - 4 = <<20-4=16>>16 students.
If Artie rounds the price of petunias to the nearest dollar,[-animateecies] he will sell them for $2 each.
Cluster 4
The groomer offers a 30% discount,[-loginecies] so the cost of the grooming is 100 - (100 * 0.[ecies.]30) = $<<100-30=70>>70
If Dana skips at 3 miles per hour,[-login.] then she runs at 3 * 4 = <<3*4=12>>12 miles per hour.
There were 140 * 2 = <<140*2=280>>280 emeralds in the chest.
[-login.]There were 175 + 140 + 280 = <<175+140+280=595>>595 gems in the chest.
There are 30+60 = <<30+60=90>>90 students at Dr.[-loginecies.] Wertz's school.
Cecilia will feed her puppy 1 cup of dog food every day for 180 days,[-login.] so she will use 180 cups of dog food in the first 180 days.

Figure 11 (Up) t-SNE visualization of latent memories generated by MemGen +SmolLM3-3B across datasets; (Down) Latent memory visualization within the TriviaQA and GSM8K datasets, clustered using K-means.

G Memory Functional Study

G.1 Visualization Process

This section details the methodology employed to visualize the geometric structure of the latent memory (as shown in Figures 5 and 11) and to identify semantically distinct clusters within it. The process involves two primary stages: obtaining a fixed-dimensional representation for each memory sequence and applying dimensionality reduction and clustering algorithms.

Sequence Representation. Let be a corpus of Q latent memory sequences collected from MemGen’s inference, where each sequence consists of K token embeddings of dimension . To obtain a single, holistic vector representation for each sequence, we compute its mean embedding :


(20)

This procedure yields a set of high-dimensional representations , which serves as the input for the subsequent analysis.

For visualization, we employed t-SNE (Maaten and Hinton, 2008) to project the high-dimensional set into a two-dimensional space. Formally, this mapping can be expressed as . These 2D data points are then utilized for the scatter plot.

To quantitatively identify distinct functional groups, we applied the K-means algorithm directly to the high-dimensional mean embeddings in . This procedure partitions the memory representations into a predefined number of N discrete clusters, (we set N = 4).

G.2 Failure Taxonomy Definitions

To systematically analyze the functional impact of different memory clusters, we adopt and refine the failure taxonomy proposed by Song et al. (2025). Below, we provide precise definitions of each failure category in the context of LLM-based agent behavior, which guide our annotation and evaluation process:

  • Planning Failure. This category refers to errors in high-level task decomposition and strategic decision-making. The agent either formulates subgoals that do not align with the original objective, fails to plan intermediate steps necessary for task completion, or misorders the reasoning sequence, leading to suboptimal or incomplete solutions.

  • Compositional Reasoning. This type of failure occurs when the agent struggles to integrate multiple pieces of information or reasoning steps into a coherent solution.

  • Tool Parsing Error. This failure occurs when the agent’s generated output cannot be parsed into a valid tool call. Typical patterns include malformed function signatures, missing required arguments, or unrequired arguments in the invocation, which prevent the external tool from being executed as intended.

  • Tool Response Error. This category refers to errors that arise after a tool has been successfully invoked. Either the tool itself returns incorrect or incomplete information, or the agent misuses or misinterprets the returned content.

  • Answer Formatting Failure. This category includes errors in the final presentation or structuring of the output, despite the reasoning process being largely correct. Examples include incorrect output format (e.g., unable to include the answers with in \boxed), violation of task-specific response templates, or missing required components in the final answer.

  • Demand Misunderstanding. This failure indicates that the agent has misinterpreted the user’s intent or the task specification. It may solve a different problem from the one posed, ignore key constraints, or pursue irrelevant objectives due to misunderstanding the instruction semantics.

  • Think-Act Inconsistency. This type refers to a mismatch between the agent’s internal reasoning and its subsequent action in ReAct mode. The agent may articulate a correct reasoning chain but execute a contradictory action or produce a final answer inconsistent with its prior deliberation.

  • False Belief captures cases where the agent maintains and reasons with inaccurate assumptions about the external environment, user state, or task context.

G.3 Annotating Failure Modes and Filtering Latent Memory

Annotation of Failure Modes. We manually annotated agent failures on the TriviaQA dataset based on the eight failure modes discussed above. Each trajectory generated by the agent was evaluated by human annotators who assigned only one failure label if the agent’s behavior deviated from a successful path. Trajectories exhibiting failures that did not fall into these predefined categories were excluded from this specific analysis.

Inference-time Filtering of Latent Memory Clusters. The core of our intervention study involved selectively removing the influence of a specific latent memory cluster during the agent’s reasoning process. Our methodology is designed to be consistent with the clustering process itself, which operates on sequence-level representations.

Let be the set of N latent memory clusters. These clusters were derived by applying Kmeans to a collection of historical memory representations, where each representation is the mean embedding
of an entire latent memory sequence. For each cluster , we compute its centroid by
averaging these historical sequence representations within the cluster.

During inference, when the memory weaver synthesizes a new latent memory sequence , we first compute its single-vector representation, , by averaging its token embeddings:


(21)

To determine the semantic affiliation of this new sequence, we compare its mean embedding against a comprehensive reference set , where is the LLM’s vocabulary embedding matrix. We then identify the set of top-k nearest neighbors to based on cosine similarity, denoted as . In our experiments, we set k = 10.

To ablate the influence of a target cluster , the entire latent memory sequence is filtered (i.e., discarded and not prepended to the reasoner’s context) if the centroid of that cluster, , is present within this top-k set. Formally, is filtered if:


(22)

This sequence-level filtering allows us to precisely ablate the contribution of a specific learned memory function and observe its impact on agent behavior.