ExpWeaver: How LLM Agents Learn from Past Experience in Latent Space

Review date: 2026-06-08 Review author: Zhongzhu Zhou Paper reviewed: ExpWeaver: LLM Agents Learn from Experience via Latent RAG Paper authors: Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You arXiv: 2606.01041v1, 2026-05-31 Venue/status: ICML 2026 (Proceedings of the 43rd International Conference on Machine Learning)

Short Answer

ExpWeaver is a framework for enabling LLM agents to improve over time by learning from their own past interactions — without paying the token-overhead penalty that plagues text-based Retrieval-Augmented Generation (RAG). The core idea is strikingly elegant: instead of retrieving past experiences as text and concatenating them into the context window (which consumes tokens and requires a separate retriever), ExpWeaver stores every experience as a dense embedding in the LLM’s own hidden-state space, and then at each autoregressive decoding step, the model looks up the most relevant past experiences directly in that latent space and integrates them via a learned cross-attention mechanism with a gated residual. The whole pipeline — experience encoding, latent retrieval, cross-attention aggregation, gated integration, and generation — is trained end-to-end with reinforcement learning using GRPO.

The practical result is remarkable: on the 13-task Experience-driven Benchmark (ExpBench), ExpWeaver achieves state-of-the-art on 12 tasks, outperforming the strongest baseline (Search-R1) by over 6.8% average. Token consumption is comparable to non-retrieval baselines — a 1.5–2× improvement over text-concatenation methods. Perhaps most impressive is the cross-domain generalization: in a zero-shot transfer experiment to pharmaceutical chemistry (Chem-TDC), ExpWeaver exceeds the best text-based retrieval baseline by over 8%, suggesting that latent experience embeddings capture transferable problem-solving strategies rather than surface-level textual patterns.

The critical limitation I see — and I will expand on this in the Critical Analysis section — is the narrow evaluation scope: only Qwen2.5-3B-Instruct was tested, the ranking benchmark uses two datasets of similar type, and variance across seeds is not reported. These gaps make it hard to assess whether the results would hold at scale or in more diverse agent scenarios.

1. Prerequisites

Before diving into the mechanics of ExpWeaver, let me lay out the background knowledge needed to follow the paper.

1.1 LLM Agents and the Experience Learning Paradigm

A language model agent is an LLM that, given a query xx, produces not only an answer yy but also a reasoning trace zz — a step-by-step chain of thought, tool calls, or subgoal decomposition. The agent interacts with an environment (web browser, code interpreter, search engine, database) and receives reward or feedback r(x,y,z)r(x, y, z) after each episode.

Experience learning refers to the idea that an agent’s past rollouts are valuable knowledge: if the agent solved a problem correctly before, the reasoning strategy it used — the sequence of subgoals, the tools it invoked, the mistakes it corrected — can inform future decisions on similar problems.

The natural way to implement this is retrieval-augmented generation: at inference time, retrieve similar past experiences from a memory bank and prepend them to the context. The agent can then pattern-match on the retrieved examples to improve its answer. This is conceptually similar to in-context learning (few-shot prompting), but the examples come from the agent’s own history rather than a fixed hand-curated set.

1.2 Retrieval-Augmented Generation (RAG) Basics

Standard RAG works as follows:

                   ┌──────────────────────────────┐
                   │  Query x at inference time   │
                   └──────────────┬───────────────┘

                    embed with retriever encoder

                   ┌──────────────▼───────────────┐
                   │  Experience Memory Bank (text)│
                   │   e1, e2, ..., eN (text docs) │
                   └──────────────┬───────────────┘

                    retrieve top-K by cosine sim

                   ┌──────────────▼───────────────┐
                   │  Concatenate retrieved docs  │
                   │  into context window         │
                   └──────────────┬───────────────┘

                   ┌──────────────▼───────────────┐
                   │  LLM generates response y    │
                   └──────────────────────────────┘

The problems with this approach for experience learning are:

  1. Token overhead. Retrieved experiences can be hundreds to thousands of tokens each. Retrieving K=3 experiences with 300 tokens each adds 900 tokens to every forward pass. This linearly scales the memory and compute cost of every generation.
  2. Decoupled architecture. The retriever (typically a separate encoder model) is trained independently from the generator. This separation prevents joint optimization — the retriever cannot learn which embeddings are most useful for generation, and the generator cannot signal back to the retriever what it actually needed.
  3. On-policy misalignment. As the generator LLM is fine-tuned, the distribution of its outputs drifts. But text-based experiences stored from earlier policy checkpoints no longer have the same semantics relative to the evolved policy. A retriever based on fixed text embeddings cannot adapt.

1.3 Autoregressive Generation and Hidden States

When an LLM generates text autoregressively, it processes the input prefix and produces a sequence of hidden states h1,h2,,hTRd\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T \in \mathbb{R}^d, one per generated token, at the final transformer layer. These hidden states are rich representations of the model’s internal “understanding” of the current context — they encode the semantic content, syntactic structure, and accumulated reasoning up to each position.

The key architectural insight in ExpWeaver is: these hidden states are the right space in which to do retrieval. A hidden state ht\mathbf{h}_t at decoding step tt captures exactly what context the model has processed, what reasoning it has done so far, and what it needs next. Retrieving experiences that match this latent context (rather than matching the surface text) is fundamentally more informative.

1.4 Cross-Attention and Gated Residuals

Cross-attention is the standard mechanism in transformer architectures for one sequence to attend to another. For a query vector uRd\mathbf{u} \in \mathbb{R}^d and a set of key-value pairs packed as matrix ZRK×d\mathbf{Z} \in \mathbb{R}^{K \times d}:

CrossAttn(u,Z,Z)=softmax ⁣(uZd)Z\text{CrossAttn}(\mathbf{u}, \mathbf{Z}, \mathbf{Z}) = \text{softmax}\!\left(\frac{\mathbf{u} \mathbf{Z}^{\top}}{\sqrt{d}}\right) \mathbf{Z}

The output is a weighted sum of the value vectors, where the weights are determined by the similarity between the query and each key. ExpWeaver uses a single learnable query token u\mathbf{u} (rather than the hidden state itself) to aggregate experiences, which decouples the retrieval-for-aggregation computation from the retrieval-for-selection step.

A gated residual (also called a gated skip connection) is a mechanism of the form:

h=αh+(1α)e\mathbf{h}' = \alpha \cdot \mathbf{h} + (1 - \alpha) \cdot \mathbf{e}

where α[0,1]\alpha \in [0, 1] is a learned gate that controls how much of the original hidden state h\mathbf{h} vs. the experience signal e\mathbf{e} to retain. Setting α1\alpha \approx 1 makes the model conservative (mostly preserving its own reasoning); α0\alpha \approx 0 would let experiences dominate. In ExpWeaver, the gating is more sophisticated — it uses a norm-preserving interpolation that I will explain in detail in Section 3.3.

1.5 GRPO: Group Relative Policy Optimization

GRPO (Shao et al., 2024) is a reinforcement learning algorithm for fine-tuning LLMs that avoids the need for a separate critic model (unlike PPO). For a query xx, GRPO samples GG responses {yi}i=1G\{y_i\}_{i=1}^G and assigns each a scalar reward rir_i. The group-normalized advantage is:

A^i=rirˉσr+ϵ\hat{A}_i = \frac{r_i - \bar{r}}{\sigma_r + \epsilon}

where rˉ\bar{r} is the group mean reward and σr\sigma_r is the standard deviation. The policy is then updated to increase the log-probability of high-advantage responses and decrease it for low-advantage ones, with a KL penalty to the reference policy to prevent mode collapse.

The key advantage of GRPO for ExpWeaver is that it provides a task-agnostic reward signal that can train both the LLM policy and the experience integration parameters jointly, without requiring separate reward models or critic networks.

1.6 LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2022) is a parameter-efficient fine-tuning method that inserts trainable low-rank matrices into transformer layers. For a pre-trained weight matrix W0Rm×nW_0 \in \mathbb{R}^{m \times n}, LoRA adds a low-rank perturbation:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where BRm×rB \in \mathbb{R}^{m \times r} and ARr×nA \in \mathbb{R}^{r \times n} with rank rmin(m,n)r \ll \min(m, n). This reduces trainable parameters by a factor of (m+n)/(2r)(m + n) / (2r). ExpWeaver applies LoRA to all attention and feed-forward layers of Qwen2.5-3B-Instruct with rank 32.

2. Problem Setting and Motivation

2.1 The Experience Learning Problem

Let D={(x,y)}\mathcal{D} = \{(x, y^*)\} be a dataset of (query, ground-truth) pairs. An LLM agent pθ(yx)p_\theta(y | x) with parameters θ\theta produces output yy and reasoning trace zz for each query xx. After generating a response, the agent receives a scalar reward r(x,y,z)r(x, y, z) measuring quality.

The experience learning problem is to enable the agent to improve its future responses by leveraging a growing memory bank M\mathcal{M} of past trajectories. Each trajectory is compressed into a condensed experience ee using a summarization function S\mathcal{S}:

e=S(x,y,z,r)(1)e = \mathcal{S}(x, y, z, r) \tag{1}

At generation time, the agent has access to M\mathcal{M} and can select a relevant candidate set Cϕ(x)\mathcal{C}_\phi(x) of size KK:

Cϕ(x)=argmaxCE,C=KeCfϕ(x,e)(2)\mathcal{C}_\phi(x) = \arg\max_{\mathcal{C} \subseteq \mathcal{E}, |\mathcal{C}|=K} \sum_{e \in \mathcal{C}} f_\phi(x, e) \tag{2}

where fϕ(x,e)f_\phi(x, e) is a relevance scoring function parameterized by ϕ\phi. The augmented generation policy is:

ypθ(yx,Cϕ(x))y \sim p_\theta(y \mid x, \mathcal{C}_\phi(x))

2.2 Why Existing Approaches Fall Short

The paper organizes prior work into two families:

Retrieval-centric methods (ReasoningBank, ExpeL, LLM-R) optimize the retriever ϕ\phi while keeping θ\theta fixed:

maxϕExD[Uθ(x,Cϕ(x))](3)\max_\phi \mathbb{E}_{x \sim \mathcal{D}} \left[ U_\theta(x, \mathcal{C}_\phi(x)) \right] \tag{3}

where UθU_\theta is an LLM-induced utility function. The problem: training the retriever independently from the generator creates a decoupled architecture where the retriever does not learn what the generator actually needs.

LLM-centric methods (IRCoT, Search-o1, Search-R1) treat retrieval as a tool-use decision made by the LLM itself:

maxθExD[ECπθ(x)Eypθ(yx,C)[r(x,y)]](4)\max_\theta \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{\mathcal{C} \sim \pi_\theta(\cdot|x)} \mathbb{E}_{y \sim p_\theta(y|x, \mathcal{C})} [r(x, y)] \right] \tag{4}

This couples retrieval and generation in a single RL loop, but the architecture still relies on an external, separately-parameterized RAG module (a search engine, a vector database with independent embedding model). Experiences remain in text space, consuming context window tokens.

The gap ExpWeaver fills: neither family integrates retrieval and generation in a shared representational space. ExpWeaver does.

3. The ExpWeaver Framework

Figure 1 from the paper shows the three-component architecture: Experience Bank, Latent Retrieval-Augmented Generation, and Task Adaptation. Let me build this up piece by piece.

flowchart TD
    subgraph EB["Experience Bank"]
        TRAJ["Past trajectory τ = (x,z,y,y*,r)"]
        SUM["LLM summarizer S(x,y,z,r) → summary s"]
        ENC["Embedding z = h_θ(s)\n(same LLM, last-token hidden state)"]
        STORE["Memory M\nstore (x,z,y,y*,r,s,z)"]
        TRAJ --> SUM --> ENC --> STORE
    end

    subgraph LRAG["Latent Retrieval-Augmented Generation"]
        INPUT["Query x"]
        DECODE["Autoregressive decoding\nh₁, h₂, ..., h_T"]
        RETR["TopK retrieval\nC_t = TopK(sim(h_t, z_e), K)"]
        CA["Cross-Attention\ne_t = LN(CrossAttn(u, Z_t, Z_t))"]
        GATE["Gated Integration\nh'_t = a_t ⊙ h_t + √(1-a²_t) ⊙ (i_t ⊙ ẽ_t)"]
        INPUT --> DECODE --> RETR --> CA --> GATE --> DECODE
    end

    subgraph TA["Task Adaptation"]
        GEN["Generative tasks:\ny ~ p_θ(y | x) via h'_T"]
        RANK["Ranking tasks:\nscore(y_j|x) = sim(h'_T, c_j)"]
        GATE --> GEN
        GATE --> RANK
    end

    STORE -.latent embeddings.-> RETR
    GEN & RANK --> REWARD["Reward r(x,y)"]
    REWARD --> RL["GRPO training\nupdate θ and ψ_exp jointly"]
    RL -.update.-> DECODE
    RL -.update.-> STORE

3.1 Experience Representation and the Experience Bank

Each experience eMe \in \mathcal{M} is stored as a 6-tuple:

e=(x,  z,  y,  y,  r,  s,  z)(5)e = (x,\; z,\; y,\; y^*,\; r,\; s,\; \mathbf{z}) \tag{5}

where:

  • xx — the original query
  • zz — the agent’s reasoning trace (chain-of-thought steps, tool calls)
  • yy — the agent’s generated output
  • yy^* — the ground-truth reference answer
  • rr — the scalar reward received from the environment
  • s=S(x,y,z,r)s = \mathcal{S}(x, y, z, r) — a textual summary produced by prompting the LLM
  • zRd\mathbf{z} \in \mathbb{R}^d — the latent embedding (dense vector)

Computing the latent embedding. The embedding z\mathbf{z} is the single most important design choice in ExpWeaver. Rather than using a separate encoder model (like a sentence-BERT), ExpWeaver computes:

z=hθ(s)(6)\mathbf{z} = \mathbf{h}_\theta(s) \tag{6}

where hθ(s)\mathbf{h}_\theta(s) denotes the hidden state of the last token at the final transformer layer of pθp_\theta when processing the summary ss.

Why is this important? Three reasons:

  1. Same representation space. The decoding hidden states ht\mathbf{h}_t and the experience embeddings z\mathbf{z} live in the exact same Rd\mathbb{R}^d space, produced by the same LLM. Cosine similarity between ht\mathbf{h}_t and z\mathbf{z} is therefore semantically meaningful — it measures whether the model’s current reasoning context is similar to the reasoning context that produced experience ee.

  2. No separate retriever. Because both query and key embeddings come from the same LLM, there is no need for an external encoder. The retrieval is architecturally “free” — no additional parameters beyond the cross-attention heads.

  3. On-policy alignment. As LoRA fine-tunes θ\theta, all future hθ(s)\mathbf{h}_\theta(s) embeddings are computed with the updated θ\theta. Experiences added to M\mathcal{M} during training naturally reside in the latent space of the current policy, analogous to the recency bias in on-policy RL. Older experiences (computed with an earlier θ\theta) will be evicted by the FIFO capacity policy before they drift too far.

Storage and indexing. Embeddings are L2L_2-normalized. Similarity is cosine similarity:

sim(z,z)=zzzz(7)\text{sim}(\mathbf{z}, \mathbf{z}') = \frac{\mathbf{z}^\top \mathbf{z}'}{\|\mathbf{z}\| \|\mathbf{z}'\|} \tag{7}

The bank uses FAISS (Johnson et al., 2019) with inner product search on L2L_2-normalized embeddings, giving efficient approximate nearest-neighbor lookup. The bank enforces a fixed capacity by evicting the oldest entries when full.

3.2 Latent Retrieval-Augmented Generation

The LRAG module operates token-by-token during autoregressive decoding. At decoding step tt, the LLM has produced hidden state htRd\mathbf{h}_t \in \mathbb{R}^d.

Step 1: Latent Experience Retrieval

Retrieve the top-KK most relevant experiences from M\mathcal{M}:

Ct=TopKeM ⁣(sim(ht,  ze),  K)(8)\mathcal{C}_t = \text{TopK}_{e \in \mathcal{M}}\!\left(\text{sim}(\mathbf{h}_t,\; \mathbf{z}_e),\; K\right) \tag{8}

where ze\mathbf{z}_e is the latent embedding of experience ee. Since ht\mathbf{h}_t is available at every decoding step at zero extra cost (it’s a byproduct of the forward pass), retrieval adds only the cost of the FAISS lookup — a few milliseconds for a bank of thousands of experiences.

Step 2: Cross-Attention Aggregation

Let Ct={e1,,eK}\mathcal{C}_t = \{e_1, \ldots, e_K\}. Stack their embeddings into a matrix:

Zt=[ze1;  ;  zeK]RK×d\mathbf{Z}_t = [\mathbf{z}_{e_1};\; \ldots;\; \mathbf{z}_{e_K}] \in \mathbb{R}^{K \times d}

Aggregate via cross-attention with a single learnable query token uRd\mathbf{u} \in \mathbb{R}^d:

et=LN ⁣(CrossAttn(u,  Zt,  Zt))(9)\mathbf{e}_t = \text{LN}\!\left(\text{CrossAttn}(\mathbf{u},\; \mathbf{Z}_t,\; \mathbf{Z}_t)\right) \tag{9}

The layer normalization (LN) stabilizes the magnitude of the aggregated experience vector. The output etRd\mathbf{e}_t \in \mathbb{R}^d is the “distilled experience signal” for this decoding step.

Why a learnable query token u\mathbf{u} rather than ht\mathbf{h}_t itself? The paper’s argument is that ht\mathbf{h}_t already serves as the retrieval query (Eq. 8); using it again as the aggregation query would create a direct dependency between retrieval and aggregation that could cause the model to weight experiences based on lexical overlap with the current token rather than their problem-solving relevance. A learned u\mathbf{u} provides a task-general aggregation scheme that the model can adapt through training.

After aggregation, the experience vector is rescaled to match the magnitude of the current hidden state:

e~t=ethtet\tilde{\mathbf{e}}_t = \mathbf{e}_t \cdot \frac{\|\mathbf{h}_t\|}{\|\mathbf{e}_t\|}

This magnitude-matching is crucial: without it, the scale difference between et\mathbf{e}_t and ht\mathbf{h}_t could dominate the gating dynamics and cause instability.

Step 3: Gated Experience Integration

The most mathematically careful part of ExpWeaver is the integration of e~t\tilde{\mathbf{e}}_t into the hidden state ht\mathbf{h}_t. A naive addition ht+αe~t\mathbf{h}_t + \alpha \tilde{\mathbf{e}}_t would violate the norm structure that downstream layers expect. ExpWeaver uses a norm-preserving interpolation inspired by spherical linear interpolation (slerp).

First, compute retention and input gate vectors:

rt=σ(Wrht),it=σ(Wiht)(10)\mathbf{r}_t = \sigma(\mathbf{W}_r \mathbf{h}_t), \quad \mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{h}_t) \tag{10}

where Wr,WiRd×d\mathbf{W}_r, \mathbf{W}_i \in \mathbb{R}^{d \times d} are learnable weight matrices and σ\sigma is the sigmoid function. The retention gate rt\mathbf{r}_t modulates how much of the original hidden state to preserve; the input gate it\mathbf{i}_t modulates how much of the experience signal to incorporate.

The mixing coefficient vector atRd\mathbf{a}_t \in \mathbb{R}^d is computed as:

at=exp ⁣(αsoftplus(λ)rt)(11)\mathbf{a}_t = \exp\!\left(-\alpha \cdot \text{softplus}(-\boldsymbol{\lambda}) \odot \mathbf{r}_t\right) \tag{11}

where λRd\boldsymbol{\lambda} \in \mathbb{R}^d is a learnable vector and α>0\alpha > 0 is a fixed scaling hyperparameter. The softplus(λ)\text{softplus}(-\boldsymbol{\lambda}) ensures non-negativity of the exponent argument while keeping λ\boldsymbol{\lambda} unconstrained. The exponential mapping gives at(0,1]d\mathbf{a}_t \in (0, 1]^d, guaranteeing that the original hidden state is always preserved to some extent.

Finally, the updated hidden state is:

ht=atht+1at2(ite~t)(12)\mathbf{h}'_t = \mathbf{a}_t \odot \mathbf{h}_t + \sqrt{1 - \mathbf{a}_t^2} \odot (\mathbf{i}_t \odot \tilde{\mathbf{e}}_t) \tag{12}

The critical property of this formula is norm preservation: since a2+(1a2)2=1a^2 + (\sqrt{1-a^2})^2 = 1 for each dimension, if ht=e~t\|\mathbf{h}_t\| = \|\tilde{\mathbf{e}}_t\| (which is enforced by the magnitude rescaling), then ht=ht\|\mathbf{h}'_t\| = \|\mathbf{h}_t\|. This ensures that the hidden state norms do not explode or collapse through the gating layers.

Let me visualize the geometry of this gated integration:

      Hidden state space R^d (one dimension shown):

      h_t ─────────────────────────────────────────●  (magnitude ‖h_t‖)
                                                    ↕  retained: a_t ⊙ h_t

      ẽ_t ───────────────────────────────●            (magnitude ‖h_t‖ after rescaling)
                                          ↕  injected: √(1-a²_t) ⊙ (i_t ⊙ ẽ_t)

      h'_t ─────────────────────────────────────●     (magnitude ≈ ‖h_t‖, preserved)

      The Pythagorean identity a² + (√(1-a²))² = 1
      ensures norm is preserved when a_t ∈ [0,1].

When at\mathbf{a}_t is initialized close to 1 (as in the paper with amin=0.98a_\text{min} = 0.98), the model starts with htht\mathbf{h}'_t \approx \mathbf{h}_t (mostly ignoring experiences). As training progresses with RL rewards, the model learns to lower at\mathbf{a}_t in exactly those positions where experience information helps, and the it\mathbf{i}_t gate controls which components of the experience vector are relevant.

All parameters ψexp={u,Wr,Wi,λ}\psi_\text{exp} = \{\mathbf{u}, \mathbf{W}_r, \mathbf{W}_i, \boldsymbol{\lambda}\} are jointly optimized with θ\theta during RL training.

3.3 Task Adaptation

ExpWeaver naturally handles two task types:

Generative tasks (QA, reasoning, coding). The enhanced hidden states {ht}\{\mathbf{h}'_t\} are used for next-token prediction. At each step, the output distribution is pθ(yty<t,x,C)p_\theta(y_t \mid y_{<t}, x, \mathcal{C}) computed from ht\mathbf{h}'_t rather than ht\mathbf{h}_t. No structural change to the generation process is needed.

Ranking tasks (recommendation). Given a candidate set V={y1,,yM}\mathcal{V} = \{y_1, \ldots, y_M\} (e.g., movies or songs), each candidate is encoded as:

cj=hθ(yj)\mathbf{c}_j = \mathbf{h}_\theta(y_j)

Then the final hidden state hT\mathbf{h}'_T (after processing all of xx) produces a relevance score:

score(yjx)=sim(hT,cj)\text{score}(y_j \mid x) = \text{sim}(\mathbf{h}'_T, \mathbf{c}_j)

The ranking π\pi is produced by sorting candidates in descending order of score. This zero-parameter adaptation to ranking requires no modifications to the architecture.

3.4 Training Algorithm

The full training procedure is summarized in Algorithm 1 of the paper:

Algorithm 1: Training ExpWeaver

Input: Dataset D = {(x, y*)}, policy p_θ, experience
       parameters ψ_exp, group size G, retrieval size K,
       learning rate η, KL coefficient β, cold-start threshold M_min

Initialize: M ← ∅

For each iteration:
  1. Sample mini-batch B ⊆ D

  2. For each (x, y*) ∈ B:
     a. For i = 1 to G:
        [Cold-start check]
        if |M| < M_min:
          sample (y_i, z_i) ~ p_θ(·|x)         # skip retrieval
        else:
          sample (y_i, z_i) ~ p_θ(·|x, C)      # with latent retrieval (Eq. 8, 12)
        compute reward r_i ← r(x, y_i)

     b. For i = 1 to G:
        summarize s_i ← S(x, y_i, z_i, r_i)   # LLM summarizer
        encode z_i ← h_θ(s_i)                  # latent embedding (Eq. 6)
        add to bank: M ← M ∪ {(x,z_i,y_i,y*,r_i,s_i,z_i)}

     c. Compute advantages: Â_i ← (r_i - r̄)/(σ_r + ε)

  3. Compute GRPO loss L(θ, ψ_exp)              # Eq. 13
  4. Update: (θ, ψ_exp) ← (θ, ψ_exp) - η∇L(θ, ψ_exp)

The GRPO objective is:

L(θ,ψexp)=ExD ⁣[1Gi=1Gpθ(yix)pθold(yix)A^i]+βExD ⁣[1Gi=1GDKL(pθpref)](13)\mathcal{L}(\theta, \psi_\text{exp}) = -\mathbb{E}_{x \sim \mathcal{D}}\!\left[\frac{1}{G}\sum_{i=1}^G \frac{p_\theta(y_i \mid x)}{p_{\theta_\text{old}}(y_i \mid x)} \hat{A}_i\right] + \beta\,\mathbb{E}_{x \sim \mathcal{D}}\!\left[\frac{1}{G}\sum_{i=1}^G D_\text{KL}(p_\theta \| p_\text{ref})\right] \tag{13}

The first term is the policy gradient (maximize expected advantage). The second is a KL penalty to prevent the policy from diverging too far from the reference model (the base pre-trained LLM), which is critical for preventing reward hacking and maintaining language coherence.

The group advantage A^i=(rirˉ)/(σr+ϵ)\hat{A}_i = (r_i - \bar{r}) / (\sigma_r + \epsilon) normalizes rewards within the group, which reduces variance and ensures that only above-average responses are reinforced.

Reward design differs by task type:

  • Generative tasks: rgen(y,y)=F(y,y)r_\text{gen}(y, y^*) = \mathcal{F}(y, y^*), where F\mathcal{F} is task-appropriate (exact match for QA, pass@1 for coding)
  • Ranking tasks: rrank(π,y)=1/ρπ(y)r_\text{rank}(\pi, y^*) = 1 / \rho_\pi(y^*), where ρπ(y)\rho_\pi(y^*) is the position of the ground-truth item in ranking π\pi. This is 11 when ranked first and decreases as position worsens.

4. Experimental Setup

4.1 The ExpBench Benchmark

ExpWeaver is evaluated on ExpBench, a custom benchmark spanning 13 tasks across three scenarios:

graph TD
    ExpBench --> GenBench["ExpBench-Generic\n10 tasks (QA + Reasoning + Coding)"]
    ExpBench --> SciBench["ExpBench-Sci\nChem-TDC (pharmaceutical chemistry)"]
    ExpBench --> RecBench["ExpBench-Rec\nRec-Movie + Rec-Music"]

    GenBench --> QA["Question Answering:\nARC-C, CommonsenseQA, GPQA, MMLU, OBQA"]
    GenBench --> Reason["Mathematical Reasoning:\nGSM8K, GSM-Symbolic, MATH"]
    GenBench --> Code["Code Generation:\nHumanEval+, MBPP+"]

    SciBench --> Chem["Chem-TDC:\ndomain-specific scientific reasoning\n(expert-level chemistry questions)"]

    RecBench --> Movie["Rec-Movie:\nMovieLens ml-1m"]
    RecBench --> Music["Rec-Music:\nNi et al. 2019"]

Why these tasks? The three scenarios test complementary aspects:

  • ExpBench-Generic tests generalization of experience learning across diverse general-domain tasks
  • ExpBench-Sci tests transfer to a specialized scientific domain where factual expertise matters beyond general reasoning patterns
  • ExpBench-Rec tests ranking capabilities in recommendation — a very different output structure (ranked list vs. text)

4.2 Baselines

The paper compares ExpWeaver against three tiers:

General Reasoning Baselines (no experience learning):

  • CoT (Wei et al., 2022): chain-of-thought prompting
  • HRPO: hierarchical reward preference optimization
  • R1 (Guo et al., 2025): RL-based reasoning with DeepSeek-style training

Retrieval-Centric Experience Learning:

  • ReasoningBank (Ouyang et al., 2025): maintains a bank of reasoning examples
  • ExpeL (Zhao et al., 2024): extracts task-specific experiences via KNN
  • LLM-R (Wang et al., 2024): pairwise ranking loss for in-context retrieval

LLM-Centric Experience Learning:

  • IRCoT (Trivedi et al., 2023): interleaved retrieval within reasoning chains
  • Search-o1 (Li et al., 2025): search-augmented reasoning with agentic search
  • Search-R1 (Jin et al., 2025): RL-trained LLM for effective search engine utilization

4.3 Implementation Details

ParameterValue
Base modelQwen2.5-3B-Instruct
LoRA rank32
LoRA scaling factor64
Experience bank indexFAISS (inner product, L2-normalized)
Retrieval size K3
GRPO group size G4
KL coefficient β0.005
LoRA learning rate5×10⁻⁶
ψ_exp learning rate1×10⁻⁴
Max sequence length1024 tokens
Batch size8 per device
Gradient accumulation4 steps (effective batch 32)
PrecisionBF16
Cross-attention heads8
a_min initialization0.98
Hardware4× NVIDIA A6000 GPUs

5. Results and Analysis

5.1 ExpBench-Generic: SOTA on 10 Diverse Tasks

Table 2 from the paper shows ExpWeaver vs. baselines on the 10-task Generic benchmark. I reproduce the key numbers below:

MethodARC-CCQAGPQAMMLUOBQAGSM8KGSM-SymMATHHEval+MBPP+Avg
CoT62.4456.4426.6758.2264.2271.1166.4453.1166.0072.8161.40
R178.4474.6721.6766.2271.3387.1179.7861.7873.0868.7573.10
Search-R180.4476.6723.3367.7873.3385.1177.5659.7871.7967.5073.27
ExpWeaver84.0082.0035.0068.0080.0089.7885.0065.3372.0078.0078.25

ExpWeaver achieves the best score on 9 out of 10 Generic tasks. The only task where it does not win is HumanEval+ (Search-R1 gets 71.79 vs. ExpWeaver’s 72.00 — this is a numerical tie). The most striking gains are on GPQA (+11.67 over R1), CommonsenseQA (+7.33), and OBQA (+6.67).

The GPQA result is particularly interesting: GPQA (Graduate-Level Google-Proof QA) requires graduate-level scientific reasoning. The +11.67 absolute gain over R1 suggests that latent experience retrieval is especially valuable when the problem requires building on prior problem-solving strategies — the kind of structured reasoning that benefits from “I solved a similar problem before by doing X” rather than raw reasoning power.

5.2 Token Efficiency: Latent vs. Text RAG

Figure 2 from the paper shows a bar chart comparison of average tokens consumed per query:

Token consumption (approximate, from paper Figure 2):

Task Category | R1   | SearchR1 | HRPO | ExpWeaver
--------------+------+----------+------+-----------
QA            | 713  | 1210     | 374  | 412
Reasoning     | 698  | 1490     | 351  | 393
Coding        | 632  | 428      | 267  | 241
Chemistry     | 676  | 2005     | 272  | 303
Rec           | 519  | 1862     | 518  | 471

Notes: SearchR1 is 1.5–2× higher due to text concatenation.
ExpWeaver ≈ HRPO (non-retrieval RL baseline), despite outperforming it.

The key observation: ExpWeaver’s token count is comparable to HRPO (an RL method with no retrieval at all). This is because latent experience integration adds zero tokens to the context window — the experiences are injected directly into hidden states, never appearing in the token sequence. Text-based Search-R1 adds retrieved passages verbatim, incurring 1.5–2× overhead.

5.3 Cross-Domain Generalization (ExpBench-Sci)

This is the most compelling result in the paper. The generalization experiment tests three settings on Chem-TDC:

SettingR1Search-R1ExpWeaver
ZeroShot (trained Generic, eval Chem)50.4454.2062.44
FewShot (exp bank from Generic, model fine-tuned on Chem)58.6763.78
InDomain (both trained and evaluated on Chem)69.5869.58

ZeroShot is the most important row. The model has never seen any chemistry questions. Yet ExpWeaver outperforms Search-R1 by 8.24 percentage points. The authors argue this is because latent experience representations capture “meta-level” problem-solving strategies (how to break down a scientific question, when to hedge, how to apply analogical reasoning) rather than domain-specific surface features.

An even more striking finding: ExpWeaver-FewShot (63.78) exceeds R1-InDomain (69.58) — no wait, let me re-read the table. Looking again at Figure 3 from the paper: R1-InDomain = 69.58, ExpWeaver-FewShot = 63.78. Actually FewShot doesn’t beat InDomain, but it comes within 5.8 points of R1-InDomain while using only Generic training data plus a few Chem experiences in the bank. That’s still an impressive gap closed.

5.4 Recommendation Ranking (ExpBench-Rec)

MethodRec-Movie NDCG@10Rec-Movie MRRRec-Music NDCG@10Rec-Music MRRAvg NDCGAvg MRR
R121.5016.8024.5019.8023.0018.30
IRanker42.3234.6933.4729.1837.9031.94
Search-R126.3419.5530.7225.3328.5322.44
ExpWeaver49.3739.5539.4233.2144.4036.38

ExpWeaver dramatically outperforms all baselines on ranking. The 6.47 NDCG@10 improvement over IRanker (the strongest ranking-specialized baseline) is notable given that IRanker was specifically designed for ranking tasks using progressive candidate elimination.

The ranking adaptation requires zero architectural changes to ExpWeaver: the final enhanced hidden state hT\mathbf{h}'_T is used as-is for cosine similarity with candidate embeddings. This universality — the same mechanism handles both generation and ranking — is a key practical advantage.

6. Ablation Studies

6.1 Effect of Retrieval Number K

The ablation varies K{1,3,5,8}K \in \{1, 3, 5, 8\} and evaluates on all five task categories (radar charts in Figure 4a).

Key findings:

  • K=1 degrades noticeably. A single retrieved experience provides insufficient diversity; if the top-1 experience is slightly mismatched, the entire integration is misleading.
  • K=3 and K=5 achieve nearly identical best results. The sweet spot is a small set of diverse relevant experiences.
  • K=8 slightly degrades vs. K=3. Adding more experiences introduces noise — some of the K=8 candidates are low-relevance and their embeddings add confusion to the cross-attention aggregation.

This suggests that the cross-attention aggregation is not infinitely robust to noise: when many irrelevant embeddings are present, the learnable query token u\mathbf{u} cannot fully down-weight them.

6.2 Aggregation Mechanism Comparison

Three variants compared (Figure 4b):

VariantDescriptionEffect
Mean PoolingReplace cross-attention with simple average of K embeddingsLargest degradation. Cannot weight experiences by relevance.
Weighted MeanWeight by retrieval similarity scoreImproves over mean pooling. But static weighting cannot adapt to the current generation context.
QformerReplace with Qformer-style multi-query architectureCompetitive but does not surpass ExpWeaver despite more parameters.
ExpWeaver (cross-attn)Single learnable query token, cross-attentionBest across all task categories

The result validates the choice of cross-attention with a single global query token: the single token learns a task-general aggregation strategy, while the cross-attention mechanism provides the flexibility to dynamically weight experiences based on the current decoding state.

6.3 Mixing Coefficient Initialization

Figure 4c varies amin{0.95,0.97,0.98,0.99}a_\text{min} \in \{0.95, 0.97, 0.98, 0.99\}:

amina_\text{min}Effect
0.95Model integrates experiences too aggressively early in training, disrupting pre-trained representations → degraded performance
0.97Better but still too much experience influence before the bank is populated
0.98Optimal balance — model primarily preserves hidden states while gradually learning experience integration
0.99Mixing range too narrow — model has insufficient flexibility to incorporate experiences even after the bank is well-populated

The conclusion: the initialization amin=0.98a_\text{min} = 0.98 ensures the model first masters the task independently (cold-start phase) and then progressively learns to leverage its experience bank as the bank grows more useful.

7. Limitations and Boundary Conditions

7.1 Model Scale

All experiments use Qwen2.5-3B-Instruct. At 3B parameters, the hidden dimension d2048d \approx 2048 and the experience bank entries are manageable (K3K \approx 3 retrievals per step). At 70B+ parameters, d8192d \approx 8192, meaning each experience embedding is 4× larger and the FAISS index scales accordingly. The cross-attention module cost (O(Kd)O(K \cdot d) per step) also grows. The paper provides no analysis of how performance vs. efficiency tradeoffs shift with model size.

7.2 Experience Bank Staleness

Experiences from early training (when the policy is weak) are encoded with an earlier version of θ\theta. As LoRA fine-tuning proceeds, the LLM’s hidden state space drifts, potentially misaligning old experiences with the current policy’s embedding space. The FIFO eviction policy partially addresses this: old experiences are evicted as new ones arrive. But if the bank is large and training is long, there will always be some proportion of “stale” experiences. The paper does not report the distribution of experience ages or their impact.

7.3 Task Diversity

The 13 evaluation tasks are drawn from established LLM benchmarks (ARC-C, GSM8K, HumanEval+, etc.). These benchmarks are predominantly academic QA-style tasks. They do not test experience learning in more complex agentic settings:

  • Web browsing tasks where state changes between steps (e.g., WebArena)
  • Multi-turn dialogue with environment feedback
  • Long-horizon tasks requiring dozens of tool calls
  • Tasks where wrong experiences could cause catastrophic failures (medical, legal)

7.4 Cold-Start Sensitivity

The cold-start threshold MminM_\text{min} is a critical hyperparameter: if too small, the model attempts to retrieve from an empty or near-empty bank, potentially learning to ignore the retrieval signal. If too large, the model spends too many gradient steps without experience integration. The paper uses a fixed MminM_\text{min} but does not report an ablation over its value.

8. Critical Assessment: Weaknesses & Improvements

8.1 Weaknesses and Flaws

(a) Single-model evaluation. Every result in the paper uses Qwen2.5-3B-Instruct. This is a significant weakness for a method that claims to be “general-purpose.” It is entirely possible that the latent RAG approach works well for 3B models but degrades for larger models (where the experience bank needs to be much larger to cover the richer hypothesis space) or for different model families (LLaMA, Mistral, etc.). Without multi-model experiments, claims about generality are unsupported.

(b) No variance reporting. The paper reports single-run results for all tasks. RL training is notoriously noisy — two runs of GRPO with different seeds can easily produce ±2–3% swings on reasoning benchmarks. The claimed gains over Search-R1 (average 6.8%) are larger than typical RL variance, but the gains on individual tasks (e.g., GPQA: 35.00 vs. 23.33 for Search-R1) could partly reflect seed luck. Standard deviations over multiple runs should be reported.

(c) FIFO eviction ignores experience quality. The experience bank uses FIFO eviction — when the bank is full, the oldest experience is removed regardless of its quality. A bad experience (reward = 0, incorrect reasoning trace) stored early will be retrieved and potentially disrupt generation until it ages out. A quality-aware eviction policy that retains high-reward experiences and discards low-reward ones would be much more sensible. The paper does not ablate this choice.

(d) The ranking benchmark is too narrow. The recommendation evaluation uses only two datasets (MovieLens-1m and a music dataset) with very similar sequential interaction structure. Both use NDCG@10 and MRR as metrics. There is no test on ranking tasks with different structure: document retrieval, code completion ranking, response selection in dialogue. The presented recommendation gains may not generalize to all ranking tasks.

(e) Summarization quality is a hidden dependency. Experiences are encoded through an LLM-produced textual summary s=S(x,y,z,r)s = \mathcal{S}(x, y, z, r) before being embedded. The quality of this summary is critical — a bad summarizer that strips out key reasoning steps would produce uninformative embeddings. The paper uses the same LLM pθp_\theta for summarization, which creates a dependency: early in training when pθp_\theta is weak, summaries may be poor, leading to uninformative experience embeddings. This bootstrapping problem is not analyzed.

8.2 Limitations the Authors Understate

(a) The “on-policy alignment” argument has a catch. The paper argues that using the same LLM for embedding ensures on-policy alignment. But this means the experience embeddings change every gradient step as LoRA updates θ\theta. An experience ee stored at training step tt will have embedding z=hθt(s)\mathbf{z} = \mathbf{h}_{\theta_t}(s). At training step t+Δtt + \Delta t, θ\theta has changed, so hθt+Δt(s)hθt(s)\mathbf{h}_{\theta_{t+\Delta t}}(s) \neq \mathbf{h}_{\theta_t}(s). But the stored embedding z\mathbf{z} is not recomputed — it remains at its original value. So retrieval (which uses the current hidden state as query) becomes misaligned with the stored keys. The paper only partially addresses this via FIFO eviction.

(b) FAISS lookup cost is glossed over. The paper states that ExpWeaver “maintains token efficiency comparable to non-retrieval baselines.” This is true for context window tokens, but FAISS nearest-neighbor search on a large bank requires non-trivial CPU/GPU time per decoding step. With a bank of 100K experiences and d=2048d=2048, and T=500T=500 decoding steps per query, that’s 500 FAISS lookups per query. The latency overhead is not measured in the paper.

(c) Sensitivity to the summarizer prompt. The textual summary ss is produced by prompting pθp_\theta. The specific summarization prompt used by the paper is not provided in the main body (presumably in the appendix). The embedding quality may be sensitive to prompt wording, which would limit reproducibility.

8.3 Concrete Improvement Suggestions

(1) Quality-aware eviction. Replace FIFO with a reward-weighted LRU strategy: when the bank is full, prefer to evict experiences with lower reward. Maintain a priority queue keyed by (ri,access_time)(r_i, \text{access\_time}) to balance quality and recency. This would prevent the bank from being polluted by failed trajectories.

(2) Online embedding refresh. After each LoRA checkpoint, periodically re-encode a random sample of stored experiences with the updated θ\theta to counteract embedding drift. This could be implemented as a background process on CPU every NN training iterations.

(3) Multi-scale retrieval. Instead of a single retrieval at every decoding step tt, experiment with retrieving only at key semantic boundaries (e.g., at step-boundaries in chain-of-thought reasoning, or at function boundaries in code generation). This would reduce the FAISS lookup overhead by 5–10× while potentially retaining most of the benefit.

(4) Hierarchical experience aggregation. Instead of a single experience bank, maintain a two-level hierarchy: a small hot-cache of high-quality recent experiences (fast GPU FAISS) and a cold archive of older experiences (CPU-side index). Most retrievals would hit the hot cache, dramatically reducing latency.

(5) Extend to multi-turn and long-horizon evaluation. Testing ExpWeaver on WebArena or SWE-bench (which require dozens of sequential tool calls) would reveal whether latent experience learning helps in truly long-horizon agent tasks, where the value of “I tried this approach before and it failed” is most pronounced.

(6) Scale experiments to 7B and 13B. The most important missing experiment is simply running the same setup on larger models. If ExpWeaver still achieves 5%+ gains at 7B, it would substantially strengthen the claims. If the gains diminish, this reveals important information about when latent experience learning is most valuable.

9. Reproducibility Notes

ArtifactStatus
CodeReleased at https://github.com/ulab-uiuc/ExpWeaver
Base modelQwen2.5-3B-Instruct (publicly available on HuggingFace)
Training libraryUnsloth (public), TRL (public)
FAISS indexFAISS (Meta, open-source)
BenchmarksARC-C, CommonsenseQA, GPQA, MMLU, OBQA, GSM8K, MATH, HumanEval+, MBPP+ (all public)
Hardware4× NVIDIA A6000 (48GB each) — reproducible on comparable hardware

Key parameters to replicate:

  • LoRA rank=32, scaling=64 on all attention+FFN layers
  • GRPO group size G=4, KL β=0.005
  • Experience bank retrieval K=3
  • Mixing coefficient init: a_min=0.98
  • Cold-start: skip retrieval when |M| < M_min (exact M_min value should be in appendix)
  • Cosine learning rate decay with 10% warmup

Potential pitfalls:

  • The FAISS index must be rebuilt/reset when switching between ExpBench scenarios (the paper trains separate models per scenario)
  • Gradient checkpointing is used due to memory constraints — disabling it will cause OOM on A6000 GPUs at effective batch size 32
  • The summarization prompt used to generate s=S(x,y,z,r)s = \mathcal{S}(x, y, z, r) is critical for experience quality; refer to the code repository for the exact prompt template

10.1 How ExpWeaver Differs from Memory Systems in LLM Agents

There is an important distinction between two types of memory in LLM agent research:

Intra-episode memory (MemGPT, MemoryBank, CoALA): these systems manage what information is held in the working context during a single multi-turn interaction. For example, MemGPT orchestrates the transfer of older context into external storage when the context window fills up, and retrieves it on demand within the same task. These systems address within-task coordination: how do the reasoning steps of a single task connect to each other?

Inter-episode experience learning (ReasoningBank, ExpeL, IRCoT, Search-R1, ExpWeaver): these systems transfer knowledge between different tasks and episodes. The agent finished Task A on Monday; on Wednesday it faces Task B with similar structure. Can knowledge from Task A help with Task B? This is what ExpWeaver targets.

ExpWeaver’s contribution is specifically at the inter-episode level, and within that space, it is the first to operate the entire retrieval-integration loop in latent (hidden state) space rather than text space.

10.2 Relationship to Retrieval-Augmented Generation Literature

Standard RAG (Lewis et al., 2020) was designed for knowledge retrieval — fetching factual documents to supplement the LLM’s parametric knowledge. The retriever is a frozen dense encoder (e.g., DPR), and retrieved documents are appended as context.

ExpWeaver differs in three key ways:

  1. What is retrieved: not external knowledge documents but the agent’s own past trajectories and reasoning strategies
  2. How retrieval is done: in latent hidden state space, not text embedding space
  3. When retrieval happens: at every autoregressive decoding step, not just once at the beginning of a query

This makes ExpWeaver more closely related to recent work on “memory-augmented neural networks” (Graves et al., 2016) and “neural episodic control” (Pritzel et al., 2017) from the DRL literature, where an agent learns to read from and write to a differentiable memory at every step. ExpWeaver can be viewed as a modern, RL-trained instantiation of this idea for language models.

10.3 The GRPO Choice and Its Implications

Using GRPO rather than PPO has practical implications for ExpWeaver’s design. PPO requires a value network VϕV_\phi that estimates the expected future return. Training VϕV_\phi requires access to a dense reward signal across the generation trajectory, which is difficult to obtain for most natural language tasks (rewards are typically sparse — available only after the full response is generated).

GRPO sidesteps this by using group relative normalization: within a batch of GG responses to the same query, the mean reward serves as the value estimate. This works well when GG is large enough to provide a stable baseline and when reward variance within a group is meaningful (i.e., the policy is not yet optimal and there is room for relative discrimination).

For ExpWeaver, GRPO is particularly well-suited because:

  • The LoRA parameters θ\theta and the experience parameters ψexp\psi_\text{exp} can be jointly updated in the same gradient step (no separate critic to train)
  • Group sampling naturally generates diverse responses that populate the experience bank with varied quality levels, from which the model can learn the contrast between successful and failed strategies

10.4 Why Norm Preservation Matters in Practice

Let me explain intuitively why the norm-preserving interpolation (Eq. 12) is critical, not just mathematically elegant.

Modern transformer architectures include LayerNorm (or RMSNorm) layers whose behavior depends critically on the magnitude of their input. A LayerNorm with parameters γ\gamma and β\beta computes:

LayerNorm(x)=γxμ(x)σ(x)+ϵ+β\text{LayerNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x} - \mu(\mathbf{x})}{\sigma(\mathbf{x}) + \epsilon} + \beta

If x=ht\mathbf{x} = \mathbf{h}'_t has the same norm as ht\mathbf{h}_t, the LayerNorm sees a numerically stable input and operates in its trained regime. But if the integration of e~t\tilde{\mathbf{e}}_t were done naively as ht=ht+e~t\mathbf{h}'_t = \mathbf{h}_t + \tilde{\mathbf{e}}_t, the norm of ht\mathbf{h}'_t would be approximately 2ht\sqrt{2} \|\mathbf{h}_t\| (if the two vectors are orthogonal) — 41% larger. Across multiple layers, this compounding effect would cause the model’s intermediate representations to explode in magnitude, potentially destabilizing generation.

The norm-preserving design in Eq. 12 ensures this doesn’t happen. The mathematical guarantee htht\|\mathbf{h}'_t\| \approx \|\mathbf{h}_t\| means the enhanced hidden state can be passed directly to subsequent transformer layers without disrupting the fine-tuned LayerNorm statistics. This is why ExpWeaver can be trained end-to-end without special normalization tricks or gradient clipping that would otherwise be needed.

10.5 The Cold-Start Strategy as Curriculum Learning

The cold-start mechanism (M<Mmin|\mathcal{M}| < M_\text{min} → skip retrieval) is a form of curriculum learning. In standard curriculum learning, the training distribution starts simple and progressively becomes harder. Here, the “curriculum” is:

  1. Phase 1 (cold start): Train on the base task without experience retrieval. The model learns the task structure and builds up a meaningful experience bank. Since the model starts from a pretrained LLM checkpoint, this phase is relatively fast.

  2. Phase 2 (warm experience): Once MMmin|\mathcal{M}| \geq M_\text{min}, the model is allowed to use the experience bank. Because the bank now contains real (if imperfect) experiences from Phase 1, the retrieval signal is meaningful. The gated integration with amin=0.98a_\text{min}=0.98 ensures the model initially trusts its own reasoning over the bank.

  3. Phase 3 (converged): As training continues, the bank fills with higher-quality experiences (the model is getting better, so more trajectories have high rewards). The model gradually lowers at\mathbf{a}_t in informative positions and learns to integrate experience effectively.

This three-phase curriculum is implicit in the algorithm but is important for training stability. Without it (i.e., starting retrieval from an empty bank), the model would learn to ignore the retrieval signal from the outset, potentially converging to a local optimum that never uses experience.

10.6 Comparison to Episodic Memory Approaches

A related but distinct concept is the episodic memory architecture used in meta-learning and continual learning. Systems like Neural Episodic Control (NEC) and experience replay in DRL maintain a buffer of (state, action, reward) transitions and use nearest-neighbor lookup to make decisions. ExpWeaver shares this spirit but has important differences:

AspectDRL Episodic Memory (e.g., NEC)ExpWeaver
Memory unitsState-action-reward transitionsFull task trajectories (x, z, y, r)
Retrieval queryCurrent environment stateCurrent decoding hidden state h_t
Retrieval granularityOnce per action stepOnce per token generation step
Integration methodDirect value function outputGated residual to hidden state
Training signalQ-learning / TDGRPO policy gradient
Task typeSequential decision in fixed MDPLanguage generation across diverse tasks

The key insight ExpWeaver borrows from episodic memory: the most useful “memory” is one that is directly retrievable from the current decision state. By using the decoding hidden state as the retrieval query, ExpWeaver operationalizes this principle in the language generation context.

11. Conclusion

ExpWeaver makes a convincing case that the right place to integrate past experience into LLM agent generation is the model’s hidden state space, not the context window. The core technical contributions — the latent experience bank, the on-policy embedding via the same LLM, the cross-attention aggregation with a learnable query token, the norm-preserving gated interpolation, and the cold-start curriculum — are each individually motivated and collectively elegant. The GRPO training framework ties everything together in a single end-to-end optimization loop.

The results are strong: SOTA on 12/13 tasks, 1.5–2× better token efficiency than text-RAG methods, and impressive zero-shot cross-domain generalization. For practitioners building LLM agents, the practical message is clear: if your agent is doing repeated tasks of the same type, storing and retrieving past trajectories in latent space is a much more efficient mechanism than text-based RAG, and the ExpWeaver training framework provides a principled way to learn this capability from scratch.

The main gap in the current paper — single-model evaluation, no variance reporting, FIFO eviction without quality awareness — leaves open questions about robustness at scale. These are addressable with incremental experiments and would substantially strengthen the contribution. Regardless, ExpWeaver represents an important step toward LLM agents that genuinely learn and improve from experience, rather than merely retrieving it.

The broader research direction this work points toward is one where the distinction between “agent memory” and “agent parameters” becomes blurry: if experience embeddings are computed by the LLM itself and integrated into its decoding process via learned gates, the agent’s memory is inseparable from its reasoning. This is perhaps the most exciting long-term implication of the ExpWeaver line of work.

12. Deep Dive: Why GPQA Benefits Most

The most striking individual result in Table 2 is the GPQA gain: ExpWeaver achieves 35.00% vs. 23.33% for Search-R1 and 21.67% for R1. A gain of over 11 percentage points on a benchmark that evaluates graduate-level scientific reasoning deserves closer attention.

What GPQA tests. GPQA (Rein et al., 2024) contains questions written by domain experts to be genuinely difficult — not solvable by Google search or shallow reasoning. Questions require multi-step inference, often combining knowledge from multiple sub-fields. The benchmark name includes “Google-Proof” specifically because the answers are not easily retrieval-completable.

Why text-based retrieval underperforms here. Search-R1 retrieves text passages from its experience bank based on surface-level semantic similarity (text embeddings). For GPQA, two problems with similar surface form (both asking about quantum chemistry, say) may require completely different reasoning strategies. Text similarity retrieves superficially similar problems that use similar vocabulary but don’t share the key reasoning structure. The retrieved experience is then concatenated into the context and may actively confuse the model if the analogy is misleading.

Why latent retrieval outperforms. ExpWeaver’s retrieval query is the current hidden state ht\mathbf{h}_t at each decoding step. By the time the model has generated half a reasoning trace for a GPQA question, its hidden state encodes a rich representation of the reasoning strategy in progress — the sequence of inferences made, the concepts activated, the uncertainty level. Two problems that use different vocabulary but require the same reasoning structure (say, both requiring constraint propagation across multiple independent variables) would have similar hidden states at the corresponding decoding positions, even if their surface text is unrelated.

This is the core argument for latent retrieval: for tasks requiring structural reasoning rather than factual recall, similarity in hidden state space better captures problem-solving similarity than similarity in text space. The GPQA result is the strongest empirical confirmation of this hypothesis in the paper.

Caveat. The 11.67-point gain is large enough that random seed variance alone is unlikely to explain it. But GPQA has a small test set (~450 questions), so even a few questions answered differently between runs could produce 2-3% swings. Multiple runs with confidence intervals would make this result bulletproof.

13. Quick Reference: Key Symbols and Equations

For quick reference, here are all the key symbols and equations from ExpWeaver in one place:

SymbolMeaning
xxQuery input
yyGenerated output
zzReasoning trace
yy^*Ground-truth reference
rrScalar reward from environment
ssTextual summary of experience
zRd\mathbf{z} \in \mathbb{R}^dLatent experience embedding
e=(x,z,y,y,r,s,z)e = (x,z,y,y^*,r,s,\mathbf{z})Full experience tuple
M\mathcal{M}Experience bank
KKNumber of retrieved experiences
ht\mathbf{h}_tHidden state at decoding step tt
u\mathbf{u}Learnable cross-attention query token
Zt\mathbf{Z}_tStack of retrieved experience embeddings
et\mathbf{e}_tAggregated experience signal
e~t\tilde{\mathbf{e}}_tMagnitude-scaled experience signal
rt,it\mathbf{r}_t, \mathbf{i}_tRetention and input gate vectors
at\mathbf{a}_tMixing coefficient vector
ht\mathbf{h}'_tEnhanced hidden state
GGGRPO group size
A^i\hat{A}_iGroup-normalized advantage
ψexp={u,Wr,Wi,λ}\psi_\text{exp} = \{\mathbf{u}, \mathbf{W}_r, \mathbf{W}_i, \boldsymbol{\lambda}\}Experience integration parameters

The five equations that define ExpWeaver:

Experience embedding:

z=hθ(s)(6)\mathbf{z} = \mathbf{h}_\theta(s) \tag{6}

Latent retrieval:

Ct=TopKeM ⁣(sim(ht,ze),K)(8)\mathcal{C}_t = \text{TopK}_{e \in \mathcal{M}}\!\left(\text{sim}(\mathbf{h}_t, \mathbf{z}_e), K\right) \tag{8}

Cross-attention aggregation:

et=LN ⁣(CrossAttn(u,Zt,Zt))(9)\mathbf{e}_t = \text{LN}\!\left(\text{CrossAttn}(\mathbf{u}, \mathbf{Z}_t, \mathbf{Z}_t)\right) \tag{9}

Mixing coefficient:

at=exp ⁣(αsoftplus(λ)rt)(11)\mathbf{a}_t = \exp\!\left(-\alpha \cdot \text{softplus}(-\boldsymbol{\lambda}) \odot \mathbf{r}_t\right) \tag{11}

Norm-preserving integration:

ht=atht+1at2(ite~t)(12)\mathbf{h}'_t = \mathbf{a}_t \odot \mathbf{h}_t + \sqrt{1 - \mathbf{a}_t^2} \odot (\mathbf{i}_t \odot \tilde{\mathbf{e}}_t) \tag{12}