MRAgent: Why Memory Should Be Reconstructed, Not Retrieved

Review date: 2026-06-22 Review author: Zhongzhu Zhou Paper reviewed: Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents Paper authors: Shuo Ji, Yibo Li, Bryan Hooi arXiv: 2606.06036v1, 2026-06-04 Venue/status: ICML 2026 (Proceedings of the 43rd International Conference on Machine Learning, Seoul)

Short Answer

Every time you ask your phone’s voice assistant “What did I say I wanted for dinner on Tuesday?” and it searches through some flat conversation log, it is using passive retrieval: pick the top-k memory chunks that look similar to your query and hand them to the model. MRAgent argues this is the wrong mental model for memory — and cognitive neuroscience agrees. Human memory recall is not a search engine lookup; it is an active reconstruction process where intermediate cues chain together into a coherent recall episode.

MRAgent (Memory Reasoning Architecture for LLM Agents) operationalizes this idea. It represents an agent’s interaction history as a heterogeneous Cue–Tag–Content graph, where fine-grained cues (entities, keywords) are linked to episodic memory units through intermediate associative tags. At query time, instead of doing one-shot similarity retrieval, MRAgent runs an iterative loop: the LLM reasons over currently available cues and tags, selects which graph branches to explore next, traverses those branches, collects evidence, and updates its reasoning trajectory — until it has accumulated enough context to answer confidently.

The paper proves, via an approximation-theoretic argument (Theorem 4.1), that active reconstruction is strictly more expressive than any passive retrieval policy given the same retrieval budget. Empirically, on the LoCoMo benchmark (50 long conversations, up to 35 sessions), MRAgent achieves a 23.3% relative improvement in LLM-Judge score over the best baseline under the Gemini backbone, and 12.4% under Claude. On LongMemEval (500 questions, ~115K-token histories), MRAgent achieves 72.95 J-score vs the next-best 54.92 — a 32% relative gain. Most strikingly, it does this while consuming only 118k tokens per sample, versus 632k for A-Mem and 3.27M for LangMem — a 5x to 28x reduction in memory access cost.

This review unpacks every piece of machinery behind these results: the graph construction pipeline, the traversal operators, the active reconstruction loop, the formal proof of superiority, and what the authors glossed over in their design choices.

1. Prerequisites

1.1 The Memory Problem in LLM Agents

A language model agent operates over an extended horizon: multiple conversation sessions, tool calls, past observations. The context window — even at 1M tokens — cannot hold an unbounded history. So agents need external memory systems: some storage that holds past interactions, and some retrieval mechanism that pulls relevant pieces into the current context at query time.

The key requirements for a good agent memory system are:

  • Faithfulness: Factual content should not be distorted during storage or retrieval.
  • Efficiency: Retrieval should not bloat the context window with irrelevant content.
  • Compositionality: The system must support reasoning over multiple pieces of evidence, especially when no single memory chunk contains the full answer.
  • Adaptivity: The retrieval process should adapt based on intermediate findings — if the first retrieved chunk reveals a new entity, the system should be able to follow that lead.

Current systems fail on the last two requirements. Let us examine why.

1.2 Classical Retrieval-Augmented Generation (RAG) and Its Limits

Standard RAG (Lewis et al., 2020) stores documents in a vector store and, given a query xx, computes a cosine similarity against all stored embeddings, returning the top-kk most similar chunks:

πpRAG(x)=TopK({sim(x,v)}vV,k)(1)\pi_{\text{p}}^{\text{RAG}}(x) = \text{TopK}\bigl(\{\text{sim}(x, v)\}_{v \in \mathcal{V}},\, k\bigr) \tag{1}

This is the simplest form of passive retrieval: the set of retrieved items is determined entirely by the query xx, with no ability to incorporate intermediate evidence discovered during retrieval.

The failure mode is illustrated in the paper’s running example:

Query: “What did Caroline do when Nate won his second video game tournament?”

  • Similarity search for “video game tournament” returns many Nate-related events about tournaments — but none directly about Caroline.
  • The correct answer requires inferring from the text “Nate just won another regional video game tournament in July” that the event occurred in July, then retrieving what Caroline was doing in July.
  • A passive retriever, given the query at the start, has no way to make this inferential leap.
Passive Retrieval (RAG):
┌──────────────────────────────────────────┐
│  Query: "What did Caroline do when       │
│         Nate won his 2nd tournament?"    │
└─────────────────────┬────────────────────┘
                      │  encode query

┌──────────────────────────────────────────┐
│  Memory: [m1: Nate won 1st tournament]   │
│          [m2: Nate won 2nd tournament]   │
│          [m3: Caroline traveled in July] │
│          ...                             │
└─────────────────────┬────────────────────┘
                      │  TopK similarity to query

              Returns m1, m2 (Nate events)
              ❌ misses Caroline's July activity
              ❌ no way to infer temporal cue "July"

Graph-based memory systems like A-Mem address some compositionality by expanding neighborhoods in a predefined N-hop manner:

Vsim=TopK({sim(x,v)}vV,k),πgraph(x)=VsimNeighbor(Vsim)(2)\mathcal{V}^{\text{sim}} = \text{TopK}(\{\text{sim}(x, v)\}_{v \in \mathcal{V}},\, k),\quad \pi_{\text{graph}}(x) = \mathcal{V}^{\text{sim}} \cup \text{Neighbor}(\mathcal{V}^{\text{sim}}) \tag{2}

But this still fails: expanding N-hop neighbors from Nate’s tournament events does not reach Caroline’s July activity unless there is an explicit graph edge connecting them — which there typically is not.

The fundamental issue: passive retrieval can only select what is directly similar to the query or pre-defined neighbors. It cannot infer new retrieval cues from intermediate evidence.

1.3 Knowledge Graphs and Heterogeneous Graphs

A knowledge graph (KG) is a directed graph G=(V,E,R)\mathcal{G} = (\mathcal{V}, \mathcal{E}, \mathcal{R}) where:

  • V\mathcal{V} is a set of nodes (entities or concepts),
  • EV×R×V\mathcal{E} \subseteq \mathcal{V} \times \mathcal{R} \times \mathcal{V} is a set of typed edges (triples),
  • R\mathcal{R} is a set of relation types.

A heterogeneous KG allows nodes of different types (e.g., persons, events, facts) connected by different relation types. MRAgent’s memory graph is a heterogeneous KG where nodes have types {\{cue, tag, episode, semantic, topic}\} and edges are the R\mathcal{R} relations.

A key property leveraged by MRAgent: once you retrieve a node, you can follow its outgoing edges to discover neighboring nodes that may not be similar to the original query text. This is what enables evidence chaining across multiple reasoning steps.

1.4 Cognitive Neuroscience: How Human Memory Actually Works

The paper’s design is deeply inspired by cognitive neuroscience. The key insight, from Rugg & Renoult (2025) and Rashid et al. (2016), is that memory recall is a reconstruction process:

  • You begin with a contextual cue (e.g., “dinner on Tuesday”).
  • This cue triggers the reactivation of engrams — compact neural memory traces that encode patterns from past experiences.
  • Engram activation propagates through associative links to related engrams, progressively reconstructing a coherent memory episode.
  • The process is iterative and evidence-dependent: intermediate reactivated patterns bias subsequent recall, allowing the brain to follow a chain of associations until a full episode is recovered.

This is exactly unlike text search. The brain does not rank every stored memory by similarity to the cue and pick the top-k. It follows a chain of semantic associations, updating its “search state” at each step.

MRAgent implements this computationally:

  • Cues correspond to contextual keywords / entities (analogous to contextual cues in the brain).
  • Tags correspond to engrams — compact associative patterns that link cues to episodic content.
  • Episodes correspond to the actual memory content (specific past events).
  • Semantic units correspond to stable long-term knowledge (personality attributes, preferences).
  • Active reconstruction replaces the passive engram-independent retrieval of RAG.
Human Memory                  MRAgent Analogue
──────────────────────────────────────────────────
Contextual cue           →   Query cue (extracted entity/keyword)
Engram                   →   Tag (associative label in graph)
Episodic memory          →   Episode node (concrete event)
Semantic memory          →   Semantic node (stable fact)
Reconstruction process   →   Iterative LLM-guided traversal

1.5 Sequential Decision Processes and the Value of Adaptive Policies

In reinforcement learning terms, memory retrieval can be modeled as a sequential decision process: at each step tt, the agent is in state S(t)S^{(t)} (accumulated evidence so far) and takes an action (which memory node/branch to explore next), observing a new evidence item. A passive retrieval policy πp\pi_{\text{p}} is stateless: {v(1),,v(T)}=πp(x)\{v^{(1)}, \ldots, v^{(T)}\} = \pi_{\text{p}}(x) — all TT retrieval steps are determined by the query xx alone, before any evidence is observed. An active retrieval policy πa\pi_{\text{a}} is stateful:

v(t)=πa(t)(x,S(t1)),S(t)=S(t1){v(t)}(3)v^{(t)} = \pi_{\text{a}}^{(t)}(x, S^{(t-1)}),\quad S^{(t)} = S^{(t-1)} \cup \{v^{(t)}\} \tag{3}

The tt-th retrieval step depends on everything seen so far. This is the key advantage: an adaptive policy can follow newly discovered evidence, while a passive policy commits to all retrieval targets before seeing any evidence.

2. Problem Formulation: Active Memory Access

2.1 Memory System Definition

The paper formalizes the memory system as follows. Let M\mathcal{M} be an external memory consisting of memory units V={v1,,vN}\mathcal{V} = \{v_1, \ldots, v_N\}. Given a query xx, memory access selects memory units over TT steps. Let S(t)={v(1),,v(t)}S^{(t)} = \{v^{(1)}, \ldots, v^{(t)}\} denote the accumulated evidence after step tt.

Passive retrieval policy πp\pi_{\text{p}}: selects all memory units solely based on the query:

{v(1),,v(T)}=πp(x)(4)\{v^{(1)}, \ldots, v^{(T)}\} = \pi_{\text{p}}(x) \tag{4}

Active reconstruction policy πa(t)\pi_{\text{a}}^{(t)}: selects the next unit conditioned on the evolving evidence:

v(t)=πa(t) ⁣(x,S(t1)),S(t)=S(t1){v(t)}(5)v^{(t)} = \pi_{\text{a}}^{(t)}\!\bigl(x,\, S^{(t-1)}\bigr),\quad S^{(t)} = S^{(t-1)} \cup \{v^{(t)}\} \tag{5}

This distinction is fundamental. A passive retriever must “guess” all retrieval targets upfront based on the query alone. An active reconstructor can follow evidence leads: if step tt retrieves a node that mentions “July”, step t+1t+1 can specifically target events in July that were not retrievable from the original query alone.

2.2 Why Passive Retrieval Fails on Complex Queries

The paper identifies three concrete weaknesses of passive retrieval:

  1. Inability to update retrieval strategy mid-course. A passive policy cannot incorporate intermediate evidence (e.g., “July” being the temporal anchor).
  2. Noise accumulation from fixed aggregation. Pre-constructed N-hop neighbor expansion from a seed set often includes irrelevant nodes — no mechanism to prune off-topic branches.
  3. Heavy reliance on pre-constructed structures. Graph-based passive retrievers rely on predefined edge types; if the right connection was not encoded during construction, retrieval fails.

Active reconstruction addresses all three: it adapts in real-time (1), prunes irrelevant branches based on accumulated context (2), and can infer new retrieval cues dynamically (3).

3. Associative Memory System: Cue–Tag–Content Graph

MRAgent organizes agent memory as a heterogeneous graph G=(C,V,R)\mathcal{G} = (\mathcal{C}, \mathcal{V}, \mathcal{R}) — a Cue–Tag–Content structure — that encodes semantic and structural associations between memory elements.

3.1 Graph Nodes and Node Types

Figure 1: Cue–Tag–Content Memory Graph Architecture

graph TD
    subgraph CueLayer["Cue Layer"]
        C1["Nate"]
        C2["video game tournament"]
        C3["Caroline"]
    end

    subgraph TagLayer["Tag Layer"]
        T1["Tournament Victory"]
        T2["Travel-Location"]
        T3["Personality"]
    end

    subgraph EpisodeLayer["Episode Layer"]
        E1["e1: Nate won 1st tournament May"]
        E2["e2: Nate won 2nd tournament July"]
        E3["e3: Caroline traveling in July"]
    end

    subgraph SemanticLayer["Semantic Layer"]
        S1["s1: Nate enjoys movies and games"]
        S2["s2: Caroline is a dancer"]
    end

    subgraph TopicLayer["Topic Layer"]
        Topic1["tau1: Nate gaming activities"]
    end

    C1 -- "Tournament Victory" --> E1
    C2 -- "Tournament Victory" --> E1
    C1 -- "Tournament Victory" --> E2
    C2 -- "Tournament Victory" --> E2
    C3 -- "Travel-Location" --> E3
    C1 -- "Personality" --> S1
    C3 -- "Profession" --> S2
    Topic1 --> E1
    Topic1 --> E2

The three core node types are:

  • Cues cCc \in \mathcal{C}: Fine-grained keywords such as entity names, actions, or contextual descriptors. Examples: “Nate”, “video game tournament”, “Caroline”, “July”.

  • Contents vVv \in \mathcal{V}: Specific memory units, further divided into:

    • Episode vVev \in \mathcal{V}^e: Concrete event with timestamp (e.g., “Nate won his first video game tournament last week”).
    • Semantic vVsv \in \mathcal{V}^s: Stable abstract knowledge (e.g., “Nate is bright and bold”).
    • Topic vVτv \in \mathcal{V}^\tau: Higher-level abstraction summarizing recurring episode patterns (e.g., “Nate’s gaming activities”).
  • Associations RC×G×V\mathcal{R} \subseteq \mathcal{C} \times \mathcal{G} \times \mathcal{V}: Each triple (c,g,v)R(c, g, v) \in \mathcal{R} connects a cue cc to a content node vv through a tag gg — an associative label encoding the semantic relation between cue and content.

3.2 The Role of Tags as Semantic Bridges

Tags are the key innovation distinguishing MRAgent’s graph from a standard knowledge graph. In a standard KG, edges are predefined relation types (is-a, located-in, etc.). Tags in MRAgent are LLM-extracted associative labels that capture the relational pattern of an episode: “Tournament Victory”, “Financial Decision”, “Creative Work Submission”.

Why this matters: two cues might both connect to an episode via the same tag, even though neither cue directly mentions the other. For example, the query mentions “Nate” and “video game tournament”, and the tag “Tournament Victory” connects them to episode e2e_2. But e2e_2‘s timestamp is July, and Caroline’s travel event e3e_3 has timestamp July — now the LLM can construct a new cue “July” and follow it to e3e_3. This cross-episode reasoning is not possible with similarity-based retrieval alone.

3.3 Two Induced Traversal Operators

The paper defines two fundamental mapping operators over R\mathcal{R}:

ϕcg(c){g(c,g,)R}(6)\phi_{c \to g}(c) \triangleq \{g \mid (c, g, \cdot) \in \mathcal{R}\} \tag{6} ϕ(c,g)v(c,g){v(c,g,v)R}(7)\phi_{(c,g) \to v}(c, g) \triangleq \{v \mid (c, g, v) \in \mathcal{R}\} \tag{7}

Operator ϕcg\phi_{c \to g} takes a cue cc and returns all tags that connect it to some content — i.e., “given I know Nate won a tournament, what associative categories exist?” Operator ϕ(c,g)v\phi_{(c,g) \to v} then retrieves the content nodes conditioned on both a cue and a selected tag — i.e., “given Nate and Tournament Victory, what specific episode?”

This two-stage decoupling is crucial: by exposing tags as an intermediate layer, the LLM can reason over a small set of semantic categories (tags) before committing to retrieving expensive episodic content. This avoids the combinatorial explosion of naive N-hop expansion.

3.4 Multi-Granular Memory Layers

The full memory graph has three layers at different granularities:

Episodic Layer (Cue–Tag–Episode): Stores event-specific memory units eiVee_i \in \mathcal{V}^e. Each episode corresponds to a concrete occurrence at a specific time. Episodes are organized along a unified timeline, enabling temporal reasoning (e.g., “what happened in July?”).

Semantic Layer (Cue–Tag–Semantic): Captures relatively stable knowledge: personal attributes, preferences, general facts extracted from dialogue. Unlike episodes, semantic units persist across individual contexts and are anchored to entity-level cues.

Abstraction Layer (Topics): Topic nodes τVτ\tau \in \mathcal{V}^\tau summarize recurring patterns shared across episodes. Topics connect to their constituent episodes via ϕτe\phi_{\tau \to e}, allowing top-down access: localize relevant topic → descend to associated episodes.

4. Memory Population Pipeline

The memory graph is constructed offline from raw dialogue text TT via an automated LLM distillation pipeline.

4.1 Episodic Memory Construction

Step 1: Segment into episodic units. The dialogue stream is first rewritten to resolve coreferences and normalize temporal expressions, then segmented into coherent episodic units:

{ei}RLLM(T)(8)\{e_i\} \leftarrow \mathcal{R}_{\text{LLM}}(T) \tag{8}

where RLLM\mathcal{R}_{\text{LLM}} performs pronoun resolution, temporal normalization, and episodic segmentation.

Step 2: Extract tags and cues. For each episodic unit eie_i, the LLM extracts:

giTLLM(ei),CiKLLM(ei)(9)g_i \leftarrow \mathcal{T}_{\text{LLM}}(e_i),\quad \mathcal{C}_i \leftarrow \mathcal{K}_{\text{LLM}}(e_i) \tag{9}

where TLLM\mathcal{T}_{\text{LLM}} produces a short associative tag (e.g., “Tournament Victory”) summarizing the relational pattern, and KLLM\mathcal{K}_{\text{LLM}} extracts a set of fine-grained cues Ci\mathcal{C}_i (entities, attributes, salient descriptors).

Step 3: Build graph triplets. Each cue cCic \in \mathcal{C}_i is linked to the episode eie_i through the tag gig_i, forming Cue–Tag–Episode triples (c,gi,ei)R(c, g_i, e_i) \in \mathcal{R}.

4.2 Semantic Memory Construction

Semantic units (cis,gis,si)(c_i^s, g_i^s, s_i) are extracted by a separate LLM-based extractor SLLM\mathcal{S}_{\text{LLM}}:

{(cis,gis,si)}SLLM(T)(10)\{(c_i^s, g_i^s, s_i)\} \leftarrow \mathcal{S}_{\text{LLM}}(T) \tag{10}

where sis_i is the stable fact content (e.g., “Nate is bright and bold”), cisc_i^s is an entity-level cue (e.g., “Nate”), and gisg_i^s is an aspect-level tag (e.g., “Personality”).

4.3 Topic Abstraction Construction

Topic nodes τj\tau_j are generated by summarizing recurring themes across episodic segments:

{τj}ALLM({ei})(11)\{\tau_j\} \leftarrow \mathcal{A}_{\text{LLM}}(\{e_i\}) \tag{11}

Each τj\tau_j summarizes a coherent set of episodes sharing a recurring semantic pattern, connected to those episodes via topic–episode links.

Figure 2: Memory Population Data Flow

flowchart LR
    A["Raw Dialogue T"] --> B["Rewrite and Segment"]
    B --> C["Episodic Units e1..en"]
    C --> D1["Tag Extraction: e_i to g_i"]
    C --> D2["Cue Extraction: e_i to C_i"]
    C --> D3["Semantic Extraction: T to cue-tag-sem"]
    C --> D4["Topic Abstraction: episodes to tau_j"]
    D1 --> E["Cue-Tag-Episode Triplets"]
    D2 --> E
    D3 --> F["Cue-Tag-Semantic Triplets"]
    D4 --> G["Topic Nodes linked to episodes"]
    E --> H["Heterogeneous Memory Graph G"]
    F --> H
    G --> H

5. MRAgent: The Active Reconstruction Loop

MRAgent formulates memory access as an active sequential decision process. We define the reconstruction state, the available traversal actions, and the full reconstruction algorithm.

5.1 Reconstruction State

At step tt, MRAgent maintains a reconstruction state:

S(t)=(Z(t),H(t))(12)\mathcal{S}^{(t)} = (\mathcal{Z}^{(t)},\, \mathcal{H}^{(t)}) \tag{12}

where:

  • Z(t)\mathcal{Z}^{(t)}: the active set — the current set of memory elements (cues, tags, content) being considered for the next traversal step.
  • H(t)\mathcal{H}^{(t)}: the reconstructed context — all evidence accumulated in steps 0,,t0, \ldots, t, which conditions subsequent traversal decisions.

Initial state: S(0)=(Z(0),)\mathcal{S}^{(0)} = (\mathcal{Z}^{(0)}, \emptyset) where Z(0)\mathcal{Z}^{(0)} contains the initial active set derived by matching query cues against the stored cue set.

5.2 Traversal Action Space

The agent can apply a finite set of traversal actions A={Π1,,Πm}\mathcal{A} = \{\Pi_1, \ldots, \Pi_m\}. Each action is a typed mapping operator over R\mathcal{R}.

Forward traversal actions expand the active set along Cue→Tag→Content relations:

Πcg(C(t))cC(t)ϕcg(c)(13)\Pi_{c \to g}(\mathcal{C}^{(t)}) \triangleq \bigcup_{c' \in \mathcal{C}^{(t)}} \phi_{c \to g}(c') \tag{13} Π(c,g)v(C(t),G(t))cC(t)gG(c)ϕ(c,g)v(c,g)(14)\Pi_{(c,g) \to v}(\mathcal{C}^{(t)}, \mathcal{G}^{(t)}) \triangleq \bigcup_{c' \in \mathcal{C}^{(t)}} \bigcup_{g' \in \mathcal{G}(c')} \phi_{(c,g) \to v}(c', g') \tag{14}

Action Πcg\Pi_{c \to g} expands from cues to their associated tags — a cheap operation since tags are lightweight. Action Π(c,g)v\Pi_{(c,g) \to v} then retrieves full episodic or semantic content based on both cues and selected tags.

Reverse traversal actions allow retrieved content to activate new cues and tags:

Πv(c,g)(V(t)){(c,g)vV(t), (c,g,v)R}(15)\Pi_{v \to (c,g)}(\mathcal{V}^{(t)}) \triangleq \{(c', g') \mid \exists v' \in \mathcal{V}^{(t)},\ (c', g', v') \in \mathcal{R}\} \tag{15}

This is the key “reactivation” mechanism — analogous to an engram activating related cues. After retrieving episode e2e_2 (“Nate won his 2nd tournament in July”), the reverse traversal can surface the cue “July” and tag “Travel/Location”, enabling subsequent retrieval of Caroline’s July activity.

The concrete toolkit contains seven specialized operations (Table 4 in the paper):

ToolOperatorPurpose
query_tag_eventsϕ(c,g)e(c,g)\phi_{(c,g) \to e}(c, g)Retrieve episodic events for a cue-tag pair
query_conversation_timeϕet(ve)\phi_{e \to t}(v^e)Return timestamp of an episode
query_event_keywordsϕectx(e)\phi_{e \to \text{ctx}}(e)Extract cues/tags from an episode
query_event_contextϕectx(e)\phi_{e \to \text{ctx}}(e)Retrieve context surrounding the episode
query_personal_informationϕ(cs,gs)vs(cs)\phi_{(c^s, g^s) \to v^s}(c^s)Return semantic aspects for a person entity
query_personal_aspectϕ(cs,gs)vs(cs,gs)\phi_{(c^s, g^s) \to v^s}(c^s, g^s)Retrieve semantic content for (person, aspect)
query_topic_eventsϕτe(τ)\phi_{\tau \to e}(\tau)Retrieve episodes associated with a topic

5.3 The Active Reconstruction Algorithm

Figure 3: MRAgent Active Reconstruction Loop

flowchart TD
    Q["Query x"] --> CI["Cue Initialization: ExtractCues"]
    CI --> IS["Initial State: Z0, empty H"]
    IS --> LOOP["Loop: t = 0 to T-1"]
    LOOP --> LLM1["Step 1: LLM Reasoning\nSelect action subset A_t"]
    LLM1 --> TRAV["Step 2: Controlled Traversal\nApply each action on active set Z_t"]
    TRAV --> LLM2["Step 3: LLM Routing\nPrune irrelevant branches\nUpdate Z_next"]
    LLM2 --> UPD["State Update: H_next = H_prev + Z_next"]
    UPD --> STOP{"Sufficient Evidence?"}
    STOP -- "Yes" --> ANS["AnswerLLM generates final answer"]
    STOP -- "No" --> LOOP
    ANS --> Output["Final Answer y-hat"]

Algorithm 1 (from the paper):

Input:  Query x, memory graph G, actions A = {Π₁,...,Πₘ}, max steps T
Output: Answer ŷ

1:  C ← ExtractCues(x)
2:  Z⁽⁰⁾ ← ActiveSetInit(C, G)           // match query cues to stored cues
3:  H⁽⁰⁾ ← ∅
4:  S⁽⁰⁾ ← (Z⁽⁰⁾, H⁽⁰⁾)
5:  for t = 0 to T - 1 do
6:      // Action selection
7:      A⁽ᵗ⁾ ← f_select(x, H⁽ᵗ⁾, Z⁽ᵗ⁾)   // LLM picks subset of A
8:      // Controlled traversal
9:      Z̃⁽ᵗ⁺¹⁾ ← ∅
10:     for all a ∈ A⁽ᵗ⁾ do
11:         Z̃⁽ᵗ⁺¹⁾ ← Z̃⁽ᵗ⁺¹⁾ ∪ Πₐ(Z⁽ᵗ⁾)     // expand candidates
12:     end for
13:     // Routing and state update
14:     Z⁽ᵗ⁺¹⁾ ← f_route(x, H⁽ᵗ⁾, Z̃⁽ᵗ⁺¹⁾)  // LLM prunes irrelevant
15:     H⁽ᵗ⁺¹⁾ ← H⁽ᵗ⁾ ∪ Z⁽ᵗ⁺¹⁾
16:     if Stop(x, H⁽ᵗ⁺¹⁾) then break      // sufficient evidence?
17: end for
18: ŷ ← AnswerLLM(x, H⁽ᵗ⁺¹⁾)
19: return ŷ

Step-by-step walkthrough for the running example:

Query: “What did Caroline do when Nate won his second video game tournament?”

Turn 0:

  • ExtractCues(x) → {Caroline, video game tournament, second}
  • Initial active set: cues matched in graph → Caroline, video game tournament
  • LLM selects action: Π_{c→g} (expand to tags) + query_tag_events for video game tournament
  • Traversal returns candidate tags: {Tournament Victory, Tournament Participation}
  • LLM routes: selects Tournament Victory as most relevant
  • Evidence: tags associated with video game tournament + Tournament Victory

Turn 1:

  • LLM action selection: Π_{(c,g)→v}(video game tournament, Tournament Victory) → retrieve episodes
  • Traversal returns: {e1e_1: “Nate won 1st tournament May”, e2e_2: “Nate won 2nd tournament July”}
  • LLM routes: selects e2e_2 (second tournament), also calls query_conversation_time(e₂) → timestamp: July
  • Evidence: e2e_2 content + temporal anchor “July”

Turn 2:

  • Reverse traversal: Π_{v→(c,g)}(e₂) surfaces new cue “July” + tag “Travel/Location”
  • LLM selects: query_tag_events(Caroline, Travel/Location) for July time window
  • Traversal returns: {e3e_3: “Caroline was traveling in July”}
  • LLM routes: e3e_3 is directly relevant
  • Evidence: e3e_3 content

Turn 3:

  • STOP condition: sufficient evidence to answer
  • AnswerLLM(”…Caroline was traveling when Nate won his second tournament”)

This multi-step process recovered the answer that was completely unreachable by any single-shot passive retrieval.

5.4 Two Operating Modes

MRAgent operates in two modes:

  • Navigate mode (Ψ=Navigate\Psi = \text{Navigate}): The agent invokes the toolkit to progressively explore the memory graph and accumulate evidence. The LLM reasons over current state and selects traversal actions.
  • Answer mode (Ψ=Answer\Psi = \text{Answer}): Once the stopping condition is met, the agent transitions to generating the final response conditioned on H(T+1)\mathcal{H}^{(T+1)}.

6. Theoretical Analysis: Why Active > Passive

6.1 Hypothesis Class Formulation

The paper proves that active reconstruction is strictly more expressive than passive retrieval for any retrieval budget T2T \geq 2.

Setup: Consider tasks with queries xXx \in \mathcal{X}, answers yYy \in \mathcal{Y}, and heterogeneous KG memories MMnM \in \mathcal{M}_n. An LM with parameters θΘ\theta \in \Theta induces:

  • Qθ,tQ_{\theta,t}: functions that decide which node to retrieve at step tt
  • HθH_\theta: an answer head that outputs y^\hat{y} given the retrieval history

An active policy πθact\pi_\theta^{\text{act}} uses Qθ,t(x,z(1),,z(t1))Q_{\theta,t}(x, z^{(1)}, \ldots, z^{(t-1)}) — conditioned on full history. A passive policy πθpass\pi_\theta^{\text{pass}} uses Qθ,t(x)Q_{\theta,t}(x) — conditioned only on the query.

The active hypothesis class at budget TT:

HactiveLM(T):={πθact:θΘ}(16)\mathcal{H}_{\text{active}}^{\text{LM}}(T) := \{\pi_\theta^{\text{act}} : \theta \in \Theta\} \tag{16}

The passive hypothesis class:

HpassiveLM(T):={πθpass:θΘ}(17)\mathcal{H}_{\text{passive}}^{\text{LM}}(T) := \{\pi_\theta^{\text{pass}} : \theta \in \Theta\} \tag{17}

Theorem 4.1 (Main): For any retrieval budget T2T \geq 2, the passive hypothesis class is strictly contained in the active hypothesis class:

HpassiveLM(T)HactiveLM(T)(18)\mathcal{H}_{\text{passive}}^{\text{LM}}(T) \subsetneq \mathcal{H}_{\text{active}}^{\text{LM}}(T) \tag{18}

6.2 Proof Sketch: Active Retrieval Achieves Zero Error

The proof constructs a separating distribution Dn,dD_{n,d} — the “Binary-Tree Needle-in-a-Haystack” task — where:

  • Memory MM is a complete binary tree of depth d=T1d = T - 1 with n=2d+11n = 2^{d+1} - 1 nodes.
  • Each internal node vuv_u encodes the next bit to follow: p(vu)p(v_u) = “go left (0) or right (1)”.
  • The answer label yy is encoded only in the target leaf vuv_{u^*}, where u{0,1}du^* \in \{0,1\}^d is sampled uniformly.

Active strategy achieves zero error: Starting from the root, at each step tt:

  1. Retrieve current node vuv_u → read the next bit b=p(vu){0,1}b = p(v_u) \in \{0, 1\}.
  2. Move to child vubv_{ub}. After dd steps, reach target leaf vuv_{u^*}; retrieve it to read yy. Total retrievals: T=d+1T = d + 1. Error = 0.

Passive strategy has irreducible error: A passive policy must commit to all T=d+1T = d+1 retrieval targets as a function of xx alone, before seeing any node payloads. The target leaf uu^* is uniform over 2d2^d leaves. Any TT fixed retrieval targets can include at most TT leaves. The probability of hitting vuv_{u^*} is T/2d\leq T / 2^d. Therefore:

L(πθpass; Dn,d)εY(1T2d)=εY(1T2T1)(19)L(\pi_\theta^{\text{pass}};\ D_{n,d}) \geq \varepsilon_Y \left(1 - \frac{T}{2^d}\right) = \varepsilon_Y \left(1 - \frac{T}{2^{T-1}}\right) \tag{19}

where εY=1supyPY(y)\varepsilon_Y = 1 - \sup_y P_Y(y) is the minimum error when only the prior is known.

Figure 4: Active vs Passive Retrieval — Hypothesis Class Geometry

graph TD
    subgraph "All Possible Retrieval Predictors"
        subgraph "H_active(T)"
            subgraph "H_passive(T)"
                P["Passive policies\n(query-only)"]
            end
            A["Active-only policies\n(history-conditioned)\ne.g., binary-tree follower"]
        end
    end

The containment HpassiveHactive\mathcal{H}_{\text{passive}} \subsetneq \mathcal{H}_{\text{active}} is strict: there exist distributions (like Dn,dD_{n,d}) where the best passive policy has bounded-away-from-zero error, but some active policy achieves zero error. This is not just a theoretical curiosity — the binary tree is a stylized model of any multi-hop query where each hop depends on information revealed by the previous hop.

7. Experiments

7.1 Setup

Benchmarks:

  • LoCoMo (Maharana et al., 2024): 50 long-term conversational dialogues, each spanning up to 35 sessions with ~300 turns average, accompanied by ~200 QA pairs per conversation. Question categories: single-hop, multi-hop, temporal, open-domain. The paper excludes adversarial questions.
  • LongMemEval (Wu et al., 2025): 500 questions paired with ~115K-token chat histories (the LongMemEval-S setting). Categories: multi-session, single-user, temporal-reasoning, single-preference.

Baselines:

  • RAG (Lewis et al., 2020): Standard similarity-based retrieval.
  • A-Mem (Xu et al., 2025): Graph-structured memory notes with LLM-assisted relation extraction; retrieves via similarity seed + neighborhood expansion.
  • MemoryOS (Kang et al., 2025): Three-tier hierarchy (short-term, mid-term, long-term persona) with hierarchical retrieval.
  • LangMem (LangChain, 2025): Compresses dialogue into memory summaries stored as embeddings; summarizes retrieved memories for context.
  • Mem0 (Chhikara et al., 2025): Extracts natural-language facts incrementally; LLM-driven ADD/UPDATE/DELETE/NOOP operations; similarity-based retrieval at inference.

LLM Backbones: Gemini-2.5-Flash and Claude-Sonnet-4.5.

Metrics:

  • LLM-Judge (J): Binary semantic correctness judged by GPT-4o-mini (J(y^,y){0,1}J(\hat{y}, y) \in \{0, 1\}).
  • F1F_1 Score: Harmonic mean of precision and recall of correct answers.
  • Evidence Recall: Fraction of ground-truth supporting evidence retrieved.

7.2 Main Results

Figure 5: LoCoMo Results — MRAgent vs. Baselines

ModelMethodMulti-hop F₁Multi-hop JTemporal F₁Temporal JOpen-Domain F₁Open-Domain JSingle-hop F₁Single-hop JOverall J
GeminiRAG34.8958.1643.5249.2225.6841.6753.6969.2061.30
A-Mem33.8153.5440.2249.5312.4933.4346.3961.8355.97
MemoryOS41.4263.8235.9147.0423.4341.6654.8271.9063.35
LangMem40.6761.8244.7053.5820.4943.6548.2069.6862.86
Mem045.1768.7958.1961.6826.2441.6654.3773.7268.31
MRAgent43.6975.1767.6680.3732.5168.7564.0890.4884.21
ClaudeRAG34.5357.4543.3948.2926.5643.7553.6669.2061.10
LangMem44.3770.9256.6480.6822.6654.7154.3683.1278.61
MRAgent56.7290.1969.8285.3434.6771.5768.6291.1088.32

Key observations:

  • Temporal questions show the sharpest improvement (Gemini: +18.7 J over Mem0; Claude: +4.7 J over LangMem on temporal). This directly validates the core design: temporal queries require inferring anchor timestamps as intermediate cues.
  • Multi-hop questions benefit substantially (+6.4 J over Mem0 on Gemini; +19.3 J over LangMem on Claude), confirming that chained evidence is needed for compositional retrieval.
  • Single-hop questions improve as well (+16.8 J over Mem0 on Gemini), suggesting that even for simpler queries, the structured tag-based access is more precise than similarity-based retrieval.

7.3 Efficiency Analysis

Figure 6: Token Consumption and Runtime Comparison (LongMemEval, Gemini)

MethodToken ConsumptionRuntime (s)
A-Mem632k1,122.23
MemoryOS273k3,135.54
LangMem3,268k1,209.57
Mem0245k533.29
MRAgent118k586.11

MRAgent uses only 118k tokens per sample — a 5x reduction versus A-Mem and a 27x reduction versus LangMem. The reason: by exposing tags as intermediate semantic shortcuts, MRAgent can decide before loading full episodic content whether a memory branch is relevant. Irrelevant branches are pruned at the tag level without ever accessing the (expensive) episodic text.

The runtime of 586s is slightly higher than Mem0 (533s) but significantly lower than MemoryOS (3,136s). The multi-step reconstruction loop adds some latency versus single-shot retrieval, but this is offset by not needing to load thousands of tokens of irrelevant context.

7.4 Ablation Study

The ablation isolates three structural variants:

  • CE (Cue→Episode): Direct episodic indexing without tags — retrieve episodes by cue match.
  • CTE (Cue→Tag→Episode): Tags mediate cue-to-episode retrieval, but no content-level content nodes.
  • CTC (Cue→Tag→Content): Full architecture including semantic nodes.

Each variant is tested with and without active reasoning.

Figure 7: Ablation Results (LoCoMo, Multi-hop, Claude Backbone)

VariantReasoningMulti-hop J (approx.)
CE (Cue→Episode)No42
CTE (Cue→Tag→Episode)No55
CTC (Cue→Tag→Content)No60
CEYes65
CTEYes78
CTC = MRAgentYes90

Ablation shows both structure richness (CE→CTE→CTC) and active reasoning each contribute, but reasoning is the dominant factor (note the gap between No-Reasoning and Yes-Reasoning columns within each variant).

Key findings:

  1. Active reasoning is the dominant factor. Across all structural variants, adding reasoning (blue bars) consistently outperforms the no-reasoning variant. Multi-step exploration is more important than any structural choice alone.
  2. Tags provide effective semantic guidance. Within the no-reasoning setting, performance improves monotonically from CE to CTE to CTC — demonstrating that richer associative structures enable more reliable retrieval.
  3. Episodic and semantic memory are complementary. Removing semantic nodes (going from CTC to CTE) degrades performance on multi-hop questions that require stable facts (e.g., “what does Nate enjoy?”) alongside episodic events.

7.5 Multi-Turn Reasoning Analysis

Figure 8: Evidence Recall Accumulation Over Reasoning Turns

The paper tracks cumulative evidence recall across reasoning turns on LoCoMo. Average turns required:

Question TypeAverage TurnsMax Valid Turns
Multi-hop3.162.65
Temporal2.422.40
Open Domain2.601.09
Single-hop2.071.28

For single-hop questions, recall plateaus after ~2 turns — the simple queries converge quickly. For multi-hop questions, recall improves by over 30% across successive steps (red line in Figure 6(a)), confirming that iterative exploration is essential for compositional reasoning.

The agent’s self-termination is also effective: Max Valid Turns closely matches Average Turns, indicating that the LLM correctly identifies when additional exploration is unlikely to yield new useful evidence. Increasing parallel retrieval per turn (KK, the per-turn budget) yields only limited gains, while increasing depth (TT, number of reasoning turns) improves performance steadily — reconstruction depth cannot be substituted by increased parallel exploration.

8. Limitations and Boundary Conditions

The authors acknowledge several limitations in the conclusion:

1. Latency grows with reconstruction depth. Multi-step exploration adds wall-clock latency versus single-shot retrieval. Queries requiring deep chains (temporal + multi-hop) may require T=4T=488 turns, each involving LLM inference. In low-latency production settings this is a concern.

2. Static memory construction — no forgetting or updating. The memory graph grows monotonically as interactions accumulate. There is no mechanism to:

  • Update existing nodes when facts change (e.g., Caroline moves to a different city).
  • Forget old or contradictory information.
  • Consolidate near-duplicate events.

This is acknowledged as a direction for future work. In long-lived deployment scenarios (e.g., a personal assistant used for years), the graph could become large and noisy.

3. Memory construction quality depends on LLM distillation accuracy. Tags and cues are extracted by an LLM. If the extraction is noisy — tags that are too generic, or cues that miss salient entities — the graph structure will be degraded, and reconstruction will fail. The paper does not analyze robustness to extraction errors.

4. Theoretical guarantee requires T2T \geq 2. Theorem 4.1 gives the strict containment for any budget T2T \geq 2. For T=1T = 1 (single-shot retrieval), active and passive policies are equivalent (both reduce to a single query-based selection).

9. Critical Assessment: Weaknesses and Improvements

9.1 Weaknesses and Flaws

Evaluation scope is narrow relative to the claims. The paper tests only two LLM backbones (Gemini-2.5-Flash and Claude-Sonnet-4.5) and two benchmarks. Both benchmarks are human–LLM-generated conversation datasets. The generalization claims (“MRAgent outperforms all baselines across question types”) are strong, but the evidence base is limited: no open-domain task-completion benchmarks, no benchmarks involving structured tool use, no stress-testing on very large memory graphs (N>104N > 10^4 episodes).

No latency breakdown. The paper reports aggregate runtime (586s for MRAgent) but does not break down how much time is spent in (a) LLM inference, (b) graph traversal, (c) LLM routing, and (d) stopping condition evaluation. Without this, it is impossible to predict whether the system is latency-bottlenecked by the graph traversal or by the LLM calls — and therefore which component to optimize.

Reconstruction cost scales with depth, but no analysis for hard cases. Section 5.5 shows that increasing TT (depth) improves multi-hop performance steadily (Figure 9). But the paper sets Tmax=8T_{\max} = 8 without analyzing the distribution of required turns across query complexity. If 5% of queries require T=8T = 8 turns, is that acceptable? Is the LLM’s stopping condition accurate for hard multi-hop queries that require more than 8 turns?

Cross-backbone MRAgent gets 86.76 J vs MRAgent 72.95 J on LongMemEval* (Table 2 in paper). This 13.8-point gap suggests that Gemini-for-memory-construction + Claude-for-retrieval is substantially better than either alone. The paper does not fully explain why — are Gemini’s tag extractions higher quality? Is Claude’s routing better? This “oracle combination” result muddles the interpretation of the ablation study.

Variance is not analyzed rigorously for baselines. The paper reports standard deviations for MRAgent results (e.g., 90.48±0.1090.48_{\pm 0.10}) but the tables show that some baselines also have ±\pm values. However, the significance of the 23% improvement claim is not tested via statistical tests on the benchmark; it relies on point estimates.

9.2 Limitations the Authors Understate or Omit

Memory construction is the weakest link, but receives the least attention. The paper describes the construction pipeline in Section 3.3 and Appendix B but provides no ablation on construction quality (e.g., what happens if tags are noisy, or cues are over-coarsened?). The 118k token efficiency claim assumes the graph is well-structured; a poorly constructed graph could cause more traversal steps and negate the efficiency gain.

The “no forgetting” limitation is more serious than acknowledged. In the paper’s experimental setup, memory graphs are constructed from fixed finite conversation histories. In a real deployment, conversations accumulate indefinitely. The paper does not address:

  • What happens when two episodic units contradict each other (e.g., “Nate lives in Singapore” vs later “Nate moved to London”)?
  • How does retrieval performance degrade as graph size grows from 10210^2 to 10510^5 episodes?
  • Is the LLM’s tag-based pruning still accurate for very large active sets?

The evaluation does not include any adversarial or out-of-distribution queries. LoCoMo explicitly excludes adversarial questions. In real agent deployments, users ask vague, ambiguous, or contradictory questions. A reconstruction approach that commits to following tag-based paths could get stuck in an irrelevant branch if the initial cue extraction is off.

Tool-call overhead is not modeled. The seven-tool toolkit (Table 4) requires the LLM to decide which tool to invoke at each step — this is a function-calling overhead. For small models (< 7B parameters), structured tool-call generation is less reliable. The paper uses Gemini-2.5-Flash and Claude-Sonnet-4.5, both large and reliable for function calling, but it is unclear how MRAgent performs with smaller LLMs.

9.3 Concrete Improvement Suggestions

Adaptive memory maintenance. The most pressing limitation is the static graph. A natural fix is to add a maintenance pass after each conversation session: a lightweight LLM run that identifies contradictory or redundant triplets and performs UPDATE/DELETE operations analogous to Mem0’s approach. This would need to be efficient enough to not bottleneck the overall system.

Tag quality ablation and robustness. The authors should ablate construction quality by: (a) replacing LLM-extracted tags with random tags, (b) using a weaker extraction model, (c) intentionally injecting noisy cues. This would reveal how sensitive the approach is to tag quality and where the irreducible floor is.

Scale evaluation. Testing on memory graphs with 10410^410510^5 episodes (e.g., a year-long conversation history) would validate whether graph traversal remains efficient at scale. Empirically, does average reconstruction depth grow logarithmically or linearly with graph size?

Multi-agent memory sharing. The paper frames MRAgent as a single-agent memory system. An interesting extension is shared memory pools across multiple agents, where episodic evidence from one agent’s interactions can be queried by another. The Cue-Tag-Content graph structure is naturally composable (just merge the graphs), but conflict resolution across agents would require new mechanisms.

Smaller LLM backbone testing. Validating MRAgent with a 7B or 13B parameter model would determine whether the approach is practical for edge/on-device deployment, or whether it requires a frontier LLM’s function-calling reliability.

10. Conclusion

MRAgent makes a clean conceptual contribution: memory retrieval should be an active, iterative reconstruction process, not a passive similarity lookup. The Cue–Tag–Content graph provides the data structure that makes this tractable — tags serve as lightweight semantic checkpoints that allow the LLM to reason about which memory branches are relevant without loading expensive episodic content first. The theoretical result (Theorem 4.1) gives a rigorous justification for the design philosophy.

The empirical results are strong: 23% improvement on LoCoMo, 32% on LongMemEval, 5–28x reduction in token consumption, with competitive runtime. The ablation cleanly isolates the contribution of each component, and the multi-turn analysis reveals the mechanism — progressive evidence accumulation over 2–4 reasoning turns.

The critical open problems are the lack of a forgetting/update mechanism and the narrow evaluation scope. Future work should address memory lifecycle management for long-lived deployments and validate the approach on more diverse agent workloads. Despite these limitations, MRAgent represents a clear step forward in how we think about memory for long-horizon LLM agents, and the cognitive neuroscience framing provides a principled foundation for further development.

Code: https://github.com/Ji-shuo/MRAgent

11. Reproducibility and Implementation Notes

11.1 Memory Construction Details

The memory construction pipeline runs as a preprocessing step over the full conversation history before any queries are served. Key implementation choices:

Episodic segmentation heuristic. The paper uses RLLM\mathcal{R}_{\text{LLM}} for segmentation without specifying the exact prompt or segment length constraints. In practice, the authors’ public code uses overlapping context windows to ensure coreferences are resolved across turn boundaries — a detail that affects tag quality significantly.

Tag vocabulary. Tags are LLM-generated phrases; the vocabulary is not pre-defined. This means semantically similar tags (e.g., “Tournament Win”, “Won a Tournament”, “Gaming Victory”) may coexist in the graph without being merged. The code implements a post-processing step that clusters similar tags using embedding similarity, but the clustering threshold is a hyperparameter not discussed in the main paper.

Graph scale in experiments. On LoCoMo (50 conversations × ~300 turns × ~1 episode/2 turns), each conversation graph has roughly 150–300 episode nodes, 500–1000 cue nodes, and 200–400 tag nodes. This is small enough that full graph traversal is feasible; at N=105N = 10^5 scale, additional indexing would be needed.

11.2 Reconstruction Loop Implementation

The Navigate/Answer mode transition is implemented as a ReAct-style loop where the LLM generates structured JSON tool calls:

{
  "mode": "Navigate",
  "action": "query_tag_events",
  "parameters": {
    "cue": "video game tournament",
    "tag": "Tournament Victory"
  },
  "reasoning": "Need to find which specific tournament was the second one."
}

The LLM must output valid JSON with the correct tool name and parameters — this requires a capable instruction-following model. The paper uses Gemini-2.5-Flash and Claude-Sonnet-4.5, both of which handle structured generation reliably. For smaller models, structured decoding (constrained generation with a JSON schema) would be required.

The stopping condition Stop(x, H^{(t+1)}) is also LLM-evaluated: the model emits "mode": "Answer" when it judges the accumulated context sufficient. This is an implicit reward signal — the model must internally estimate whether additional traversal will yield marginal value.

11.3 Evaluation Metric Details

The LLM-Judge metric uses GPT-4o-mini as a semantic equivalence judge with temperature 0. Each method is evaluated three independent times, and mean ± std is reported. The judge prompt asks: “Is the generated answer semantically equivalent to the reference answer, accounting for paraphrasing and different surface forms?” — outputting binary 0/1. This is more robust than exact-match metrics for open-ended memory questions, but it introduces judge-model-specific bias (GPT-4o-mini may have calibration differences from other models).

Evidence Recall in Equation (24):

Recall=1Ni=1NE^iEiEi(20)\text{Recall} = \frac{1}{N} \sum_{i=1}^{N} \frac{|\hat{\mathcal{E}}_i \cap \mathcal{E}_i^*|}{|\mathcal{E}_i^*|} \tag{20}

where E^i\hat{\mathcal{E}}_i is the agent-retrieved evidence set and Ei\mathcal{E}_i^* is the annotated ground-truth evidence set. This measures the retrieval completeness independent of whether the answer generation step uses the evidence correctly.

11.4 Comparison to GraphRAG

GraphRAG (Han et al., 2025) is a related system that organizes retrieval corpora into graph structures and retrieves via community summaries and neighborhood expansion. The key architectural difference from MRAgent:

  • GraphRAG uses a top-down community summary approach: the graph is summarized at multiple coarse-grained levels, and retrieval starts from high-level community descriptions.
  • MRAgent uses a bottom-up tag-guided traversal: starting from fine-grained cues, the system navigates up through tags and down to content.

GraphRAG is designed for document corpora (e.g., news articles, Wikipedia); MRAgent is designed for agent interaction histories where temporal and entity-specific relationships matter. The two approaches are complementary and could potentially be combined.

12. Broader Implications

12.1 A New Paradigm for Agent Memory

MRAgent’s most important contribution is not the specific architecture but the design principle: separate the retrieval structure from the retrieval policy. Existing systems conflate these — the graph structure (edges, neighborhoods) implicitly encodes the retrieval policy (follow these edges). MRAgent decouples them: the Cue-Tag-Content graph is the structure, and the LLM-driven traversal loop is the policy. This separation allows the policy to adapt at inference time without retraining the memory representation.

This principle generalizes beyond agent memory. Any system where: (a) the relevant answer depends on multi-hop reasoning across heterogeneous information, (b) intermediate evidence reveals new retrieval directions, (c) the retrieval budget is limited,

…would benefit from active reconstruction over passive retrieval. Candidate applications: scientific literature review (following citation chains based on intermediate findings), medical record analysis (following symptom → diagnosis → treatment chains), and code repository navigation (following function-call chains to trace bug origins).

12.2 Relationship to Search-R1 and Agentic RAG

Search-o1 (Li et al., 2025) and Search-R1 (Jin et al., 2025) both move retrieval inside the reasoning loop, issuing search queries on demand when a knowledge gap is detected. MRAgent differs in a crucial way: Search-o1 and Search-R1 retrieve from external corpora (web search, knowledge bases) to fill factual gaps within a single reasoning task. MRAgent reconstructs from the agent’s own interaction history to recall past experiences.

The two paradigms are complementary. A complete long-horizon agent could use:

  • MRAgent-style active reconstruction for accessing its own memory (what happened in past sessions),
  • Search-o1-style agentic RAG for accessing external world knowledge (current facts, documentation).

The combination raises interesting architectural questions: should the two memory sources be unified into a single graph, or kept separate with a meta-router deciding which to consult?