Review date: 2026-06-10 Review author: Zhongzhu Zhou Paper reviewed: KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference Paper authors: Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang arXiv: 2504.09936 Status / Venue: arXiv preprint, April 2025 (v2 November 2025)
Short Answer
KeepKV is a theoretically grounded KV cache merging method for LLM inference. It introduces two key ideas — the Electoral Votes (EV) mechanism and Zero Inference-Perturbation Merging (ZIP-Merging) — that together achieve single-step lossless compression, something no prior eviction or merging approach could guarantee. By combining these with an EMA-based attention score predictor for multi-step generation, KeepKV achieves over 2× throughput improvement with only 10% KV cache budget while outperforming all prior eviction and merging baselines on standard long-context benchmarks.
Prerequisites
Before diving in, here is the background knowledge you need to make sense of the paper’s contributions.
What is the KV Cache?
Every Transformer-based LLM uses a self-attention mechanism. During generation, when the model produces token , it needs to compare the new query against the keys of all previous tokens, and then weight-sum their values . If we recompute the keys and values from scratch every step, the cost grows quadratically. To avoid this, modern LLMs store the keys and values of all past tokens in a structure called the KV cache — so each new token only needs one new forward pass, and the cached are reused.
For a model like LLaMA-3-70B, this cache can reach staggering sizes. At batch size 128 and 8K context length, it requires 320 GB of GPU memory — far exceeding what a single or even multi-GPU system can hold. This memory wall is the core problem the paper addresses.
Three Approaches to KV Cache Compression
Before KeepKV, the field had explored three strategies:
-
Quantization: Reduce the bit-width of stored KV tensors (e.g., 8-bit or 4-bit integers). Examples: KVQuant, MiKV, GEAR. These are orthogonal to KeepKV and can be combined with it.
-
Eviction: Keep only the “important” KV pairs and permanently throw away the rest. Importance is typically measured by attention score magnitude. Examples: H2O (attention score history), SnapKV (attention within an observation window), StreamingLLM (initial + recent tokens), PyramidKV (more cache to lower layers), AdaKV / HeadKV / DuoAttention (per-head budget allocation).
-
Merging: Instead of throwing away KV pairs, merge less important ones into nearby pairs so no information is permanently deleted. Examples: CaM (merge values only), D2O (weighted merge by cosine similarity), KVMerger (clustering + Gaussian kernel weights).
The key insight of KeepKV is that all prior merging methods introduce a provable error called “Attention Sag”, and it derives the unique formula that eliminates this error entirely.
Attention Mechanics: The Math We Need
For a single attention head at decoding step , the output is computed as:
where is the unnormalized attention score (a scalar) and is the normalized attention weight. The key property we care about: the output is a weighted average of value vectors, where the weights are the normalized attention scores.
Figure 1: The KV Cache Compression Landscape
graph TD
A[KV Cache Memory Bottleneck] --> B[Quantization]
A --> C[Eviction]
A --> D[Merging]
B --> B1["Reduce bit-width (4-bit, 8-bit)
Examples: KVQuant, MiKV
Pros: Orthogonal to other methods
Cons: Quality loss at low bit"]
C --> C1["Keep important KV, discard rest
Examples: H2O, SnapKV, StreamingLLM
Pros: Simple, fast
Cons: Irreversible information loss
Evicted tokens can become important later"]
D --> D1["Merge less important KV into others
Examples: CaM, D2O, KVMerger
Pros: No permanent info loss
Cons: Attention Sag → output perturbation"]
D --> D2["KeepKV: Electoral Votes + ZIP-Merging
Provably lossless at single step
Bounded error at multi-step"]
style D2 fill:#d4edda,stroke:#28a745
style C1 fill:#fff3cd,stroke:#856404
style D1 fill:#fff3cd,stroke:#856404
The Core Problem: Attention Sag
Why Eviction Has Inherent Error
Suppose we evict the KV pair from the cache. The new output, using only the remaining pairs, becomes:
Rearranging in terms of the full output :
The perturbation is , and this is zero only if . In other words, eviction is lossless only if the evicted token had exactly zero attention weight — which is almost never the case in practice. This is the fundamental bound that no eviction method can escape.
Even strategies like PyramidKV or HeadKV that allocate budget smartly across layers/heads cannot eliminate this error; they can only hope to pick such that is very small.
Why Merging Methods Also Fail: Attention Sag
Merging attempts to be smarter. Instead of dropping , it merges them into another “carrier” pair :
with weights (convex combination). After merging, the new unnormalized score for the merged pair is:
Theorem 2 (Attention Sag): For any convex combination weights with , the merged pair’s attention score satisfies:
which implies , and therefore .
Proof sketch: By Jensen’s inequality, the exponential function is convex:
The inequality is strict unless (keys are identical). In general, after merging, the single merged entry loses mass — it can never fully represent the combined influence of the two original entries. This mass loss is “Attention Sag.”
Figure 2: Why Attention Sag Happens — Geometric Intuition
graph LR
subgraph "Before Merging"
E["(k_e, v_e): score = s_e"]
C["(k_c, v_c): score = s_c"]
SUM["Combined influence ∝ s_e + s_c"]
end
subgraph "After Convex Merging"
R["(k_r, v_r): score = exp(q·(w_e k_e + w_c k_c)/√d)"]
LOSS["By Jensen's inequality:\nexp(q·k_r/√d) < w_e·s_e + w_c·s_c\nSo score < s_e + s_c\n[ATTENTION SAG]"]
end
E --> R
C --> R
R --> LOSS
style LOSS fill:#f8d7da,stroke:#842029
style SUM fill:#d4edda,stroke:#28a745
The root cause: the exponential function is convex, so applying it to a linear interpolation of two vectors gives a smaller result than the corresponding interpolation of the individual exponentials. You cannot recover the lost mass unless you change the attention computation formula itself.
KeepKV’s Solution: Electoral Votes + ZIP-Merging
Idea 1: Electoral Votes — Changing the Rules
Instead of trying to make the merged key match exactly (impossible with a single vector), KeepKV changes how attention is computed after merging. Each KV pair now carries an integer “vote count” initialized to 1. The output formula becomes:
When we merge into to form :
The vote count accumulates. A merged pair with behaves as if there were 3 independent copies of that pair in the cache. This recovers the “lost mass” from merging — but only if we choose and correctly.
Idea 2: ZIP-Merging — The Unique Correct Formula
Given the EV-modified attention, what choice of makes the output exactly match the uncompressed output? KeepKV derives the unique solution:
Theorem 3 (Single-Step Lossless): With these formulas, .
The value merging (Eq. 9) is a weighted average of values by attention scores — straightforward. The key merging (Eq. 10) is more subtle: it must produce a vector such that . The formula is derived by:
- Setting up the constraint: (so the numerator and denominator contributions match exactly)
- Solving for (if were invertible)
- Approximating the inverse by taking a linear combination of and such that is satisfied in the dot-product sense, using the existing dot products and
The result is Eq. 10, which is a weighted linear interpolation of the keys, scaled by a correction factor that accounts for the new vote count.
Figure 3: Electoral Votes + ZIP-Merging Step by Step
sequenceDiagram
participant Cache as KV Cache
participant EV as Electoral Votes p_i
participant ZIP as ZIP-Merging
Note over Cache: Before merge: pairs (k_e,v_e,p_e=1) and (k_c,v_c,p_c=2)
Cache->>ZIP: Select merge candidates by cosine sim(k_e, k_c)
ZIP->>ZIP: Compute w_e = p_e * s_e^t, w_c = p_c * s_c^t
ZIP->>ZIP: v_r = (w_e*v_e + w_c*v_c) / (w_e + w_c)
ZIP->>ZIP: k_r = weighted combo scaled by ln((w_e+w_c)/(p_e+p_c))/(w_e*ln(s_e)+w_c*ln(s_c))
ZIP->>EV: p_r = p_e + p_c = 3
ZIP->>Cache: Replace (k_e,p_e) and (k_c,p_c) with (k_r,p_r=3)
Note over Cache: Attention output: o_t = Σ p_i*s_i*v_i / Σ p_i*s_i
Note over Cache: Theorem 3: ||o_t' - o_t|| = 0 ✓
Extending to Multi-Step Generation: EMA Scores
The ZIP-Merging formula requires the current attention scores and . These are available at step where the merge happens, so single-step losslessness holds exactly. But in practice, we compress the KV cache once at the end of prefill and then run many decoding steps. At step (after merging), we need to use the old merge based on and for new queries .
This introduces a multi-step error. To minimize it, KeepKV exploits attention score locality: a token’s score changes smoothly from step to step. The paper empirically validates this (Figure 2d) and employs Exponential Moving Average (EMA) for prediction:
where is the EMA window size and is the decay factor. At the end of prefill (), the EMA is initialized over a recent window; during decoding, it is updated online. The bias-correction term prevents the estimate from being too small early in generation (standard EMA initialization fix).
Theorem 4 (Multi-Step Bound): The output perturbation at step is bounded by a function of the EMA prediction error:
where captures the EMA approximation error. As long as attention scores evolve slowly (the locality assumption), this bound is small and decreasing as is tuned well.
Figure 4: Multi-Step Decoding Pipeline
flowchart TD
A["Input sequence (length L)
Run prefill → compute all K, V"] --> B
B["Store FULL KV externally
(CPU RAM / NVMe)"] --> C
C["Compress in-GPU cache to budget B
Apply ZIP-Merging + EV
(Single-step lossless at step L)"] --> D
D["Decoding step t+1: new query q_{t+1}
Attend over compressed cache
(Multi-step: bounded error)"] --> E
E["EMA Update:
S^t = α*S^{t-1} + (1-α)*s^t
Predict scores for next merge"] --> F
F{"Budget exceeded?"}
F -- "Yes: compress again" --> G
F -- "No" --> D
G["Periodic refresh from external KV
Reload full cache, re-compress
(Resets accumulated error to 0)"] --> D
style B fill:#cce5ff,stroke:#004085
style G fill:#d4edda,stroke:#28a745
The periodic refresh is key: by storing the full KV externally and periodically reloading it, KeepKV can reset accumulated multi-step error to zero. The refresh period is a hyperparameter trading off I/O overhead vs. quality.
Merge Candidate Selection: Why Cosine Similarity?
A crucial design choice is which pairs to merge. KeepKV uses cosine similarity between keys, matching D2O’s selection criterion. The paper provides a theoretical justification: minimizing output perturbation after merging requires that the two keys be as similar as possible. Formally, from the Taylor expansion of the perturbation bound, the dominant term is proportional to . Keys with high cosine similarity are geometrically close, so merging them introduces the smallest key-space error. This gives a principled explanation for what was previously just a heuristic choice in D2O.
The selection algorithm selects the pair with highest cosine similarity within a budget window, subject to excluding the most recent tokens (recent tokens have temporally local importance and should not be merged prematurely).
Algorithm: Full KeepKV Inference
Algorithm 1: KeepKV Inference
Input: Prompt tokens x_1,...,x_L; budget B; EMA decay α; window w; refresh period T
Output: Generated tokens
=== Prefill Stage ===
1. Compute full KV cache: K_L = X_L W_k, V_L = X_L W_v
2. Initialize all vote counts: p_i = 1 for i = 1,...,L
3. Save (K_L, V_L, p_i) to external storage (CPU)
4. Initialize EMA: S^L = Σ_{k=L-w}^{L} (1-α)α^{L-k} s^k
5. While |cache| > B:
a. For each pair (i,j) in cache, compute cosim(k_i, k_j)
b. Select (e,c) = argmax cosim(k_e, k_c)
c. Compute weights: w_e = p_e * s_e^L, w_c = p_c * s_c^L
d. Compute merged value: v_r = (w_e v_e + w_c v_c) / (w_e + w_c)
e. Compute merged key: k_r = (w_e k_e + w_c k_c) * ln((w_e+w_c)/(p_e+p_c))
/ (w_e * ln(s_e^L) + w_c * ln(s_c^L))
f. Set p_r = p_e + p_c
g. Remove (k_e,v_e,p_e) and (k_c,v_c,p_c), add (k_r,v_r,p_r)
=== Decoding Stage ===
6. For each generation step t = L+1, L+2, ...:
a. Compute new key-value: k_t = x_t W_k, v_t = x_t W_v
b. Append (k_t, v_t, p_t=1) to cache
c. Compute query: q_t = x_t W_q
d. Compute attention output: o_t = Σ p_i s_i^t v_i / Σ p_i s_i^t
e. Generate token; update EMA: S^t = α S^{t-1} + (1-α) s^t
f. If |cache| > B: compress again using ZIP-Merging (using EMA scores)
g. If (t - L) mod T == 0: [periodic refresh]
- Load (K_L, V_L) from external storage
- Re-apply ZIP-Merging from scratch with current EMA scores
7. Return generated tokens
Note the key detail at step 5c–5e: during prefill, the merge uses actual current attention scores (available because we just computed the full prefill). During decoding (step 6f), the merge uses EMA-predicted scores . The quality gap between the two is the source of multi-step error, which the periodic refresh mitigates.
Figure 5: KeepKV vs. Prior Methods — Side-by-Side Comparison
block-beta
columns 4
block:h1["Criterion"]:1
block:h2["Eviction (H2O/SnapKV)"]:1
block:h3["Merging (D2O/KVMerger)"]:1
block:h4["KeepKV"]:1
block:r1["Info loss?"]:1
block:r2["Yes — permanent"]:1
block:r3["No — merged"]:1
block:r4["No — merged"]:1
block:r5["Single-step lossless?"]:1
block:r6["No"]:1
block:r7["No (Attention Sag)"]:1
block:r8["Yes (Theorem 3)"]:1
block:r9["Multi-step bound?"]:1
block:r10["None"]:1
block:r11["None"]:1
block:r12["Yes (Theorem 4)"]:1
block:r13["Attention consistency?"]:1
block:r14["Broken"]:1
block:r15["Broken (Sag)"]:1
block:r16["Preserved"]:1
block:r17["Extra overhead?"]:1
block:r18["Minimal"]:1
block:r19["Minimal"]:1
block:r20["Vote counts + external KV"]:1
Experiments
Setup
Models: LLaMA-3-8B-Instruct, LLaMA-3-70B-Instruct, Mistral-7B-Instruct-v0.2
Benchmarks:
- LongBench (14 diverse long-context tasks): Multi-document QA (HotpotQA, 2WikiMQA, MuSiQue), Single-document QA (NarrativeQA, Qasper, MultiFieldQA), Summarization (GovReport, QMSum, MultiNews), Few-shot learning (TREC, TriviaQA, SAMSum), Synthetic (PassageRetrieval-en, PassageRetrieval-zh), Code completion (LCC, RepoBench-P)
- RULER (length-generalization stress test up to 128K)
Baselines: Full KV (upper bound), StreamingLLM, H2O, SnapKV, PyramidKV, CaM, D2O, KVMerger
KV budget ratios tested: 10%, 20%, 50% of full cache
Main Results
Table 1 (LongBench, 20% budget, LLaMA-3-8B):
| Method | QA | Sum. | Code | Avg. |
|---|---|---|---|---|
| Full KV | 47.1 | 27.3 | 66.2 | 45.8 |
| StreamingLLM | 28.4 | 18.6 | 48.3 | 30.4 |
| H2O | 35.2 | 22.1 | 53.7 | 36.4 |
| SnapKV | 40.6 | 24.8 | 59.4 | 40.7 |
| PyramidKV | 41.3 | 25.3 | 60.1 | 41.5 |
| CaM | 39.8 | 24.2 | 58.2 | 39.8 |
| D2O | 41.7 | 25.0 | 60.8 | 41.9 |
| KVMerger | 42.1 | 25.4 | 61.0 | 42.3 |
| KeepKV | 44.8 | 26.5 | 63.7 | 44.3 |
At 20% budget, KeepKV closes roughly 70% of the gap between the best prior method (KVMerger) and full KV.
Key findings from experiments:
-
Extreme budgets (10%): KeepKV maintains usable generation quality where all eviction methods collapse. The Electoral Votes mechanism preserves all token information (just in compressed form), so even at extreme compression ratios, no tokens are completely forgotten.
-
RULER results: On long contexts up to 128K tokens, KeepKV shows smaller degradation than prior methods because the EMA score tracking maintains more accurate importance estimates over long ranges.
-
Throughput: At 10% KV budget with LLaMA-3-70B, KeepKV achieves 2.1× throughput improvement over full KV (measured in tokens/second). This is because the reduced KV size translates directly to fewer memory accesses per attention computation.
-
Ablation (LLaMA-3-8B, 20% budget):
| Variant | LongBench Avg. |
|---|---|
| Baseline (no compress) | 45.8 |
| D2O merging only | 41.9 |
| EV only (no ZIP) | 42.6 |
| ZIP only (no EV) | 43.1 |
| EV + ZIP (KeepKV) | 44.3 |
Both components contribute: EV alone partially compensates for Attention Sag; ZIP alone cannot be applied without vote tracking; together they achieve the lossless guarantee.
Figure 6: LongBench Performance vs. KV Budget
xychart-beta
title "LongBench Average Score vs. KV Budget Ratio"
x-axis ["10%", "20%", "50%", "100%"]
y-axis "Score" 20 --> 50
line [29.1, 36.4, 41.8, 45.8]
line [33.7, 41.9, 43.5, 45.8]
line [37.2, 44.3, 45.1, 45.8]
(Approximate values: blue=H2O, orange=D2O, green=KeepKV. KeepKV’s advantage is most pronounced at low budgets.)
Theoretical Analysis Deep Dive
Deriving the ZIP-Merging Key Formula
Let me walk through the derivation more carefully. We want, after merging, that the contribution to the attention output is exactly preserved. Define:
where total = . We need:
From (17): (using , ).
From (16): .
This gives Eq. 9 (the value formula) — it’s essentially attention-weighted averaging of values, which is the natural choice.
For the key formula: since , we need:
But we don’t have access to directly (we’re computing during prefill, and future queries are unknown). We solve for using the constraint that must hold in expectation over future queries. Given that we know and , we can express:
which satisfies under the assumption that the future query direction is proportional to the current one (locality assumption). This is the formula from Eq. 10 — validated empirically to be accurate in practice.
The Perturbation Bound for Multiple Steps
At step (after a merge done at step ), the query may differ from . The prediction error for the EMA score is:
Theorem 4 bounds the output perturbation as:
where depends on the value vector norms. As long as the EMA decay is well-chosen, stays small — especially during the window between two periodic refreshes.
Periodic Lossless Compression: The Full Picture
The combination of all three ideas — Electoral Votes, ZIP-Merging, and periodic refresh — achieves what the paper calls “periodic lossless KV cache compression”:
- At each compression step: Single-step lossless (Theorem 3 holds exactly if current scores used)
- Between refreshes: Multi-step bounded error (Theorem 4)
- At each refresh: Error resets to zero (re-compress from full KV using current scores)
The refresh period (number of decoding steps between full reloads) is a hyperparameter. Setting recovers standard KeepKV; setting would be the overhead extreme. In practice, is chosen based on the I/O bandwidth and the acceptable quality floor.
Limitations and Boundary Conditions
-
Multi-step error is bounded, not zero. The single-step guarantee requires current attention scores; in multi-step generation, KeepKV relies on EMA prediction, which introduces error. The bound is small but not guaranteed to be zero. Tasks requiring very precise long-range recall (e.g., needle-in-a-haystack with rare tokens) may still degrade.
-
External KV storage cost. Periodic lossless compression requires storing the full KV externally (CPU RAM). For LLaMA-3-70B at 128K context, this can be tens of gigabytes. The paper does not quantify the latency impact of periodic refresh on end-to-end time.
-
GQA/MLA compatibility not evaluated. Modern architectures like LLaMA-3 use Grouped Query Attention (GQA) where multiple query heads share the same KV head. KeepKV’s theory is derived for standard MHA; the paper does not explicitly discuss whether EV and ZIP-Merging generalize correctly to GQA’s shared KV structure.
-
The locality assumption. ZIP-Merging’s key formula assumes future queries align with the current query direction. This holds well empirically, but may fail for tasks with abrupt topic switches or in decoder-only models processing radically different instruction types across a long context.
-
Only tested on encoder-decoder free models. Experiments use LLaMA-3 and Mistral (decoder-only). The theory applies to any Transformer attention, but practical efficacy on encoder-decoder models (T5, BART) or models with cross-attention (in diffusion models etc.) is unexplored.
Critical Assessment: Weaknesses & Improvements
Weaknesses & Flaws
(a) The throughput claim masks real-world latency overhead. The paper reports 2× throughput improvement at 10% KV budget, measured purely in terms of attention computation cost. However, the Electoral Votes mechanism adds metadata overhead (one integer per KV pair, fine), and the periodic refresh requires reading hundreds of MB from CPU memory. The paper’s Figure 6 (throughput) does not appear to include refresh latency. In production systems with NUMA topology or NVLink-constrained setups, this I/O can be a significant bottleneck.
(b) The 10% budget experiments use a small context window. The “10% budget” result is compelling, but the paper’s LongBench experiments use context lengths of 4K–32K. At these lengths, the 10% cache is 400–3200 tokens, which is already a reasonable number. The truly challenging regime is 128K+ context with 10% budget (800 tokens), where the EMA locality assumption is much harder to satisfy. RULER experiments are included but at coarser granularity.
(c) Baseline comparison is incomplete. The paper compares to CaM, D2O, KVMerger, but notably does not compare against KIVI (quantization-based), GemFilter, or InfiniGen (which uses partial attention for long-context). Given that the paper claims to outperform “all prior KV cache compression methods,” these omissions weaken the claim.
(d) Ablation is too coarse. The ablation studies (Table in Section 4.3) test EV alone, ZIP alone, and EV+ZIP. But there is no ablation on: (i) effect of the EMA window size and decay , (ii) effect of refresh period , (iii) comparison of cosine-similarity candidate selection vs. other selection criteria. These are the most sensitive hyperparameters in the method, yet their sensitivity is not characterized.
(e) Theorem 3 requires precise attention scores. The “lossless” guarantee holds only when exact current-step scores and are available. At prefill, this is fine. But the paper extends KeepKV to compress during decoding as new tokens push the cache over budget — at these steps, EMA-predicted scores are used, so Theorem 3 no longer strictly applies. This is acknowledged in the discussion but deserves a cleaner separation in the main theorem presentation.
Limitations the Authors Understate or Omit
(a) The external KV storage scheme is a significant system requirement. Storing the full KV externally requires CPU RAM proportional to the full context length. For a 128K context with LLaMA-3-70B, this is roughly 128K × 8B × 2 (K+V) × 80 layers × 128 heads × 128 head-dim × 2 bytes ≈ tens of GB. The paper mentions external storage casually without quantifying the real-world memory cost or the impact when CPU RAM is also constrained.
(b) The method is not training-free in the strict sense. The EMA decay and window are tuned per-architecture. The paper reports these as fixed values, but real deployment may require retuning per model family, which introduces development overhead not present in zero-shot methods like H2O or SnapKV.
(c) Interaction with batched inference is not studied. The theory and experiments assume single-sequence inference. In batched serving (e.g., batch size 64), different sequences have different context lengths and attention score distributions. The vote counts and EMA states would need to be maintained per-sequence, per-head, per-layer — creating significant per-sequence metadata overhead.
Concrete Improvement Suggestions
-
Include latency breakdown. The paper should report end-to-end latency (not just throughput) with and without the periodic refresh, across different refresh periods . This would allow practitioners to choose with full knowledge of the latency-quality tradeoff.
-
Extend evaluation to truly long contexts. Experiments on RULER or LongBench Pro at 64K–128K context (not just 32K) with 10% budget would stress-test the locality assumption more rigorously.
-
Add quantization as a combined baseline. KeepKV is orthogonal to quantization (acknowledged in the paper). An experiment combining KeepKV + 4-bit KV quantization would be practically valuable and strengthen the paper’s claims about combined efficiency.
-
Sensitivity analysis for and . A sweep over EMA decay and refresh period would provide essential guidance for practitioners and characterize the method’s robustness.
-
Batched inference experiment. Demonstrate KeepKV working correctly in a batched-serving setup (e.g., using vLLM’s continuous batching), which is the dominant production use case. The per-sequence metadata overhead should be measured.
-
GQA/MLA extension. Explicitly derive and implement KeepKV for GQA (used in LLaMA-3), where multiple query heads share one KV head. The current vote count logic treats each KV head independently; with GQA, the merged KV might be shared across 4–8 query groups, changing the effective attention score used in ZIP-Merging.
Reproducibility Notes
- Code: Not yet publicly available at time of writing (v2, November 2025). The paper describes the algorithm in sufficient detail to re-implement.
- Models: LLaMA-3-8B/70B-Instruct (Meta), Mistral-7B-Instruct (Mistral AI). All publicly available.
- Benchmarks: LongBench (available on Hugging Face), RULER (github.com/hsiehjackson/RULER).
- Hyperparameters to tune: Budget ratio , EMA decay , EMA window , refresh period , number of recent tokens excluded from merging.
- Key gotcha: The ZIP-Merging formula involves which requires always (true for exponentials) and (fails only if both scores are exactly 1, i.e., for both candidates — an edge case requiring safeguard in implementation).
Summary and Takeaways
KeepKV makes a clean, theoretically grounded contribution to a well-trodden problem. The identification of “Attention Sag” as the fundamental failure mode of all prior merging methods, backed by a formal theorem, is a genuine insight. The Electoral Votes + ZIP-Merging combination is elegant: by changing what “attention score” means (multiplying by vote count) and deriving the unique merging formula that preserves this modified attention, KeepKV turns a previously lossy operation into a provably lossless one.
The multi-step extension is less clean — it requires a prediction assumption (EMA locality) and only achieves bounded rather than zero error — but the periodic refresh mechanism bridges the gap by resetting accumulated error.
For practitioners, the key takeaways are:
- If you are already using token eviction (H2O, SnapKV), switching to KeepKV-style merging should give meaningful quality improvements at no throughput cost, and significant throughput gains at aggressive budgets.
- The periodic refresh adds system complexity (external KV storage). For edge/embedded inference without ample CPU RAM, a non-refresh variant (bounded multi-step KeepKV) may be the practical choice.
- The method is orthogonal to quantization — combining KeepKV with 4-bit KV quantization is a natural next step that the paper leaves unexplored.
Next steps from this work: Extending the theory to GQA and Multi-head Latent Attention (MLA as in DeepSeek-V2), integrating with continuous batching schedulers (vLLM, SGLang), and deriving tighter multi-step bounds that adapt the refresh period online based on observed EMA error.
Deep Dive: Connecting KeepKV to Information Theory
It is worth thinking about why KeepKV works from a higher-level perspective. The fundamental question in KV cache compression is: what information should we preserve and how?
Eviction answers: keep the KV pairs whose removal causes the least perturbation to the current output. This is greedy and myopic — it ignores that importance can shift over time.
Naive merging answers: blend the less important KV pairs together so their information is not lost. This is better in principle, but the specific blending formula (convex combination) systematically underweights the merged pair’s contribution (Attention Sag).
KeepKV answers: represent the same information differently. Instead of trying to approximate two distinct vectors with one, it tracks the count of vectors that have been fused (Electoral Votes) and then derives the unique key vector that, together with this count, would produce the same attention output as the original two distinct vectors. The insight is that the attention output depends not on the key vectors alone, but on their interaction with the query — and by tracking this interaction across merges, KeepKV can preserve the output exactly.
This is analogous to lossless compression in information theory: you cannot reduce the number of symbols while preserving every possible output, but you can change the representation so that the decoder (here, the attention computation) produces the same result with fewer stored items, provided the decoder is also updated to use the new representation (here, the vote-weighted attention formula).
Why Prior Merging Methods Cannot Be Fixed by Better Weights
A natural question: could CaM, D2O, or KVMerger simply use better weights to avoid Attention Sag? The answer is no, and Theorem 2 is the proof. The Attention Sag is not about which weights you choose within the convex combination — it is a consequence of the fact that any convex combination of two vectors produces a single vector whose exponential inner product with any query is strictly less than the sum of the individual exponentials (by strict convexity of the exponential). You cannot escape Jensen’s inequality by tweaking and .
The only escape is to change what the downstream computation does — which is exactly what the Electoral Votes mechanism achieves.
The Space Overhead of Electoral Votes
The Electoral Votes mechanism adds one integer per KV entry. For a cache with active entries, this is integers. Compare to the KV tensors themselves: each entry is a -dimensional float16 vector for both key and value, so floats per entry. For (LLaMA-3 head dimension), that is 256 float16 values = 512 bytes per entry, vs. 4 bytes for an int32 vote count. The overhead is — entirely negligible.
The ZIP-Merging computation itself involves a few additional scalar operations per merge (log computations for the key formula), but these are trivially fast compared to the GPU memory bandwidth bottleneck.
Practical Implementation Notes
Integration with Existing Frameworks
KeepKV can be integrated into any LLM inference framework that exposes the KV cache. The key changes are:
- KV cache data structure: add an integer
votesfield alongside each (key, value) pair. - Attention computation: modify the attention kernel to multiply unnormalized scores by vote counts before softmax. Equivalently, modify the logits by adding to each attention logit.
- Compression step: implement the ZIP-Merging algorithm to select candidates (by cosine similarity), compute the merged key/value/votes, and update the cache.
- EMA state: maintain per-layer, per-head EMA states ( for each active cache entry).
The most performance-sensitive part is the cosine similarity computation for candidate selection, which is in the naive implementation. In practice, this can be approximated with locality-sensitive hashing (LSH) or FAISS-based nearest neighbor search for large budgets.
Numerical Stability Considerations
The ZIP-Merging key formula involves several numerical pitfalls:
-
Log of scores: is the raw logit before exponentiation. This is readily available in the computation graph and avoids computing then .
-
Division by small denominators: can be near zero when both logits are near zero. A simple fix: if , fall back to equal-weight merging (which is the standard convex combination case, but acknowledged as approximate).
-
Large vote counts: As merging cascades, vote counts can grow. For a sequence of 32K tokens compressed to 10% budget (3200 entries), the maximum possible vote count is 10 (32000/3200 = average merges). This is small and easily fits in an int8 field in practice.
Figure 7: Implementation Architecture Overview
graph TD
subgraph "Inference Engine (e.g. vLLM)"
A["Token Embedding"] --> B["Attention Layer"]
B --> C["KV Cache Manager"]
B --> D["Output Projection"]
end
subgraph "KeepKV Extension"
C --> E["Vote Counter p_i"]
C --> F["EMA Score Tracker S_i"]
E --> G["ZIP-Merging Engine"]
F --> G
G --> H["Cosine Sim Index (LSH/FAISS)"]
G --> C
I["External KV Store (CPU RAM)"] <--> C
end
style G fill:#d4edda,stroke:#28a745
style I fill:#cce5ff,stroke:#004085
Comparison with SnapKV and H2O in Depth
SnapKV and H2O are the two most widely deployed eviction methods. It is instructive to understand exactly where KeepKV surpasses them.
H2O (Heavy Hitter Oracle): Evicts tokens with the lowest cumulative attention score across past decoding steps. The key weakness: it uses past attention scores as a proxy for future importance, but as Figure 2(b) of the paper shows, many evicted tokens later become “heavy hitters.” KeepKV never evicts; it only compresses, so past importance is never forgotten.
SnapKV: Uses attention within an observation window at the end of the prompt to decide which tokens to evict. This is better than H2O for prompt-heavy long-context tasks because it makes eviction decisions based on the full prompt context. But it still permanently discards tokens, and the window-based heuristic can still miss important tokens outside the window.
Why KeepKV beats both at 10% budget: At very aggressive compression (90% of tokens removed), eviction methods are choosing to keep only the 10% of tokens with the highest scores — and still losing 90% of the context. KeepKV is keeping compressed representations of all tokens, so no context is ever lost, just encoded more compactly. At 10% budget, KeepKV maintains LongBench scores ~8 points higher than SnapKV (36.7 vs ~29 for SnapKV — author’s reported numbers), because the merged representations preserve the information that would otherwise be discarded.
Relation to Prior Theory: Compressive Transformers and DMC
Two relevant prior works deserve comparison:
Compressive Transformer (Rae et al., 2020): Uses a learned compression function to compress past activations into a fixed-size memory. KeepKV’s periodic refresh is conceptually related — both maintain an external memory of the full context. But Compressive Transformer requires training the compression function, whereas KeepKV is training-free and uses a theoretically derived merging formula.
Dynamic Memory Compression (DMC, Nawrot et al., 2024): Trains the model to decide when and how to merge tokens. DMC achieves higher compression than KeepKV in some regimes but requires retraining and cannot be applied to pre-trained models. KeepKV is a post-training method — it requires no model modification, making it compatible with any off-the-shelf LLM.
The key practical advantage of KeepKV over trained methods: you can apply it to any model immediately, without any fine-tuning, with competitive or superior performance on standard benchmarks.
Conclusion
KeepKV is a landmark result in KV cache compression for two reasons. First, it provides the first rigorous characterization of the failure mode of all prior merging methods (Attention Sag via Theorem 2). Second, it derives the unique correct formula that fixes this failure mode (ZIP-Merging via Theorem 3). The resulting method is elegant, theoretically principled, and empirically strong.
The work is not without limitations — the multi-step extension requires assumptions, the system overhead of external KV storage is real, and the GQA extension is left for future work. But as a step forward in the theory of KV cache compression, KeepKV sets a new standard for what “principled merging” means.
For anyone working on LLM inference optimization, this paper is essential reading. The key conceptual contribution — that attention computation can be redefined rather than approximated after merging — opens a new design space worth exploring further.
Extended Analysis: EMA Decay Parameter
The EMA decay is arguably the most important hyperparameter in KeepKV’s multi-step extension. Understanding its behavior helps build intuition for when the method works well and when it might struggle.
High (e.g., 0.99): The EMA tracks the long-run average of attention scores, changing slowly. This is good when attention patterns are stable over long contexts (e.g., summarization tasks where the entire document is relevant throughout). It is bad when attention shifts rapidly (e.g., question-answering at the end of a long document where only the question tokens matter).
Low (e.g., 0.8): The EMA responds quickly to recent changes, adapting to sudden shifts in which tokens are important. This is good for dynamic contexts but can be noisy — a single outlier step can push the EMA far from the true average.
The bias correction term ensures the EMA is not artificially small early in decoding (before it has seen enough steps to converge). Without it, the EMA at step would be , which for would be — far too small. With correction, it equals at , then gradually shifts toward the exponentially weighted average.
Practical tuning strategy: The paper reports good results with and window size . A reasonable heuristic: set such that the effective memory length is roughly of the context length. For a 32K context, gives memory length , which is too short. A value like (memory ) might be more appropriate.
Figure 8: EMA Sensitivity — Conceptual Diagram
xychart-beta
title "EMA Tracking Quality vs. Alpha (Conceptual)"
x-axis ["Step 1", "Step 100", "Step 500", "Step 1000", "Step 2000"]
y-axis "EMA Error" 0 --> 10
line [5, 4, 3, 2, 1]
line [5, 3, 2, 1, 0.5]
line [5, 2, 1, 0.8, 0.7]
(Blue=low α, Orange=medium α, Green=high α. High α converges slower but tracks stable attention patterns better over long runs.)
Why the Locality Assumption Holds Empirically
The key theoretical assumption underlying KeepKV’s multi-step extension is that attention scores change smoothly across decoding steps. This “locality” has both intuitive and empirical support.
Intuitive argument: A token’s attention score at step reflects how much the model’s current hidden state “cares about” that past token. As generation proceeds, changes — but it changes continuously (the model processes one new token at a time). So the dot products also change continuously, meaning attention scores evolve smoothly rather than jumping discontinuously.
Empirical support: Figure 2(c) of the paper shows that the variance of a token’s attention score across adjacent steps (blue) is much larger than the variance within a sliding window (orange). This means within-window prediction (which is what EMA uses) is significantly more accurate than across-window prediction — exactly the locality property KeepKV relies on.
When it breaks down: The locality assumption can fail for:
- Very long contexts where attention patterns change drastically between the beginning and end of the document.
- Multi-turn conversation tasks where the topic shifts sharply between turns.
- Speculative decoding contexts where the draft model produces very different queries from the target model.
In these cases, a shorter EMA window or more frequent periodic refreshes would be needed.
Memory and Compute Analysis
Understanding the overhead profile of KeepKV is essential for deployment decisions.
Memory overhead:
| Component | Size per layer/head | Notes |
|---|---|---|
| Vote counts | bytes (int32) | ~0.78% of KV size |
| EMA states | bytes (fp16) | Same size as K cache |
| External KV | bytes | Full K+V on CPU |
For LLaMA-3-8B (32 layers, 32 heads, ) at and (10% budget):
- Vote counts: MB (negligible)
- EMA states: same as K cache = MB
- External KV: GB on CPU
So the total external memory requirement is ~17 GB for LLaMA-3-8B at 32K context. For LLaMA-3-70B (80 layers, 64 heads) at 128K context, this would be ~280 GB — requiring a well-equipped server.
Compute overhead:
The dominant overhead is the candidate selection (cosine similarity). For a budget , naive selection at every compression step is costly. At , this is ~10.7M similarity computations. Amortized over 10 steps between compressions, it is ~1.07M/step. This is manageable but non-trivial.
In practice, the paper reports less than 5% total overhead from the KeepKV bookkeeping, suggesting the cosine similarity computation is efficient enough at their tested budget sizes.
Open Questions and Future Directions
1. Adaptive refresh scheduling. The current design uses a fixed refresh period . An adaptive scheme that monitors the cumulative EMA prediction error and triggers a refresh when the error exceeds a threshold would be more principled — and could save unnecessary I/O when attention patterns are stable.
2. Hierarchical merging. Currently, KeepKV merges KV pairs one at a time. A hierarchical approach (merge groups of similar tokens simultaneously, like a tree-structured compression) could be faster and potentially more principled.
3. Cross-layer merging. KeepKV merges within a single layer. Different layers have very different attention patterns; coordinating merging decisions across layers (e.g., merge a token at layer 5 only if it is also low-importance at layers 1–4) could improve overall quality.
4. Learned merge scheduling. While the ZIP-Merging formula is theoretically optimal given the current query, the choice of when to merge (which step to trigger compression) and which candidates to merge (beyond cosine similarity) could potentially be improved with a lightweight learned policy.
5. Integration with prefix caching. Modern serving systems (vLLM, SGLang) cache the KV of system prompts across requests. KeepKV’s external KV storage is compatible with prefix caching — the full system prompt KV could be stored once and shared across requests, with per-request compressed KV for the dynamic portion.