Review date: 2026-06-17 Review author: Zhongzhu Zhou Paper reviewed: OScaR: The Occam’s Razor for Extreme KV Cache Quantization in LLMs and Beyond Paper authors: Zunhai Su, Rui Yang, Chao Zhang, et al. arXiv: https://arxiv.org/abs/2605.19660 Status/Venue: Preprint, arXiv May 2026
Short Answer
OScaR is a training-free method that pushes KV cache quantization to INT2 (2-bit) precision with near-lossless accuracy by diagnosing and surgically fixing a previously under-examined root cause: Token Norm Imbalance (TNI) — the fact that a small subset of tokens (Attention Sinks) carry anomalously low ℓ2 norms, which inflates per-channel quantization step sizes and wastes representational bits. The fix is a two-component pipeline: a Hadamard-based Canalized Rotation that disperses channel-wise outliers, followed by Omni-Token Scaling that normalizes inter-token norm disparity. On Llama-3.1-8B at 128K context, OScaR delivers a 3.0× latency reduction and a 5.3× memory footprint reduction relative to BF16, while matching 16-bit accuracy on LongBench-E.
Prerequisites: What You Need to Know First
Before dissecting OScaR’s machinery, this section crystallizes the essential background that the rest of the review assumes. Readers already comfortable with KV caching, INT quantization, and the Hadamard transform may skip ahead.
The KV Cache and Why It Grows
During autoregressive decoding in a transformer, each token at position attends to all previous tokens via the familiar scaled dot-product attention:
where is the current-token query, is the key matrix accumulated over all past tokens, and is the value matrix. Without caching, computing and from scratch at each step costs matrix-vector products — quadratic total. The standard fix is the KV cache: store and for every past token and reuse them.
The downside is memory. For a model with layers, attention heads, head dimension , precision bits, and sequence length , the KV cache occupies:
For Llama-3.1-8B () at BF16 () and :
The model weights themselves are only ~16 GB. The KV cache overwhelms GPU memory for long contexts, severely limiting batch size and throughput. INT2 quantization reduces per-element storage from 16 bits to 2 bits — an 8× compression ratio, shrinking that 67 GB to ~8 GB.
Uniform Quantization: Mechanics and Error
Uniform -bit quantization maps a floating-point value to an integer in . Given a block of values with minimum and maximum :
The worst-case reconstruction error for a single element is bounded by half a step:
At INT2, we have only 4 discrete levels (). The step size equals one-third of the entire range:
Compare this to INT8 where — roughly 85× more granular. The message: at INT2, the dynamic range must be kept as small as possible, or quantization error explodes.
Attention Sinks
One of the most reliably reproduced empirical phenomena in modern LLMs is the existence of Attention Sink tokens. These are typically the first one or two tokens in a sequence (often BOS or punctuation), which receive disproportionately large attention weights from almost every query and every head. The current interpretation is that sinks serve as “parking” positions for the model to distribute probability mass when no past token is strongly relevant.
The key observation for OScaR: Attention Sink tokens tend to have anomalously small key norms compared to ordinary tokens. The absolute values in their key vectors are small, but the model still assigns them large attention weights — a structural property of the softmax mechanism that OScaR exploits to explain quantization degradation.
The Fast Hadamard Transform
The Hadamard transform (where is a power of 2) is a real orthogonal transform defined recursively:
For , the explicit matrix is:
Notice every entry is . For general , every entry is .
Key properties exploited by OScaR:
- Orthogonality: , so the transform is lossless and its inverse equals its transpose: .
- Energy equalization: A sparse vector (one large entry, rest near-zero) is spread uniformly across all dimensions after — each output dimension receives a contribution times the original large entry. If the input is (outlier in dimension ), then — the energy is now spread over dimensions each with magnitude .
- Norm preservation: for all , since is orthogonal.
- Fast computation: Via the Fast Walsh-Hadamard algorithm in rather than , executable on GPU in a single fused CUDA kernel.
Property 2 is the essential insight: outlier channels (large values in one dimension) become diffuse after the Hadamard transform, equalizing the dynamic range across all channels.
Prior Work: Rotation-Based Quantization
The idea of using orthogonal rotations to smooth outliers before quantization is not unique to OScaR — it builds on a lineage that includes QuaRot (Ashkboos et al., 2024) and SpinQuant (Liu et al., 2024), which apply random Hadamard rotations to weight quantization. OScaR’s novelty is applying this principle specifically to the KV cache in the online decoding regime, and identifying that rotation alone is insufficient without the subsequent token-norm equalization step. The combination and the TNI diagnostic are new.
Problem: Why KV Cache Memory Is the Bottleneck
To motivate why OScaR targets INT2 specifically — rather than, say, INT4 which is already well-studied — consider the practical pressure points in production LLM serving.
Memory-Bound vs Compute-Bound Regimes
Modern LLM inference is typically memory-bandwidth-bound during decoding. Loading KV cache tensors from HBM to SRAM dominates latency. With KV cache at BF16:
[GPU HBM] ──(67 GB at 128K context)──> [SRAM] ──> Attention
↑
bottleneck: HBM bandwidth ~3.35 TB/s (H100)
At INT2, the same data is only ~8 GB — loading it takes 8× less time. The arithmetic intensity (FLOPs per byte loaded) improves dramatically, enabling the 3× latency reduction OScaR reports.
Batch Size and Throughput
For an LLM serving multiple users concurrently (batch size ), total KV cache memory scales linearly in . At BF16, Llama-3.1-8B with 4K context and batch=48 requires:
This exceeds a single 40 GB A100. INT2 brings it to ~6 GB — fitting comfortably on a single card, with headroom for larger batches or longer contexts.
The INT4 vs INT2 Trade-off
INT4 quantization (e.g., with methods like KVQuant) is already practical and widely deployed. Moving from INT4 to INT2 doubles memory compression again but increases quantization error by a factor of ~4 (since step size scales inversely with , and going from to gives coarser steps). The challenge OScaR accepts: can we restructure the KV cache representation so that INT2 quantization error remains small enough for near-lossless task performance?
Background: KIVI and the Per-Channel Paradigm
The most important prior baseline for OScaR is KIVI (Key-Value INT quantization), the method that first established viable INT2 KV cache quantization for practical-length sequences. Understanding KIVI’s design choices and failure modes is essential to appreciating what OScaR fixes.
KIVI’s Asymmetric Quantization Strategy
KIVI applies different quantization axes to keys and values, motivated by their different statistical distributions:
For Keys (per-channel quantization): Transformers consistently exhibit large outliers along specific channel (head-dimension) axes for key tensors, but relatively uniform distributions across token (sequence) axes. Per-channel quantization assigns one step size per column across a group of tokens:
where is the set of token positions in group of size .
For Values (per-token quantization): Value tensors have relatively uniform distributions across channels for a given token, but large inter-token variation. Per-token quantization is therefore appropriate:
Residual buffer: Because quantization errors accumulate on newly-arriving tokens before enough context exists for reliable statistics, KIVI maintains the most recent tokens (typically ) in full BF16 precision. Once a token ages past position , it gets committed to the quantized cache.
KIVI’s Architecture
graph LR
subgraph KIVI Pipeline
A[New token k_t] --> B{t > R?}
B -- No --> C[BF16 residual buffer]
B -- Yes --> D[Per-channel block quant]
D --> E[INT2 K cache]
C --> F[Attention]
E --> G[Dequant INT2 → BF16]
G --> F
end
Figure 1: KIVI KV cache pipeline. The residual BF16 buffer holds the most recent tokens; older tokens are committed to the INT2 quantized cache.
Where KIVI Breaks at INT2
At INT2, the step size covers one-third of the entire dynamic range. Within a group of tokens, the per-channel step size is:
where is the within-group range for channel .
KIVI’s LongBench-E score at INT2 drops from the BF16 baseline of 41.70% to 39.84% — a 1.86 percentage-point gap. On some specific subtasks the drop is much larger. OScaR closes this gap to 0.05 pp (41.75% vs 41.70%). The question is: why does KIVI’s gap exist, and what specific mechanism creates it?
Diagnosing the Root Cause: Token Norm Imbalance
The core intellectual contribution of OScaR is not the fix, but the diagnosis. The paper introduces the concept of Token Norm Imbalance (TNI) and traces the performance degradation in KIVI back to this identifiable, measurable cause.
Defining Token Norm Imbalance
For each attention head and token position , define the ℓ2 norm of the key vector:
Let denote the collection of these norms across all heads for token :
TNI is defined as the substantial disparity in across different token positions within the same quantization block. Empirically, the paper finds that Attention Sink tokens consistently have ℓ2 norms 5–50× smaller than ordinary tokens.
Why Norm Imbalance Breaks Per-Channel Quantization
Consider a quantization group containing one Attention Sink token (with small norm ) and ordinary tokens (with typical norm ).
For channel , the key values from ordinary tokens occupy for some proportional to . The Attention Sink token’s key values are proportionally small — near zero for channel in expectation. But the exact sign and magnitude of the Attention Sink key values are incoherent with the ordinary tokens: they may be slightly positive or negative in a channel where ordinary tokens cluster, or they may pull the minimum or maximum of the range in unexpected ways.
The quantization range for channel in group is:
The reconstruction error for any element in this group is bounded by:
TNI inflates in the following way: even though the Attention Sink token has a small norm overall, the relative placement of its channel values can expand the observed range beyond what ordinary-token variance alone would produce. More precisely, since Attention Sink key vectors have small norms, their individual channel entries are small — but this “small” value may lie on the opposite side of zero from the ordinary tokens’ cluster in that channel, effectively expanding .
The theoretical reconstruction error bound (OScaR Appendix G, Eq. 11) states that the expected per-channel quantization error is governed by the inter-token norm variance within the group:
TNI directly drives upward through the presence of low-norm sink tokens, which in turn amplifies quantization error.
Worked Example: TNI in a Single Quantization Block
To make TNI concrete, consider a toy example with , tokens, . Suppose the key matrix for one channel across the four tokens is:
Token 0 is an Attention Sink (key value 0.05); tokens 1–3 are ordinary (values near 8).
The per-channel quantization range is:
The INT2 step size is:
Quantizing token 1 ():
Quantizing token 0 ():
Now suppose we remove the sink token and only quantize tokens 1–3:
Max error — a 4× improvement in precision for tokens 1–3. The Attention Sink’s presence forces the entire group to use a coarse step size that wastes precision on ordinary tokens. OScaR’s OTS step normalizes all tokens to unit norm, effectively collapsing this 8.3 vs 0.05 disparity before quantization.
Empirical Evidence for TNI
The paper provides visualization of token-wise ℓ2 norms across layers in Llama-3.1-8B. The pattern is consistent:
- Token positions 0 and 1 (BOS and the first real token) have norms roughly 10–20× smaller than the median token norm.
- This low-norm pattern persists across all layers and all heads.
- When a group of tokens is quantized per-channel, the probability that at least one Attention Sink falls within the group is high (roughly of groups for standard long-context inputs, but the first group always contains sinks).
- The groups containing Attention Sinks show measurably higher reconstruction error in channel-wise metrics.
graph LR
subgraph LOW["Low-Norm Attention Sinks"]
S0["Sink pos=0<br/>ℓ2 norm ≈ 0.4"]
S1["Sink pos=1<br/>ℓ2 norm ≈ 0.6"]
end
subgraph NORM["Normal Tokens"]
T2["t=2 norm≈7.2"]
T3["t=3 norm≈8.1"]
T4["t=4 norm≈7.8"]
T5["t=5 norm≈8.4"]
end
S0 -- "~15× below median" --> T2
S1 --> T3
Figure 2: Conceptual illustration of Token Norm Imbalance. Attention Sink tokens at positions 0 and 1 have norms roughly 15× below the mean of ordinary tokens. Within a quantization block containing a sink, the effective dynamic range for per-channel quantization expands, increasing step size and reconstruction error.
Why Direct Scaling Fails: The Outlier Artifact Trap
Given the diagnosis — low-norm sink tokens inflating the per-channel range — the obvious fix is to normalize token norms before quantization. Apply a per-token scale to bring all tokens to unit norm, quantize, then multiply back at dequantization. This is called direct token-wise scaling and it fails. Understanding why is as important as understanding OScaR’s solution.
The Scaling-Induced Outlier Artifact
When a token has very small norm , scaling by amplifies every channel uniformly. Consider channel : for ordinary tokens, might have values in (in normalized units). For the Attention Sink, in absolute terms, which after scaling becomes — potentially a very large number.
The critical issue: the Attention Sink’s key vector, when enlarged to unit norm, can have large entries in channels where ordinary tokens have near-zero entries. This is because the Attention Sink’s unit-norm direction is not aligned with ordinary tokens’ unit-norm direction. The Hadamard transform’s energy-equalization property prevents this misalignment from causing outliers — but without the Hadamard transform, scaling the sink token first creates a new kind of channel-wise outlier.
Formally, let where is the unit-norm direction and . After scaling, the contribution to channel of the sink token is . If is large in a channel where ordinary tokens have small values, then:
can be larger than . Direct scaling has traded the original TNI problem for a new channel outlier problem — and at INT2, either problem is fatal.
Illustration: Why Order Matters
flowchart TD
A["Raw Keys: TNI present\n(sink: norm≈0.5, ord: norm≈8)"] --> B{Direct Scaling?}
B -- "Yes (wrong order)" --> C["Sink amplified to unit norm\nCreates channel outliers\nPer-channel range EXPANDS\nQuantization error WORSE"]
B -- "No" --> D[Canalized Rotation first]
D --> E["Outliers spread across d dims\nNo single channel dominates\nScaling now SAFE"]
E --> F[Omni-Token Scaling]
F --> G["Uniform norms\nSmall per-channel range\nINT2 error SMALL"]
Figure 3: Order dependency in OScaR. Applying token scaling before Canalized Rotation triggers the Scaling-Induced Outlier Artifact, expanding the per-channel dynamic range. OScaR applies Canalized Rotation first to eliminate channel outliers, then Omni-Token Scaling is safe.
The OScaR Framework: Algorithm and Theory
OScaR’s two components — Canalized Rotation (CR) and Omni-Token Scaling (OTS) — are mutually necessary: CR alone reduces channel outliers but leaves TNI unaddressed; OTS alone triggers the Outlier Artifact. Together, they address the full TNI problem.
Component 1: Canalized Rotation (CR)
Motivation. Before any token scaling, we must ensure that the per-channel dynamic range is already small and that no single channel dominates. The Fast Hadamard Transform accomplishes this by redistributing energy uniformly.
The Hadamard Transform Applied to Keys. For each token at head , apply :
After this transform, if channel had a large value (an outlier), its contribution to any output dimension is:
The outlier energy is diluted by and spread across all dimensions. For , each dimension receives at most — an order-of-magnitude reduction.
Key invariant preserved: Since is orthogonal:
The Hadamard transform does not change token norms. TNI persists after CR — but channel outliers are eliminated.
Handling Queries. Attention requires . After applying to keys, the query must also be transformed to preserve correctness:
since . So applying identical Hadamard transforms to both and is lossless — the attention logits are unchanged. In practice, the query transform is applied online (per decoding step), while the key transform is fused into the key projection weight matrix offline.
Handling Values. Values use per-token quantization which is less sensitive to channel outliers. The Hadamard transform can still improve value quantization: apply to both the value projection weight matrix and the output projection weight matrix offline. At inference, automatically has the Hadamard baked in (no per-token online computation), and the attention output is:
The merged matrices and are computed once offline, adding zero runtime overhead.
Component 2: Omni-Token Scaling (OTS)
Motivation. After Canalized Rotation, channel outliers are gone, so scaling individual tokens is now safe. We now address the remaining TNI: the vast norm disparity between Attention Sink tokens and ordinary tokens.
The Scaling Procedure. For each token , compute its ℓ2 norm in the rotated space:
(same magnitude as the original, since is orthogonal). Normalize:
Now every token has unit ℓ2 norm. The inter-token norm variance is exactly zero: . The quantization error bound from the TNI theory becomes:
In practice the bound is not exactly zero (the variance of channel values across unit-norm vectors is not zero), but the dominant source of quantization error — norm disparity — is eliminated.
Storage and Dequantization. The scalar must be stored alongside the quantized key for later recovery. The storage cost is one BF16 scalar per token per head:
For heads, tokens: MB versus the ~8 GB INT2 KV cache — negligible overhead (0.1%).
At dequantization, the original scaled key is recovered:
and the attention logit uses this recovered value.
Complete Algorithm: OScaR Algorithm 1
The following pseudocode traces OScaR from offline preparation through online inference:
ALGORITHM: OScaR KV Cache Pipeline
--- OFFLINE (one-time, before inference) ---
1. For each attention layer l:
a. Load K projection weight: W_K ∈ R^{d_model × d_h}
b. Compute H_{d_h} (Fast Hadamard matrix of size d_h × d_h)
c. Merge: W̃_K ← H_{d_h} · W_K (fused into weight matrix)
d. Load V projection weight: W_V ∈ R^{d_model × d_h}
e. Load output projection weight: W_O ∈ R^{d_h × d_model}
f. Merge: W̃_V ← H_{d_h} · W_V; W̃_O ← W_O · H_{d_h}^T
g. Query: W̃_Q ← H_{d_h} · W_Q (cancel Key's rotation at attention)
--- ONLINE (per decode step, token t) ---
2. For token t at layer l:
a. Compute k̃_t = W̃_K · x_t [K with baked-in H; no extra FLT]
Compute q̃_t = W̃_Q · x_t [Q with baked-in H; cancels K's H]
Compute ṽ_t = W̃_V · x_t [V with baked-in H]
b. Omni-Token Scaling for keys:
s_t ← ‖k̃_t‖₂ [scalar ℓ2 norm]
k̂_t ← k̃_t / s_t [unit-norm key]
Append (k̂_t, s_t) to residual BF16 buffer
c. If residual buffer length > R (e.g., R = 128):
Commit oldest group of G (e.g., G = 32) tokens to INT2:
For each channel j, block g:
Δ_{j,g} = (max_i k̂_{i,j} - min_i k̂_{i,j}) / 3
z_{j,g} = -(min_i k̂_{i,j}) / Δ_{j,g}
Q_{i,j,g} = clamp(round(k̂_{i,j}/Δ_{j,g} + z_{j,g}), 0, 3)
Store Q_{i,j,g} as INT2, store (Δ_{j,g}, z_{j,g}) as BF16
--- ATTENTION (per step) ---
3. Compute attention logits for token t:
a. For each past token i in INT2 cache:
k̃_i^rec ← s_i · Dequant(Q_i, Δ_{j,g(i)}, z_{j,g(i)})
logit_i ← q̃_t · k̃_i^rec^T / √d_h
b. For each past token i in BF16 residual buffer:
logit_i ← q̃_t · (s_i · k̂_i)^T / √d_h
c. Softmax → attention weights α
d. Output ← Σ_i α_i · ṽ_i^rec
--- KEY CUDA IMPLEMENTATION DETAILS ---
4. Single fused kernel: FHT + ‖·‖₂ computation + INT2 packing
(avoids three separate kernel launches and intermediate HBM writes)
5. FlashDecoding-v2 extended with INT2 dequant path
(online dequantization inside the tiling loop, never materializing full BF16 K)
Figure 4: OScaR Algorithm 1 pseudocode with CUDA implementation notes.
Mathematical Analysis of Why OScaR Works
After Canalized Rotation and Omni-Token Scaling, every stored key has unit ℓ2 norm. The values on the unit -sphere are roughly isotropically distributed (after the Hadamard whitening). For a per-channel block quantization group of unit-norm vectors in :
For and : . The step size and the max quantization error . By contrast, without OScaR, the range for a channel containing an outlier token can easily be or larger, giving and max error — a 5× larger error bound.
System Design and CUDA Implementation
OScaR’s algorithmic correctness is a necessary but not sufficient condition for practical utility. At INT2, memory is compressed but compute must also be efficient. This section details the engineering choices that deliver OScaR’s reported 3.0× latency improvement.
Kernel Fusion for Canalized Rotation + Scaling
The naive implementation would require three kernel launches per decoding step per key token:
- Apply FHT:
- Compute ℓ2 norm:
- Scale and pack into INT2: , then quantize
Each kernel launch has ~5–10 μs overhead and each intermediate result must round-trip through HBM. For short sequences, kernel launch overhead dominates. OScaR fuses all three into a single CUDA kernel:
__global__ void oscar_encode_key(
float* k_in, // raw key: [batch, heads, d_h]
uint8_t* k_int2_out, // packed INT2: [batch, heads, d_h/4]
float* scales_out, // per-token ℓ2 norms: [batch, heads]
float* quant_params // per-channel (Δ, z): [batch, heads, d_h/G, 2]
) {
// Tile across d_h; each warp handles one head
// Step 1: Load k_in to shared memory, apply butterfly FHT in-place
// Step 2: Reduce warp-level squared-sum → ℓ2 norm s
// Step 3: Divide by s in shared memory (unit-norm k̂)
// Step 4: Per-channel min/max reduction over group G
// Step 5: Compute Δ, z; quantize; pack 4× INT2 into 1 byte
// Step 6: Write k_int2_out, scales_out, quant_params to HBM
}
FlashDecoding-v2 INT2 extension. Standard FlashDecoding splits the KV cache across sequence chunks and accumulates partial attention outputs. OScaR extends this by performing INT2 dequantization inside the tiled loop, restoring to BF16 transiently in SRAM (never writing back to HBM), then scaling by and computing attention. The dequantized values are immediately consumed by the dot product — SRAM bandwidth, not HBM bandwidth, determines cost.
Memory Layout for INT2 Cache
Packing 4× INT2 values into a single byte requires careful layout to avoid bank conflicts and enable coalesced access. OScaR uses a channel-major INT2 layout:
Packed byte = [bits 7:6 = val for channel j+3]
[bits 5:4 = val for channel j+2]
[bits 3:2 = val for channel j+1]
[bits 1:0 = val for channel j+0]
For a warp accessing the key cache at a fixed time step, consecutive channels for consecutive tokens are coalesced — each warp transaction covers 32 threads × 1 byte = 32 bytes, aligning with 128-byte cache lines.
Efficiency Results Summary
graph LR
A["BF16 FlashDecoding-v2<br/>Memory: 5.3x relative<br/>Throughput: 331 tok/s<br/>Latency 128K: 30.9 ms/tok"]
B["OScaR INT2<br/>Memory: 1.0x - 5.3x reduction<br/>Throughput: 1354 tok/s<br/>Latency 128K: 10.3 ms/tok"]
A -- "5.3x memory reduction, 4.1x throughput, 3.0x latency" --> B
Figure 5: Memory footprint comparison. OScaR achieves a 5.3× reduction in KV cache memory at batch=48, ctx=4K on Qwen3-8B compared to BF16 FlashDecoding-v2.
| Metric | BF16 FlashDecoding-v2 | OScaR INT2 | Ratio |
|---|---|---|---|
| Memory (batch=48, ctx=4K) | 5.3× relative | 1.0× | 5.3× |
| Throughput (tokens/s, same setup) | 331 | 1354 | 4.1× |
| Latency (ms/token, ctx=128K) | ~30.9 | ~10.3 | 3.0× |
Table 1: OScaR efficiency metrics on H20 GPU, Qwen3-8B.
The 4.1× throughput gain exceeds the 3× latency gain because higher batch sizes become feasible at INT2, amortizing fixed overheads (weight loading, layer norm, FFN) across more tokens simultaneously.
Experimental Setup
Models and Hardware
- Primary LLM: Llama-3.1-8B, Qwen3-8B (text-only)
- Multimodal: Qwen3-VL-8B, Qwen3-VL-4B (vision-language)
- Omni-modal: Qwen3-Omni-30B (audio-text-vision)
- Hardware: NVIDIA H20 GPU (96 GB HBM3, 3.35 TB/s bandwidth)
- Quantization config: INT2, group_size=32, residual buffer R=128 tokens
- Precision: BF16 for weights, INT2 for cached K/V
Baselines
- BF16 FlashDecoding-v2: Full-precision upper bound
- KIVI: Per-channel key quantization, per-token value quantization; no rotation/scaling
- TurboQuant+: SOTA prior method combining rotation and per-channel quantization
- OTT (Omni-Token Transfer): Token-only normalization without Canalized Rotation
Evaluation Benchmarks
LongBench-E: A long-context benchmark suite testing 6 task types: single-document QA, multi-document QA, summarization, few-shot learning, code completion, and synthetic retrieval (NIAH). Average scores are reported as percentages.
NIAH (Needle-in-a-Haystack): Exact-match retrieval of a specific fact planted at a random position in a long document (up to 128K tokens). Tests whether quantization destroys retrieval fidelity.
OCRBench: Optical character recognition in images — tests whether quantization affects the vision encoder–LLM interface in multimodal models.
MMAU-Pro: Multi-modal audio understanding benchmark used with Qwen3-Omni.
Results
Text-Only: LongBench-E
graph LR
BASE["BF16 Baseline 41.70pct reference"]
OScaR["OScaR INT2 41.75pct best"]
OTT["OTT 40.74pct -0.96pp"]
TQ["TurboQuant plus 40.03pct -1.67pp"]
KIVI["KIVI 39.84pct -1.86pp"]
BASE --> OScaR
BASE --> OTT
BASE --> TQ
BASE --> KIVI
Figure 6: LongBench-E INT2 scores on Llama-3.1-8B. OScaR (41.75%) exceeds the BF16 baseline (41.70%) by 0.05pp — the only INT2 method to do so. The second-best method (OTT, 40.74%) is 1.01pp behind OScaR.
On Qwen3-8B, OScaR scores 48.7% versus the BF16 baseline of 49.6% — only a 1.7% relative drop, while KIVI drops 4.5% relative.
NIAH Retrieval
The NIAH result is arguably OScaR’s strongest single result:
- OScaR: 96.5% exact-match retrieval
- 16-bit BF16: 96.0%
- Second-best INT2 method: 92.7%
- KIVI INT2: ~88%
OScaR at INT2 surpasses full-precision BF16 by 0.5pp on this task. This counterintuitive result suggests that OScaR’s norm normalization incidentally improves the uniformity of attention weight distributions, making needle retrieval more reliable — not merely “preserving” full-precision performance but improving it.
Multimodal: OCRBench
On Qwen3-VL-8B: OScaR 66.6% vs TurboQuant 65.8% vs KIVI 66.2% vs 16bit 67.4%. OScaR closes 84% of the gap to 16-bit performance. On the smaller Qwen3-VL-4B: OScaR achieves +2.5pp over the second-best INT2 method — the gap is larger on smaller models, consistent with small models being more sensitive to quantization noise.
Omni-modal: MMAU-Pro
On Qwen3-Omni-30B: OScaR 85.6%, TurboQuant 84.7%, KIVI 85.1%, 16bit 85.8%. The gap from 16bit is only 0.2pp. OScaR generalizes beyond text-only transformers to audio-visual-language models that process heterogeneous modality tokens in a unified KV cache — demonstrating that TNI is a modality-agnostic pathology.
Cross-Model Comparison: Does OScaR Generalize?
An important question is whether OScaR’s benefits are model-specific (artifacts of Llama/Qwen architectures) or general. The paper addresses this partially by testing three architectural families:
- Llama-3.1-8B: Standard GQA with 32 K/V heads, RoPE positional embeddings, SwiGLU FFN.
- Qwen3-8B: Modified GQA, different head-dimension ratios, Qwen-specific positional encoding.
- Qwen3-VL-8B: Vision encoder prefix tokens added to the KV cache alongside text tokens; tests whether TNI exists for vision tokens.
- Qwen3-Omni-30B: Audio and text tokens interleaved; tests TNI in multi-modal token streams.
The consistent 1–2pp improvement across these architectures supports the claim that TNI is architecture-agnostic. The underlying cause — Attention Sinks at fixed positions with anomalously small norms — is a property of the softmax attention mechanism rather than any specific weight initialization, making it plausibly universal across the current transformer family.
Per-Task Analysis: LongBench-E Subtasks
While average numbers look strong, the distribution of per-task results reveals important nuances:
| Task type | OScaR vs 16bit gap (Llama-3.1-8B) |
|---|---|
| Single-doc QA | ~0 pp |
| Multi-doc QA | +0.3 pp (OScaR better!) |
| Summarization | -0.2 pp |
| Few-shot | -0.1 pp |
| Code completion | ~0 pp |
| NIAH synthetic | +0.5 pp |
| Qasper (long doc QA) | -3.2 pp (largest gap) |
Qasper requires very long-range cross-document reasoning. The residual BF16 buffer covers only the last tokens, so tokens requiring long-range attention must pass through INT2. For tasks where a single distant key is critically important (Qasper), INT2 quantization error in that key can derail the answer.
Limitations
The paper acknowledges several boundaries of OScaR’s applicability:
1. Residual Buffer Dependency. OScaR still requires a full-precision buffer of tokens. This is necessary for newly-arriving tokens whose statistics are insufficient for stable block quantization. While the paper treats as a fixed hyperparameter, the optimal likely varies by task and model. The buffer represents ~1.5% of memory at 128K context (BF16 buffer: MB vs INT2 cache ~8 GB), but at shorter contexts the fraction is larger.
2. Sequence-Length Scaling is Not Addressed. OScaR reduces bits per token but does not address the growth of the cache. At , even 2-bit quantization may be insufficient. OScaR is complementary to, but does not replace, methods like H2O (heavy-hitter eviction), SnapKV, or streaming window attention that reduce itself.
3. Architecture-Specific CUDA Kernels. The fused FHT + INT2 packing kernel is implemented specifically for GQA (Grouped-Query Attention) heads in Qwen and Llama architectures. MLA (Multi-head Latent Attention), used in DeepSeek models, has a structurally different KV layout — the paper does not discuss MLA adaptation.
4. Extreme Degradation on Specific Tasks. The ~3pp gap on Qasper (mentioned above) suggests that INT2 remains insufficient for tasks requiring precise retrieval of many specific distant facts in a long document. The paper’s average-metric presentation can obscure these task-specific failures.
5. Group Size Sensitivity. Group size is used throughout. Larger reduces metadata overhead but worsens quantization (more tokens per block means more chance of TNI within a block). The paper does not provide ablations across different values.
6. Interaction with Positional Encoding. Modern LLMs use RoPE (Rotary Positional Encoding), which applies position-dependent rotations to queries and keys before computing attention. OScaR applies its Hadamard rotation after the linear projection but before the per-head dimension split where RoPE is applied. The interaction between the fixed Hadamard matrix and the continuously varying RoPE rotation has not been analyzed: in principle, RoPE could partially “undo” the channel-equalization effect of the Hadamard transform for certain head dimensions at certain positions. The empirical results suggest this is not a practical problem, but the theory is incomplete.
7. Online Scale Computation Adds Latency at Prefill. During the prefill phase, OScaR must compute the ℓ2 norm for each token in the prompt before quantization. For a 128K-token prompt, this means norm computations in addition to the Hadamard transforms. While individually cheap (a single warp-level reduction), at large head counts and long contexts this contributes measurably to prefill latency. The paper reports decoding latency only; prefill latency comparison with BF16 is not provided.
Critical Assessment: Weaknesses & Improvements
Weakness 1: Causal Attribution Is Partially Circular
The paper argues that TNI is the fundamental bottleneck and validates this by showing OScaR (which fixes TNI) outperforms all baselines. However, this is an indirect argument — the paper does not isolate the contribution of TNI-fixing from the Hadamard rotation’s other beneficial effects (e.g., reduction of channel-wise kurtosis, which is independently known to help INT quantization). The OTT baseline (scaling without Canalized Rotation) scores 40.74%, showing that scaling alone doesn’t work. But a Hadamard-only baseline (CR without OTS) is not reported. Without CR-only numbers, it is unclear whether the primary mechanism is TNI equalization or generic outlier reduction from the Hadamard transform. This matters for understanding generalizability.
Suggested improvement: Add a CR-only ablation (Hadamard transform without Omni-Token Scaling) to the results table. Report per-group range statistics before and after each step to isolate the mechanism.
Weakness 2: Evaluation Scope Is Narrow
All text-only experiments use Llama-3.1-8B and Qwen3-8B — both ~8B parameter models. INT2 quantization behavior at larger scales (30B, 70B) is reported only for the omni-modal task (Qwen3-Omni-30B), where one datapoint is insufficient to draw conclusions. The claim “OScaR performs well across model scales” is unsupported for the 70B-175B range. Large models have more attention heads, different outlier statistics, and architectures (e.g., MoE) that may change the TNI picture.
Suggested improvement: Include at least one result on Llama-3.1-70B or Qwen3-72B. Even a single LongBench-E number would substantially strengthen the generalization claim.
Weakness 3: The NIAH “Better than BF16” Result Is Suspicious
OScaR reports 96.5% vs. BF16’s 96.0% on NIAH — INT2 quantization improves retrieval. The paper attributes this to more uniform attention distributions, but provides no mechanistic explanation or ablation. Possible alternative explanations: (a) random seed variation in needle placement, (b) the specific NIAH configuration used (needle length, document structure) is favorable to OScaR; (c) the residual BF16 buffer happens to include the needle token in most test cases, so INT2 quantization of other tokens doesn’t matter.
Suggested improvement: Report NIAH scores broken down by needle depth (shallow, middle, deep) and context length. A genuine attention-uniformity improvement should show the largest gains at deep/long positions.
Weakness 4: Overhead Accounting for the Scale Vector
Storing one BF16 scale per token per head adds bytes to the cache. For : 8 MB. For (GQA), S=1M: also 16 MB. This is described as negligible, and indeed it is — but the paper does not account for the scale vector bandwidth cost during attention. For each cached key, a BF16 multiply-and-scale must be performed during dequantization. At 128K context with 32 heads, this involves BF16 multiply operations. Whether this is a bottleneck in practice is not measured.
Suggested improvement: Profile and report the runtime breakdown between dequantization cost and attention FLOPS for different context lengths. This would validate the claim that the scale-vector overhead is truly negligible.
Weakness 5: Comparison with Structured Pruning Is Missing
OScaR achieves 5.3× memory reduction at INT2. Competing approaches like SnapKV (evict low-importance KV pairs), StreamingLLM (sliding window + attention sinks), and H2O (heavy-hitter eviction) also achieve large effective memory reductions by dropping tokens entirely. The paper compares only against quantization-based methods. A direct comparison with SnapKV or similar at equivalent memory budgets would better contextualize OScaR’s practical value.
Suggested improvement: Include a memory-controlled comparison where both OScaR (INT2, full context) and SnapKV/H2O (BF16, pruned context) use the same total memory. On tasks requiring access to sparse distant tokens (NIAH, long-range QA), OScaR’s full-context retention should win decisively.
Weakness 6: The “Training-Free” Claim Requires Nuance
The paper emphasizes training-free applicability. However, the offline weight-merging step modifies , , , and with Hadamard rotations. For models deployed with quantized weights (e.g., already-INT4-quantized via GPTQ or AWQ), these offline transformations may interfere with the existing quantization scheme. The paper assumes BF16 base weights, which may not hold for all deployment scenarios.
Suggested improvement: Test OScaR applied on top of an already weight-quantized model (e.g., W4A16). Report whether the offline Hadamard merging degrades weight quantization quality, or whether the two compression axes are orthogonal.
Weakness 7: No Sensitivity Analysis on Residual Buffer Size
The residual buffer of BF16 tokens is treated as a fixed hyperparameter with no ablation. At shorter contexts (e.g., 2K tokens), the BF16 buffer covers 6.25% of all tokens — a non-trivial fraction that partly explains low degradation on short-context tasks, potentially inflating reported accuracy. At very long contexts ( tokens), becomes negligible. A sensitivity curve showing accuracy vs (at ) would quantify how much of OScaR’s accuracy benefit comes from the INT2 innovation versus simply keeping more tokens in high precision.
Suggested improvement: Ablate residual buffer size on LongBench-E and NIAH. Report the accuracy vs memory trade-off curve for different values. This would clearly separate the contribution of INT2 accuracy from BF16 residual accuracy.
Conclusion
OScaR is a well-motivated and carefully engineered contribution to extreme KV cache compression. Its central diagnostic insight — that Token Norm Imbalance, specifically the anomalously low norms of Attention Sink tokens, is the primary bottleneck for INT2 per-channel key quantization — is original and empirically supported. The two-component solution (Canalized Rotation followed by Omni-Token Scaling) is elegant: CR is necessary to prevent scaling-induced outlier artifacts, OTS is necessary to equalize token norms, and neither alone suffices.
The experimental results are compelling across three model modalities (text, vision-language, audio-visual-language), with the NIAH result being particularly striking. The 5.3× memory reduction and 4.1× throughput improvement on H20 GPU are practically relevant numbers for production serving workloads.
The main open questions are:
- Can the CR-only and OTS-only ablation confirm the claimed mechanism?
- Does OScaR generalize to 70B+ models and MLA architectures?
- How does OScaR compare to token-eviction methods under equal memory budgets?
- Does OScaR’s residual buffer size significantly affect the reported accuracy?
- How does the Hadamard rotation interact with RoPE at long contexts?
Despite these gaps, OScaR represents a clear advance over KIVI and TurboQuant+ and provides a principled framework that could be extended to quantization of other tensor types (e.g., activation quantization, weight-activation quantization) where similar norm imbalance pathologies may exist. The Occam’s Razor in the title is apt: the simplest explanation of INT2 degradation (token norm disparity) points directly to the most parsimonious fix (normalize norms after smoothing outliers), and the engineering execution makes that fix practical.
For practitioners, OScaR is immediately usable on Llama and Qwen families with the released code, and the 5.3× memory reduction makes 128K-context inference feasible on a single consumer GPU for the first time. For researchers, the TNI diagnostic opens a promising direction: understanding which structural token properties (not just channel statistics) govern quantization quality in transformer KV caches, and whether similar norm-based analyses apply to activations, FFN states, or speculative decoding draft caches.
The paper is recommended reading for anyone working on LLM inference efficiency. Its combination of a clearly-articulated failure diagnosis, a theoretically principled fix, and rigorous multi-modal empirical validation sets a high standard for the KV quantization literature.
Broader Context: Where OScaR Fits in the KV Compression Landscape
To close, it is useful to position OScaR within the three dominant paradigms for KV cache compression:
mindmap
root((KV Compression))
Quantization
KIVI -- per-channel INT2
KVQuant -- INT4 + outlier
OScaR -- INT2 + rotation + scaling
Eviction / Pruning
H2O -- heavy-hitter eviction
SnapKV -- clustered eviction
StreamingLLM -- sliding window + sinks
Low-Rank Projection
MLA -- latent KV vectors
KVSharer -- cross-layer sharing
GEAR -- low-rank residual
Figure 7: KV cache compression taxonomy. OScaR occupies the quantization branch. It is complementary to eviction and low-rank methods — combining OScaR with SnapKV (quantize the retained cache) could yield compounding memory savings, though the interaction of eviction with OScaR’s norm statistics has not been studied.
Practical Deployment Considerations
For teams considering OScaR in production:
- vLLM integration: OScaR requires modifying the attention backend to use its fused INT2 kernel. A PagedAttention-compatible INT2 extension would be needed — currently not available as an upstream vLLM plugin.
- Calibration-free deployment: Because OScaR is training-free and requires no calibration data for the rotation (the Hadamard matrix is fixed by architecture, not data-dependent), it can be applied to any new model without additional preparation beyond the offline weight merge.
- Cloud serving cost: A 5.3× memory reduction means an 8-GPU serving cluster at BF16 could serve the same load with 2 GPUs at OScaR INT2 — roughly a 4× reduction in GPU-hours cost (accounting for compute overhead). This is the economic driver for pushing to INT2.
- Numerical stability: The scale is stored in BF16. For Attention Sink tokens with , the BF16 representation is accurate. There is no risk of underflow or overflow at these magnitudes.
References and Further Reading
- KIVI: Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,” ICML 2024.
- TurboQuant+: Prior SOTA on rotation-based INT2 KV quantization (exact citation not provided in OScaR preprint).
- Attention Sinks: Xiao et al., “Efficient Streaming Language Models with Attention Sinks,” ICLR 2024.
- Fast Hadamard Transform: Fino & Algazi, “Unified Matrix Treatment of the Fast Walsh-Hadamard Transform,” IEEE Trans. Comput. 1976.
- FlashDecoding-v2: Dao et al., “FlashDecoding: Fast Large Language Model Inference on GPUs,” MLSys 2024.
- SnapKV: Li et al., “SnapKV: LLM Knows What You are Looking for Before Generation,” NeurIPS 2024.
- OScaR arXiv: https://arxiv.org/abs/2605.19660