OScaR: Occam's Razor for Extreme KV Cache Quantization

Review date: 2026-06-17 Review author: Zhongzhu Zhou Paper reviewed: OScaR: The Occam’s Razor for Extreme KV Cache Quantization in LLMs and Beyond Paper authors: Zunhai Su, Rui Yang, Chao Zhang, et al. arXiv: https://arxiv.org/abs/2605.19660 Status/Venue: Preprint, arXiv May 2026

Short Answer

OScaR is a training-free method that pushes KV cache quantization to INT2 (2-bit) precision with near-lossless accuracy by diagnosing and surgically fixing a previously under-examined root cause: Token Norm Imbalance (TNI) — the fact that a small subset of tokens (Attention Sinks) carry anomalously low ℓ2 norms, which inflates per-channel quantization step sizes and wastes representational bits. The fix is a two-component pipeline: a Hadamard-based Canalized Rotation that disperses channel-wise outliers, followed by Omni-Token Scaling that normalizes inter-token norm disparity. On Llama-3.1-8B at 128K context, OScaR delivers a 3.0× latency reduction and a 5.3× memory footprint reduction relative to BF16, while matching 16-bit accuracy on LongBench-E.

Prerequisites: What You Need to Know First

Before dissecting OScaR’s machinery, this section crystallizes the essential background that the rest of the review assumes. Readers already comfortable with KV caching, INT quantization, and the Hadamard transform may skip ahead.

The KV Cache and Why It Grows

During autoregressive decoding in a transformer, each token at position tt attends to all previous tokens via the familiar scaled dot-product attention:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where QR1×dQ \in \mathbb{R}^{1 \times d} is the current-token query, KRS×dK \in \mathbb{R}^{S \times d} is the key matrix accumulated over all SS past tokens, and VRS×dV \in \mathbb{R}^{S \times d} is the value matrix. Without caching, computing KK and VV from scratch at each step costs O(S)O(S) matrix-vector products — quadratic total. The standard fix is the KV cache: store KK and VV for every past token and reuse them.

The downside is memory. For a model with LL layers, HH attention heads, head dimension dhd_h, precision pp bits, and sequence length SS, the KV cache occupies:

MemKV=2LHdhSp8   bytes\text{Mem}_{KV} = 2 \cdot L \cdot H \cdot d_h \cdot S \cdot \frac{p}{8} \;\text{ bytes}

For Llama-3.1-8B (L=32,H=32,dh=128L=32, H=32, d_h=128) at BF16 (p=16p=16) and S=128,000S=128{,}000:

MemKV=2×32×32×128×128000×16867GB\text{Mem}_{KV} = 2 \times 32 \times 32 \times 128 \times 128000 \times \frac{16}{8} \approx \mathbf{67\,\text{GB}}

The model weights themselves are only ~16 GB. The KV cache overwhelms GPU memory for long contexts, severely limiting batch size and throughput. INT2 quantization reduces per-element storage from 16 bits to 2 bits — an 8× compression ratio, shrinking that 67 GB to ~8 GB.

Uniform Quantization: Mechanics and Error

Uniform bb-bit quantization maps a floating-point value xx to an integer in [0,2b1][0, 2^b - 1]. Given a block of values with minimum xminx_{\min} and maximum xmaxx_{\max}:

Δ=xmaxxmin2b1\Delta = \frac{x_{\max} - x_{\min}}{2^b - 1} z=xminΔz = -\frac{x_{\min}}{\Delta} Q(x)=clamp ⁣(round ⁣(xΔ+z),  0,  2b1)Q(x) = \text{clamp}\!\left(\text{round}\!\left(\frac{x}{\Delta} + z\right),\; 0,\; 2^b - 1\right) x^=Δ(Q(x)z)\hat{x} = \Delta \cdot (Q(x) - z)

The worst-case reconstruction error for a single element is bounded by half a step:

xx^Δ2=xmaxxmin2(2b1)|x - \hat{x}| \leq \frac{\Delta}{2} = \frac{x_{\max} - x_{\min}}{2(2^b - 1)}

At INT2, we have only 4 discrete levels (22=42^2 = 4). The step size equals one-third of the entire range:

ΔINT2=xmaxxmin3\Delta_{\text{INT2}} = \frac{x_{\max} - x_{\min}}{3}

Compare this to INT8 where ΔINT8=(xmaxxmin)/255\Delta_{\text{INT8}} = (x_{\max} - x_{\min})/255 — roughly 85× more granular. The message: at INT2, the dynamic range xmaxxminx_{\max} - x_{\min} must be kept as small as possible, or quantization error explodes.

Attention Sinks

One of the most reliably reproduced empirical phenomena in modern LLMs is the existence of Attention Sink tokens. These are typically the first one or two tokens in a sequence (often BOS or punctuation), which receive disproportionately large attention weights from almost every query and every head. The current interpretation is that sinks serve as “parking” positions for the model to distribute probability mass when no past token is strongly relevant.

The key observation for OScaR: Attention Sink tokens tend to have anomalously small key norms compared to ordinary tokens. The absolute values in their key vectors are small, but the model still assigns them large attention weights — a structural property of the softmax mechanism that OScaR exploits to explain quantization degradation.

The Fast Hadamard Transform

The Hadamard transform HdRd×dH_d \in \mathbb{R}^{d \times d} (where dd is a power of 2) is a real orthogonal transform defined recursively:

H1=[1],Hd=12[Hd/2Hd/2Hd/2Hd/2]H_1 = \begin{bmatrix} 1 \end{bmatrix},\quad H_d = \frac{1}{\sqrt{2}} \begin{bmatrix} H_{d/2} & H_{d/2} \\ H_{d/2} & -H_{d/2} \end{bmatrix}

For d=4d=4, the explicit matrix is:

H4=12[1111111111111111]H_4 = \frac{1}{2}\begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \\ 1 & -1 & -1 & 1 \end{bmatrix}

Notice every entry is ±1/2\pm 1/2. For general dd, every entry is ±1/d\pm 1/\sqrt{d}.

Key properties exploited by OScaR:

  1. Orthogonality: HdHdT=IdH_d H_d^T = I_d, so the transform is lossless and its inverse equals its transpose: Hd1=HdT=HdH_d^{-1} = H_d^T = H_d.
  2. Energy equalization: A sparse vector (one large entry, rest near-zero) is spread uniformly across all dd dimensions after HdH_d — each output dimension receives a contribution ±1/d\pm 1/\sqrt{d} times the original large entry. If the input is ej=(0,,0,M,0,,0)e_j = (0,\ldots,0, M, 0,\ldots,0) (outlier in dimension jj), then (Hdej)j=Hj,jM=±M/d(H_d e_j)_{j'} = H_{j',j} \cdot M = \pm M/\sqrt{d} — the energy M2M^2 is now spread over dd dimensions each with magnitude M/dM/\sqrt{d}.
  3. Norm preservation: Hdx2=x2\|H_d x\|_2 = \|x\|_2 for all xx, since HdH_d is orthogonal.
  4. Fast computation: Via the Fast Walsh-Hadamard algorithm in O(dlogd)O(d \log d) rather than O(d2)O(d^2), executable on GPU in a single fused CUDA kernel.

Property 2 is the essential insight: outlier channels (large values in one dimension) become diffuse after the Hadamard transform, equalizing the dynamic range across all channels.

Prior Work: Rotation-Based Quantization

The idea of using orthogonal rotations to smooth outliers before quantization is not unique to OScaR — it builds on a lineage that includes QuaRot (Ashkboos et al., 2024) and SpinQuant (Liu et al., 2024), which apply random Hadamard rotations to weight quantization. OScaR’s novelty is applying this principle specifically to the KV cache in the online decoding regime, and identifying that rotation alone is insufficient without the subsequent token-norm equalization step. The combination and the TNI diagnostic are new.

Problem: Why KV Cache Memory Is the Bottleneck

To motivate why OScaR targets INT2 specifically — rather than, say, INT4 which is already well-studied — consider the practical pressure points in production LLM serving.

Memory-Bound vs Compute-Bound Regimes

Modern LLM inference is typically memory-bandwidth-bound during decoding. Loading KV cache tensors from HBM to SRAM dominates latency. With KV cache at BF16:

[GPU HBM] ──(67 GB at 128K context)──> [SRAM] ──> Attention

          bottleneck: HBM bandwidth ~3.35 TB/s (H100)

At INT2, the same data is only ~8 GB — loading it takes 8× less time. The arithmetic intensity (FLOPs per byte loaded) improves dramatically, enabling the 3× latency reduction OScaR reports.

Batch Size and Throughput

For an LLM serving multiple users concurrently (batch size BB), total KV cache memory scales linearly in BB. At BF16, Llama-3.1-8B with 4K context and batch=48 requires:

KVBF16=48×2×32×32×128×4096×251GB\text{KV}_{BF16} = 48 \times 2 \times 32 \times 32 \times 128 \times 4096 \times 2 \approx 51\,\text{GB}

This exceeds a single 40 GB A100. INT2 brings it to ~6 GB — fitting comfortably on a single card, with headroom for larger batches or longer contexts.

The INT4 vs INT2 Trade-off

INT4 quantization (e.g., with methods like KVQuant) is already practical and widely deployed. Moving from INT4 to INT2 doubles memory compression again but increases quantization error by a factor of ~4 (since step size scales inversely with 2b12^b - 1, and going from b=4b=4 to b=2b=2 gives (15)/(3)=5×(15)/(3) = 5 \times coarser steps). The challenge OScaR accepts: can we restructure the KV cache representation so that INT2 quantization error remains small enough for near-lossless task performance?

Background: KIVI and the Per-Channel Paradigm

The most important prior baseline for OScaR is KIVI (Key-Value INT quantization), the method that first established viable INT2 KV cache quantization for practical-length sequences. Understanding KIVI’s design choices and failure modes is essential to appreciating what OScaR fixes.

KIVI’s Asymmetric Quantization Strategy

KIVI applies different quantization axes to keys and values, motivated by their different statistical distributions:

For Keys (per-channel quantization): Transformers consistently exhibit large outliers along specific channel (head-dimension) axes for key tensors, but relatively uniform distributions across token (sequence) axes. Per-channel quantization assigns one step size per column jj across a group of GG tokens:

Δj,g=maxiBgKi,jminiBgKi,j2b1\Delta_{j,g} = \frac{\max_{i \in \mathcal{B}_g} K_{i,j} - \min_{i \in \mathcal{B}_g} K_{i,j}}{2^b - 1}

where Bg\mathcal{B}_g is the set of token positions in group gg of size GG.

For Values (per-token quantization): Value tensors have relatively uniform distributions across channels for a given token, but large inter-token variation. Per-token quantization is therefore appropriate:

Δt=maxjVt,jminjVt,j2b1\Delta_{t} = \frac{\max_{j} V_{t,j} - \min_{j} V_{t,j}}{2^b - 1}

Residual buffer: Because quantization errors accumulate on newly-arriving tokens before enough context exists for reliable statistics, KIVI maintains the most recent RR tokens (typically R=128R = 128) in full BF16 precision. Once a token ages past position RR, it gets committed to the quantized cache.

KIVI’s Architecture

graph LR
    subgraph KIVI Pipeline
        A[New token k_t] --> B{t > R?}
        B -- No --> C[BF16 residual buffer]
        B -- Yes --> D[Per-channel block quant]
        D --> E[INT2 K cache]
        C --> F[Attention]
        E --> G[Dequant INT2 → BF16]
        G --> F
    end

Figure 1: KIVI KV cache pipeline. The residual BF16 buffer holds the most recent R=128R=128 tokens; older tokens are committed to the INT2 quantized cache.

Where KIVI Breaks at INT2

At INT2, the step size covers one-third of the entire dynamic range. Within a group of G=32G=32 tokens, the per-channel step size is:

Δj,g=Rj,g3\Delta_{j,g} = \frac{R_{j,g}}{3}

where Rj,g=maxiKi,jminiKi,jR_{j,g} = \max_i K_{i,j} - \min_i K_{i,j} is the within-group range for channel jj.

KIVI’s LongBench-E score at INT2 drops from the BF16 baseline of 41.70% to 39.84% — a 1.86 percentage-point gap. On some specific subtasks the drop is much larger. OScaR closes this gap to 0.05 pp (41.75% vs 41.70%). The question is: why does KIVI’s gap exist, and what specific mechanism creates it?

Diagnosing the Root Cause: Token Norm Imbalance

The core intellectual contribution of OScaR is not the fix, but the diagnosis. The paper introduces the concept of Token Norm Imbalance (TNI) and traces the performance degradation in KIVI back to this identifiable, measurable cause.

Defining Token Norm Imbalance

For each attention head hh and token position tt, define the ℓ2 norm of the key vector:

kt,h2=j=1dhkt,h,j2\|k_{t,h}\|_2 = \sqrt{\sum_{j=1}^{d_h} k_{t,h,j}^2}

Let NtK\mathcal{N}_{t}^{K} denote the collection of these norms across all heads for token tt:

NtK={kt,h2h=1,,H}\mathcal{N}_t^K = \left\{ \|k_{t,h}\|_2 \mid h = 1, \ldots, H \right\}

TNI is defined as the substantial disparity in NtK\mathcal{N}_t^K across different token positions tt within the same quantization block. Empirically, the paper finds that Attention Sink tokens consistently have ℓ2 norms 5–50× smaller than ordinary tokens.

Why Norm Imbalance Breaks Per-Channel Quantization

Consider a quantization group Bg\mathcal{B}_g containing one Attention Sink token (with small norm μsink\mu_{\text{sink}}) and G1G-1 ordinary tokens (with typical norm μordμsink\mu_{\text{ord}} \gg \mu_{\text{sink}}).

For channel jj, the key values from ordinary tokens occupy [aj,aj][-a_j, a_j] for some aja_j proportional to μord\mu_{\text{ord}}. The Attention Sink token’s key values are proportionally small — near zero for channel jj in expectation. But the exact sign and magnitude of the Attention Sink key values are incoherent with the ordinary tokens: they may be slightly positive or negative in a channel where ordinary tokens cluster, or they may pull the minimum or maximum of the range in unexpected ways.

The quantization range for channel jj in group gg is:

Rj,g=maxiBgKi,jminiBgKi,jR_{j,g} = \max_{i \in \mathcal{B}_g} K_{i,j} - \min_{i \in \mathcal{B}_g} K_{i,j}

The reconstruction error for any element in this group is bounded by:

ϵj,gΔj,g2=Rj,g2(2b1)\epsilon_{j,g} \leq \frac{\Delta_{j,g}}{2} = \frac{R_{j,g}}{2(2^b - 1)}

TNI inflates Rj,gR_{j,g} in the following way: even though the Attention Sink token has a small norm overall, the relative placement of its channel values can expand the observed range beyond what ordinary-token variance alone would produce. More precisely, since Attention Sink key vectors have small norms, their individual channel entries are small — but this “small” value may lie on the opposite side of zero from the ordinary tokens’ cluster in that channel, effectively expanding Rj,gR_{j,g}.

The theoretical reconstruction error bound (OScaR Appendix G, Eq. 11) states that the expected per-channel quantization error is governed by the inter-token norm variance within the group:

E[ϵj,g]VartBg ⁣(kt2)1/22b1\mathbb{E}[\epsilon_{j,g}] \propto \frac{\text{Var}_{t \in \mathcal{B}_g}\!\left(\|k_t\|_2\right)^{1/2}}{2^b - 1}

TNI directly drives VartBg(kt2)\text{Var}_{t \in \mathcal{B}_g}(\|k_t\|_2) upward through the presence of low-norm sink tokens, which in turn amplifies quantization error.

Worked Example: TNI in a Single Quantization Block

To make TNI concrete, consider a toy example with dh=4d_h = 4, G=4G = 4 tokens, b=2b = 2. Suppose the key matrix for one channel jj across the four tokens is:

K:,j=[0.058.17.98.3]\mathbf{K}_{:,j} = \begin{bmatrix} 0.05 \\ 8.1 \\ 7.9 \\ 8.3 \end{bmatrix}

Token 0 is an Attention Sink (key value 0.05); tokens 1–3 are ordinary (values near 8).

The per-channel quantization range is:

Rj=8.30.05=8.25R_j = 8.3 - 0.05 = 8.25

The INT2 step size is:

Δj=8.253=2.75\Delta_j = \frac{8.25}{3} = 2.75

Quantizing token 1 (K1,j=8.1K_{1,j} = 8.1):

Q(8.1)=clamp ⁣(round ⁣(8.12.75+0),0,3)=round(2.945)=3Q(8.1) = \text{clamp}\!\left(\text{round}\!\left(\frac{8.1}{2.75} + 0\right), 0, 3\right) = \text{round}(2.945) = 3 K^1,j=2.75×3=8.25error=8.258.1=0.15\hat{K}_{1,j} = 2.75 \times 3 = 8.25 \quad \Rightarrow \quad \text{error} = |8.25 - 8.1| = 0.15

Quantizing token 0 (K0,j=0.05K_{0,j} = 0.05):

Q(0.05)=clamp ⁣(round ⁣(0.052.75),0,3)=0Q(0.05) = \text{clamp}\!\left(\text{round}\!\left(\frac{0.05}{2.75}\right), 0, 3\right) = 0 K^0,j=0error=00.05=0.05\hat{K}_{0,j} = 0 \quad \Rightarrow \quad \text{error} = |0 - 0.05| = 0.05

Now suppose we remove the sink token and only quantize tokens 1–3:

Rjno-sink=8.37.9=0.4,Δjno-sink=0.430.133R_j^{\text{no-sink}} = 8.3 - 7.9 = 0.4, \quad \Delta_j^{\text{no-sink}} = \frac{0.4}{3} \approx 0.133

Max error 0.067\leq 0.067a 4× improvement in precision for tokens 1–3. The Attention Sink’s presence forces the entire group to use a coarse step size that wastes precision on ordinary tokens. OScaR’s OTS step normalizes all tokens to unit norm, effectively collapsing this 8.3 vs 0.05 disparity before quantization.

Empirical Evidence for TNI

The paper provides visualization of token-wise ℓ2 norms across layers in Llama-3.1-8B. The pattern is consistent:

  • Token positions 0 and 1 (BOS and the first real token) have norms roughly 10–20× smaller than the median token norm.
  • This low-norm pattern persists across all layers and all heads.
  • When a group of G=32G=32 tokens is quantized per-channel, the probability that at least one Attention Sink falls within the group is high (roughly 2/32=6.25%2/32 = 6.25\% of groups for standard long-context inputs, but the first group always contains sinks).
  • The groups containing Attention Sinks show measurably higher reconstruction error in channel-wise metrics.
graph LR
    subgraph LOW["Low-Norm Attention Sinks"]
        S0["Sink pos=0<br/>ℓ2 norm ≈ 0.4"]
        S1["Sink pos=1<br/>ℓ2 norm ≈ 0.6"]
    end
    subgraph NORM["Normal Tokens"]
        T2["t=2  norm≈7.2"]
        T3["t=3  norm≈8.1"]
        T4["t=4  norm≈7.8"]
        T5["t=5  norm≈8.4"]
    end
    S0 -- "~15× below median" --> T2
    S1 --> T3

Figure 2: Conceptual illustration of Token Norm Imbalance. Attention Sink tokens at positions 0 and 1 have norms roughly 15× below the mean of ordinary tokens. Within a quantization block containing a sink, the effective dynamic range for per-channel quantization expands, increasing step size and reconstruction error.

Why Direct Scaling Fails: The Outlier Artifact Trap

Given the diagnosis — low-norm sink tokens inflating the per-channel range — the obvious fix is to normalize token norms before quantization. Apply a per-token scale st=kt2s_t = \|k_t\|_2 to bring all tokens to unit norm, quantize, then multiply back at dequantization. This is called direct token-wise scaling and it fails. Understanding why is as important as understanding OScaR’s solution.

The Scaling-Induced Outlier Artifact

When a token has very small norm ktsink2=ϵ1\|k_{t_{\text{sink}}}\|_2 = \epsilon \ll 1, scaling by 1/ϵ1/\epsilon amplifies every channel uniformly. Consider channel jj: for ordinary tokens, Ki,jK_{i,j} might have values in [3,3][-3, 3] (in normalized units). For the Attention Sink, Ktsink,j0.01K_{t_\text{sink},j} \approx 0.01 in absolute terms, which after scaling becomes 0.01/ϵ0.01/\epsilon — potentially a very large number.

The critical issue: the Attention Sink’s key vector, when enlarged to unit norm, can have large entries in channels where ordinary tokens have near-zero entries. This is because the Attention Sink’s unit-norm direction is not aligned with ordinary tokens’ unit-norm direction. The Hadamard transform’s energy-equalization property prevents this misalignment from causing outliers — but without the Hadamard transform, scaling the sink token first creates a new kind of channel-wise outlier.

Formally, let ksink=ϵk^sinkk_{\text{sink}} = \epsilon \cdot \hat{k}_{\text{sink}} where k^sink\hat{k}_{\text{sink}} is the unit-norm direction and ϵ=ksink2\epsilon = \|k_{\text{sink}}\|_2. After scaling, the contribution to channel jj of the sink token is k^sink,j\hat{k}_{\text{sink},j}. If k^sink,j\hat{k}_{\text{sink},j} is large in a channel where ordinary tokens have small values, then:

Rj,gafter direct scaling=max ⁣(k^sink,j,  maxtsinkk~t,j)min ⁣(k^sink,j,  mintsinkk~t,j)R_{j,g}^{\text{after direct scaling}} = \max\!\left(\hat{k}_{\text{sink},j},\; \max_{t \neq \text{sink}} \tilde{k}_{t,j}\right) - \min\!\left(\hat{k}_{\text{sink},j},\; \min_{t \neq \text{sink}} \tilde{k}_{t,j}\right)

can be larger than Rj,gbefore scalingR_{j,g}^{\text{before scaling}}. Direct scaling has traded the original TNI problem for a new channel outlier problem — and at INT2, either problem is fatal.

Illustration: Why Order Matters

flowchart TD
    A["Raw Keys: TNI present\n(sink: norm≈0.5, ord: norm≈8)"] --> B{Direct Scaling?}
    B -- "Yes (wrong order)" --> C["Sink amplified to unit norm\nCreates channel outliers\nPer-channel range EXPANDS\nQuantization error WORSE"]
    B -- "No" --> D[Canalized Rotation first]
    D --> E["Outliers spread across d dims\nNo single channel dominates\nScaling now SAFE"]
    E --> F[Omni-Token Scaling]
    F --> G["Uniform norms\nSmall per-channel range\nINT2 error SMALL"]

Figure 3: Order dependency in OScaR. Applying token scaling before Canalized Rotation triggers the Scaling-Induced Outlier Artifact, expanding the per-channel dynamic range. OScaR applies Canalized Rotation first to eliminate channel outliers, then Omni-Token Scaling is safe.

The OScaR Framework: Algorithm and Theory

OScaR’s two components — Canalized Rotation (CR) and Omni-Token Scaling (OTS) — are mutually necessary: CR alone reduces channel outliers but leaves TNI unaddressed; OTS alone triggers the Outlier Artifact. Together, they address the full TNI problem.

Component 1: Canalized Rotation (CR)

Motivation. Before any token scaling, we must ensure that the per-channel dynamic range is already small and that no single channel dominates. The Fast Hadamard Transform accomplishes this by redistributing energy uniformly.

The Hadamard Transform Applied to Keys. For each token tt at head hh, apply HdhRdh×dhH_{d_h} \in \mathbb{R}^{d_h \times d_h}:

k~t,h=Hdhkt,h\tilde{k}_{t,h} = H_{d_h} \cdot k_{t,h}

After this transform, if channel jj had a large value kt,h,j=Mk_{t,h,j} = M (an outlier), its contribution to any output dimension jj' is:

k~t,h,j=j=1dhHj,jkt,h,j=±1dhkt,h,j+other small terms\tilde{k}_{t,h,j'} = \sum_{j=1}^{d_h} H_{j',j} \cdot k_{t,h,j} = \frac{\pm 1}{\sqrt{d_h}} \cdot k_{t,h,j} + \text{other small terms}

The outlier energy MM is diluted by 1/dh1/\sqrt{d_h} and spread across all dhd_h dimensions. For dh=128d_h = 128, each dimension receives at most M/128M/11M/\sqrt{128} \approx M/11 — an order-of-magnitude reduction.

Key invariant preserved: Since HdhH_{d_h} is orthogonal:

k~t,h2=Hdhkt,h2=kt,h2\|\tilde{k}_{t,h}\|_2 = \|H_{d_h} k_{t,h}\|_2 = \|k_{t,h}\|_2

The Hadamard transform does not change token norms. TNI persists after CR — but channel outliers are eliminated.

Handling Queries. Attention requires QKT=tqktTQK^T = \sum_t q \cdot k_t^T. After applying HH to keys, the query must also be transformed to preserve correctness:

(Hq)(Hkt)T=qTHTHkt=qTkt(H \cdot q)(H \cdot k_t)^T = q^T H^T H k_t = q^T k_t

since HTH=IH^T H = I. So applying identical Hadamard transforms to both QQ and KK is lossless — the attention logits are unchanged. In practice, the query transform is applied online (per decoding step), while the key transform is fused into the key projection weight matrix offline.

Handling Values. Values use per-token quantization which is less sensitive to channel outliers. The Hadamard transform can still improve value quantization: apply HH to both the value projection weight matrix WVW_V and the output projection weight matrix WOW_O offline. At inference, VV automatically has the Hadamard baked in (no per-token online computation), and the attention output is:

out=softmax()(VWO)=softmax()(HVraw)(HTWOraw)\text{out} = \text{softmax}(\ldots) \cdot (V \cdot W_O) = \text{softmax}(\ldots) \cdot (H \cdot V^{\text{raw}}) \cdot (H^T W_O^{\text{raw}})

The merged matrices W~V=HWV\tilde{W}_V = H W_V and W~O=WOrawHT\tilde{W}_O = W_O^{\text{raw}} H^T are computed once offline, adding zero runtime overhead.

Component 2: Omni-Token Scaling (OTS)

Motivation. After Canalized Rotation, channel outliers are gone, so scaling individual tokens is now safe. We now address the remaining TNI: the vast norm disparity between Attention Sink tokens and ordinary tokens.

The Scaling Procedure. For each token tt, compute its ℓ2 norm in the rotated space:

st=k~t,h2=Hkt,h2=kt,h2s_t = \|\tilde{k}_{t,h}\|_2 = \|H \cdot k_{t,h}\|_2 = \|k_{t,h}\|_2

(same magnitude as the original, since HH is orthogonal). Normalize:

k^t,h=k~t,hst\hat{k}_{t,h} = \frac{\tilde{k}_{t,h}}{s_t}

Now every token has unit ℓ2 norm. The inter-token norm variance is exactly zero: VartBg(k^t2)=0\text{Var}_{t \in \mathcal{B}_g}(\|\hat{k}_t\|_2) = 0. The quantization error bound from the TNI theory becomes:

E[ϵj,g]VartBg ⁣(k^t2)1/22b1=0\mathbb{E}[\epsilon_{j,g}] \propto \frac{\text{Var}_{t \in \mathcal{B}_g}\!\left(\|\hat{k}_t\|_2\right)^{1/2}}{2^b - 1} = 0

In practice the bound is not exactly zero (the variance of channel values across unit-norm vectors is not zero), but the dominant source of quantization error — norm disparity — is eliminated.

Storage and Dequantization. The scalar sts_t must be stored alongside the quantized key for later recovery. The storage cost is one BF16 scalar per token per head:

Extra cost=HS2 bytesKV cache size\text{Extra cost} = H \cdot S \cdot 2 \text{ bytes} \ll \text{KV cache size}

For H=32H=32 heads, S=128000S=128000 tokens: 32×128000×2=832 \times 128000 \times 2 = 8 MB versus the ~8 GB INT2 KV cache — negligible overhead (0.1%).

At dequantization, the original scaled key is recovered:

k~t,hstk^t,hdequant\tilde{k}_{t,h} \approx s_t \cdot \hat{k}_{t,h}^{\text{dequant}}

and the attention logit uses this recovered value.

Complete Algorithm: OScaR Algorithm 1

The following pseudocode traces OScaR from offline preparation through online inference:

ALGORITHM: OScaR KV Cache Pipeline

--- OFFLINE (one-time, before inference) ---
1. For each attention layer l:
   a. Load K projection weight: W_K ∈ R^{d_model × d_h}
   b. Compute H_{d_h} (Fast Hadamard matrix of size d_h × d_h)
   c. Merge: W̃_K ← H_{d_h} · W_K  (fused into weight matrix)
   d. Load V projection weight: W_V ∈ R^{d_model × d_h}
   e. Load output projection weight: W_O ∈ R^{d_h × d_model}
   f. Merge: W̃_V ← H_{d_h} · W_V;  W̃_O ← W_O · H_{d_h}^T
   g. Query: W̃_Q ← H_{d_h} · W_Q  (cancel Key's rotation at attention)

--- ONLINE (per decode step, token t) ---
2. For token t at layer l:
   a. Compute k̃_t = W̃_K · x_t     [K with baked-in H; no extra FLT]
      Compute q̃_t = W̃_Q · x_t     [Q with baked-in H; cancels K's H]
      Compute ṽ_t = W̃_V · x_t     [V with baked-in H]

   b. Omni-Token Scaling for keys:
      s_t ← ‖k̃_t‖₂                [scalar ℓ2 norm]
      k̂_t ← k̃_t / s_t             [unit-norm key]
      Append (k̂_t, s_t) to residual BF16 buffer

   c. If residual buffer length > R (e.g., R = 128):
      Commit oldest group of G (e.g., G = 32) tokens to INT2:
        For each channel j, block g:
          Δ_{j,g} = (max_i k̂_{i,j} - min_i k̂_{i,j}) / 3
          z_{j,g} = -(min_i k̂_{i,j}) / Δ_{j,g}
          Q_{i,j,g} = clamp(round(k̂_{i,j}/Δ_{j,g} + z_{j,g}), 0, 3)
        Store Q_{i,j,g} as INT2, store (Δ_{j,g}, z_{j,g}) as BF16

--- ATTENTION (per step) ---
3. Compute attention logits for token t:
   a. For each past token i in INT2 cache:
      k̃_i^rec ← s_i · Dequant(Q_i, Δ_{j,g(i)}, z_{j,g(i)})
      logit_i ← q̃_t · k̃_i^rec^T / √d_h
   b. For each past token i in BF16 residual buffer:
      logit_i ← q̃_t · (s_i · k̂_i)^T / √d_h
   c. Softmax → attention weights α
   d. Output ← Σ_i α_i · ṽ_i^rec

--- KEY CUDA IMPLEMENTATION DETAILS ---
4. Single fused kernel: FHT + ‖·‖₂ computation + INT2 packing
   (avoids three separate kernel launches and intermediate HBM writes)
5. FlashDecoding-v2 extended with INT2 dequant path
   (online dequantization inside the tiling loop, never materializing full BF16 K)

Figure 4: OScaR Algorithm 1 pseudocode with CUDA implementation notes.

Mathematical Analysis of Why OScaR Works

After Canalized Rotation and Omni-Token Scaling, every stored key has unit ℓ2 norm. The values on the unit dhd_h-sphere are roughly isotropically distributed (after the Hadamard whitening). For a per-channel block quantization group of GG unit-norm vectors in Rdh\mathbb{R}^{d_h}:

E ⁣[maxiBgk^i,jminiBgk^i,j]=O ⁣(logGdh)\mathbb{E}\!\left[\max_{i \in \mathcal{B}_g} \hat{k}_{i,j} - \min_{i \in \mathcal{B}_g} \hat{k}_{i,j}\right] = O\!\left(\sqrt{\frac{\log G}{d_h}}\right)

For G=32G=32 and dh=128d_h=128: log(32)/128=5/1280.20\sqrt{\log(32)/128} = \sqrt{5/128} \approx 0.20. The step size Δj,g0.067\Delta_{j,g} \approx 0.067 and the max quantization error ϵ0.033\epsilon \leq 0.033. By contrast, without OScaR, the range Rj,gR_{j,g} for a channel containing an outlier token can easily be O(1)O(1) or larger, giving Δ0.33\Delta \approx 0.33 and max error 0.17\leq 0.17 — a 5× larger error bound.

System Design and CUDA Implementation

OScaR’s algorithmic correctness is a necessary but not sufficient condition for practical utility. At INT2, memory is compressed but compute must also be efficient. This section details the engineering choices that deliver OScaR’s reported 3.0× latency improvement.

Kernel Fusion for Canalized Rotation + Scaling

The naive implementation would require three kernel launches per decoding step per key token:

  1. Apply FHT: kHkk \to H \cdot k
  2. Compute ℓ2 norm: s=Hk2s = \|H \cdot k\|_2
  3. Scale and pack into INT2: kk^k \to \hat{k}, then quantize

Each kernel launch has ~5–10 μs overhead and each intermediate result must round-trip through HBM. For short sequences, kernel launch overhead dominates. OScaR fuses all three into a single CUDA kernel:

__global__ void oscar_encode_key(
    float* k_in,          // raw key: [batch, heads, d_h]
    uint8_t* k_int2_out,  // packed INT2: [batch, heads, d_h/4]
    float* scales_out,    // per-token ℓ2 norms: [batch, heads]
    float* quant_params   // per-channel (Δ, z): [batch, heads, d_h/G, 2]
) {
    // Tile across d_h; each warp handles one head
    // Step 1: Load k_in to shared memory, apply butterfly FHT in-place
    // Step 2: Reduce warp-level squared-sum → ℓ2 norm s
    // Step 3: Divide by s in shared memory (unit-norm k̂)
    // Step 4: Per-channel min/max reduction over group G
    // Step 5: Compute Δ, z; quantize; pack 4× INT2 into 1 byte
    // Step 6: Write k_int2_out, scales_out, quant_params to HBM
}

FlashDecoding-v2 INT2 extension. Standard FlashDecoding splits the KV cache across sequence chunks and accumulates partial attention outputs. OScaR extends this by performing INT2 dequantization inside the tiled loop, restoring k^i,j\hat{k}_{i,j} to BF16 transiently in SRAM (never writing back to HBM), then scaling by sis_i and computing attention. The dequantized values are immediately consumed by the dot product — SRAM bandwidth, not HBM bandwidth, determines cost.

Memory Layout for INT2 Cache

Packing 4× INT2 values into a single byte requires careful layout to avoid bank conflicts and enable coalesced access. OScaR uses a channel-major INT2 layout:

Packed byte = [bits 7:6 = val for channel j+3]
              [bits 5:4 = val for channel j+2]
              [bits 3:2 = val for channel j+1]
              [bits 1:0 = val for channel j+0]

For a warp accessing the key cache at a fixed time step, consecutive channels for consecutive tokens are coalesced — each warp transaction covers 32 threads × 1 byte = 32 bytes, aligning with 128-byte cache lines.

Efficiency Results Summary

graph LR
    A["BF16 FlashDecoding-v2<br/>Memory: 5.3x relative<br/>Throughput: 331 tok/s<br/>Latency 128K: 30.9 ms/tok"]
    B["OScaR INT2<br/>Memory: 1.0x - 5.3x reduction<br/>Throughput: 1354 tok/s<br/>Latency 128K: 10.3 ms/tok"]
    A -- "5.3x memory reduction, 4.1x throughput, 3.0x latency" --> B

Figure 5: Memory footprint comparison. OScaR achieves a 5.3× reduction in KV cache memory at batch=48, ctx=4K on Qwen3-8B compared to BF16 FlashDecoding-v2.

MetricBF16 FlashDecoding-v2OScaR INT2Ratio
Memory (batch=48, ctx=4K)5.3× relative1.0×5.3×
Throughput (tokens/s, same setup)33113544.1×
Latency (ms/token, ctx=128K)~30.9~10.33.0×

Table 1: OScaR efficiency metrics on H20 GPU, Qwen3-8B.

The 4.1× throughput gain exceeds the 3× latency gain because higher batch sizes become feasible at INT2, amortizing fixed overheads (weight loading, layer norm, FFN) across more tokens simultaneously.

Experimental Setup

Models and Hardware

  • Primary LLM: Llama-3.1-8B, Qwen3-8B (text-only)
  • Multimodal: Qwen3-VL-8B, Qwen3-VL-4B (vision-language)
  • Omni-modal: Qwen3-Omni-30B (audio-text-vision)
  • Hardware: NVIDIA H20 GPU (96 GB HBM3, 3.35 TB/s bandwidth)
  • Quantization config: INT2, group_size=32, residual buffer R=128 tokens
  • Precision: BF16 for weights, INT2 for cached K/V

Baselines

  1. BF16 FlashDecoding-v2: Full-precision upper bound
  2. KIVI: Per-channel key quantization, per-token value quantization; no rotation/scaling
  3. TurboQuant+: SOTA prior method combining rotation and per-channel quantization
  4. OTT (Omni-Token Transfer): Token-only normalization without Canalized Rotation

Evaluation Benchmarks

LongBench-E: A long-context benchmark suite testing 6 task types: single-document QA, multi-document QA, summarization, few-shot learning, code completion, and synthetic retrieval (NIAH). Average scores are reported as percentages.

NIAH (Needle-in-a-Haystack): Exact-match retrieval of a specific fact planted at a random position in a long document (up to 128K tokens). Tests whether quantization destroys retrieval fidelity.

OCRBench: Optical character recognition in images — tests whether quantization affects the vision encoder–LLM interface in multimodal models.

MMAU-Pro: Multi-modal audio understanding benchmark used with Qwen3-Omni.

Results

Text-Only: LongBench-E

graph LR
    BASE["BF16 Baseline 41.70pct reference"]
    OScaR["OScaR INT2 41.75pct best"]
    OTT["OTT 40.74pct -0.96pp"]
    TQ["TurboQuant plus 40.03pct -1.67pp"]
    KIVI["KIVI 39.84pct -1.86pp"]
    BASE --> OScaR
    BASE --> OTT
    BASE --> TQ
    BASE --> KIVI

Figure 6: LongBench-E INT2 scores on Llama-3.1-8B. OScaR (41.75%) exceeds the BF16 baseline (41.70%) by 0.05pp — the only INT2 method to do so. The second-best method (OTT, 40.74%) is 1.01pp behind OScaR.

On Qwen3-8B, OScaR scores 48.7% versus the BF16 baseline of 49.6% — only a 1.7% relative drop, while KIVI drops 4.5% relative.

NIAH Retrieval

The NIAH result is arguably OScaR’s strongest single result:

  • OScaR: 96.5% exact-match retrieval
  • 16-bit BF16: 96.0%
  • Second-best INT2 method: 92.7%
  • KIVI INT2: ~88%

OScaR at INT2 surpasses full-precision BF16 by 0.5pp on this task. This counterintuitive result suggests that OScaR’s norm normalization incidentally improves the uniformity of attention weight distributions, making needle retrieval more reliable — not merely “preserving” full-precision performance but improving it.

Multimodal: OCRBench

On Qwen3-VL-8B: OScaR 66.6% vs TurboQuant 65.8% vs KIVI 66.2% vs 16bit 67.4%. OScaR closes 84% of the gap to 16-bit performance. On the smaller Qwen3-VL-4B: OScaR achieves +2.5pp over the second-best INT2 method — the gap is larger on smaller models, consistent with small models being more sensitive to quantization noise.

Omni-modal: MMAU-Pro

On Qwen3-Omni-30B: OScaR 85.6%, TurboQuant 84.7%, KIVI 85.1%, 16bit 85.8%. The gap from 16bit is only 0.2pp. OScaR generalizes beyond text-only transformers to audio-visual-language models that process heterogeneous modality tokens in a unified KV cache — demonstrating that TNI is a modality-agnostic pathology.

Cross-Model Comparison: Does OScaR Generalize?

An important question is whether OScaR’s benefits are model-specific (artifacts of Llama/Qwen architectures) or general. The paper addresses this partially by testing three architectural families:

  • Llama-3.1-8B: Standard GQA with 32 K/V heads, RoPE positional embeddings, SwiGLU FFN.
  • Qwen3-8B: Modified GQA, different head-dimension ratios, Qwen-specific positional encoding.
  • Qwen3-VL-8B: Vision encoder prefix tokens added to the KV cache alongside text tokens; tests whether TNI exists for vision tokens.
  • Qwen3-Omni-30B: Audio and text tokens interleaved; tests TNI in multi-modal token streams.

The consistent 1–2pp improvement across these architectures supports the claim that TNI is architecture-agnostic. The underlying cause — Attention Sinks at fixed positions with anomalously small norms — is a property of the softmax attention mechanism rather than any specific weight initialization, making it plausibly universal across the current transformer family.

Per-Task Analysis: LongBench-E Subtasks

While average numbers look strong, the distribution of per-task results reveals important nuances:

Task typeOScaR vs 16bit gap (Llama-3.1-8B)
Single-doc QA~0 pp
Multi-doc QA+0.3 pp (OScaR better!)
Summarization-0.2 pp
Few-shot-0.1 pp
Code completion~0 pp
NIAH synthetic+0.5 pp
Qasper (long doc QA)-3.2 pp (largest gap)

Qasper requires very long-range cross-document reasoning. The residual BF16 buffer covers only the last R=128R=128 tokens, so tokens requiring long-range attention must pass through INT2. For tasks where a single distant key is critically important (Qasper), INT2 quantization error in that key can derail the answer.

Limitations

The paper acknowledges several boundaries of OScaR’s applicability:

1. Residual Buffer Dependency. OScaR still requires a full-precision buffer of R=128R=128 tokens. This is necessary for newly-arriving tokens whose statistics are insufficient for stable block quantization. While the paper treats RR as a fixed hyperparameter, the optimal RR likely varies by task and model. The buffer represents ~1.5% of memory at 128K context (BF16 buffer: 128×32×128×2=1128 \times 32 \times 128 \times 2 = 1 MB vs INT2 cache ~8 GB), but at shorter contexts the fraction is larger.

2. Sequence-Length Scaling is Not Addressed. OScaR reduces bits per token but does not address the O(S)O(S) growth of the cache. At SS \to \infty, even 2-bit quantization may be insufficient. OScaR is complementary to, but does not replace, methods like H2O (heavy-hitter eviction), SnapKV, or streaming window attention that reduce SS itself.

3. Architecture-Specific CUDA Kernels. The fused FHT + INT2 packing kernel is implemented specifically for GQA (Grouped-Query Attention) heads in Qwen and Llama architectures. MLA (Multi-head Latent Attention), used in DeepSeek models, has a structurally different KV layout — the paper does not discuss MLA adaptation.

4. Extreme Degradation on Specific Tasks. The ~3pp gap on Qasper (mentioned above) suggests that INT2 remains insufficient for tasks requiring precise retrieval of many specific distant facts in a long document. The paper’s average-metric presentation can obscure these task-specific failures.

5. Group Size Sensitivity. Group size G=32G=32 is used throughout. Larger GG reduces metadata overhead but worsens quantization (more tokens per block means more chance of TNI within a block). The paper does not provide ablations across different GG values.

6. Interaction with Positional Encoding. Modern LLMs use RoPE (Rotary Positional Encoding), which applies position-dependent rotations to queries and keys before computing attention. OScaR applies its Hadamard rotation after the linear projection but before the per-head dimension split where RoPE is applied. The interaction between the fixed Hadamard matrix and the continuously varying RoPE rotation has not been analyzed: in principle, RoPE could partially “undo” the channel-equalization effect of the Hadamard transform for certain head dimensions at certain positions. The empirical results suggest this is not a practical problem, but the theory is incomplete.

7. Online Scale Computation Adds Latency at Prefill. During the prefill phase, OScaR must compute the ℓ2 norm for each token in the prompt before quantization. For a 128K-token prompt, this means 128000×H128000 \times H norm computations in addition to the Hadamard transforms. While individually cheap (a single warp-level reduction), at large head counts and long contexts this contributes measurably to prefill latency. The paper reports decoding latency only; prefill latency comparison with BF16 is not provided.

Critical Assessment: Weaknesses & Improvements

Weakness 1: Causal Attribution Is Partially Circular

The paper argues that TNI is the fundamental bottleneck and validates this by showing OScaR (which fixes TNI) outperforms all baselines. However, this is an indirect argument — the paper does not isolate the contribution of TNI-fixing from the Hadamard rotation’s other beneficial effects (e.g., reduction of channel-wise kurtosis, which is independently known to help INT quantization). The OTT baseline (scaling without Canalized Rotation) scores 40.74%, showing that scaling alone doesn’t work. But a Hadamard-only baseline (CR without OTS) is not reported. Without CR-only numbers, it is unclear whether the primary mechanism is TNI equalization or generic outlier reduction from the Hadamard transform. This matters for understanding generalizability.

Suggested improvement: Add a CR-only ablation (Hadamard transform without Omni-Token Scaling) to the results table. Report per-group range Rj,gR_{j,g} statistics before and after each step to isolate the mechanism.

Weakness 2: Evaluation Scope Is Narrow

All text-only experiments use Llama-3.1-8B and Qwen3-8B — both ~8B parameter models. INT2 quantization behavior at larger scales (30B, 70B) is reported only for the omni-modal task (Qwen3-Omni-30B), where one datapoint is insufficient to draw conclusions. The claim “OScaR performs well across model scales” is unsupported for the 70B-175B range. Large models have more attention heads, different outlier statistics, and architectures (e.g., MoE) that may change the TNI picture.

Suggested improvement: Include at least one result on Llama-3.1-70B or Qwen3-72B. Even a single LongBench-E number would substantially strengthen the generalization claim.

Weakness 3: The NIAH “Better than BF16” Result Is Suspicious

OScaR reports 96.5% vs. BF16’s 96.0% on NIAH — INT2 quantization improves retrieval. The paper attributes this to more uniform attention distributions, but provides no mechanistic explanation or ablation. Possible alternative explanations: (a) random seed variation in needle placement, (b) the specific NIAH configuration used (needle length, document structure) is favorable to OScaR; (c) the residual BF16 buffer happens to include the needle token in most test cases, so INT2 quantization of other tokens doesn’t matter.

Suggested improvement: Report NIAH scores broken down by needle depth (shallow, middle, deep) and context length. A genuine attention-uniformity improvement should show the largest gains at deep/long positions.

Weakness 4: Overhead Accounting for the Scale Vector

Storing one BF16 scale sts_t per token per head adds H×S×2H \times S \times 2 bytes to the cache. For H=32,S=128000H=32, S=128000: 8 MB. For H=8H=8 (GQA), S=1M: also 16 MB. This is described as negligible, and indeed it is — but the paper does not account for the scale vector bandwidth cost during attention. For each cached key, a BF16 multiply-and-scale must be performed during dequantization. At 128K context with 32 heads, this involves 32×128000=4M32 \times 128000 = 4M BF16 multiply operations. Whether this is a bottleneck in practice is not measured.

Suggested improvement: Profile and report the runtime breakdown between dequantization cost and attention FLOPS for different context lengths. This would validate the claim that the scale-vector overhead is truly negligible.

Weakness 5: Comparison with Structured Pruning Is Missing

OScaR achieves 5.3× memory reduction at INT2. Competing approaches like SnapKV (evict low-importance KV pairs), StreamingLLM (sliding window + attention sinks), and H2O (heavy-hitter eviction) also achieve large effective memory reductions by dropping tokens entirely. The paper compares only against quantization-based methods. A direct comparison with SnapKV or similar at equivalent memory budgets would better contextualize OScaR’s practical value.

Suggested improvement: Include a memory-controlled comparison where both OScaR (INT2, full context) and SnapKV/H2O (BF16, pruned context) use the same total memory. On tasks requiring access to sparse distant tokens (NIAH, long-range QA), OScaR’s full-context retention should win decisively.

Weakness 6: The “Training-Free” Claim Requires Nuance

The paper emphasizes training-free applicability. However, the offline weight-merging step modifies WKW_K, WQW_Q, WVW_V, and WOW_O with Hadamard rotations. For models deployed with quantized weights (e.g., already-INT4-quantized via GPTQ or AWQ), these offline transformations may interfere with the existing quantization scheme. The paper assumes BF16 base weights, which may not hold for all deployment scenarios.

Suggested improvement: Test OScaR applied on top of an already weight-quantized model (e.g., W4A16). Report whether the offline Hadamard merging degrades weight quantization quality, or whether the two compression axes are orthogonal.

Weakness 7: No Sensitivity Analysis on Residual Buffer Size

The residual buffer of R=128R=128 BF16 tokens is treated as a fixed hyperparameter with no ablation. At shorter contexts (e.g., 2K tokens), the BF16 buffer covers 6.25% of all tokens — a non-trivial fraction that partly explains low degradation on short-context tasks, potentially inflating reported accuracy. At very long contexts (S=1MS = 1M tokens), R=128R=128 becomes negligible. A sensitivity curve showing accuracy vs RR (at R{0,32,64,128,256}R \in \{0, 32, 64, 128, 256\}) would quantify how much of OScaR’s accuracy benefit comes from the INT2 innovation versus simply keeping more tokens in high precision.

Suggested improvement: Ablate residual buffer size RR on LongBench-E and NIAH. Report the accuracy vs memory trade-off curve for different RR values. This would clearly separate the contribution of INT2 accuracy from BF16 residual accuracy.

Conclusion

OScaR is a well-motivated and carefully engineered contribution to extreme KV cache compression. Its central diagnostic insight — that Token Norm Imbalance, specifically the anomalously low norms of Attention Sink tokens, is the primary bottleneck for INT2 per-channel key quantization — is original and empirically supported. The two-component solution (Canalized Rotation followed by Omni-Token Scaling) is elegant: CR is necessary to prevent scaling-induced outlier artifacts, OTS is necessary to equalize token norms, and neither alone suffices.

The experimental results are compelling across three model modalities (text, vision-language, audio-visual-language), with the NIAH result being particularly striking. The 5.3× memory reduction and 4.1× throughput improvement on H20 GPU are practically relevant numbers for production serving workloads.

The main open questions are:

  1. Can the CR-only and OTS-only ablation confirm the claimed mechanism?
  2. Does OScaR generalize to 70B+ models and MLA architectures?
  3. How does OScaR compare to token-eviction methods under equal memory budgets?
  4. Does OScaR’s residual buffer size RR significantly affect the reported accuracy?
  5. How does the Hadamard rotation interact with RoPE at long contexts?

Despite these gaps, OScaR represents a clear advance over KIVI and TurboQuant+ and provides a principled framework that could be extended to quantization of other tensor types (e.g., activation quantization, weight-activation quantization) where similar norm imbalance pathologies may exist. The Occam’s Razor in the title is apt: the simplest explanation of INT2 degradation (token norm disparity) points directly to the most parsimonious fix (normalize norms after smoothing outliers), and the engineering execution makes that fix practical.

For practitioners, OScaR is immediately usable on Llama and Qwen families with the released code, and the 5.3× memory reduction makes 128K-context inference feasible on a single consumer GPU for the first time. For researchers, the TNI diagnostic opens a promising direction: understanding which structural token properties (not just channel statistics) govern quantization quality in transformer KV caches, and whether similar norm-based analyses apply to activations, FFN states, or speculative decoding draft caches.

The paper is recommended reading for anyone working on LLM inference efficiency. Its combination of a clearly-articulated failure diagnosis, a theoretically principled fix, and rigorous multi-modal empirical validation sets a high standard for the KV quantization literature.

Broader Context: Where OScaR Fits in the KV Compression Landscape

To close, it is useful to position OScaR within the three dominant paradigms for KV cache compression:

mindmap
  root((KV Compression))
    Quantization
      KIVI -- per-channel INT2
      KVQuant -- INT4 + outlier
      OScaR -- INT2 + rotation + scaling
    Eviction / Pruning
      H2O -- heavy-hitter eviction
      SnapKV -- clustered eviction
      StreamingLLM -- sliding window + sinks
    Low-Rank Projection
      MLA -- latent KV vectors
      KVSharer -- cross-layer sharing
      GEAR -- low-rank residual

Figure 7: KV cache compression taxonomy. OScaR occupies the quantization branch. It is complementary to eviction and low-rank methods — combining OScaR with SnapKV (quantize the retained cache) could yield compounding memory savings, though the interaction of eviction with OScaR’s norm statistics has not been studied.

Practical Deployment Considerations

For teams considering OScaR in production:

  1. vLLM integration: OScaR requires modifying the attention backend to use its fused INT2 kernel. A PagedAttention-compatible INT2 extension would be needed — currently not available as an upstream vLLM plugin.
  2. Calibration-free deployment: Because OScaR is training-free and requires no calibration data for the rotation (the Hadamard matrix is fixed by architecture, not data-dependent), it can be applied to any new model without additional preparation beyond the offline weight merge.
  3. Cloud serving cost: A 5.3× memory reduction means an 8-GPU serving cluster at BF16 could serve the same load with 2 GPUs at OScaR INT2 — roughly a 4× reduction in GPU-hours cost (accounting for compute overhead). This is the economic driver for pushing to INT2.
  4. Numerical stability: The scale sts_t is stored in BF16. For Attention Sink tokens with st0.4s_t \approx 0.4, the BF16 representation is accurate. There is no risk of underflow or overflow at these magnitudes.

References and Further Reading

  • KIVI: Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,” ICML 2024.
  • TurboQuant+: Prior SOTA on rotation-based INT2 KV quantization (exact citation not provided in OScaR preprint).
  • Attention Sinks: Xiao et al., “Efficient Streaming Language Models with Attention Sinks,” ICLR 2024.
  • Fast Hadamard Transform: Fino & Algazi, “Unified Matrix Treatment of the Fast Walsh-Hadamard Transform,” IEEE Trans. Comput. 1976.
  • FlashDecoding-v2: Dao et al., “FlashDecoding: Fast Large Language Model Inference on GPUs,” MLSys 2024.
  • SnapKV: Li et al., “SnapKV: LLM Knows What You are Looking for Before Generation,” NeurIPS 2024.
  • OScaR arXiv: https://arxiv.org/abs/2605.19660