SigmaScale: Learning to Scale Weight Matrices for Better SVD-Based LLM Compression

Review date: 2026-06-26 Review author: Zhongzhu Zhou Paper reviewed: SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices Paper authors: Ernests Lavrinovics, Marco Letizia, Roy Janco, Shai Segal, Johannes Bjerva, Maurizio Pierini arXiv: 2606.07098 Status/Venue: arXiv preprint, June 2026

Short Answer

SigmaScale learns per-weight-matrix row and column scaling vectors that reshape the singular-value spectrum before truncated SVD compression, reducing the effective intrinsic rank and cutting activation-based reconstruction loss — making it competitive with the best SVD methods in the mild-to-moderate compression regime without requiring any specialized hardware.

Prerequisites: What You Need to Know Before Diving In

Before we get into SigmaScale itself, let me lay out the core concepts you need to follow the technical content. If you’ve worked with matrix factorization before, feel free to skim; if not, read this section carefully because everything else builds on it.

What Is Singular Value Decomposition (SVD)?

SVD is a fundamental matrix factorization theorem. For any matrix WRm×nW \in \mathbb{R}^{m \times n}, SVD factorizes it into three matrices:

W=UΣVTW = U \Sigma V^T

where:

  • URm×mU \in \mathbb{R}^{m \times m} is an orthogonal matrix whose columns are the left singular vectors
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is a diagonal matrix containing the singular values σ1σ2σmin(m,n)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0 sorted in descending order
  • VRn×nV \in \mathbb{R}^{n \times n} is an orthogonal matrix whose columns are the right singular vectors

Think of the singular values as measuring “how important” each component direction is. Large singular values correspond to directions in which the matrix has large action; small singular values correspond to nearly-null directions.

Another useful way to see SVD: you can write the full matrix as a sum of rank-1 outer products:

W=i=1min(m,n)uiσiviTW = \sum_{i=1}^{\min(m,n)} u_i \sigma_i v_i^T

where uiu_i is the ii-th column of UU and viv_i is the ii-th column of VV.

Truncated SVD and the Eckart–Young–Mirsky Theorem

The key theorem driving nearly all low-rank compression work is the Eckart–Young–Mirsky theorem (1936/1960):

Theorem (Eckart–Young–Mirsky): Among all rank-kk matrices WW', the one that minimizes the Frobenius norm WWF\|W - W'\|_F is given by the truncated SVD:

W(k)=i=1kuiσiviT=UkΣkVkTW^{(k)} = \sum_{i=1}^{k} u_i \sigma_i v_i^T = U_k \Sigma_k V_k^T

where Uk,VkU_k, V_k keep only the top kk columns and Σk\Sigma_k keeps only the top kk singular values.

Intuition: Because singular values are sorted in descending order, keeping the top kk retains the “most important” kk directions and discards the weakest ones. The error of this approximation is:

WW(k)F=σk+12+σk+22++σmin(m,n)2\|W - W^{(k)}\|_F = \sqrt{\sigma_{k+1}^2 + \sigma_{k+2}^2 + \cdots + \sigma_{\min(m,n)}^2}

This is optimal — no other rank-kk matrix is closer to WW in Frobenius norm.

Memory savings: instead of storing m×nm \times n parameters, you store UkU_k (m×km \times k), Σk\Sigma_k (kk), and VkTV_k^T (k×nk \times n) — a total of k(m+n+1)k(m+n+1) parameters vs. mnmn. The compression ratio is k(m+n)/mnk(m+n)/mn. For large matrices and small kk, this is a big saving.

Why doesn’t vanilla SVD work well for LLMs? The Eckart–Young theorem minimizes WWF\|W - W'\|_F, but what we really care about is whether the model produces the same outputs on real data. The Frobenius norm treats all weight entries equally, but in practice some directions matter enormously (because they amplify large activations) while others are nearly irrelevant. This is the root cause motivating activation-aware methods.

Low-Rank Representation at Inference Time

Once you have W=UkΣkVkT=LRW' = U_k \Sigma_k V_k^T = L R where:

L=UkΣkRm×k,R=ΣkVkTRk×nL = U_k \sqrt{\Sigma_k} \in \mathbb{R}^{m \times k}, \quad R = \sqrt{\Sigma_k} V_k^T \in \mathbb{R}^{k \times n}

a forward pass becomes:

WxWx=L(Rx)=L(Rx)Wx \approx W'x = L(Rx) = L \cdot (R x)

You compute RxRx first (k×nk \times n multiplied by n×1n \times 1 = kk-dim vector, cost knkn), then L(Rx)L \cdot (Rx) (m×km \times k times kk-dim vector, cost mkmk). Total cost: k(m+n)k(m+n) vs. the original mnmn. For kmin(m,n)k \ll \min(m,n) this is a substantial speedup that works on any hardware — no special kernel or quantized data type needed.

Activation-Aware Compression Loss

Instead of the Frobenius norm on weights, we want to minimize reconstruction error on actual activations. For a calibration dataset with input activations XRn×sX \in \mathbb{R}^{n \times s} (ss samples), the activation-aware Frobenius loss is:

LF=1mnWXWXF2\mathcal{L}_F = \frac{1}{mn} \|WX - W'X\|_F^2

This shifts focus from weight structure to functional equivalence: two weight matrices that compute similar outputs on typical inputs are “the same” from a compression standpoint, even if they differ entry-wise.

Effective Rank Entropy

The effective rank entropy of a matrix’s singular value spectrum is a soft measure of how many singular values carry meaningful information. For a diagonal matrix Σ\Sigma with non-negative entries, define the normalized probabilities pi=σi/jσjp_i = \sigma_i / \sum_j \sigma_j. The effective-rank entropy is:

H(Σ)=ipilogpiH(\Sigma) = -\sum_i p_i \log p_i

Low entropy means the spectrum is concentrated (a few large singular values dominate, others are tiny) — effectively low rank. High entropy means the singular values are spread out (many directions matter equally). When a compression method can lower the effective rank entropy of the scaled weight matrix, it means the spectrum becomes more concentrated after the linear transformation, and truncated SVD can capture a larger fraction of the information with fewer rank-kk components.

Prior Art: ASVD and SVD-LLM

Before SigmaScale, two dominant approaches solved the “activation outlier” problem:

ASVD (Yuan et al., 2023): Instead of minimizing WWF\|W - W'\|_F, ASVD absorbs the activation statistics into the weight matrix before SVD. Specifically, it computes an activation-covariance-based scaling diagonal SS analytically from the calibration data, then decomposes SWSW by truncated SVD. The idea: if certain input channels have very large activation magnitudes, scaling those channels down in WW before SVD forces the decomposition to “pay attention” to those directions.

SVD-LLM (Wang et al., 2024): Computes the scaling matrix SS via Cholesky decomposition of the activation covariance matrix Cov(X)=XXT\text{Cov}(X) = X X^T. The Cholesky factor SS whitens activations, and the truncated SVD on SWSW is then optimal in the whitened (activation-covariance-normalized) metric. This gives a principled analytical solution, and SVD-LLM further combines this with a sequential layer-by-layer update scheme.

Both methods analytically derive the scaling from calibration statistics. SigmaScale’s key idea: why not learn the scaling matrices by gradient descent instead? This offers more flexibility to adapt to per-layer weight structure, at the cost of requiring an optimization loop.

Introduction: The Problem SigmaScale Solves

Large language models have grown rapidly to tens and hundreds of billions of parameters (Llama, DeepSeek, Qwen, GPT-4, etc.). While their performance scales with parameter count, so does the deployment cost: GPU memory, inference latency, and power consumption.

Low-rank decomposition via SVD is an attractive compression approach because:

  1. It works on any hardware — no quantized data types or special kernels needed.
  2. It can be stacked with quantization or pruning.
  3. The compressed representation W=LRW' = LR replaces every matrix multiply WxWx with two smaller ones L(Rx)L(Rx) at reduced FLOPs.

But naïve SVD compression (minimize WWF\|W - W'\|_F) performs poorly in practice because LLM weight matrices have outlier activation patterns: certain input channels are much larger in magnitude than others, causing the activation-unaware SVD to allocate rank to directions that barely affect the output.

Prior works (ASVD, SVD-LLM) resolve this by computing a scaling transformation SS analytically from activation statistics, then decomposing the scaled matrix SWSW. Both approaches work well, but they fix SS before optimization and compute it from a summary statistic (activation covariance or its Cholesky factor) rather than directly from the compression loss.

SigmaScale’s hypothesis: directly optimizing SS under the activation-aware loss LF\mathcal{L}_F should learn a better scaling transformation — one that minimizes actual compression error rather than a proxy statistic. Specifically, it learns per-matrix row and column scaling vectors drRmd_r \in \mathbb{R}^m and dcRnd_c \in \mathbb{R}^n via gradient descent, then uses the resulting scaling matrices Sr=diag(exp(dr))S_r = \text{diag}(\exp(d_r)) and Sc=diag(exp(dc))S_c = \text{diag}(\exp(d_c)) to pre-condition the weight matrix before SVD truncation.

The SigmaScale Method: Full Technical Walkthrough

Figure 1: The SigmaScale Processing Pipeline

flowchart TD
    A["Pre-trained LLM\n(Llama 3.1 8B / Qwen3-8B)"] --> B["Phase 1: Sensitivity Probing\nPer-layer perplexity at 9 compression levels"]
    B --> C["Binary Search\nGlobal rank assignment k* per layer"]
    C --> D["Phase 2: Scaling Matrix Learning\nOptimize d_r, d_c per weight matrix\nunder activation-aware loss L_F"]
    D --> E["Phase 3: Apply Scaled SVD\nW' = Sr^{-1} * f_svd(Sr*W*Sc) * Sc^{-1}"]
    E --> F["Phase 4: Post-Compression Fine-Tuning\nSFT or KD with frozen uncompressed layers"]
    F --> G["Compressed LLM\nW' = L * R  (rank-k factors)"]

The pipeline has four distinct phases, executed once per model. Let me walk through each in detail.

Phase 1: Sensitivity Probing — Finding the Right Rank Per Layer

Not all layers are equally sensitive to compression. An early attention layer might tolerate aggressive rank reduction while a crucial MLP layer in the middle of the network might degrade sharply. Sensitivity probing characterizes this per-layer tolerance.

Step-by-Step: Sensitivity Probing

  1. Define a grid of compression ratios c{0.1,0.2,,0.9}c \in \{0.1, 0.2, \ldots, 0.9\} (where 0.9 means retain 90% of parameters).
  2. For each layer \ell and each module (Q, K, V, O projections; MLP up/down/gate projections): a. Compute the target rank from the compression ratio:
k=cW(m+n)1k = c \cdot |\mathbf{W}| \cdot (m + n)^{-1}

where W=mn|\mathbf{W}| = mn is the total parameter count of the weight matrix, and m,nm, n are its row and column dimensions. Rearranging: k(m+n)=cmnk(m+n) = c \cdot mn, so k=cmn/(m+n)k = c \cdot mn / (m+n).

b. Apply truncated SVD at rank kk to the isolated weight matrix. c. Measure perplexity on the calibration set with this single weight compressed, all others intact. 3. Result: a 2D sensitivity map — compression ratio × layer — with perplexity impact for each entry. 4. Run the ASVD binary search algorithm over this map to find the optimal per-layer ranks {k1,k2,,kL}\{k_1^*, k_2^*, \ldots, k_L^*\} that meet the global compression target while minimizing total perplexity increase.

Figure 2: Sensitivity Probing Flow for a Single Layer

flowchart LR
    subgraph "For each layer ℓ and module"
        W["Weight matrix W ∈ R^{m×n}"] --> SVD["Compute SVD: W = U Σ V^T"]
        SVD --> RANK["Compute target rank k\nfor each c in {0.1,...,0.9}"]
        RANK --> TRUNC["Truncated SVD W_k = U_k Σ_k V_k^T"]
        TRUNC --> PPL["Measure perplexity\non calibration set"]
        PPL --> MAP["Sensitivity entry:\n(layer ℓ, module, c) → Δppl"]
    end
    MAP --> BINARY["Binary Search\nFind optimal k* per layer\nunder global budget"]

Why binary search? The problem of assigning per-layer ranks under a global parameter budget is combinatorially large. Binary search over the compression ratio cc (treating all layers uniformly at each candidate cc, then perturbing) finds a good solution efficiently. ASVD introduced this technique; SigmaScale inherits it.

Why probe in isolation? Probing each layer’s sensitivity independently ignores cross-layer interactions, but it provides a good first approximation. The key insight is that layers with steeply rising perplexity curves are “sensitive” and should be given higher rank; flat curves indicate compressible layers.

Phase 2: Learning Scaling Matrices

This is the core novel contribution. For each weight matrix WRm×nW \in \mathbb{R}^{m \times n}, SigmaScale learns two vectors drRmd_r \in \mathbb{R}^m and dcRnd_c \in \mathbb{R}^n that define diagonal scaling transformations.

Design Choice 1: Why Diagonal Scaling?

A full scaling matrix SRm×mS \in \mathbb{R}^{m \times m} would have m2m^2 parameters to optimize — far too many. Restricting to diagonal scaling (just m+nm + n parameters total for row and column) makes the optimization lightweight and avoids overfitting to the calibration set.

Geometrically, diagonal row scaling Sr=diag(s1,,sm)S_r = \text{diag}(s_1, \ldots, s_m) rescales each row of WW independently. If row ii has activation outliers, scaling it down “absorbs” the outlier into the weight matrix in a way that SVD can better handle. Column scaling ScS_c does the same for columns (input channels).

Design Choice 2: Parameterizing via Exponentiation

Rather than learning dr,dcd_r, d_c as the scaling values directly, SigmaScale parameterizes through the exponential:

Sr=diag(exp(dr)),Sc=diag(exp(dc))S_r = \text{diag}(\exp(d_r)), \quad S_c = \text{diag}(\exp(d_c))

Why exp? This ensures SrS_r and ScS_c are always positive definite diagonal matrices regardless of the values of dr,dcd_r, d_c. This matters for two reasons:

  1. The inverse Sr1=diag(exp(dr))S_r^{-1} = \text{diag}(\exp(-d_r)) always exists (no division by zero).
  2. Positivity is a natural constraint for scaling matrices that “stretch” or “shrink” directions.

The unconstrained optimization is over drRmd_r \in \mathbb{R}^m and dcRnd_c \in \mathbb{R}^n — no box constraints needed.

Initialization

The scaling vectors are initialized with small Gaussian noise scaled by the weight matrix’s standard deviation:

dr,dc=(0.1)σWϵ,ϵN(0,I)d_{r}, d_{c} = (0.1) \cdot \sigma_W \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

where σW\sigma_W is the empirical standard deviation of entries of WW. This ensures the initial scaling is close to identity (since exp(small)1\exp(\text{small}) \approx 1) while respecting the scale of the weight matrix. Starting near identity means the first SVD compression is essentially unscaled, and the optimization incrementally learns how to scale.

The Objective: Activation-Aware Frobenius Loss

With the scaling matrices defined, the compressed approximation of WW under row/column scaling is:

W=Sr1fsvd(k)(SrWSc)Sc1W' = S_r^{-1} \cdot f_{\text{svd}}^{(k)}(S_r W S_c) \cdot S_c^{-1}

where fsvd(k)(M)f_{\text{svd}}^{(k)}(M) denotes the rank-kk truncated SVD of matrix MM.

Step-by-step breakdown of this formula:

  1. SrWScS_r W S_c: pre-condition the weight matrix by scaling rows (by SrS_r) and columns (by ScS_c). In the scaled space, the singular value spectrum more closely tracks functional importance.
  2. fsvd(k)()f_{\text{svd}}^{(k)}(\cdot): truncate to rank kk in the scaled space. By Eckart–Young, this is the best rank-kk approximation in the scaled metric.
  3. Sr1()Sc1S_r^{-1}(\cdot)S_c^{-1}: undo the scaling to get back to the original weight space. The final WW' is the “best rank-kk approximation of WW in the metric defined by Sr,ScS_r, S_c.”

The training objective is:

LF=1mnWXWXF2\mathcal{L}_F = \frac{1}{mn} \|WX - W'X\|_F^2

Gradients flow through W=Sr1fsvd(k)(SrWSc)Sc1W' = S_r^{-1} f_{\text{svd}}^{(k)}(S_r W S_c) S_c^{-1} with respect to drd_r and dcd_c (via SrS_r and ScS_c). The SVD itself is non-differentiable in the traditional sense, but Taylor-expansion-based approximations (cited by the paper) allow approximate gradient computation.

Why normalize by mnmn? Without normalization, the loss magnitude grows with matrix size, making it hard to use a single learning rate schedule across different layers. Normalizing by mnmn gives a loss that is roughly scale-invariant.

Figure 3: Scaling + SVD Data Flow for a Single Weight Matrix

flowchart LR
    subgraph inputs
        W["W ∈ R^{m×n}\noriginal weight"]
        X["X ∈ R^{n×s}\ncalibration activations"]
        dr["d_r ∈ R^m\nrow scale vector"]
        dc["d_c ∈ R^n\ncol scale vector"]
    end

    subgraph scaling
        Sr["Sr = diag(exp(d_r))\nRow scaling (m×m diag)"]
        Sc["Sc = diag(exp(d_c))\nCol scaling (n×n diag)"]
    end

    subgraph svd_compress
        SW["Ŵ = Sr · W · Sc\nScaled weight (m×n)"]
        TSVD["f_svd^k(Ŵ) = Uk Σk Vk^T\nRank-k truncated SVD"]
        Wprime["W' = Sr^{-1} Uk Σk Vk^T Sc^{-1}\nUnscaled compressed weight (m×n)"]
    end

    subgraph loss
        diff["WX - W'X (output diff)"]
        LF["L_F = (1/mn) ||WX - W'X||_F^2"]
    end

    dr --> Sr
    dc --> Sc
    W --> SW
    Sr --> SW
    Sc --> SW
    SW --> TSVD
    TSVD --> Wprime
    Sr --> Wprime
    Sc --> Wprime
    W --> diff
    X --> diff
    Wprime --> diff
    diff --> LF
    LF -->|"backprop through Sr, Sc"| dr
    LF -->|"backprop through Sr, Sc"| dc

Phase 3: Final Compressed Weight Extraction

After learning drd_r and dcd_c, the final low-rank factors are extracted as:

L=Sr1UkΣkRm×k,R=ΣkVkTSc1Rk×nL = S_r^{-1} U_k \sqrt{\Sigma_k} \in \mathbb{R}^{m \times k}, \quad R = \sqrt{\Sigma_k} V_k^T S_c^{-1} \in \mathbb{R}^{k \times n}

so that W=LRW' = LR exactly.

Why split Σk\Sigma_k as Σk\sqrt{\Sigma_k} between LL and RR? This is a symmetric factorization that balances the magnitude of the two factors, helping numerical stability during post-compression fine-tuning. Alternatives (absorbing all of Σk\Sigma_k into LL or RR) are also valid but create imbalanced scales.

What is stored? Instead of WRm×nW \in \mathbb{R}^{m \times n} (mnmn parameters), we store LRm×kL \in \mathbb{R}^{m \times k} and RRk×nR \in \mathbb{R}^{k \times n}, totalling k(m+n)k(m+n) parameters. At 0.9x retention with typical Llama MLP weights (m=n=4096m = n = 4096, k0.9×4096×4096/(4096+4096)=0.9×2048=1843k \approx 0.9 \times 4096 \times 4096 / (4096+4096) = 0.9 \times 2048 = 1843), the storage ratio is about 1843×2×4096/(40962)=0.901843 \times 2 \times 4096 / (4096^2) = 0.90 — consistent with a 10% parameter reduction per matrix.

Phase 4: Post-Compression Fine-Tuning

After replacing all weight matrices with their low-rank approximations, the model needs to be fine-tuned to recover performance. SigmaScale compares two strategies:

Supervised Fine-Tuning (SFT): optimize the compressed weights on an instruction-following dataset (Alpaca in this case). Non-compressed weights (layer norms, embeddings, LM head) are frozen; only the low-rank factor weights are updated.

Knowledge Distillation (KD): use the uncompressed teacher model to provide soft targets, minimizing KL-divergence between teacher and compressed student output distributions. The rationale: multi-step post-training (RLHF, instruction tuning) shaped the original model’s output distribution in ways that may not be captured by a simple supervised dataset. KD re-anchors the student to the teacher’s behavior.

Interestingly, SigmaScale’s results show that KD does not substantially outperform SFT for this method — a negative result that the authors flag and contrast with prior work (Xin et al., 2026) that found KD beneficial for SVD compression recovery.

Pseudocode: Full SigmaScale Algorithm

Algorithm: SigmaScale Compression

Input:
  - Pre-trained LLM with weight matrices {W_ℓ}
  - Calibration activations X (n=32 samples, seq_len=2048)
  - Global target compression ratio c_global
  - Rank-k grid c ∈ {0.1, 0.2, ..., 0.9}

Phase 1 — Sensitivity Probing:
  for each layer ℓ, each module m (attn/MLP):
    for each c in {0.1, ..., 0.9}:
      k_c = c * |W_ℓ_m| / (rows + cols)   # Eq. (2)
      W'_c = f_svd^{k_c}(W_ℓ_m)           # Truncated SVD, no scaling
      Measure PPL(W_ℓ_m ← W'_c) on calibration set
      store sensitivity[ℓ][m][c] = Δppl
  # Binary search for globally optimal k* per layer
  {k*_ℓ_m} = BinarySearch(sensitivity, c_global)

Phase 2 — Learn Scaling Matrices:
  for each layer ℓ, each module m:
    k = k*_ℓ_m   # from Phase 1
    Initialize d_r ~ 0.1 * σ(W) * N(0, I_m)
    Initialize d_c ~ 0.1 * σ(W) * N(0, I_n)
    
    Optimization loop (T steps):
      S_r = diag(exp(d_r))                     # positive row scaling
      S_c = diag(exp(d_c))                     # positive col scaling
      Ŵ = S_r @ W @ S_c                        # scaled weight
      Û_k, Σ̂_k, V̂_k^T = truncated_SVD(Ŵ, k)  # rank-k SVD of scaled W
      W' = S_r^{-1} @ Û_k @ Σ̂_k @ V̂_k^T @ S_c^{-1}  # unscaled approx
      L_F = (1/mn) * ||W*X - W'*X||_F^2       # Eq. (4)
      Backprop: update d_r, d_c via gradient descent on L_F

Phase 3 — Extract Low-Rank Factors:
  for each layer ℓ, each module m:
    S_r = diag(exp(d_r*))   # final learned scaling
    S_c = diag(exp(d_c*))
    Ŵ = S_r @ W @ S_c
    U_k, Σ_k, V_k^T = truncated_SVD(Ŵ, k)
    L = S_r^{-1} @ U_k @ sqrt(Σ_k)    # Eq. (5a)
    R = sqrt(Σ_k) @ V_k^T @ S_c^{-1}  # Eq. (5b)
    Replace W with (L, R) in model     # W ≈ L @ R

Phase 4 — Post-Compression Fine-Tuning:
  Freeze all non-compressed weights (layer norms, embeddings, LM head)
  For each batch (x, y) from Alpaca dataset:
    Option A (SFT): minimize cross-entropy(student(x), y)
    Option B (KD):  minimize KL(teacher_logits(x) || student_logits(x))
    Update only L, R factors for compressed matrices

Output: Compressed LLM with all W replaced by LR factorizations

Line-by-Line Explanation of Key Steps

Phase 1, rank computation k_c = c * |W| / (rows + cols): This comes from solving k(m+n)=cmnk(m+n) = c \cdot mn for kk. The constraint is: the total parameter count of the factored representation (km+kn=k(m+n))(k \cdot m + k \cdot n = k(m+n)) should equal cc times the original parameter count mnmn.

Phase 2, S_r = diag(exp(d_r)): Exponentiation ensures all diagonal entries are strictly positive, making the matrix invertible. The unconstrained parameter space drRmd_r \in \mathbb{R}^m is mapped to positive definite diagonal matrices.

Phase 2, backprop through truncated SVD: This is non-trivial because the SVD function is not differentiable at repeated singular values. The paper cites Taylor-expansion-based gradient approximations for this step.

Phase 3, L = S_r^{-1} @ U_k @ sqrt(Σ_k) and R = sqrt(Σ_k) @ V_k^T @ S_c^{-1}: Verify: LR=Sr1UkΣkΣkVkTSc1=Sr1UkΣkVkTSc1=WLR = S_r^{-1} U_k \sqrt{\Sigma_k} \cdot \sqrt{\Sigma_k} V_k^T S_c^{-1} = S_r^{-1} U_k \Sigma_k V_k^T S_c^{-1} = W'. ✓

The Mathematics: Why Does Scaling Help?

Framing the Problem as a Metric Change

The key insight is that SVD minimizes reconstruction error in a specific metric. Vanilla SVD minimizes WWF\|W - W'\|_F (the standard Frobenius norm, which treats all entries equally). What we actually want is to minimize output error WxWx\|Wx - W'x\| for typical activations xx.

If activations xx have covariance Σx=E[xxT]\Sigma_x = \mathbb{E}[xx^T], the weighted output error is:

Ex[WxWx2]=(WW)Σx1/2F2\mathbb{E}_x[\|Wx - W'x\|^2] = \|(W - W')\Sigma_x^{1/2}\|_F^2

So the “right” metric for compression is the activation-covariance-weighted Frobenius norm Σx1/2F2\|\cdot \Sigma_x^{1/2}\|_F^2. SVD-LLM computes Σx1/2\Sigma_x^{1/2} via Cholesky decomposition and uses it as the scaling matrix ScS_c on columns.

SigmaScale generalizes this: instead of fixing Sc=Σx1/2S_c = \Sigma_x^{1/2}, it learns ScS_c (and also SrS_r for rows) by gradient descent on the actual activation-aware loss LF\mathcal{L}_F.

Why Learned Scaling Can Beat Analytical Scaling

Analytical methods (ASVD, SVD-LLM) derive the optimal SS for a specific proxy objective (whitening, covariance alignment). But the true objective is minimizing LF\mathcal{L}_F with the truncation at exactly rank kk — a non-convex problem. Gradient descent over the full loss can find solutions that analytical methods cannot, because:

  1. It can account for interactions between row and column scaling simultaneously.
  2. It directly minimizes LF\mathcal{L}_F rather than a proxy.
  3. It can adapt to per-matrix structure that doesn’t match simple covariance-based patterns.

The trade-off: every gradient step requires a full SVD computation (cost O(n3)O(n^3)), making it much more expensive than analytical methods that compute scaling once. SigmaScale is slower to compress but potentially higher quality.

Effective Rank Entropy: A Proxy for Compressibility

The effective rank entropy H(Σ)H(\Sigma) of the singular value spectrum quantifies how “spread out” the information is across dimensions. For compression to be effective, we want the spectrum to be concentrated — a few large singular values capturing most of the information.

When SigmaScale’s learned scaling reshapes WSrWScW \to S_r W S_c, it changes the singular value distribution of the scaled matrix. The paper shows (Table 2) that during optimization, the average effective rank entropy decreases — meaning the spectrum becomes more concentrated — and this decrease correlates strongly with reductions in LF\mathcal{L}_F.

Intuition: Scaling rows and columns “rotates” and “stretches” the weight matrix in its embedding spaces. A well-chosen scaling can concentrate variance along a few dominant singular directions, making rank-kk truncation more efficient. This is why SigmaScale works: it actively reshapes the singular value spectrum to be more amenable to low-rank approximation.

Experiments

Experimental Setup

FactorDetails
ModelsLlama 3.1 8B Instruct, Qwen3-8B
Compression ratios0.90× (mild), 0.75× (moderate), 0.50× (aggressive)
Calibration data32 samples × 2048 tokens from Wikitext-2 training split
Perplexity eval141 samples × 2048 tokens from Wikitext-2 test split
Zero-shot benchmarks5 downstream tasks (BoolQ, PIQA, SIQA, WinoGrande, ARC)
Fine-tuning datasetAlpaca (52K instruction-following examples)
BaselinesSVD-LLM (Wang et al. 2024), ASVD+ (Yuan et al. 2023)
Post-compression FTSFT vs. KD (uncompressed teacher)
ComputeDescribed in Appendix C (not fully disclosed in main text)
Evaluationlm-evaluation-harness framework

Figure 4: Comparison of Scaling Matrix Derivation Strategies

graph LR
    subgraph "ASVD (Yuan 2023)"
        A1["Compute activation\nmagnitudes from X"] --> A2["Scale columns of W\nby 1/activation_magnitude"]
        A2 --> A3["SVD decompose scaled W\nat rank k"]
    end
    
    subgraph "SVD-LLM (Wang 2024)"
        B1["Compute activation\ncovariance: C = XX^T"] --> B2["Cholesky: C = LL^T\nS_c = L (whitening)"]
        B2 --> B3["SVD decompose S_c W\nat rank k"]
    end
    
    subgraph "SigmaScale (This paper)"
        C1["Initialize d_r, d_c\n≈ small Gaussian"] --> C2["Learn S_r=diag(exp(d_r))\nS_c=diag(exp(d_c)) via SGD"]
        C2 --> C3["Minimize L_F = ||WX - W'X||_F^2\ndirectly over T steps"]
        C3 --> C2
        C3 --> C4["SVD decompose S_r W S_c\nat rank k*"]
    end

Key difference: ASVD and SVD-LLM derive scaling from activation statistics once before compression. SigmaScale optimizes scaling under the actual compression objective over multiple gradient steps.

Results Summary

The paper’s Table 1 (reproduced in condensed form) shows results for Llama 3.1 8B Instruct:

At 0.90× retention (mild compression):

  • SigmaScale substantially improves perplexity over SVD-LLM
  • Recovers most zero-shot performance on all five benchmarks
  • Both KD and SFT variants perform similarly

At 0.75× retention (moderate compression):

  • SigmaScale generally improves some zero-shot benchmarks vs. baselines
  • Perplexity improvements are marginal

At 0.50× retention (aggressive compression):

  • SigmaScale degrades sharply, especially for Llama 3.1 8B Instruct
  • ASVD+ and SVD-LLM appear more resilient at this extreme regime

Similar trends hold for Qwen3-8B, though the degradation at 0.50× is less severe.

Method0.90× (mild)0.75× (moderate)0.50× (aggressive)
SigmaScaleBest (lowest PPL)Competitive / marginal gainWorst (sharp degradation)
SVD-LLMGoodGoodMore resilient
ASVD+GoodGoodMore resilient

(Qualitative summary from paper text; exact numbers in Table 1.)

Key trend: SigmaScale leads at mild compression but degrades most sharply under aggressive compression, suggesting the method’s benefit is specific to the retained-rank regime where learned scaling can reshape the spectrum without losing critical subspaces.

The key takeaway from this chart: SigmaScale (top line) is best at 0.90×, competitive at 0.75×, but degrades most at 0.50×. The method appears to be a “mild compression specialist.”

Why Does SigmaScale Fail at Aggressive Compression?

The paper’s own explanation: at 0.50× retention, the retained rank subspace is so small that no amount of scaling can compensate for the information discarded. Scaling manipulates which directions are considered important, but it cannot create information that simply isn’t there. Once you discard half the singular directions, the model fundamentally loses capacity.

This is analogous to audio compression: you can choose which frequencies to keep (scaling), but at extremely low bitrates, no choice can preserve the signal quality.

Effective Rank Entropy Analysis

Table 2 from the paper quantifies the correlation between scaling optimization and effective rank entropy:

MetricAverage Decrease During Training
Compression loss LF\mathcal{L}_FMeasured (strong decrease)
Effective rank entropy H(Σ)H(\Sigma)Strong correlated decrease

Interpretation: when gradient descent pushes the scaling vectors to reduce LF\mathcal{L}_F, it simultaneously reshapes the singular value spectrum to be more concentrated (lower H(Σ)H(\Sigma)). This is mechanistic evidence that SigmaScale works by “focusing” the weight matrix’s information content into fewer dominant directions — exactly what truncated SVD needs to perform well.

Figure 6: Feature Comparison of SVD Compression Methods

FeatureVanilla SVDASVDSVD-LLMSigmaScale
Scaling typeNoneColumn (mag.)Column (Cholesky)Row + Column (learned)
Scaling derived fromAct. magnitudeAct. covarianceGradient descent
Optimization steps000Multiple (O(n³) per step)
Post-compression FTOptionalOptionalYesYes (SFT or KD)
Best regimeAnyMildMild-moderateMild
Hardware requirementNoneNoneNoneNone
Computational costLowMediumMediumHigh

The table highlights SigmaScale’s trade-off: most flexible and potentially highest quality, but most computationally expensive at compression time (though inference cost is identical to any other low-rank factorization).

Critical Assessment: Weaknesses and Improvements

Weaknesses and Flaws

1. Limited compression regimes evaluated. The paper only tests three compression levels: 0.90×, 0.75×, and 0.50×. The actually interesting and practically useful range for deployment is often 0.6×–0.85× — and results at these intermediate points are not presented. This makes it hard to assess where exactly SigmaScale transitions from effective to ineffective.

2. Evaluation breadth is narrow. The paper evaluates perplexity on Wikitext-2 and five zero-shot benchmarks. This omits:

  • Long-form generation quality (coherence, factuality, instruction following on real queries)
  • Coding benchmarks (HumanEval, MBPP)
  • Mathematical reasoning (GSM8K, MATH) — particularly relevant since quantization/compression has known issues with reasoning chains
  • Multilingual tasks (Qwen3 is multilingual; English-only eval seems insufficient)

The 5-benchmark suite is standard but known to be saturated at this model scale, meaning small differences in accuracy may be noise rather than signal.

3. Calibration data sensitivity not rigorously studied. The authors acknowledge using Wikitext-2 primarily “for consistency with SVD-LLM and ASVD” and admit it is likely a “subpar choice.” Yet they do not run any ablation varying the calibration dataset (e.g., instruction-following data vs. Wikipedia text vs. code). This is a significant omission: ASVD and SVD-LLM both demonstrate sensitivity to calibration distribution, and a learned scaling method with m+nm+n free parameters per matrix is potentially more sensitive.

4. Computational cost not quantified. The paper describes needing an SVD at every optimization step (cost O(n3)O(n^3)) but Appendix C does not appear in the main text excerpt, and precise wall-clock compression times are not directly compared against SVD-LLM and ASVD. How many gradient steps are taken? What is the actual compression time overhead? For practitioners deciding whether to use SigmaScale vs. SVD-LLM, this information is critical.

5. Only 8B-scale models. Results are shown only on Llama 3.1 8B Instruct and Qwen3-8B. Low-rank methods often behave differently at different scales: 70B models have different singular value structures than 8B models. There is no evidence the method scales to the models most relevant for deployment (the 70B+ range where compression savings are largest in absolute terms).

6. No latency or throughput measurements. The paper motivates SVD compression as reducing “LLM-inference computing cost,” but reports no inference latency or throughput numbers. Frobenius reconstruction loss and perplexity tell us about weight quality, not actual speedup. Especially at 0.90× retention, the question is: what is the actual wall-clock speedup vs. the quality loss?

Limitations the Authors Understate or Omit

The O(n³) per-step cost is a showstopper for large layers. The paper mentions this as a limitation but does not quantify it. In a 70B model, MLP weight matrices are 8192×286728192 \times 28672. A single SVD computation costs O(min(m,n)2max(m,n))O(\min(m,n)^2 \max(m,n)) which for these dimensions is enormous. Running hundreds of gradient steps per matrix (each requiring a full SVD) would be prohibitively slow — likely slower than retraining the model from scratch. The paper does not propose approximate SVD (e.g., randomized SVD or Lanczos) to alleviate this, and does not bound the number of gradient steps.

The negative KD result needs more investigation. Prior work (Xin et al., 2026) found KD significantly better than SFT for compressed LLM recovery. SigmaScale’s KD results are “not substantially better.” The authors note this but do not investigate why. Possible explanations: (a) SigmaScale’s learned scaling already pre-aligns the compressed model’s output distribution with the teacher; (b) the specific KD implementation was suboptimal; (c) the 8B model scale is too small for KD to show benefits. Without analysis, this result is hard to interpret or build on.

Interaction with LoRA or quantization not tested. Many practical deployments combine multiple compression techniques (e.g., SVD compression + INT8 quantization, or SVD initialization for LoRA fine-tuning). The paper claims SVD methods “can be deployed alongside quantization and pruning” but does not demonstrate this for SigmaScale.

Concrete Improvement Suggestions

1. Study calibration data ablation. Run SigmaScale with at least 3 calibration datasets: Wikitext-2 (used), Alpaca (instruction-following), and code (e.g., The Stack). Report how much calibration distribution shifts compression quality. This would directly address the paper’s own stated uncertainty about Wikitext being “subpar.”

2. Add randomized/approximate SVD. Replace the exact O(n3)O(n^3) SVD per gradient step with a randomized SVD (Halko et al., 2011) of cost O(mnlogk)O(mn \log k). This would dramatically reduce compression time and enable applying the method to larger models. The loss in approximation quality from using approximate SVD in the inner loop is likely small compared to the truncation approximation itself.

3. Extend evaluation to reasoning and coding. Add at minimum GSM8K (mathematical reasoning) and HumanEval (coding) to the benchmark suite. These tasks are known to be sensitive to model compression in ways that perplexity does not predict.

4. Report actual compression time. Provide wall-clock compression time vs. SVD-LLM and ASVD on the same hardware. This is essential for practitioners to make a trade-off decision.

5. Test at 70B scale. Even a single experiment on Llama 3.1 70B would dramatically increase the practical relevance of the work. The authors could limit this to 0.90× retention (where the method works best) and a single benchmark suite to keep cost manageable.

6. Ablate the number of optimization steps. How does quality evolve with the number of gradient steps? A convergence plot would show whether 100 steps or 10,000 steps are needed, informing practitioners about the compression time vs. quality trade-off.

Limitations and Boundary Conditions

SigmaScale is most effective when:

  • The compression ratio is mild (0.90× retention, i.e., 10% parameter reduction per matrix).
  • The weight matrices have structured singular value spectra that can be reshaped by diagonal scaling.
  • Computational resources for compression time are available (O(n³) per step × many steps per matrix × many matrices).

It is least effective when:

  • Aggressive compression is needed (0.50× or lower).
  • Calibration data distribution differs from inference distribution.
  • Large-scale models (70B+) where O(n³) SVD per step is prohibitively expensive.

It is not a complete solution for extreme low-rank compression: at very low retention rates, the fundamental information loss cannot be overcome by any choice of scaling.

Conclusion

SigmaScale introduces a novel approach to SVD-based LLM compression: rather than analytically deriving scaling matrices from activation statistics (as ASVD and SVD-LLM do), it learns them by gradient descent under the activation-aware Frobenius loss. The key contribution is demonstrating that:

  1. Learned scaling can lower the effective rank entropy of weight matrices, making them more amenable to low-rank truncation.
  2. This entropy reduction correlates strongly with compression quality (lower LF\mathcal{L}_F).
  3. The method is competitive with state-of-the-art SVD methods in the mild-to-moderate compression regime, without requiring specialized hardware.

The work exposes an interesting research question: how much better can SVD-based compression become if the scaling pre-conditioning is optimized rather than analytically derived? SigmaScale provides a first data point, though the computational cost of the approach limits its near-term practical applicability. Future work combining approximate SVD, richer fine-tuning datasets, and larger model scales will determine whether learned scaling becomes the standard approach.

Reproduction Notes

Key implementation details:

  • Models: Llama 3.1 8B Instruct (HuggingFace meta-llama/Llama-3.1-8B-Instruct) and Qwen3-8B (Qwen/Qwen3-8B)
  • Calibration: 32 samples × 2048 tokens from Wikitext-2 training split
  • Eval perplexity: Wikitext-2 test split (141 samples × 2048 tokens)
  • Zero-shot eval: lm-evaluation-harness framework
  • Fine-tuning data: Alpaca (52K samples); authors also created a custom Alpaca variant based on Llama 3.1-8B output distribution (see Appendix G in the paper)
  • Baselines: SVD-LLM and ASVD+ with unified hyperparameters for fair comparison
  • Codebase: Available (linked in Appendix G of the paper)
  • Compute: Described in Appendix C (not fully disclosed in main text)

Potential pitfalls:

  • The gradient computation through SVD requires handling of repeated singular values carefully (Taylor approximation).
  • The optimal number of optimization steps is not stated explicitly in the main text.
  • The Alpaca dataset used for fine-tuning may introduce instruction-following distribution shift; testing with more diverse fine-tuning data is recommended before deploying.

Quick sanity check for reproduction: at 0.90× retention on Llama 3.1 8B Instruct, SigmaScale should substantially lower perplexity vs. vanilla truncated SVD and modestly improve over SVD-LLM, while recovering BoolQ/PIQA/ARC accuracy close to the uncompressed baseline.

Deep Dive: Mathematical Relationships Between Scaling and Compression Quality

The Weighted Low-Rank Approximation Perspective

To understand why scaling helps, it is instructive to derive the optimal low-rank approximation under a weighted Frobenius norm.

Given a weight matrix WRm×nW \in \mathbb{R}^{m \times n} and symmetric positive definite matrices ARm×mA \in \mathbb{R}^{m \times m}, BRn×nB \in \mathbb{R}^{n \times n}, define the (A,B)(A, B)-weighted Frobenius norm:

MA,B2=tr(AMBMT)=A1/2MB1/2F2\|M\|_{A, B}^2 = \text{tr}(A M B M^T) = \|A^{1/2} M B^{1/2}\|_F^2

The best rank-kk approximation of WW under this metric is:

W=A1/2(i=1kuiσiviT)B1/2W^* = A^{-1/2} \left( \sum_{i=1}^{k} u_i \sigma_i v_i^T \right) B^{-1/2}

where ui,σi,viu_i, \sigma_i, v_i are the singular triplets of A1/2WB1/2A^{1/2} W B^{1/2}.

SigmaScale’s design in this framework: By setting A=Sr2=diag(exp(2dr))A = S_r^2 = \text{diag}(\exp(2d_r)) and B=Sc2=diag(exp(2dc))B = S_c^2 = \text{diag}(\exp(2d_c)) (so A1/2=SrA^{1/2} = S_r, B1/2=ScB^{1/2} = S_c), the problem reduces exactly to the SigmaScale formulation:

W=Sr1fsvd(k)(SrWSc)Sc1W' = S_r^{-1} f_{\text{svd}}^{(k)}(S_r W S_c) S_c^{-1}

This confirms that SigmaScale is finding the best rank-kk approximation of WW in the metric defined by the learned scaling matrices. Optimizing the scaling parameters dr,dcd_r, d_c is equivalent to searching for the best weighted norm under which rank-kk truncation incurs minimum activation-based loss.

Connection to the Activation Covariance Matrix

Let XRn×sX \in \mathbb{R}^{n \times s} be the calibration activation matrix. The activation-aware loss can be written as:

LF=1mnWXWXF2=1mn(WW)XF2\mathcal{L}_F = \frac{1}{mn} \|WX - W'X\|_F^2 = \frac{1}{mn} \|(W - W')X\|_F^2

If we define the empirical activation covariance C=XXTRn×nC = XX^T \in \mathbb{R}^{n \times n} (positive semi-definite), then:

(WW)XF2=tr((WW)T(WW)C)=WWC2\|(W - W')X\|_F^2 = \text{tr}\left((W - W')^T (W - W') C\right) = \|W - W'\|_C^2

where C\|\cdot\|_C is the CC-weighted Frobenius norm on rows.

SVD-LLM directly uses the Cholesky factor ScS_c of CC (so ScScT=CS_c S_c^T = C) as the column scaling, which yields the best rank-kk approximation under exactly this column-weighted norm. This is theoretically motivated: SVD-LLM minimizes WXWXF2\|WX - W'X\|_F^2 over the choice of the best factored form that is expressible via column scaling.

SigmaScale additionally introduces row scaling SrS_r, which is not captured by column-covariance weighting alone. The row scaling allows the method to also reweight output directions — useful when the output distribution has structured asymmetries that simple column weighting misses.

Why Row Scaling Matters

Consider an LLM’s attention output projection WORd×dW_O \in \mathbb{R}^{d \times d}. The input activations to WOW_O are the attention output heads, and the output is added to the residual stream. The residual stream has its own distribution — certain output dimensions may be much more “important” (strongly coupled to downstream computation) than others.

Column scaling ScS_c accounts for the input activation distribution. Row scaling SrS_r can account for the output importance — essentially weighting reconstruction error more heavily for high-importance output dimensions. Pure column-covariance methods (SVD-LLM) do not have this degree of freedom.

This theoretical argument predicts that the benefit of learned row scaling should be larger for weight matrices whose row importance is heterogeneous and not well-correlated with column activation magnitudes — and indeed the paper shows improvement in the mild compression regime where these subtle asymmetries matter.

Practical Deployment Considerations

Memory and Inference Cost

For a layer with weight WRm×nW \in \mathbb{R}^{m \times n} compressed to rank kk:

QuantityFormulaExample (m=n=4096m=n=4096, k=0.9×2048k=0.9 \times 2048)
Original parametersmnmn16.8M
Compressed parametersk(m+n)k(m+n)15.1M\approx 15.1M
Parameter reduction(1k(m+n)/mn)×100%(1 - k(m+n)/mn) \times 100\%10%\approx 10\%
Original MACs (batch 1)mnmn16.8M MACs
Compressed MACs (batch 1)k(m+n)k(m+n)15.1M\approx 15.1M MACs
Memory bandwidth savedSame ratio as parameters10%\approx 10\%

At 0.90× retention, the savings are modest in absolute terms — roughly 10% parameter reduction per compressed matrix. Since the model also has uncompressed elements (embeddings, LN, head), the actual model-level compression ratio is less than 10%.

For 0.50× retention, the savings are substantial: k(m+n)=0.5mnk(m+n) = 0.5mn \to 50% of parameters per matrix. But as SigmaScale shows, quality degrades sharply at this regime.

Hardware Considerations for Inference

Low-rank matrix products WxL(Rx)Wx \approx L(Rx) introduce a sequential dependency (must finish RxRx before starting LxLx). For small batch sizes (latency-critical serving), this can actually hurt throughput because the reduced FLOP count is not enough to fully saturate GPU SIMD units across small rank dimensions.

For large batches (throughput-critical serving), the k(m+n)k(m+n) vs mnmn FLOP reduction translates more directly to speedup, since tensor cores can efficiently handle both steps.

Rule of thumb: SVD low-rank compression benefits throughput-heavy serving (batch sizes ≥ 32) more than latency-sensitive serving (batch sizes = 1 or small). This is a consideration when deciding whether to use SigmaScale vs. quantization for a given deployment scenario.

Stacking with Quantization

The compressed matrices LRm×kL \in \mathbb{R}^{m \times k} and RRk×nR \in \mathbb{R}^{k \times n} can in principle be quantized independently after compression. However:

  1. The factor matrices LL and RR have different value distributions than the original weight WW.
  2. The error from quantization stacks with the truncation error from SVD.
  3. The post-compression fine-tuning (SFT or KD) is done on FP16 factors; quantizing after fine-tuning is one option; quantization-aware fine-tuning of the low-rank factors is another.

SigmaScale does not report any quantization experiments, leaving this as an open direction.

Historical Context: The Evolution of SVD-Based LLM Compression

Understanding where SigmaScale fits requires a brief historical arc:

Phase 1 — Naive SVD (2021-2022): Direct truncated SVD on weight matrices. Very fast to compress, but perplexity loss is unacceptably high. Root cause: ignored activation outliers.

Phase 2 — Activation-Aware Scaling (2023): ASVD introduced column scaling based on activation magnitudes. First to demonstrate competitive quality on 7B models. Simple and efficient but uses a rough proxy (L1 magnitude) rather than full covariance.

Phase 3 — Covariance-Based Scaling (2024): SVD-LLM uses Cholesky decomposition of activation covariance for provably optimal column scaling. Adds sequential layer-by-layer weight update to propagate compression error corrections. State-of-the-art at the time.

Phase 4 — Learned Scaling (2026, SigmaScale): Directly optimizes scaling parameters under the compression loss. Adds row scaling as a new degree of freedom. Competitive in mild regime, not a full solution for aggressive compression. Computational cost higher.

What’s next? The natural extensions are: (1) learned non-diagonal transformations (full rotations, as in QuaRot/QuIP for quantization); (2) joint optimization across layers (SigmaScale optimizes each matrix independently); (3) integration with LoRA fine-tuning post-deployment.

Reflection: What Makes This Paper Worth Reading?

SigmaScale is a clean, well-motivated paper that makes a targeted contribution: demonstrating that learned scaling beats analytical scaling for SVD compression in the mild regime, and providing mechanistic evidence via the effective rank entropy correlation.

What it does well:

  • Clear hypothesis (learn vs. derive scaling)
  • Mechanistic analysis (effective rank entropy correlation)
  • Honest about limitations (aggressive compression fails, O(n³) cost, narrow eval)
  • Two models tested (Llama 3.1 + Qwen3)
  • SFT vs. KD comparison (even if the negative KD result isn’t fully explained)

What I’d want to see in a follow-up:

  • Randomized SVD for scalability
  • Calibration data ablation (the most obviously missing experiment)
  • 70B scale validation
  • Latency measurements
  • Integration with quantization

For researchers working on efficient LLM deployment, SigmaScale is a useful reference for the proposition that “activation-aware diagonal pre-conditioning + learned optimization can outperform covariance-based analytics” — and the effective rank entropy metric is a potentially reusable diagnostic tool for other compression methods.

Glossary of Key Terms

TermDefinition
Truncated SVDKeeping only the top kk singular triplets of the SVD; optimal rank-kk approximation under Frobenius norm (Eckart–Young theorem)
Low-rank factorizationRepresenting weight matrix WW as product LRLR of two thin matrices, reducing storage and FLOPs
Activation outliersInput channels with abnormally large activation magnitudes relative to others; cause naïve SVD to misallocate rank
Scaling matrixDiagonal matrix applied to pre-condition a weight matrix before SVD; shifts the effective metric for rank-kk approximation
Activation-aware lossFrobenius reconstruction error on actual calibration activations XX: (WW)XF2\|(W - W')X\|_F^2; contrasted with weight-space Frobenius norm
Effective rank entropyEntropy of the normalized singular value distribution; low entropy = concentrated spectrum = easier to compress
Knowledge distillation (KD)Minimizing KL divergence between a compressed student and uncompressed teacher’s output logits; used to recover post-compression performance
Sensitivity probingMeasuring how much each layer’s perplexity rises under compression at various ratios; drives per-layer rank allocation
Binary search (ASVD)Efficient algorithm to find globally optimal rank allocation satisfying a total parameter budget
Retention ratioFraction of original parameters kept per matrix after low-rank approximation (0.90 = keep 90%)
ASVDActivation-aware SVD: column scaling from activation magnitudes (Yuan et al., 2023)
SVD-LLMColumn scaling from Cholesky decomposition of activation covariance (Wang et al., 2024)
SigmaScaleThis paper: learned row+column diagonal scaling via gradient descent on activation-aware loss