June 26, 2026 EN #SVD & Low-Rank #Model Compression #LLM Inference

SigmaScale: Learning to Scale Weight Matrices for Better SVD-Based LLM Compression

Review date: 2026-06-26 Review author: Zhongzhu Zhou Paper reviewed: SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices Paper authors: Ernests Lavrinovics, Marco Letizia, Roy Janco, Shai Segal, Johannes Bjerva, Maurizio Pierini arXiv: 2606.07098 Status/Venue: arXiv preprint, June 2026

Short Answer

SigmaScale learns per-weight-matrix row and column scaling vectors that reshape the singular-value spectrum before truncated SVD compression, reducing the effective intrinsic rank and cutting activation-based reconstruction loss — making it competitive with the best SVD methods in the mild-to-moderate compression regime without requiring any specialized hardware.

Prerequisites: What You Need to Know Before Diving In

Before we get into SigmaScale itself, let me lay out the core concepts you need to follow the technical content. If you’ve worked with matrix factorization before, feel free to skim; if not, read this section carefully because everything else builds on it.

What Is Singular Value Decomposition (SVD)?

SVD is a fundamental matrix factorization theorem. For any matrix $W \in \mathbb{R}^{m \times n}$ , SVD factorizes it into three matrices:

W = U \Sigma V^T

where:

$U \in \mathbb{R}^{m \times m}$ is an orthogonal matrix whose columns are the left singular vectors
$\Sigma \in \mathbb{R}^{m \times n}$ is a diagonal matrix containing the singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ sorted in descending order
$V \in \mathbb{R}^{n \times n}$ is an orthogonal matrix whose columns are the right singular vectors

Think of the singular values as measuring “how important” each component direction is. Large singular values correspond to directions in which the matrix has large action; small singular values correspond to nearly-null directions.

Another useful way to see SVD: you can write the full matrix as a sum of rank-1 outer products:

W = \sum_{i=1}^{\min(m,n)} u_i \sigma_i v_i^T

where $u_i$ is the $i$ -th column of $U$ and $v_i$ is the $i$ -th column of $V$ .

Truncated SVD and the Eckart–Young–Mirsky Theorem

The key theorem driving nearly all low-rank compression work is the Eckart–Young–Mirsky theorem (1936/1960):

Theorem (Eckart–Young–Mirsky): Among all rank- $k$ matrices $W'$ , the one that minimizes the Frobenius norm $\|W - W'\|_F$ is given by the truncated SVD:

W^{(k)} = \sum_{i=1}^{k} u_i \sigma_i v_i^T = U_k \Sigma_k V_k^T

where $U_k, V_k$ keep only the top $k$ columns and $\Sigma_k$ keeps only the top $k$ singular values.

Intuition: Because singular values are sorted in descending order, keeping the top $k$ retains the “most important” $k$ directions and discards the weakest ones. The error of this approximation is:

\|W - W^{(k)}\|_F = \sqrt{\sigma_{k+1}^2 + \sigma_{k+2}^2 + \cdots + \sigma_{\min(m,n)}^2}

This is optimal — no other rank- $k$ matrix is closer to $W$ in Frobenius norm.

Memory savings: instead of storing $m \times n$ parameters, you store $U_k$ ( $m \times k$ ), $\Sigma_k$ ( $k$ ), and $V_k^T$ ( $k \times n$ ) — a total of $k(m+n+1)$ parameters vs. $mn$ . The compression ratio is $k(m+n)/mn$ . For large matrices and small $k$ , this is a big saving.

Why doesn’t vanilla SVD work well for LLMs? The Eckart–Young theorem minimizes $\|W - W'\|_F$ , but what we really care about is whether the model produces the same outputs on real data. The Frobenius norm treats all weight entries equally, but in practice some directions matter enormously (because they amplify large activations) while others are nearly irrelevant. This is the root cause motivating activation-aware methods.

Low-Rank Representation at Inference Time

Once you have $W' = U_k \Sigma_k V_k^T = L R$ where:

L = U_k \sqrt{\Sigma_k} \in \mathbb{R}^{m \times k}, \quad R = \sqrt{\Sigma_k} V_k^T \in \mathbb{R}^{k \times n}

a forward pass becomes:

Wx \approx W'x = L(Rx) = L \cdot (R x)

You compute $Rx$ first ( $k \times n$ multiplied by $n \times 1$ = $k$ -dim vector, cost $kn$ ), then $L \cdot (Rx)$ ( $m \times k$ times $k$ -dim vector, cost $mk$ ). Total cost: $k(m+n)$ vs. the original $mn$ . For $k \ll \min(m,n)$ this is a substantial speedup that works on any hardware — no special kernel or quantized data type needed.

Activation-Aware Compression Loss

Instead of the Frobenius norm on weights, we want to minimize reconstruction error on actual activations. For a calibration dataset with input activations $X \in \mathbb{R}^{n \times s}$ ( $s$ samples), the activation-aware Frobenius loss is:

\mathcal{L}_F = \frac{1}{mn} \|WX - W'X\|_F^2

This shifts focus from weight structure to functional equivalence: two weight matrices that compute similar outputs on typical inputs are “the same” from a compression standpoint, even if they differ entry-wise.

Effective Rank Entropy

The effective rank entropy of a matrix’s singular value spectrum is a soft measure of how many singular values carry meaningful information. For a diagonal matrix $\Sigma$ with non-negative entries, define the normalized probabilities $p_i = \sigma_i / \sum_j \sigma_j$ . The effective-rank entropy is:

H(\Sigma) = -\sum_i p_i \log p_i

Low entropy means the spectrum is concentrated (a few large singular values dominate, others are tiny) — effectively low rank. High entropy means the singular values are spread out (many directions matter equally). When a compression method can lower the effective rank entropy of the scaled weight matrix, it means the spectrum becomes more concentrated after the linear transformation, and truncated SVD can capture a larger fraction of the information with fewer rank- $k$ components.

Prior Art: ASVD and SVD-LLM

Before SigmaScale, two dominant approaches solved the “activation outlier” problem:

ASVD (Yuan et al., 2023): Instead of minimizing $\|W - W'\|_F$ , ASVD absorbs the activation statistics into the weight matrix before SVD. Specifically, it computes an activation-covariance-based scaling diagonal $S$ analytically from the calibration data, then decomposes $SW$ by truncated SVD. The idea: if certain input channels have very large activation magnitudes, scaling those channels down in $W$ before SVD forces the decomposition to “pay attention” to those directions.

SVD-LLM (Wang et al., 2024): Computes the scaling matrix $S$ via Cholesky decomposition of the activation covariance matrix $\text{Cov}(X) = X X^T$ . The Cholesky factor $S$ whitens activations, and the truncated SVD on $SW$ is then optimal in the whitened (activation-covariance-normalized) metric. This gives a principled analytical solution, and SVD-LLM further combines this with a sequential layer-by-layer update scheme.

Both methods analytically derive the scaling from calibration statistics. SigmaScale’s key idea: why not learn the scaling matrices by gradient descent instead? This offers more flexibility to adapt to per-layer weight structure, at the cost of requiring an optimization loop.

Introduction: The Problem SigmaScale Solves

Large language models have grown rapidly to tens and hundreds of billions of parameters (Llama, DeepSeek, Qwen, GPT-4, etc.). While their performance scales with parameter count, so does the deployment cost: GPU memory, inference latency, and power consumption.

Low-rank decomposition via SVD is an attractive compression approach because:

It works on any hardware — no quantized data types or special kernels needed.
It can be stacked with quantization or pruning.
The compressed representation $W' = LR$ replaces every matrix multiply $Wx$ with two smaller ones $L(Rx)$ at reduced FLOPs.

But naïve SVD compression (minimize $\|W - W'\|_F$ ) performs poorly in practice because LLM weight matrices have outlier activation patterns: certain input channels are much larger in magnitude than others, causing the activation-unaware SVD to allocate rank to directions that barely affect the output.

Prior works (ASVD, SVD-LLM) resolve this by computing a scaling transformation $S$ analytically from activation statistics, then decomposing the scaled matrix $SW$ . Both approaches work well, but they fix $S$ before optimization and compute it from a summary statistic (activation covariance or its Cholesky factor) rather than directly from the compression loss.

SigmaScale’s hypothesis: directly optimizing $S$ under the activation-aware loss $\mathcal{L}_F$ should learn a better scaling transformation — one that minimizes actual compression error rather than a proxy statistic. Specifically, it learns per-matrix row and column scaling vectors $d_r \in \mathbb{R}^m$ and $d_c \in \mathbb{R}^n$ via gradient descent, then uses the resulting scaling matrices $S_r = \text{diag}(\exp(d_r))$ and $S_c = \text{diag}(\exp(d_c))$ to pre-condition the weight matrix before SVD truncation.

The SigmaScale Method: Full Technical Walkthrough

Figure 1: The SigmaScale Processing Pipeline

flowchart TD
    A["Pre-trained LLM\n(Llama 3.1 8B / Qwen3-8B)"] --> B["Phase 1: Sensitivity Probing\nPer-layer perplexity at 9 compression levels"]
    B --> C["Binary Search\nGlobal rank assignment k* per layer"]
    C --> D["Phase 2: Scaling Matrix Learning\nOptimize d_r, d_c per weight matrix\nunder activation-aware loss L_F"]
    D --> E["Phase 3: Apply Scaled SVD\nW' = Sr^{-1} * f_svd(Sr*W*Sc) * Sc^{-1}"]
    E --> F["Phase 4: Post-Compression Fine-Tuning\nSFT or KD with frozen uncompressed layers"]
    F --> G["Compressed LLM\nW' = L * R  (rank-k factors)"]

The pipeline has four distinct phases, executed once per model. Let me walk through each in detail.

Phase 1: Sensitivity Probing — Finding the Right Rank Per Layer

Not all layers are equally sensitive to compression. An early attention layer might tolerate aggressive rank reduction while a crucial MLP layer in the middle of the network might degrade sharply. Sensitivity probing characterizes this per-layer tolerance.

Step-by-Step: Sensitivity Probing

Define a grid of compression ratios $c \in \{0.1, 0.2, \ldots, 0.9\}$ (where 0.9 means retain 90% of parameters).
For each layer $\ell$ and each module (Q, K, V, O projections; MLP up/down/gate projections): a. Compute the target rank from the compression ratio:

k = c \cdot |\mathbf{W}| \cdot (m + n)^{-1}

where $|\mathbf{W}| = mn$ is the total parameter count of the weight matrix, and $m, n$ are its row and column dimensions. Rearranging: $k(m+n) = c \cdot mn$ , so $k = c \cdot mn / (m+n)$ .

b. Apply truncated SVD at rank $k$ to the isolated weight matrix. c. Measure perplexity on the calibration set with this single weight compressed, all others intact. 3. Result: a 2D sensitivity map — compression ratio × layer — with perplexity impact for each entry. 4. Run the ASVD binary search algorithm over this map to find the optimal per-layer ranks $\{k_1^*, k_2^*, \ldots, k_L^*\}$ that meet the global compression target while minimizing total perplexity increase.

Figure 2: Sensitivity Probing Flow for a Single Layer

flowchart LR
    subgraph "For each layer ℓ and module"
        W["Weight matrix W ∈ R^{m×n}"] --> SVD["Compute SVD: W = U Σ V^T"]
        SVD --> RANK["Compute target rank k\nfor each c in {0.1,...,0.9}"]
        RANK --> TRUNC["Truncated SVD W_k = U_k Σ_k V_k^T"]
        TRUNC --> PPL["Measure perplexity\non calibration set"]
        PPL --> MAP["Sensitivity entry:\n(layer ℓ, module, c) → Δppl"]
    end
    MAP --> BINARY["Binary Search\nFind optimal k* per layer\nunder global budget"]

Why binary search? The problem of assigning per-layer ranks under a global parameter budget is combinatorially large. Binary search over the compression ratio $c$ (treating all layers uniformly at each candidate $c$ , then perturbing) finds a good solution efficiently. ASVD introduced this technique; SigmaScale inherits it.

Why probe in isolation? Probing each layer’s sensitivity independently ignores cross-layer interactions, but it provides a good first approximation. The key insight is that layers with steeply rising perplexity curves are “sensitive” and should be given higher rank; flat curves indicate compressible layers.

Phase 2: Learning Scaling Matrices

This is the core novel contribution. For each weight matrix $W \in \mathbb{R}^{m \times n}$ , SigmaScale learns two vectors $d_r \in \mathbb{R}^m$ and $d_c \in \mathbb{R}^n$ that define diagonal scaling transformations.

Design Choice 1: Why Diagonal Scaling?

A full scaling matrix $S \in \mathbb{R}^{m \times m}$ would have $m^2$ parameters to optimize — far too many. Restricting to diagonal scaling (just $m + n$ parameters total for row and column) makes the optimization lightweight and avoids overfitting to the calibration set.

Geometrically, diagonal row scaling $S_r = \text{diag}(s_1, \ldots, s_m)$ rescales each row of $W$ independently. If row $i$ has activation outliers, scaling it down “absorbs” the outlier into the weight matrix in a way that SVD can better handle. Column scaling $S_c$ does the same for columns (input channels).

Design Choice 2: Parameterizing via Exponentiation

Rather than learning $d_r, d_c$ as the scaling values directly, SigmaScale parameterizes through the exponential:

S_r = \text{diag}(\exp(d_r)), \quad S_c = \text{diag}(\exp(d_c))

Why exp? This ensures $S_r$ and $S_c$ are always positive definite diagonal matrices regardless of the values of $d_r, d_c$ . This matters for two reasons:

The inverse $S_r^{-1} = \text{diag}(\exp(-d_r))$ always exists (no division by zero).
Positivity is a natural constraint for scaling matrices that “stretch” or “shrink” directions.

The unconstrained optimization is over $d_r \in \mathbb{R}^m$ and $d_c \in \mathbb{R}^n$ — no box constraints needed.

Initialization

The scaling vectors are initialized with small Gaussian noise scaled by the weight matrix’s standard deviation:

d_{r}, d_{c} = (0.1) \cdot \sigma_W \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

where $\sigma_W$ is the empirical standard deviation of entries of $W$ . This ensures the initial scaling is close to identity (since $\exp(\text{small}) \approx 1$ ) while respecting the scale of the weight matrix. Starting near identity means the first SVD compression is essentially unscaled, and the optimization incrementally learns how to scale.

The Objective: Activation-Aware Frobenius Loss

With the scaling matrices defined, the compressed approximation of $W$ under row/column scaling is:

W' = S_r^{-1} \cdot f_{\text{svd}}^{(k)}(S_r W S_c) \cdot S_c^{-1}

where $f_{\text{svd}}^{(k)}(M)$ denotes the rank- $k$ truncated SVD of matrix $M$ .

Step-by-step breakdown of this formula:

$S_r W S_c$ : pre-condition the weight matrix by scaling rows (by $S_r$ ) and columns (by $S_c$ ). In the scaled space, the singular value spectrum more closely tracks functional importance.
$f_{\text{svd}}^{(k)}(\cdot)$ : truncate to rank $k$ in the scaled space. By Eckart–Young, this is the best rank- $k$ approximation in the scaled metric.
$S_r^{-1}(\cdot)S_c^{-1}$ : undo the scaling to get back to the original weight space. The final $W'$ is the “best rank- $k$ approximation of $W$ in the metric defined by $S_r, S_c$ .”

The training objective is:

\mathcal{L}_F = \frac{1}{mn} \|WX - W'X\|_F^2

Gradients flow through $W' = S_r^{-1} f_{\text{svd}}^{(k)}(S_r W S_c) S_c^{-1}$ with respect to $d_r$ and $d_c$ (via $S_r$ and $S_c$ ). The SVD itself is non-differentiable in the traditional sense, but Taylor-expansion-based approximations (cited by the paper) allow approximate gradient computation.

Why normalize by $mn$ ? Without normalization, the loss magnitude grows with matrix size, making it hard to use a single learning rate schedule across different layers. Normalizing by $mn$ gives a loss that is roughly scale-invariant.

Figure 3: Scaling + SVD Data Flow for a Single Weight Matrix

flowchart LR
    subgraph inputs
        W["W ∈ R^{m×n}\noriginal weight"]
        X["X ∈ R^{n×s}\ncalibration activations"]
        dr["d_r ∈ R^m\nrow scale vector"]
        dc["d_c ∈ R^n\ncol scale vector"]
    end

    subgraph scaling
        Sr["Sr = diag(exp(d_r))\nRow scaling (m×m diag)"]
        Sc["Sc = diag(exp(d_c))\nCol scaling (n×n diag)"]
    end

    subgraph svd_compress
        SW["Ŵ = Sr · W · Sc\nScaled weight (m×n)"]
        TSVD["f_svd^k(Ŵ) = Uk Σk Vk^T\nRank-k truncated SVD"]
        Wprime["W' = Sr^{-1} Uk Σk Vk^T Sc^{-1}\nUnscaled compressed weight (m×n)"]
    end

    subgraph loss
        diff["WX - W'X (output diff)"]
        LF["L_F = (1/mn) ||WX - W'X||_F^2"]
    end

    dr --> Sr
    dc --> Sc
    W --> SW
    Sr --> SW
    Sc --> SW
    SW --> TSVD
    TSVD --> Wprime
    Sr --> Wprime
    Sc --> Wprime
    W --> diff
    X --> diff
    Wprime --> diff
    diff --> LF
    LF -->|"backprop through Sr, Sc"| dr
    LF -->|"backprop through Sr, Sc"| dc

Phase 3: Final Compressed Weight Extraction

After learning $d_r$ and $d_c$ , the final low-rank factors are extracted as:

L = S_r^{-1} U_k \sqrt{\Sigma_k} \in \mathbb{R}^{m \times k}, \quad R = \sqrt{\Sigma_k} V_k^T S_c^{-1} \in \mathbb{R}^{k \times n}

so that $W' = LR$ exactly.

Why split $\Sigma_k$ as $\sqrt{\Sigma_k}$ between $L$ and $R$ ? This is a symmetric factorization that balances the magnitude of the two factors, helping numerical stability during post-compression fine-tuning. Alternatives (absorbing all of $\Sigma_k$ into $L$ or $R$ ) are also valid but create imbalanced scales.

What is stored? Instead of $W \in \mathbb{R}^{m \times n}$ ( $mn$ parameters), we store $L \in \mathbb{R}^{m \times k}$ and $R \in \mathbb{R}^{k \times n}$ , totalling $k(m+n)$ parameters. At 0.9x retention with typical Llama MLP weights ( $m = n = 4096$ , $k \approx 0.9 \times 4096 \times 4096 / (4096+4096) = 0.9 \times 2048 = 1843$ ), the storage ratio is about $1843 \times 2 \times 4096 / (4096^2) = 0.90$ — consistent with a 10% parameter reduction per matrix.

Phase 4: Post-Compression Fine-Tuning

After replacing all weight matrices with their low-rank approximations, the model needs to be fine-tuned to recover performance. SigmaScale compares two strategies:

Supervised Fine-Tuning (SFT): optimize the compressed weights on an instruction-following dataset (Alpaca in this case). Non-compressed weights (layer norms, embeddings, LM head) are frozen; only the low-rank factor weights are updated.

Knowledge Distillation (KD): use the uncompressed teacher model to provide soft targets, minimizing KL-divergence between teacher and compressed student output distributions. The rationale: multi-step post-training (RLHF, instruction tuning) shaped the original model’s output distribution in ways that may not be captured by a simple supervised dataset. KD re-anchors the student to the teacher’s behavior.

Interestingly, SigmaScale’s results show that KD does not substantially outperform SFT for this method — a negative result that the authors flag and contrast with prior work (Xin et al., 2026) that found KD beneficial for SVD compression recovery.

Pseudocode: Full SigmaScale Algorithm

Algorithm: SigmaScale Compression

Input:
  - Pre-trained LLM with weight matrices {W_ℓ}
  - Calibration activations X (n=32 samples, seq_len=2048)
  - Global target compression ratio c_global
  - Rank-k grid c ∈ {0.1, 0.2, ..., 0.9}

Phase 1 — Sensitivity Probing:
  for each layer ℓ, each module m (attn/MLP):
    for each c in {0.1, ..., 0.9}:
      k_c = c * |W_ℓ_m| / (rows + cols)   # Eq. (2)
      W'_c = f_svd^{k_c}(W_ℓ_m)           # Truncated SVD, no scaling
      Measure PPL(W_ℓ_m ← W'_c) on calibration set
      store sensitivity[ℓ][m][c] = Δppl
  # Binary search for globally optimal k* per layer
  {k*_ℓ_m} = BinarySearch(sensitivity, c_global)

Phase 2 — Learn Scaling Matrices:
  for each layer ℓ, each module m:
    k = k*_ℓ_m   # from Phase 1
    Initialize d_r ~ 0.1 * σ(W) * N(0, I_m)
    Initialize d_c ~ 0.1 * σ(W) * N(0, I_n)
    
    Optimization loop (T steps):
      S_r = diag(exp(d_r))                     # positive row scaling
      S_c = diag(exp(d_c))                     # positive col scaling
      Ŵ = S_r @ W @ S_c                        # scaled weight
      Û_k, Σ̂_k, V̂_k^T = truncated_SVD(Ŵ, k)  # rank-k SVD of scaled W
      W' = S_r^{-1} @ Û_k @ Σ̂_k @ V̂_k^T @ S_c^{-1}  # unscaled approx
      L_F = (1/mn) * ||W*X - W'*X||_F^2       # Eq. (4)
      Backprop: update d_r, d_c via gradient descent on L_F

Phase 3 — Extract Low-Rank Factors:
  for each layer ℓ, each module m:
    S_r = diag(exp(d_r*))   # final learned scaling
    S_c = diag(exp(d_c*))
    Ŵ = S_r @ W @ S_c
    U_k, Σ_k, V_k^T = truncated_SVD(Ŵ, k)
    L = S_r^{-1} @ U_k @ sqrt(Σ_k)    # Eq. (5a)
    R = sqrt(Σ_k) @ V_k^T @ S_c^{-1}  # Eq. (5b)
    Replace W with (L, R) in model     # W ≈ L @ R

Phase 4 — Post-Compression Fine-Tuning:
  Freeze all non-compressed weights (layer norms, embeddings, LM head)
  For each batch (x, y) from Alpaca dataset:
    Option A (SFT): minimize cross-entropy(student(x), y)
    Option B (KD):  minimize KL(teacher_logits(x) || student_logits(x))
    Update only L, R factors for compressed matrices

Output: Compressed LLM with all W replaced by LR factorizations

Line-by-Line Explanation of Key Steps

Phase 1, rank computation k_c = c * |W| / (rows + cols): This comes from solving $k(m+n) = c \cdot mn$ for $k$ . The constraint is: the total parameter count of the factored representation $(k \cdot m + k \cdot n = k(m+n))$ should equal $c$ times the original parameter count $mn$ .

Phase 2, S_r = diag(exp(d_r)): Exponentiation ensures all diagonal entries are strictly positive, making the matrix invertible. The unconstrained parameter space $d_r \in \mathbb{R}^m$ is mapped to positive definite diagonal matrices.

Phase 2, backprop through truncated SVD: This is non-trivial because the SVD function is not differentiable at repeated singular values. The paper cites Taylor-expansion-based gradient approximations for this step.

Phase 3, L = S_r^{-1} @ U_k @ sqrt(Σ_k) and R = sqrt(Σ_k) @ V_k^T @ S_c^{-1}: Verify: $LR = S_r^{-1} U_k \sqrt{\Sigma_k} \cdot \sqrt{\Sigma_k} V_k^T S_c^{-1} = S_r^{-1} U_k \Sigma_k V_k^T S_c^{-1} = W'$ . ✓

The Mathematics: Why Does Scaling Help?

Framing the Problem as a Metric Change

The key insight is that SVD minimizes reconstruction error in a specific metric. Vanilla SVD minimizes $\|W - W'\|_F$ (the standard Frobenius norm, which treats all entries equally). What we actually want is to minimize output error $\|Wx - W'x\|$ for typical activations $x$ .

If activations $x$ have covariance $\Sigma_x = \mathbb{E}[xx^T]$ , the weighted output error is:

\mathbb{E}_x[\|Wx - W'x\|^2] = \|(W - W')\Sigma_x^{1/2}\|_F^2

So the “right” metric for compression is the activation-covariance-weighted Frobenius norm $\|\cdot \Sigma_x^{1/2}\|_F^2$ . SVD-LLM computes $\Sigma_x^{1/2}$ via Cholesky decomposition and uses it as the scaling matrix $S_c$ on columns.

SigmaScale generalizes this: instead of fixing $S_c = \Sigma_x^{1/2}$ , it learns $S_c$ (and also $S_r$ for rows) by gradient descent on the actual activation-aware loss $\mathcal{L}_F$ .

Why Learned Scaling Can Beat Analytical Scaling

Analytical methods (ASVD, SVD-LLM) derive the optimal $S$ for a specific proxy objective (whitening, covariance alignment). But the true objective is minimizing $\mathcal{L}_F$ with the truncation at exactly rank $k$ — a non-convex problem. Gradient descent over the full loss can find solutions that analytical methods cannot, because:

It can account for interactions between row and column scaling simultaneously.
It directly minimizes $\mathcal{L}_F$ rather than a proxy.
It can adapt to per-matrix structure that doesn’t match simple covariance-based patterns.

The trade-off: every gradient step requires a full SVD computation (cost $O(n^3)$ ), making it much more expensive than analytical methods that compute scaling once. SigmaScale is slower to compress but potentially higher quality.

Effective Rank Entropy: A Proxy for Compressibility

The effective rank entropy $H(\Sigma)$ of the singular value spectrum quantifies how “spread out” the information is across dimensions. For compression to be effective, we want the spectrum to be concentrated — a few large singular values capturing most of the information.

When SigmaScale’s learned scaling reshapes $W \to S_r W S_c$ , it changes the singular value distribution of the scaled matrix. The paper shows (Table 2) that during optimization, the average effective rank entropy decreases — meaning the spectrum becomes more concentrated — and this decrease correlates strongly with reductions in $\mathcal{L}_F$ .

Intuition: Scaling rows and columns “rotates” and “stretches” the weight matrix in its embedding spaces. A well-chosen scaling can concentrate variance along a few dominant singular directions, making rank- $k$ truncation more efficient. This is why SigmaScale works: it actively reshapes the singular value spectrum to be more amenable to low-rank approximation.

Experiments

Experimental Setup

Factor	Details
Models	Llama 3.1 8B Instruct, Qwen3-8B
Compression ratios	0.90× (mild), 0.75× (moderate), 0.50× (aggressive)
Calibration data	32 samples × 2048 tokens from Wikitext-2 training split
Perplexity eval	141 samples × 2048 tokens from Wikitext-2 test split
Zero-shot benchmarks	5 downstream tasks (BoolQ, PIQA, SIQA, WinoGrande, ARC)
Fine-tuning dataset	Alpaca (52K instruction-following examples)
Baselines	SVD-LLM (Wang et al. 2024), ASVD+ (Yuan et al. 2023)
Post-compression FT	SFT vs. KD (uncompressed teacher)
Compute	Described in Appendix C (not fully disclosed in main text)
Evaluation	lm-evaluation-harness framework

Figure 4: Comparison of Scaling Matrix Derivation Strategies

graph LR
    subgraph "ASVD (Yuan 2023)"
        A1["Compute activation\nmagnitudes from X"] --> A2["Scale columns of W\nby 1/activation_magnitude"]
        A2 --> A3["SVD decompose scaled W\nat rank k"]
    end
    
    subgraph "SVD-LLM (Wang 2024)"
        B1["Compute activation\ncovariance: C = XX^T"] --> B2["Cholesky: C = LL^T\nS_c = L (whitening)"]
        B2 --> B3["SVD decompose S_c W\nat rank k"]
    end
    
    subgraph "SigmaScale (This paper)"
        C1["Initialize d_r, d_c\n≈ small Gaussian"] --> C2["Learn S_r=diag(exp(d_r))\nS_c=diag(exp(d_c)) via SGD"]
        C2 --> C3["Minimize L_F = ||WX - W'X||_F^2\ndirectly over T steps"]
        C3 --> C2
        C3 --> C4["SVD decompose S_r W S_c\nat rank k*"]
    end

Key difference: ASVD and SVD-LLM derive scaling from activation statistics once before compression. SigmaScale optimizes scaling under the actual compression objective over multiple gradient steps.

Results Summary

The paper’s Table 1 (reproduced in condensed form) shows results for Llama 3.1 8B Instruct:

At 0.90× retention (mild compression):

SigmaScale substantially improves perplexity over SVD-LLM
Recovers most zero-shot performance on all five benchmarks
Both KD and SFT variants perform similarly

At 0.75× retention (moderate compression):

SigmaScale generally improves some zero-shot benchmarks vs. baselines
Perplexity improvements are marginal

At 0.50× retention (aggressive compression):

SigmaScale degrades sharply, especially for Llama 3.1 8B Instruct
ASVD+ and SVD-LLM appear more resilient at this extreme regime

Similar trends hold for Qwen3-8B, though the degradation at 0.50× is less severe.

Figure 5: Compression Quality vs. Retention Rate (Qualitative Trends)

Method	0.90× (mild)	0.75× (moderate)	0.50× (aggressive)
SigmaScale	Best (lowest PPL)	Competitive / marginal gain	Worst (sharp degradation)
SVD-LLM	Good	Good	More resilient
ASVD+	Good	Good	More resilient

(Qualitative summary from paper text; exact numbers in Table 1.)

Key trend: SigmaScale leads at mild compression but degrades most sharply under aggressive compression, suggesting the method’s benefit is specific to the retained-rank regime where learned scaling can reshape the spectrum without losing critical subspaces.

The key takeaway from this chart: SigmaScale (top line) is best at 0.90×, competitive at 0.75×, but degrades most at 0.50×. The method appears to be a “mild compression specialist.”

Why Does SigmaScale Fail at Aggressive Compression?

The paper’s own explanation: at 0.50× retention, the retained rank subspace is so small that no amount of scaling can compensate for the information discarded. Scaling manipulates which directions are considered important, but it cannot create information that simply isn’t there. Once you discard half the singular directions, the model fundamentally loses capacity.

This is analogous to audio compression: you can choose which frequencies to keep (scaling), but at extremely low bitrates, no choice can preserve the signal quality.

Effective Rank Entropy Analysis

Table 2 from the paper quantifies the correlation between scaling optimization and effective rank entropy:

Metric	Average Decrease During Training
Compression loss $\mathcal{L}_F$	Measured (strong decrease)
Effective rank entropy $H(\Sigma)$	Strong correlated decrease

Interpretation: when gradient descent pushes the scaling vectors to reduce $\mathcal{L}_F$ , it simultaneously reshapes the singular value spectrum to be more concentrated (lower $H(\Sigma)$ ). This is mechanistic evidence that SigmaScale works by “focusing” the weight matrix’s information content into fewer dominant directions — exactly what truncated SVD needs to perform well.

Figure 6: Feature Comparison of SVD Compression Methods

Feature	Vanilla SVD	ASVD	SVD-LLM	SigmaScale
Scaling type	None	Column (mag.)	Column (Cholesky)	Row + Column (learned)
Scaling derived from	—	Act. magnitude	Act. covariance	Gradient descent
Optimization steps	0	0	0	Multiple (O(n³) per step)
Post-compression FT	Optional	Optional	Yes	Yes (SFT or KD)
Best regime	Any	Mild	Mild-moderate	Mild
Hardware requirement	None	None	None	None
Computational cost	Low	Medium	Medium	High

The table highlights SigmaScale’s trade-off: most flexible and potentially highest quality, but most computationally expensive at compression time (though inference cost is identical to any other low-rank factorization).

Critical Assessment: Weaknesses and Improvements

Weaknesses and Flaws

1. Limited compression regimes evaluated. The paper only tests three compression levels: 0.90×, 0.75×, and 0.50×. The actually interesting and practically useful range for deployment is often 0.6×–0.85× — and results at these intermediate points are not presented. This makes it hard to assess where exactly SigmaScale transitions from effective to ineffective.

2. Evaluation breadth is narrow. The paper evaluates perplexity on Wikitext-2 and five zero-shot benchmarks. This omits:

Long-form generation quality (coherence, factuality, instruction following on real queries)
Coding benchmarks (HumanEval, MBPP)
Mathematical reasoning (GSM8K, MATH) — particularly relevant since quantization/compression has known issues with reasoning chains
Multilingual tasks (Qwen3 is multilingual; English-only eval seems insufficient)

The 5-benchmark suite is standard but known to be saturated at this model scale, meaning small differences in accuracy may be noise rather than signal.

3. Calibration data sensitivity not rigorously studied. The authors acknowledge using Wikitext-2 primarily “for consistency with SVD-LLM and ASVD” and admit it is likely a “subpar choice.” Yet they do not run any ablation varying the calibration dataset (e.g., instruction-following data vs. Wikipedia text vs. code). This is a significant omission: ASVD and SVD-LLM both demonstrate sensitivity to calibration distribution, and a learned scaling method with $m+n$ free parameters per matrix is potentially more sensitive.

4. Computational cost not quantified. The paper describes needing an SVD at every optimization step (cost $O(n^3)$ ) but Appendix C does not appear in the main text excerpt, and precise wall-clock compression times are not directly compared against SVD-LLM and ASVD. How many gradient steps are taken? What is the actual compression time overhead? For practitioners deciding whether to use SigmaScale vs. SVD-LLM, this information is critical.

5. Only 8B-scale models. Results are shown only on Llama 3.1 8B Instruct and Qwen3-8B. Low-rank methods often behave differently at different scales: 70B models have different singular value structures than 8B models. There is no evidence the method scales to the models most relevant for deployment (the 70B+ range where compression savings are largest in absolute terms).

6. No latency or throughput measurements. The paper motivates SVD compression as reducing “LLM-inference computing cost,” but reports no inference latency or throughput numbers. Frobenius reconstruction loss and perplexity tell us about weight quality, not actual speedup. Especially at 0.90× retention, the question is: what is the actual wall-clock speedup vs. the quality loss?

Limitations the Authors Understate or Omit

The O(n³) per-step cost is a showstopper for large layers. The paper mentions this as a limitation but does not quantify it. In a 70B model, MLP weight matrices are $8192 \times 28672$ . A single SVD computation costs $O(\min(m,n)^2 \max(m,n))$ which for these dimensions is enormous. Running hundreds of gradient steps per matrix (each requiring a full SVD) would be prohibitively slow — likely slower than retraining the model from scratch. The paper does not propose approximate SVD (e.g., randomized SVD or Lanczos) to alleviate this, and does not bound the number of gradient steps.

The negative KD result needs more investigation. Prior work (Xin et al., 2026) found KD significantly better than SFT for compressed LLM recovery. SigmaScale’s KD results are “not substantially better.” The authors note this but do not investigate why. Possible explanations: (a) SigmaScale’s learned scaling already pre-aligns the compressed model’s output distribution with the teacher; (b) the specific KD implementation was suboptimal; (c) the 8B model scale is too small for KD to show benefits. Without analysis, this result is hard to interpret or build on.

Interaction with LoRA or quantization not tested. Many practical deployments combine multiple compression techniques (e.g., SVD compression + INT8 quantization, or SVD initialization for LoRA fine-tuning). The paper claims SVD methods “can be deployed alongside quantization and pruning” but does not demonstrate this for SigmaScale.

Concrete Improvement Suggestions

1. Study calibration data ablation. Run SigmaScale with at least 3 calibration datasets: Wikitext-2 (used), Alpaca (instruction-following), and code (e.g., The Stack). Report how much calibration distribution shifts compression quality. This would directly address the paper’s own stated uncertainty about Wikitext being “subpar.”

2. Add randomized/approximate SVD. Replace the exact $O(n^3)$ SVD per gradient step with a randomized SVD (Halko et al., 2011) of cost $O(mn \log k)$ . This would dramatically reduce compression time and enable applying the method to larger models. The loss in approximation quality from using approximate SVD in the inner loop is likely small compared to the truncation approximation itself.

3. Extend evaluation to reasoning and coding. Add at minimum GSM8K (mathematical reasoning) and HumanEval (coding) to the benchmark suite. These tasks are known to be sensitive to model compression in ways that perplexity does not predict.

4. Report actual compression time. Provide wall-clock compression time vs. SVD-LLM and ASVD on the same hardware. This is essential for practitioners to make a trade-off decision.

5. Test at 70B scale. Even a single experiment on Llama 3.1 70B would dramatically increase the practical relevance of the work. The authors could limit this to 0.90× retention (where the method works best) and a single benchmark suite to keep cost manageable.

6. Ablate the number of optimization steps. How does quality evolve with the number of gradient steps? A convergence plot would show whether 100 steps or 10,000 steps are needed, informing practitioners about the compression time vs. quality trade-off.

Limitations and Boundary Conditions

SigmaScale is most effective when:

The compression ratio is mild (0.90× retention, i.e., 10% parameter reduction per matrix).
The weight matrices have structured singular value spectra that can be reshaped by diagonal scaling.
Computational resources for compression time are available (O(n³) per step × many steps per matrix × many matrices).

It is least effective when:

Aggressive compression is needed (0.50× or lower).
Calibration data distribution differs from inference distribution.
Large-scale models (70B+) where O(n³) SVD per step is prohibitively expensive.

It is not a complete solution for extreme low-rank compression: at very low retention rates, the fundamental information loss cannot be overcome by any choice of scaling.

Conclusion

SigmaScale introduces a novel approach to SVD-based LLM compression: rather than analytically deriving scaling matrices from activation statistics (as ASVD and SVD-LLM do), it learns them by gradient descent under the activation-aware Frobenius loss. The key contribution is demonstrating that:

Learned scaling can lower the effective rank entropy of weight matrices, making them more amenable to low-rank truncation.
This entropy reduction correlates strongly with compression quality (lower $\mathcal{L}_F$ ).
The method is competitive with state-of-the-art SVD methods in the mild-to-moderate compression regime, without requiring specialized hardware.

The work exposes an interesting research question: how much better can SVD-based compression become if the scaling pre-conditioning is optimized rather than analytically derived? SigmaScale provides a first data point, though the computational cost of the approach limits its near-term practical applicability. Future work combining approximate SVD, richer fine-tuning datasets, and larger model scales will determine whether learned scaling becomes the standard approach.

Reproduction Notes

Key implementation details:

Models: Llama 3.1 8B Instruct (HuggingFace meta-llama/Llama-3.1-8B-Instruct) and Qwen3-8B (Qwen/Qwen3-8B)
Calibration: 32 samples × 2048 tokens from Wikitext-2 training split
Eval perplexity: Wikitext-2 test split (141 samples × 2048 tokens)
Zero-shot eval: lm-evaluation-harness framework
Fine-tuning data: Alpaca (52K samples); authors also created a custom Alpaca variant based on Llama 3.1-8B output distribution (see Appendix G in the paper)
Baselines: SVD-LLM and ASVD+ with unified hyperparameters for fair comparison
Codebase: Available (linked in Appendix G of the paper)
Compute: Described in Appendix C (not fully disclosed in main text)

Potential pitfalls:

The gradient computation through SVD requires handling of repeated singular values carefully (Taylor approximation).
The optimal number of optimization steps is not stated explicitly in the main text.
The Alpaca dataset used for fine-tuning may introduce instruction-following distribution shift; testing with more diverse fine-tuning data is recommended before deploying.

Quick sanity check for reproduction: at 0.90× retention on Llama 3.1 8B Instruct, SigmaScale should substantially lower perplexity vs. vanilla truncated SVD and modestly improve over SVD-LLM, while recovering BoolQ/PIQA/ARC accuracy close to the uncompressed baseline.

Deep Dive: Mathematical Relationships Between Scaling and Compression Quality

The Weighted Low-Rank Approximation Perspective

To understand why scaling helps, it is instructive to derive the optimal low-rank approximation under a weighted Frobenius norm.

Given a weight matrix $W \in \mathbb{R}^{m \times n}$ and symmetric positive definite matrices $A \in \mathbb{R}^{m \times m}$ , $B \in \mathbb{R}^{n \times n}$ , define the $(A, B)$ -weighted Frobenius norm:

\|M\|_{A, B}^2 = \text{tr}(A M B M^T) = \|A^{1/2} M B^{1/2}\|_F^2

The best rank- $k$ approximation of $W$ under this metric is:

W^* = A^{-1/2} \left( \sum_{i=1}^{k} u_i \sigma_i v_i^T \right) B^{-1/2}

where $u_i, \sigma_i, v_i$ are the singular triplets of $A^{1/2} W B^{1/2}$ .

SigmaScale’s design in this framework: By setting $A = S_r^2 = \text{diag}(\exp(2d_r))$ and $B = S_c^2 = \text{diag}(\exp(2d_c))$ (so $A^{1/2} = S_r$ , $B^{1/2} = S_c$ ), the problem reduces exactly to the SigmaScale formulation:

W' = S_r^{-1} f_{\text{svd}}^{(k)}(S_r W S_c) S_c^{-1}

This confirms that SigmaScale is finding the best rank- $k$ approximation of $W$ in the metric defined by the learned scaling matrices. Optimizing the scaling parameters $d_r, d_c$ is equivalent to searching for the best weighted norm under which rank- $k$ truncation incurs minimum activation-based loss.

Connection to the Activation Covariance Matrix

Let $X \in \mathbb{R}^{n \times s}$ be the calibration activation matrix. The activation-aware loss can be written as:

\mathcal{L}_F = \frac{1}{mn} \|WX - W'X\|_F^2 = \frac{1}{mn} \|(W - W')X\|_F^2

If we define the empirical activation covariance $C = XX^T \in \mathbb{R}^{n \times n}$ (positive semi-definite), then:

\|(W - W')X\|_F^2 = \text{tr}\left((W - W')^T (W - W') C\right) = \|W - W'\|_C^2

where $\|\cdot\|_C$ is the $C$ -weighted Frobenius norm on rows.

SVD-LLM directly uses the Cholesky factor $S_c$ of $C$ (so $S_c S_c^T = C$ ) as the column scaling, which yields the best rank- $k$ approximation under exactly this column-weighted norm. This is theoretically motivated: SVD-LLM minimizes $\|WX - W'X\|_F^2$ over the choice of the best factored form that is expressible via column scaling.

SigmaScale additionally introduces row scaling $S_r$ , which is not captured by column-covariance weighting alone. The row scaling allows the method to also reweight output directions — useful when the output distribution has structured asymmetries that simple column weighting misses.

Why Row Scaling Matters

Consider an LLM’s attention output projection $W_O \in \mathbb{R}^{d \times d}$ . The input activations to $W_O$ are the attention output heads, and the output is added to the residual stream. The residual stream has its own distribution — certain output dimensions may be much more “important” (strongly coupled to downstream computation) than others.

Column scaling $S_c$ accounts for the input activation distribution. Row scaling $S_r$ can account for the output importance — essentially weighting reconstruction error more heavily for high-importance output dimensions. Pure column-covariance methods (SVD-LLM) do not have this degree of freedom.

This theoretical argument predicts that the benefit of learned row scaling should be larger for weight matrices whose row importance is heterogeneous and not well-correlated with column activation magnitudes — and indeed the paper shows improvement in the mild compression regime where these subtle asymmetries matter.

Practical Deployment Considerations

Memory and Inference Cost

For a layer with weight $W \in \mathbb{R}^{m \times n}$ compressed to rank $k$ :

Quantity	Formula	Example ( $m=n=4096$ , $k=0.9 \times 2048$ )
Original parameters	$mn$	16.8M
Compressed parameters	$k(m+n)$	$\approx 15.1M$
Parameter reduction	$(1 - k(m+n)/mn) \times 100\%$	$\approx 10\%$
Original MACs (batch 1)	$mn$	16.8M MACs
Compressed MACs (batch 1)	$k(m+n)$	$\approx 15.1M$ MACs
Memory bandwidth saved	Same ratio as parameters	$\approx 10\%$

At 0.90× retention, the savings are modest in absolute terms — roughly 10% parameter reduction per compressed matrix. Since the model also has uncompressed elements (embeddings, LN, head), the actual model-level compression ratio is less than 10%.

For 0.50× retention, the savings are substantial: $k(m+n) = 0.5mn \to$ 50% of parameters per matrix. But as SigmaScale shows, quality degrades sharply at this regime.

Hardware Considerations for Inference

Low-rank matrix products $Wx \approx L(Rx)$ introduce a sequential dependency (must finish $Rx$ before starting $Lx$ ). For small batch sizes (latency-critical serving), this can actually hurt throughput because the reduced FLOP count is not enough to fully saturate GPU SIMD units across small rank dimensions.

For large batches (throughput-critical serving), the $k(m+n)$ vs $mn$ FLOP reduction translates more directly to speedup, since tensor cores can efficiently handle both steps.

Rule of thumb: SVD low-rank compression benefits throughput-heavy serving (batch sizes ≥ 32) more than latency-sensitive serving (batch sizes = 1 or small). This is a consideration when deciding whether to use SigmaScale vs. quantization for a given deployment scenario.

Stacking with Quantization

The compressed matrices $L \in \mathbb{R}^{m \times k}$ and $R \in \mathbb{R}^{k \times n}$ can in principle be quantized independently after compression. However:

The factor matrices $L$ and $R$ have different value distributions than the original weight $W$ .
The error from quantization stacks with the truncation error from SVD.
The post-compression fine-tuning (SFT or KD) is done on FP16 factors; quantizing after fine-tuning is one option; quantization-aware fine-tuning of the low-rank factors is another.

SigmaScale does not report any quantization experiments, leaving this as an open direction.

Historical Context: The Evolution of SVD-Based LLM Compression

Understanding where SigmaScale fits requires a brief historical arc:

Phase 1 — Naive SVD (2021-2022): Direct truncated SVD on weight matrices. Very fast to compress, but perplexity loss is unacceptably high. Root cause: ignored activation outliers.

Phase 2 — Activation-Aware Scaling (2023): ASVD introduced column scaling based on activation magnitudes. First to demonstrate competitive quality on 7B models. Simple and efficient but uses a rough proxy (L1 magnitude) rather than full covariance.

Phase 3 — Covariance-Based Scaling (2024): SVD-LLM uses Cholesky decomposition of activation covariance for provably optimal column scaling. Adds sequential layer-by-layer weight update to propagate compression error corrections. State-of-the-art at the time.

Phase 4 — Learned Scaling (2026, SigmaScale): Directly optimizes scaling parameters under the compression loss. Adds row scaling as a new degree of freedom. Competitive in mild regime, not a full solution for aggressive compression. Computational cost higher.

What’s next? The natural extensions are: (1) learned non-diagonal transformations (full rotations, as in QuaRot/QuIP for quantization); (2) joint optimization across layers (SigmaScale optimizes each matrix independently); (3) integration with LoRA fine-tuning post-deployment.

Reflection: What Makes This Paper Worth Reading?

SigmaScale is a clean, well-motivated paper that makes a targeted contribution: demonstrating that learned scaling beats analytical scaling for SVD compression in the mild regime, and providing mechanistic evidence via the effective rank entropy correlation.

What it does well:

Clear hypothesis (learn vs. derive scaling)
Mechanistic analysis (effective rank entropy correlation)
Honest about limitations (aggressive compression fails, O(n³) cost, narrow eval)
Two models tested (Llama 3.1 + Qwen3)
SFT vs. KD comparison (even if the negative KD result isn’t fully explained)

What I’d want to see in a follow-up:

Randomized SVD for scalability
Calibration data ablation (the most obviously missing experiment)
70B scale validation
Latency measurements
Integration with quantization

For researchers working on efficient LLM deployment, SigmaScale is a useful reference for the proposition that “activation-aware diagonal pre-conditioning + learned optimization can outperform covariance-based analytics” — and the effective rank entropy metric is a potentially reusable diagnostic tool for other compression methods.

Glossary of Key Terms

Term	Definition
Truncated SVD	Keeping only the top $k$ singular triplets of the SVD; optimal rank- $k$ approximation under Frobenius norm (Eckart–Young theorem)
Low-rank factorization	Representing weight matrix $W$ as product $LR$ of two thin matrices, reducing storage and FLOPs
Activation outliers	Input channels with abnormally large activation magnitudes relative to others; cause naïve SVD to misallocate rank
Scaling matrix	Diagonal matrix applied to pre-condition a weight matrix before SVD; shifts the effective metric for rank- $k$ approximation
Activation-aware loss	Frobenius reconstruction error on actual calibration activations $X$ : $\\|(W - W')X\\|_F^2$ ; contrasted with weight-space Frobenius norm
Effective rank entropy	Entropy of the normalized singular value distribution; low entropy = concentrated spectrum = easier to compress
Knowledge distillation (KD)	Minimizing KL divergence between a compressed student and uncompressed teacher’s output logits; used to recover post-compression performance
Sensitivity probing	Measuring how much each layer’s perplexity rises under compression at various ratios; drives per-layer rank allocation
Binary search (ASVD)	Efficient algorithm to find globally optimal rank allocation satisfying a total parameter budget
Retention ratio	Fraction of original parameters kept per matrix after low-rank approximation (0.90 = keep 90%)
ASVD	Activation-aware SVD: column scaling from activation magnitudes (Yuan et al., 2023)
SVD-LLM	Column scaling from Cholesky decomposition of activation covariance (Wang et al., 2024)
SigmaScale	This paper: learned row+column diagonal scaling via gradient descent on activation-aware loss