June 12, 2026 EN #SVD & Low-Rank #Model Compression #LLM Inference #Transformer

SliceGPT: Post-Training LLM Compression via Computational Invariance

Review date: 2026-06-12 Review author: Zhongzhu Zhou Paper reviewed: SliceGPT: Compress Large Language Models by Deleting Rows and Columns Paper authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Grangeiro Perez, Torsten Hoefler, James Hensman arXiv: 2401.15024 Status / Venue: ICLR 2024 (accepted); Microsoft Research + ETH Zürich; 22 pages, 8 figures

Short Answer

SliceGPT proposes a post-training compression scheme built on a structural mathematical insight called computational invariance: any orthogonal change-of-basis applied simultaneously to consecutive weight matrices cancels out exactly, leaving the model’s outputs unchanged. The authors use PCA over calibration activations to find the basis in which the residual stream’s last few directions carry near-zero variance, then physically remove those rows and columns from the weight matrices. The result is a set of smaller, fully dense weight matrices that run faster on standard hardware with no custom CUDA kernels. At 25% parameter reduction, LLAMA2-70B retains 99% of its zero-shot performance while inference compute drops to 64–66% of the original.

Prerequisites

1. Transformer Architecture Fundamentals

A modern decoder-only transformer (GPT, LLAMA, OPT) is a stack of $L$ transformer blocks, each containing:

RMS Layer Normalization — normalizes the residual stream by its RMS and scales by a learned vector $\gamma \in \mathbb{R}^d$
Multi-Head Self-Attention — applies Q/K/V projections, scaled dot-product attention, and an output projection
MLP / Feed-Forward Network — an up-projection, a pointwise nonlinearity (GeLU, SiLU), and a down-projection
Residual connections — the output of every sub-block is added back to the input

The central data structure flowing through the network is the residual stream: a tensor of shape $(\text{seq\_len}, d)$ where $d$ is the model dimension (also called hidden size or embedding dimension). In LLAMA2-7B, $d = 4096$ ; in LLAMA2-70B, $d = 8192$ .

Every linear layer in the transformer operates on this residual stream: it reads a vector from the stream, multiplies by a weight matrix, and either writes back to the stream (output projections) or produces an intermediate tensor (Q/K/V). The dimension $d$ is the bottleneck that SliceGPT targets.

2. Singular Value Decomposition (SVD)

For any matrix $A \in \mathbb{R}^{m \times n}$ , the SVD factorizes it as:

A = U \Sigma V^T

where:

$U \in \mathbb{R}^{m \times m}$ — orthonormal left singular vectors (columns form an orthonormal basis of $\mathbb{R}^m$ )
$\Sigma \in \mathbb{R}^{m \times n}$ — diagonal matrix of singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$
$V \in \mathbb{R}^{n \times n}$ — orthonormal right singular vectors

The Eckart–Young theorem gives the best rank- $k$ approximation:

A_k = U_k \Sigma_k V_k^T, \quad \text{with} \quad \|A - A_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}

SliceGPT does not apply SVD directly to weight matrices (that would be ordinary low-rank compression). Instead it uses SVD to find the optimal change of basis for the activations — a conceptually different use of the same tool.

3. Principal Component Analysis (PCA) and Its Geometry

Given a data matrix $X \in \mathbb{R}^{d \times n}$ whose columns are activation samples, PCA finds the orthogonal transformation $Q \in \mathbb{R}^{d \times d}$ such that the covariance of $QX$ is diagonal:

\text{Cov}(QX) = Q \cdot \frac{XX^T}{n} \cdot Q^T = \text{diag}(\lambda_1, \ldots, \lambda_d)

with $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d \ge 0$ . The rows of $Q$ are the eigenvectors of the empirical covariance $\frac{1}{n} XX^T$ , sorted by descending eigenvalue. The eigenvalue $\lambda_i$ measures the variance of the activations in the $i$ -th principal direction.

In PCA, after transforming $X \mapsto QX$ , the last few coordinates of $QX$ have variance $\lambda_{d-k}, \ldots, \lambda_d \approx 0$ . These coordinates are effectively zero in every sample — carrying no information. Discarding them is essentially lossless.

4. Orthogonal Matrices: The Key Algebraic Tool

A matrix $Q \in \mathbb{R}^{d \times d}$ is orthogonal if $QQ^T = Q^TQ = I$ . Its critical properties:

Norm-preserving: $\|Qx\|_2 = \|x\|_2$ for all $x$ (orthogonal transforms are rigid rotations/reflections)
Exact inverse: $Q^{-1} = Q^T$ (cheap to invert)
Exact identity insertions: $Q^TQ = I$ , so inserting $Q^TQ$ anywhere in a product leaves it unchanged

The last property is the crux of SliceGPT. Inserting $I = Q^TQ$ between two weight matrices changes the parameterization but not the computation — and choosing $Q$ wisely (via PCA) reveals low-variance directions that can be discarded.

5. Post-Training Compression: The Landscape

Post-training compression reduces model size or compute after training, using only forward passes on a small calibration dataset. Three main paradigms:

Method	Strategy	Acceleration Mechanism	Custom Kernel?
Quantization (GPTQ, AWQ)	Reduce precision (FP16→INT4)	Less memory bandwidth	Partial (dequant.)
Unstructured Sparsity (SparseGPT, Wanda)	Zero individual weights	Sparse GEMM	Yes
Structured Compression (SliceGPT, LLM-Pruner)	Remove entire dimensions	Smaller dense GEMM	No

SliceGPT is a structured method: it removes complete rows and columns, leaving matrices that are still dense but smaller. This means standard highly-optimized dense BLAS libraries (cuBLAS, oneDNN) work without modification.

6. Computational Complexity Preview

For a transformer layer with residual-stream dimension $d$ and MLP intermediate dimension $d_\text{ff}$ , per-layer compute is approximately:

\text{FLOPs} \approx 2 \times (3d^2 + d^2 + 2d \cdot d_\text{ff}) = 2(4d^2 + 2d \cdot d_\text{ff})

If SliceGPT reduces $d \to k = (1-s)d$ with $s = 0.25$ , then $k = 0.75d$ and the compute scales as $(k/d)^2 = 0.5625$ for the $d^2$ terms and $(k/d) = 0.75$ for the $d \cdot d_\text{ff}$ terms. The blended reduction is approximately 64–66%, matching the paper’s empirical measurements.

What SliceGPT Does: Overview

SliceGPT (Ashkboos et al., Microsoft Research + ETH Zürich, ICLR 2024) makes three contributions:

Contribution 1 — Computational invariance theorem. A formal proof that for any sequence of orthogonal matrices $\{Q_0, Q_1, \ldots, Q_L\}$ , there exists a reparameterization of every transformer weight matrix such that the model’s output is exactly preserved for all inputs.

Contribution 2 — A principled slicing algorithm. Using PCA on calibration-data activations, the algorithm (a) identifies the optimal orthogonal basis at each layer, (b) rotates the weights into this basis, and (c) physically truncates the weight matrices by removing the last $d - k$ rows/columns (the directions with near-zero activation variance).

Contribution 3 — Hardware-native deployment. The sliced model consists only of smaller dense matrices, running on standard hardware without any new infrastructure, achieving actual latency and GPU-count reductions.

The Core Insight: Computational Invariance

Formal Derivation

Setup. Consider two consecutive linear operations separated by an element-wise nonlinearity $\phi$ (GeLU, SiLU, ReLU):

y = W_2 \,\phi\!\bigl(W_1\, x\bigr)

with $W_1 \in \mathbb{R}^{h \times d}$ , $W_2 \in \mathbb{R}^{d \times h}$ , $x \in \mathbb{R}^d$ .

Step 1: Insert $Q^TQ = I$ .

For any orthogonal $Q \in \mathbb{R}^{d \times d}$ :

y = W_2\, \phi\!\bigl(W_1\, Q^T Q\, x\bigr)

Step 2: Re-parenthesize.

y = W_2\, \phi\!\bigl((W_1 Q^T)(Q x)\bigr)

Define $\tilde{W}_1 = W_1 Q^T$ and $\tilde{x} = Qx$ . Then:

y = W_2\, \phi(\tilde{W}_1\, \tilde{x})

The output $y$ is bit-for-bit identical. The computation is parameterization-invariant under the orthogonal reparameterization $W_1 \to W_1 Q^T$ , $x \to Qx$ .

Step 3: Propagate through the full residual stream.

The residual stream at layer $l$ carries $x_l$ . Let all operations reading from position $l$ absorb $Q_l^T$ on the right of their weight, and all operations writing to position $l$ absorb $Q_l$ on the left of their weight. Then:

The stream at position $l$ now carries $Q_l x_l$ in the new parameterization
Every consumer $W_\text{in}$ sees $(W_\text{in} Q_l^T)(Q_l x_l) = W_\text{in} x_l$ — unchanged output
Every producer $W_\text{out}$ now produces $Q_l (W_\text{out} x_{l-1})$ , which is the new stream at position $l$

Theorem (Computational Invariance, ICLR 2024): For any pretrained transformer $f_\theta$ and any sequence of orthogonal matrices $\{Q_l\}_{l=0}^L$ , there exists a reparameterized transformer $f_{\tilde{\theta}}$ with $f_{\tilde{\theta}}(x) = f_\theta(x)$ for all inputs $x$ .

This is an exact statement — no error, no approximation. The subsequent slicing (keeping only $k$ dimensions) introduces the only approximation.

Truncation Error Bound

After choosing $Q_l$ to be the PCA matrix of calibration activations at layer $l$ , the truncation error (squared norm of discarded activation components) is bounded by:

\epsilon_l \le C \sum_{i=k_l+1}^{d} \lambda_i^{(l)}

where $\lambda_i^{(l)}$ is the $i$ -th eigenvalue of the empirical covariance at layer $l$ . For well-trained large models, the eigenvalue spectrum decays sharply (Zipfian-like), making $\sum_{i > k} \lambda_i$ small even at modest $k$ .

Figure 1: Computational Invariance Diagram

flowchart LR
    subgraph Original["Original Parameterization"]
        x1["x ∈ ℝᵈ"] --> W1["W₁ ∈ ℝ^{h×d}"]
        W1 --> phi1["φ(·)  element-wise"]
        phi1 --> W2["W₂ ∈ ℝ^{d×h}"]
        W2 --> y1["y ∈ ℝᵈ"]
    end
    subgraph Rotated["After inserting Q^T Q = I"]
        x2["Qx ∈ ℝᵈ"] --> W1Q["W₁Q^T ∈ ℝ^{h×d}"]
        W1Q --> phi2["φ(·)  element-wise"]
        phi2 --> W2b["W₂ ∈ ℝ^{d×h}"]
        W2b --> y2["y ∈ ℝᵈ (identical)"]
    end
    Original -. "Insert Q^T Q = I\n(zero error)" .-> Rotated

Figure 1: The computation is identical in both parameterizations. Choosing Q as the PCA rotation orders the coordinates by variance, making the last k-to-d dimensions safe to discard.

The SliceGPT Algorithm

Algorithm 1: SliceGPT (Pseudocode)

Input:
  f_θ          pretrained transformer (L layers, hidden dim d)
  D_calib      calibration data: C sequences × T tokens each
                (paper uses C=256, T=2048 from C4 dataset)
  s            global sparsity ratio (paper uses s=0.25)

Output:
  f_θ̃          compressed transformer with hidden dim k = round(d·(1−s))

─────────────────────────────────────────────────────
Preprocessing (RMSNorm absorption):
  For each transformer block l:
    Fold scale parameter γ_l into the next weight:
      For W reading immediately after RMSNorm at l:
        W ← W · diag(γ_l)
    Remove RMSNorm from the model graph.
  (This step is exact: RMS normalization is invariant to orthogonal Q.)
─────────────────────────────────────────────────────
Layer-wise PCA and slicing:
  For l = 0 to L−1:

    (A) Collect activations:
        Run D_calib through layers 0..l−1 with a forward hook.
        A_l ← concatenate all token hidden states at position l
              shape: (d, N) where N = C × T

    (B) Compute PCA basis:
        C_l ← (1/N) · A_l @ A_l.T          # empirical covariance (d×d)
        eigenvalues, Q_l ← eigh(C_l)        # eigendecomposition
        # Q_l rows = eigenvectors sorted by DESCENDING eigenvalue

    (C) Choose slice width:
        k_l ← round(d · (1 − s))            # uniform sparsity
        # (non-uniform variant: optimize k_l via marginal EVR budget)

    (D) Transform and slice all weights at position l:
        For W_in ∈ {W_Q, W_K, W_V, W_gate, W_up}  # read from stream at l
          W_in ← (W_in @ Q_l.T)[:, :k_l]   # rotate then keep top-k cols

        For W_out ∈ {W_O, W_down}           # write to stream at l+1
          W_out ← (Q_{l+1} @ W_out)[:k_{l+1}, :]  # rotate then keep top-k rows
          (uses k_{l+1} from the NEXT iteration)

─────────────────────────────────────────────────────
Boundary transformations:
  Input embedding E ∈ ℝ^{V×d}:
    E ← (E @ Q_0.T)[:, :k_0]
  Output LM head W_lm ∈ ℝ^{V×d}:
    W_lm ← (W_lm @ Q_L.T)[:, :k_L]
─────────────────────────────────────────────────────
Optional recovery fine-tuning:
  Fine-tune f_θ̃ for 1 epoch on D_calib (or larger dataset)
  using standard AdamW with LoRA adapters.

Line-by-Line Explanation

Why absorb RMSNorm first?

RMSNorm computes $\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma$ . The RMS scale is $\|x\|_2 / \sqrt{d}$ , which is invariant to orthogonal transformation since $\|Qx\|_2 = \|x\|_2$ . Therefore the RMS normalization itself is transparent to the basis change. The scale $\gamma$ is a diagonal matrix and can be absorbed:

W \leftarrow W \cdot \text{diag}(\gamma)

After absorption, there are no normalization layers to worry about. This simplification is not an approximation — it is algebraically exact.

Why layer-by-layer, not all at once?

PCA at layer $l$ must reflect the actual distribution of activations produced by layers $0, \ldots, l-1$ with the weights and the calibration data. Using random Gaussian activations would give the wrong basis (the statistics of residual-stream activations are highly non-Gaussian). The layer-by-layer scan captures this correctly.

Why does this work for the residual stream?

Residual connections add the input to the output: $x_{l+1} = x_l + \text{Block}_l(x_l)$ . Both $x_l$ and $\text{Block}_l(x_l)$ live in the same $\mathbb{R}^d$ space, so applying the same $Q_l$ to both is consistent. The addition is preserved: $Q_l x_{l+1} = Q_l x_l + Q_l \text{Block}_l(x_l)$ .

The Q matrices disappear at inference time.

After transformation, $W_\text{in} \leftarrow W_\text{in} Q_l^T[:, :k]$ is a $d_\text{out} \times k$ matrix. It is stored as-is. At inference, the sliced model takes $k$ -dimensional inputs and produces $k$ -dimensional outputs. No Q matrix is consulted at inference — the rotation is baked into the weight values.

Dimension bookkeeping.

After slicing, each transformer block operates with:

Input/output residual stream: $k = (1-s)d$ dimensions
Q/K/V matrices: $k \times d_\text{head}$ (head dimension unchanged)
MLP: $k \times d_\text{ff}$ and $d_\text{ff} \times k$ (intermediate dim unchanged)

Total parameters scale as $\approx 12k^2/12d^2 = (k/d)^2 = (0.75)^2 = 0.5625$ for dominant $d^2$ terms.

Figure 2: SliceGPT Compression Pipeline

flowchart TD
    A["Pretrained LLM\nhidden dim d"] --> B["Calibration dataset\n256 × 2048 tokens, C4"]
    B --> C["Step 1: Absorb RMSNorm γ\ninto adjacent weights"]
    C --> D["For each layer l:\nforward pass → A_l ∈ ℝ^{d×N}"]
    D --> E["PCA: covariance C_l = A_l A_l^T\neigh → Q_l, eigenvalues"]
    E --> F["Set k_l = round(d·(1−s))"]
    F --> G["W_in ← (W_in Q_l^T)[:, :k]\nW_out ← (Q_{l+1} W_out)[:k, :]"]
    G --> H{l < L?}
    H -->|Yes, l++| D
    H -->|Done| I["Transform embeddings E,\nLM head W_lm"]
    I --> J["Optional: 1-epoch fine-tuning\nwith LoRA"]
    J --> K["Compressed model\nhidden dim k = 0.75d\nDense matrices only"]

Figure 2: The full SliceGPT pipeline. The calibration phase (collecting activations, computing PCA) requires only forward passes — no gradients. The Q matrices are absorbed and not stored.

Handling Special Components

RMSNorm / LayerNorm

As derived above, RMSNorm is absorbed exactly into the first downstream weight. For LayerNorm (used in OPT), which also has a bias $\beta$ :

\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta

The bias $\beta$ is absorbed into the bias term of the following linear layer:

b_\text{new} = W_\text{in} \beta + b_\text{old}

After absorption, both LayerNorm and RMSNorm disappear from the compressed graph.

Multi-Head Self-Attention

For $H$ heads with per-head dimension $d_\text{head}$ (and $H \cdot d_\text{head} = d$ ):

Q/K/V projections all read from the same stream at layer $l$ :

\tilde{W}_Q = (W_Q Q_l^T)[\text{:, :k}], \quad \tilde{W}_K = (W_K Q_l^T)[\text{:, :k}], \quad \tilde{W}_V = (W_V Q_l^T)[\text{:, :k}]

Each becomes a matrix of shape $(H \cdot d_\text{head}) \times k$ (from $(H \cdot d_\text{head}) \times d$ ). The input dimension shrinks from $d$ to $k$ ; the per-head output dimension $d_\text{head}$ is unchanged.

Output projection $W_O \in \mathbb{R}^{d \times (H \cdot d_\text{head})}$ writes to the stream at layer $l+1$ :

\tilde{W}_O = (Q_{l+1} W_O)[\text{:k, :}]

The output dimension shrinks from $d$ to $k_{l+1}$ ; the input dimension $H \cdot d_\text{head}$ is unchanged.

Grouped-Query Attention (LLAMA2-70B uses GQA): The key and value heads are shared across multiple query groups. SliceGPT handles this identically — the input dimension to $W_K, W_V$ shrinks from $d$ to $k$ , while the per-head dimension stays fixed.

MLP Block (SwiGLU)

LLAMA2’s MLP uses SwiGLU:

\text{MLP}(x) = W_\text{down}\!\bigl(\text{SiLU}(W_\text{gate}\, x) \odot W_\text{up}\, x\bigr)

Both $W_\text{gate}$ and $W_\text{up}$ read from the layer- $l$ stream:

\tilde{W}_\text{gate} = (W_\text{gate}\, Q_l^T)[\text{:, :k}], \qquad \tilde{W}_\text{up} = (W_\text{up}\, Q_l^T)[\text{:, :k}]

$W_\text{down}$ writes to the layer- $(l+1)$ stream:

\tilde{W}_\text{down} = (Q_{l+1}\, W_\text{down})[\text{:k, :}]

The intermediate dimension $d_\text{ff}$ (≈ $8d/3$ for SwiGLU in LLAMA2) is not sliced in the basic algorithm. Slicing $d_\text{ff}$ would require an additional PCA pass over post-nonlinearity activations and is left for future work.

Embedding and LM Head

The token embedding table $E \in \mathbb{R}^{V \times d}$ maps discrete token IDs to the residual stream at position 0. It must be aligned with the layer-0 basis $Q_0$ :

\tilde{E} = (E Q_0^T)[\text{:, :}k_0]

The output LM head $W_\text{lm} \in \mathbb{R}^{V \times d}$ reads from the final residual stream (position $L$ ):

\tilde{W}_\text{lm} = (W_\text{lm}\, Q_L^T)[\text{:, :}k_L]

After these transformations, the model is fully self-consistent. A $k$ -dimensional residual stream flows from the embedding table through all $L$ layers to the LM head with no mismatch.

Figure 3: Component-Level Slicing Map

flowchart LR
    RS_l["Residual stream l\ndim: k_l = 0.75d"] --> WQ["W_Q Q_l^T [:, :k]\ndim: d_h × k_l"]
    RS_l --> WK["W_K Q_l^T [:, :k]\ndim: d_h × k_l"]
    RS_l --> WV["W_V Q_l^T [:, :k]\ndim: d_h × k_l"]
    RS_l --> Wg["W_gate Q_l^T [:, :k]\ndim: d_ff × k_l"]
    RS_l --> Wu["W_up Q_l^T [:, :k]\ndim: d_ff × k_l"]
    WQ & WK & WV --> Attn["Attention\n(internal d_h unchanged)"]
    Attn --> WO["Q_{l+1} W_O [:k, :]\ndim: k_{l+1} × d_h"]
    Wg & Wu --> MLP["SiLU / GeLU\n(d_ff unchanged)"]
    MLP --> Wd["Q_{l+1} W_down [:k, :]\ndim: k_{l+1} × d_ff"]
    WO & Wd --> RS_l1["Residual stream l+1\ndim: k_{l+1} = 0.75d"]

Figure 3: Every weight that reads from the residual stream has its input dimension sliced from d to k. Every weight that writes to the stream has its output dimension sliced. Internal dimensions (d_h, d_ff) are unchanged.

Calibration: Practical Details

Dataset and Scale

SliceGPT uses 256 sequences of 2048 tokens from C4 (≈524K tokens total). The authors confirm that:

C4 and Wikitext-2 give essentially identical results (PCA basis is data-distribution-insensitive within natural text)
128 sequences is sufficient; 512 provides marginal improvement
The calibration needs only inference-mode forward passes — no gradients, no optimizer state

For LLAMA2-70B at 8192 dimensions, each covariance matrix is $8192 \times 8192 = 67M$ entries (256 MB in FP32). With 80 layers, the total covariance storage is ~20GB — manageable on a single A100.

Explained Variance Ratio

After running PCA at layer $l$ , the explained variance ratio (EVR) at width $k$ is:

\text{EVR}(k, l) = \frac{\sum_{i=1}^{k} \lambda_i^{(l)}}{\sum_{i=1}^{d} \lambda_i^{(l)}}

For LLAMA2-70B at $k = 0.75d$ , EVR typically exceeds 99.5% at early/middle layers and drops slightly (to ~98.5%) at the final layers. This quantifies how much activation energy is preserved after slicing.

Non-Uniform Sparsity Allocation

A more principled variant optimizes per-layer $k_l$ subject to a global parameter budget:

\min_{k_0, \ldots, k_L} \sum_{l=0}^{L-1} \underbrace{\sum_{i=k_l+1}^{d} \lambda_i^{(l)}}_{\text{truncation error}} \quad \text{s.t.} \quad \sum_{l=0}^{L-1} \text{params}(k_l) \le B

This can be solved greedily: sort layers by marginal truncation error per parameter removed and allocate budget accordingly. The paper reports that non-uniform allocation gives marginal improvement over uniform sparsity for large models; small models benefit more.

Experiments and Results

Experimental Setup

Models evaluated: LLAMA2-7B, 13B, 70B; OPT-13B, 30B, 66B; Phi-2 (2.7B)
Calibration data: 256 × 2048 tokens from C4
Evaluation benchmark: EleutherAI LM-Eval-Harness, 7 zero-shot tasks (WinoGrande, HellaSwag, PIQA, ARC-easy, ARC-challenge, OpenBookQA, BoolQ)
Baselines: SparseGPT (50% unstructured), Wanda (50% unstructured), LLM-Pruner (20% structured)

Table 1: Zero-Shot Accuracy at 25% Parameter Reduction

Model	Dense Acc.	SliceGPT Acc.	Retained	Dense GPUs	Sliced GPUs
LLAMA2-7B	64.0%	58.2%	91.0%	1×A100	1×A100
LLAMA2-13B	66.8%	61.4%	91.9%	2×A100	1×A100
LLAMA2-70B	70.4%	69.8%	99.1%	4×A100	2×A100
OPT-66B	66.7%	66.1%	99.1%	4×A100	2×A100
Phi-2 (2.7B)	71.2%	63.9%	89.7%	1×RTX3090	1×RTX3090

Key observation: Large models (≥66B) tolerate slicing far better than small models. LLAMA2-70B loses only 0.6 percentage points at 25% compression — within noise on individual tasks. Small models like Phi-2 lose ~7 points, reflecting their lower redundancy.

Table 2: Perplexity on Wikitext-2 (lower is better)

Model	Dense	SliceGPT 20%	SliceGPT 25%	SparseGPT 50%
LLAMA2-7B	5.47	5.82	6.82	6.51
LLAMA2-13B	4.88	5.12	5.72	5.40
LLAMA2-70B	3.32	3.40	3.52	3.51
OPT-66B	9.34	9.55	9.80	9.76

At 25% structural reduction, SliceGPT is competitive with SparseGPT at 50% unstructured sparsity on 66–70B models. On smaller models, unstructured sparsity has a slight perplexity edge.

Compute and GPU Reduction

For LLAMA2-70B at $s = 0.25$ :

\text{FLOPs}_\text{sliced} / \text{FLOPs}_\text{dense} = (k/d)^2 = (0.75)^2 = 0.5625 \approx 56\%

Empirically measured at 64–66% (slightly higher than theoretical because embedding and MLP-intermediate terms are not fully reduced). The model also fits on half the GPU count:

Dense: 4×A100-40GB required
Sliced: 2×A100-40GB sufficient

This is a practical infrastructure saving: half the hardware cost for 99% of the task performance.

Fine-Tuning Recovery

One epoch of LoRA fine-tuning after slicing:

LLAMA2-7B: ~2 points recovered (58.2% → 60.1%)
LLAMA2-70B: ~0.2 points recovered (already near-dense quality)
Phi-2: ~3 points recovered (63.9% → 67.1%)

Fine-tuning is most beneficial where the initial accuracy drop is largest (small models, high sparsity).

Figure 4: Performance vs. GPU Count (LLAMA2-70B)

flowchart LR
    subgraph DenseSetup["Dense LLAMA2-70B"]
        G1["GPU 1\n16.7B params"] & G2["GPU 2\n16.7B params"] & G3["GPU 3\n16.7B params"] & G4["GPU 4\n16.7B params"]
        G1 & G2 & G3 & G4 --> Perf1["70.4% zero-shot\n100% FLOPs"]
    end
    subgraph SlicedSetup["SliceGPT LLAMA2-70B (s=0.25)"]
        SG1["GPU 1\n~26B params"] & SG2["GPU 2\n~26B params"]
        SG1 & SG2 --> Perf2["69.8% zero-shot\n64% FLOPs\n(-0.6 pts, -2 GPUs)"]
    end

Figure 4: SliceGPT halves the GPU count for LLAMA2-70B inference while losing only 0.6 percentage points on zero-shot benchmarks.

Comparison to Prior Work

Table 3: Structural Compression Method Comparison (LLAMA2-70B, Wikitext-2 PPL)

Method	Type	Param Reduction	PPL	Custom Kernel	GPU Savings
Dense	—	0%	3.32	No	—
SparseGPT	Unstructured	~50%	3.51	Yes	No
Wanda	Unstructured	~50%	3.53	Yes	No
LLM-Pruner	Structural	~20%	5.3	No	Partial
SliceGPT	Structural	25%	3.52	No	4→2 GPUs

SliceGPT is the only method in this table that simultaneously achieves: no custom kernels, actual GPU count reduction, and sub-4.0 perplexity at meaningful compression.

Why Computational Invariance is More Principled Than Magnitude-Based Pruning

Most structural pruning methods select neurons/channels to prune by magnitude, gradient, or Taylor expansion — all heuristics. SliceGPT instead:

Applies a theoretically optimal basis change (the PCA rotation is optimal in the sense of minimizing reconstruction error after truncation, by the Eckart–Young theorem)
The subsequent truncation discards directions that are demonstrably low-variance in the calibration distribution
The basis change itself introduces zero approximation error — only the truncation does

This gives a principled upper bound on the error introduced: it is exactly the PCA reconstruction error (discarded eigenvalue sum), which can be computed and used to set the sparsity budget.

Figure 5: Taxonomy of Post-Training LLM Compression

flowchart TD
    root["Post-Training LLM Compression"]
    root --> Quant["Quantization\nGPTQ · AWQ · SmoothQuant · LLM.int8"]
    root --> Unstruct["Unstructured Sparsity\nSparseGPT · Wanda · Magnitude"]
    root --> Struct["Structural Pruning\nSliceGPT · LLM-Pruner · ShortGPT · FLAP"]
    root --> LowRank["Low-Rank Decomposition\nSVD-LLM · ASVD · TrLoRA"]
    Struct -->|"Theoretical basis:\ncomputational invariance + PCA"| SliceGPT_node["SliceGPT\n(this paper)"]

Figure 5: SliceGPT sits in the structured pruning quadrant, uniquely backed by a theoretical invariance argument rather than a magnitude-based heuristic.

Limitations and Boundary Conditions

Scale Dependence is Fundamental

The 99% retention at 25% sparsity holds only for 60B+ parameter models. At 7B:

Performance drop: ~6 points
Explanation: smaller models have less redundancy (eigenvectors of the activation covariance have flatter spectra, meaning more variance is spread across directions rather than concentrated in a few)

This is not a bug but a fundamental property: SliceGPT exploits over-parameterization. Models below ~13B are not sufficiently over-parameterized for 25% slicing to be near-lossless.

Intermediate MLP Dimension Untouched

The basic algorithm only slices the residual stream dimension $d$ . The MLP intermediate dimension $d_\text{ff} = 8d/3$ (SwiGLU LLAMA2) is preserved. This means:

The $W_\text{gate}$ and $W_\text{up}$ matrices change from $d_\text{ff} \times d$ to $d_\text{ff} \times k$ : savings proportional to $(d - k)/d = s$
The $W_\text{down}$ changes from $d \times d_\text{ff}$ to $k \times d_\text{ff}$ : same savings

But the element-wise nonlinearity and the intermediate activations still occupy $d_\text{ff}$ dimensions. For models where the MLP dominates (e.g., MoE models), this limits the FLOP savings.

Benchmark Narrowness

All evaluations use short-answer, zero-shot classification tasks. No results are reported for:

Code generation (HumanEval, MBPP)
Mathematical reasoning (GSM8K, MATH)
Long-form instruction following (MT-Bench, AlpacaEval)
Long-context tasks (SCROLLS, LongBench)
Multilingual benchmarks

These tasks may be more sensitive to residual stream dimension reduction, particularly long-context tasks where the model must maintain a rich information state across many tokens.

Single-Architecture Evaluation at Large Scale

The 70B-scale experiments cover only LLAMA2 and OPT. Other modern large models — Falcon-180B, Mixtral-8×7B (MoE), GPT-NeoX-20B — are not evaluated. MoE architectures are especially interesting since their routing mechanism interacts with the residual stream in non-trivial ways.

Non-Linear Components Limit Invariance

Computational invariance holds for element-wise $\phi$ because $Q \phi(x) = \phi(Qx)$ only when $\phi$ is the identity (which it obviously isn’t). Wait — this needs clarification: the invariance holds because $\phi$ is applied to the intermediate vector (not the residual stream). The residual stream transformation $Q$ cancels out before reaching $\phi$ . But if $\phi$ were applied to the residual stream directly (as in some architectures), this would break.

For standard transformer attention, the softmax is applied to attention scores $QK^T/\sqrt{d_k}$ — not the residual stream. The attention scores operate in the per-head space (dimension $d_\text{head}$ ), which is not sliced. So attention softmax is handled correctly.

Critical Assessment: Weaknesses & Improvements

(a) Weaknesses and Flaws

The “25% parameter reduction” framing is imprecise. SliceGPT reduces the hidden dimension from $d$ to $k = 0.75d$ . Weight matrices of shape $d_\text{out} \times d$ become $d_\text{out} \times k$ — one dimension changes. For a transformer with weight matrices of shape $d \times d$ , the parameter reduction per matrix is $(d^2 - dk)/d^2 = s = 25\%$ . But the MLP’s $d_\text{ff} \times d$ matrices only shrink in one of their two dimensions, and the embedding table $V \times d$ is very large. The net total parameter reduction depends on the model’s dimension ratios. The paper reports “up to 25% of model parameters including embeddings” — the “up to” qualifier deserves more prominence, and Table 2 in the paper shows exact per-model figures that vary meaningfully.

Perplexity comparison at different operating points. SliceGPT at 25% structural sparsity is compared against SparseGPT/Wanda at 50% unstructured sparsity. The paper frames this as “competitive,” but these methods are not at the same FLOP reduction point. SparseGPT at 50% unstructured sparsity has half the weight parameters but, without custom sparse kernels, no latency benefit — while SliceGPT at 25% structural sparsity has ~44% FLOP reduction but actual latency benefit. A fair comparison would match on actual measured throughput (tokens/second) at the same hardware budget, not on nominal parameter counts.

No ablation on calibration dataset size or domain. The paper states C4 and Wikitext-2 give similar results (one comparison), but provides no systematic study. For practitioners deploying SliceGPT on domain-specific models (medical, legal, code), it is unknown whether calibrating with C4 is adequate or whether domain-matched calibration data is necessary. This is a practical gap.

No latency measurements on realistic inference workloads. The paper reports FLOP counts and mentions running on fewer GPUs, but does not report actual tokens/second at various batch sizes. For memory-bandwidth-bound regimes (small batch sizes), FLOP reduction does not directly translate to latency reduction. The “faster” claim, while plausible, is not fully substantiated.

Limited fine-tuning analysis. The paper briefly mentions 1-epoch LoRA fine-tuning but does not explore: how much recovery is possible with more compute (3–5 epochs), what training data is optimal, or whether full fine-tuning outperforms LoRA for recovery.

(b) Limitations the Authors Understate

KV-cache size is unaffected in the basic algorithm. Since per-head dimension $d_\text{head}$ is not sliced (only the input dimension $d$ of $W_K, W_V$ changes), the K and V vectors output by the projections still have dimension $d_\text{head}$ per head. The KV cache size is therefore unchanged. For long-context inference where KV cache is the primary memory bottleneck, SliceGPT provides no direct benefit. The paper does not acknowledge this.

Tensor-parallel sharding may be complicated. The reduced hidden dimension $k = 0.75d$ may not be evenly divisible by the number of GPUs in tensor-parallel settings. For LLAMA2-70B: $d = 8192$ (easily divisible by 8), $k = 6144 = 0.75 \times 8192 = 2^{11} \times 3$ — divisible by 8 but not by all desired tensor-parallel degrees. For non-power-of-2 $k$ , padding or irregular sharding is needed.

Weight materialization overhead during compression. During compression, both the original weight $W$ and the transformed $W Q_l^T$ must be held in memory simultaneously. For LLAMA2-70B with 140B parameters at FP16, this transiently requires ~280GB — more than 4×A100-80GB can hold. The paper reports 4×A100-80GB is sufficient but does not detail how this is managed (likely layer-by-layer with careful memory management).

(c) Concrete Improvement Suggestions

1. Slice the MLP intermediate dimension. Apply PCA to the post-activation intermediate activations (the vector after the GeLU/SiLU) and additionally reduce $d_\text{ff}$ . This requires two PCA passes per layer (one at the residual stream, one at the MLP intermediate) but would provide proportional FLOP savings across all weight matrices. Expected benefit: at $s = 0.25$ on both $d$ and $d_\text{ff}$ , total FLOPs reduce to $\sim(0.75)^2 = 56\%$ rather than the current ~64–66%.

2. Non-uniform allocation with validation-loop tuning. Use a small held-out set (16–32 sequences) to measure actual perplexity impact of slicing each layer independently. Protect layers that show large perplexity sensitivity (typically layers near the input and output) and aggressively slice middle layers. Gradient-free black-box optimization (CMA-ES or a greedy scan) over the $\{k_l\}$ schedule could substantially improve the accuracy-compression trade-off without additional compute.

3. Evaluate on reasoning and code benchmarks. Add HumanEval (code generation), GSM8K (math), and MT-Bench (instruction following) to the evaluation suite. If SliceGPT degrades disproportionately on these tasks — which require multi-step precision — this should be transparently reported, with per-task analysis of which tasks are most sensitive to $d$ reduction.

4. Combine with quantization and measure jointly. Apply AWQ or GPTQ after SliceGPT and compare against AWQ/GPTQ alone on the same hardware. If SliceGPT + INT4 achieves better throughput than INT4 alone at similar accuracy, that is a compelling deployment story the paper misses. The combination is natural (slicing reduces the matrix sizes before quantization) but unexplored.

5. Measure KV-cache impact of slicing the head dimension. Extend the algorithm to also apply PCA on the per-head key/value activations (a separate PCA within each attention head) and slice $d_\text{head}$ . This would reduce KV cache memory proportionally, which is critical for long-context serving. This is a non-trivial extension but directly addresses the KV-cache limitation identified above.

Deep Dive: SliceGPT vs. Low-Rank Matrix Decomposition

It is easy to conflate SliceGPT with weight-level low-rank decomposition methods (e.g., SVD-LLM, ASVD). Both involve SVD and both produce smaller weight matrices. The difference is conceptual and has practical consequences.

Low-Rank Decomposition of Weights (What SliceGPT is NOT)

Conventional low-rank compression approximates each weight matrix individually:

W \approx U_k \Sigma_k V_k^T

where $U_k \in \mathbb{R}^{m \times k}$ , $V_k^T \in \mathbb{R}^{k \times n}$ . This replaces one $m \times n$ matrix with two smaller ones: the computation changes from $y = Wx$ to $y \approx U_k (\Sigma_k V_k^T x)$ .

Problems with weight-level SVD:

Each matrix is approximated independently, ignoring that the approximation errors of consecutive layers accumulate through the residual stream
The approximation is in the weight space — the truncated directions in $W$ may not correspond to directions that the activations actually occupy
The two smaller matrices ( $U_k$ , $V_k^T$ ) both need to be stored and multiplied; unless $k \ll \min(m,n)$ , the inference overhead can actually increase due to two separate GEMM calls

What SliceGPT Actually Does

SliceGPT applies SVD/PCA to the activations, not the weight matrices. The key mathematical distinction:

Step 1 (SliceGPT): Find $Q_l$ such that the activations $A_l$ have maximum variance in the first $k_l$ coordinates.

Step 2 (SliceGPT): Rotate all weights that touch position $l$ to be consistent with this new basis. This step is exact (computational invariance).

Step 3 (SliceGPT): Truncate the last $d - k_l$ coordinates. This step is the only approximation, and its error equals the discarded eigenvalue sum.

The resulting weights are single matrices (not product pairs): a matrix that was $m \times d$ becomes $m \times k$ — one matrix, not two. This is why inference computation actually decreases rather than just being rearranged.

Formal Comparison of Error Sources

Low-rank SVD of weight $W$ :

\text{Error} = \|W - U_k \Sigma_k V_k^T\|_F = \sqrt{\sum_{i>k} \sigma_i(W)^2}

This error is in weight space and may not reflect what activations actually use.

SliceGPT truncation at layer $l$ :

\text{Error}_l = \left\|\sum_l W_\text{out} \cdot Q_l x_l - W_\text{out} \cdot Q_l^{(k)} x_l\right\| \le C \sum_{i > k_l} \lambda_i^{(l)}

where $\lambda_i^{(l)}$ are eigenvalues of the activation covariance. This error is in activation space — directly measuring how much of the actual runtime information is discarded.

The activation-space error bound is tighter and more meaningful for downstream task performance, because it directly measures how much the model “sees” in the directions being removed.

Figure 6: SliceGPT vs. Weight-Level Low-Rank Decomposition

flowchart TB
    subgraph WeightSVD["Weight-Level SVD (e.g., SVD-LLM)"]
        W1["W ∈ ℝ^{m×d}"] -->|"SVD truncation"| UV["U_k (m×k)\n× Σ_k V_k^T (k×d)\nTwo matrices"]
        UV --> Err1["Error: ‖W − U_kΣ_kV_k^T‖_F\n(in weight space)"]
    end
    subgraph SliceGPT_Diag["SliceGPT (activation-space)"]
        W2["W ∈ ℝ^{m×d}"] -->|"Rotate: W Q_l^T"| WQ["W Q_l^T ∈ ℝ^{m×d}\n(exact, zero error)"]
        WQ -->|"Slice: keep cols 1:k"| Wk["W Q_l^T [:, :k] ∈ ℝ^{m×k}\nOne matrix"]
        Wk --> Err2["Error: Σ_{i>k} λᵢ(activation covariance)\n(in activation space, tighter bound)"]
    end

Figure 6: Weight-level SVD produces a product of two matrices and measures error in weight space. SliceGPT produces a single smaller matrix and measures error in activation space — directly bounding the impact on runtime behavior.

Sparsity Scaling Behavior

How Does Accuracy Degrade as Sparsity Increases?

Understanding the accuracy-sparsity curve is critical for practitioners choosing the operating point. The paper reports results at 20% and 25% sparsity for some models, and at 30%+ for others. The qualitative pattern is:

0–10% sparsity: Nearly zero accuracy loss for all model sizes. The high-variance PCA directions are far more important than the low-variance ones; removing only the tail is almost free.
10–20% sparsity: Negligible loss for 70B+, small loss (~1–2 points) for 7–13B. Still practically useful.
25% sparsity: The “sweet spot” for large models — 99% retention at 70B. For 7B, the 6-point loss becomes noticeable.
30%+ sparsity: Accuracy drops accelerate nonlinearly. The eigenvalue spectrum decays rapidly but not infinitely; at high sparsity, directions with meaningful variance are being removed.

This nonlinear degradation pattern matches the mathematical prediction: the truncation error grows slowly at first (low eigenvalues discarded) and then quickly (eigenvalues with non-trivial variance begin to be discarded).

Per-Layer Eigenvalue Spectra

Analyzing the eigenvalue spectra of $A_l A_l^T$ at different layers reveals:

Early layers (l = 0–10): Relatively flat spectra (activations use many directions roughly equally). These layers are harder to compress and benefit most from non-uniform sparsity (lower $s$ ).
Middle layers (l = 10–60 for 70B): Steep spectra, very high EVR even at $k = 0.5d$ . High redundancy.
Final layers (l > 60): Moderate spectra. The LM head needs to distinguish many different token predictions, requiring more dimensions.

This layered structure explains why uniform sparsity works well on average but non-uniform allocation (protecting early and late layers) can unlock better accuracy at the same compute budget.

Interaction with Model Architecture Variants

Tied embeddings: Some models tie the input embedding $E$ and output LM head $W_\text{lm}$ . SliceGPT’s treatment of these as separate matrices would break the tie. The codebase handles this by only transforming one of them and re-tying after compression.

Rotary Position Embeddings (RoPE): LLAMA2 uses RoPE for positional encoding. RoPE is applied to Q and K after the projections, operating in the per-head space (dimension $d_\text{head}$ ). Since SliceGPT does not change $d_\text{head}$ , RoPE is unaffected.

ALiBi Positional Biases (OPT): Additive biases in attention scores, again in the per-head space. Unaffected by residual-stream slicing.

Reproducibility Notes

Code: github.com/microsoft/TransformerCompression (MIT license)
Calibration data: HuggingFace allenai/c4 English subset; 256 × 2048 tokens; takes ~30 min to preprocess
Compression runtime: ~1–2 hours on 4×A100-80GB for LLAMA2-70B; single-GPU is sufficient for ≤13B models
Evaluation: EleutherAI LM-Eval-Harness v0.3+; 7-task zero-shot average
Determinism: Fully deterministic given fixed calibration sequence order; no randomness after calibration sampling
Dependencies: PyTorch ≥ 2.0, transformers, datasets, scipy.linalg.eigh (for covariance eigendecomposition)
Memory for compression: Requires holding full-precision weights plus one layer’s covariance matrix at a time; ~160GB peak for LLAMA2-70B

The algorithm is straightforward: ~200 lines of PyTorch to implement from scratch, making SliceGPT one of the most accessible papers in post-training compression for pedagogical purposes.

Summary: Design Decisions at a Glance

Before concluding, here is a quick reference capturing SliceGPT’s key design choices and their implications:

Decision	Rationale	Open Gap
PCA basis (not random Q)	Minimizes activation reconstruction error (Eckart-Young optimal)	Requires calibration forward pass
Uniform sparsity by default	Simple; near-optimal for large models	Suboptimal for small models — non-uniform is better
Absorb RMSNorm into weights	Exact simplification; no extra ops at inference	Only works for diagonal-scale norms
Preserve $d_\text{ff}$	Avoids second PCA pass	Leaves MLP FLOP savings on the table
256-sequence calibration	Sufficient for stable PCA; low overhead	May be domain-sensitive for specialized models
Optional fine-tuning	Avoids training setup for large models	Small models benefit significantly from even 1 epoch

Each “Open Gap” row is a concrete future research direction. Together they sketch a roadmap for extending SliceGPT to higher compression ratios and broader deployment scenarios.

Conclusion

SliceGPT makes a clean theoretical contribution — the computational invariance theorem — and translates it directly into an engineering outcome: smaller, faster, hardware-agnostic transformer inference. The insight that an orthogonal basis change is transparent to the computation, and that PCA identifies the optimal basis for subsequent truncation, is both elegant and practically powerful.

At 70B scale, the method delivers compelling results: 25% compression with 99% task performance, halved GPU count, and 34–36% FLOP reduction — all without custom kernels. For practitioners deploying LLAMA2-70B or similar models, SliceGPT represents one of the most deployment-friendly compression options available.

The method’s limitations are equally clear: it is most effective for large (≥30B) models, has been validated primarily on short zero-shot classification tasks, leaves the MLP intermediate dimension untouched, and does not address the KV cache. These are not disqualifying limitations but they define the boundary conditions for when SliceGPT is the right tool.

For researchers building on this work, the most impactful next steps are: MLP intermediate slicing, per-head KV-cache reduction via head-dimension PCA, extended evaluation on reasoning and code tasks, and composability with quantization. The computational invariance theorem itself is a result worth studying independently — it may underpin future compression methods for other neural architectures beyond transformers.

Personal take: SliceGPT is one of the most pedagogically clean papers in post-training compression. The core insight fits in five lines of algebra, the code is minimal and well-commented, and the 70B results are genuinely impressive. Reading the computational invariance proof is time well spent for anyone working with transformer internals.

For deeper context:

SparseGPT (Frantar & Alistarh, NeurIPS 2023) — unstructured counterpart; compare their layer-wise reconstruction against SliceGPT’s PCA calibration
SVD-LLM (Wang et al., 2024) — weight-level SVD for LLMs; contrasts with SliceGPT’s activation-space philosophy
ASVD (Yuan et al., 2023) — activation-aware SVD, thematically closest to SliceGPT but at the weight level
QuIP# (Tseng et al., NeurIPS 2024) — uses random orthogonal incoherence transforms before quantization; the orthogonal-transform idea is mathematically related to SliceGPT’s basis change, applied to enable better quantization rather than slicing
LLM-Pruner (Ma et al., 2023) — gradient-guided structural pruning; shows how heuristic-based methods compare in accuracy-compression trade-off
TransformerCompression (GitHub) — official open-source implementation by Microsoft Research; clean, well-documented, actively maintained

Conceptual Reading Path

For a reader new to post-training compression, the recommended reading order is:

This paper (SliceGPT) — start here to understand the theoretical framework
SparseGPT — see how unstructured methods handle the same calibration-based problem differently
SVD-LLM / ASVD — compare weight-space vs. activation-space SVD approaches directly
QuIP# — see how the same orthogonal-transform idea is applied in a quantization context
LLM-Pruner — understand gradient-guided structural pruning to appreciate SliceGPT’s calibration-only simplicity

This path builds a coherent mental model: the common thread across all these methods is using calibration data to guide compression decisions, with each method differing in what is compressed (weights, activations, structure) and how the calibration signal is used (Hessian, PCA, gradient).