Review date: 2026-06-12 Review author: Zhongzhu Zhou Paper reviewed: SliceGPT: Compress Large Language Models by Deleting Rows and Columns Paper authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Grangeiro Perez, Torsten Hoefler, James Hensman arXiv: 2401.15024 Status / Venue: ICLR 2024 (accepted); Microsoft Research + ETH Zürich; 22 pages, 8 figures
Short Answer
SliceGPT proposes a post-training compression scheme built on a structural mathematical insight called computational invariance: any orthogonal change-of-basis applied simultaneously to consecutive weight matrices cancels out exactly, leaving the model’s outputs unchanged. The authors use PCA over calibration activations to find the basis in which the residual stream’s last few directions carry near-zero variance, then physically remove those rows and columns from the weight matrices. The result is a set of smaller, fully dense weight matrices that run faster on standard hardware with no custom CUDA kernels. At 25% parameter reduction, LLAMA2-70B retains 99% of its zero-shot performance while inference compute drops to 64–66% of the original.
Prerequisites
1. Transformer Architecture Fundamentals
A modern decoder-only transformer (GPT, LLAMA, OPT) is a stack of transformer blocks, each containing:
- RMS Layer Normalization — normalizes the residual stream by its RMS and scales by a learned vector
- Multi-Head Self-Attention — applies Q/K/V projections, scaled dot-product attention, and an output projection
- MLP / Feed-Forward Network — an up-projection, a pointwise nonlinearity (GeLU, SiLU), and a down-projection
- Residual connections — the output of every sub-block is added back to the input
The central data structure flowing through the network is the residual stream: a tensor of shape where is the model dimension (also called hidden size or embedding dimension). In LLAMA2-7B, ; in LLAMA2-70B, .
Every linear layer in the transformer operates on this residual stream: it reads a vector from the stream, multiplies by a weight matrix, and either writes back to the stream (output projections) or produces an intermediate tensor (Q/K/V). The dimension is the bottleneck that SliceGPT targets.
2. Singular Value Decomposition (SVD)
For any matrix , the SVD factorizes it as:
where:
- — orthonormal left singular vectors (columns form an orthonormal basis of )
- — diagonal matrix of singular values
- — orthonormal right singular vectors
The Eckart–Young theorem gives the best rank- approximation:
SliceGPT does not apply SVD directly to weight matrices (that would be ordinary low-rank compression). Instead it uses SVD to find the optimal change of basis for the activations — a conceptually different use of the same tool.
3. Principal Component Analysis (PCA) and Its Geometry
Given a data matrix whose columns are activation samples, PCA finds the orthogonal transformation such that the covariance of is diagonal:
with . The rows of are the eigenvectors of the empirical covariance , sorted by descending eigenvalue. The eigenvalue measures the variance of the activations in the -th principal direction.
In PCA, after transforming , the last few coordinates of have variance . These coordinates are effectively zero in every sample — carrying no information. Discarding them is essentially lossless.
4. Orthogonal Matrices: The Key Algebraic Tool
A matrix is orthogonal if . Its critical properties:
- Norm-preserving: for all (orthogonal transforms are rigid rotations/reflections)
- Exact inverse: (cheap to invert)
- Exact identity insertions: , so inserting anywhere in a product leaves it unchanged
The last property is the crux of SliceGPT. Inserting between two weight matrices changes the parameterization but not the computation — and choosing wisely (via PCA) reveals low-variance directions that can be discarded.
5. Post-Training Compression: The Landscape
Post-training compression reduces model size or compute after training, using only forward passes on a small calibration dataset. Three main paradigms:
| Method | Strategy | Acceleration Mechanism | Custom Kernel? |
|---|---|---|---|
| Quantization (GPTQ, AWQ) | Reduce precision (FP16→INT4) | Less memory bandwidth | Partial (dequant.) |
| Unstructured Sparsity (SparseGPT, Wanda) | Zero individual weights | Sparse GEMM | Yes |
| Structured Compression (SliceGPT, LLM-Pruner) | Remove entire dimensions | Smaller dense GEMM | No |
SliceGPT is a structured method: it removes complete rows and columns, leaving matrices that are still dense but smaller. This means standard highly-optimized dense BLAS libraries (cuBLAS, oneDNN) work without modification.
6. Computational Complexity Preview
For a transformer layer with residual-stream dimension and MLP intermediate dimension , per-layer compute is approximately:
If SliceGPT reduces with , then and the compute scales as for the terms and for the terms. The blended reduction is approximately 64–66%, matching the paper’s empirical measurements.
What SliceGPT Does: Overview
SliceGPT (Ashkboos et al., Microsoft Research + ETH Zürich, ICLR 2024) makes three contributions:
Contribution 1 — Computational invariance theorem. A formal proof that for any sequence of orthogonal matrices , there exists a reparameterization of every transformer weight matrix such that the model’s output is exactly preserved for all inputs.
Contribution 2 — A principled slicing algorithm. Using PCA on calibration-data activations, the algorithm (a) identifies the optimal orthogonal basis at each layer, (b) rotates the weights into this basis, and (c) physically truncates the weight matrices by removing the last rows/columns (the directions with near-zero activation variance).
Contribution 3 — Hardware-native deployment. The sliced model consists only of smaller dense matrices, running on standard hardware without any new infrastructure, achieving actual latency and GPU-count reductions.
The Core Insight: Computational Invariance
Formal Derivation
Setup. Consider two consecutive linear operations separated by an element-wise nonlinearity (GeLU, SiLU, ReLU):
with , , .
Step 1: Insert .
For any orthogonal :
Step 2: Re-parenthesize.
Define and . Then:
The output is bit-for-bit identical. The computation is parameterization-invariant under the orthogonal reparameterization , .
Step 3: Propagate through the full residual stream.
The residual stream at layer carries . Let all operations reading from position absorb on the right of their weight, and all operations writing to position absorb on the left of their weight. Then:
- The stream at position now carries in the new parameterization
- Every consumer sees — unchanged output
- Every producer now produces , which is the new stream at position
Theorem (Computational Invariance, ICLR 2024): For any pretrained transformer and any sequence of orthogonal matrices , there exists a reparameterized transformer with for all inputs .
This is an exact statement — no error, no approximation. The subsequent slicing (keeping only dimensions) introduces the only approximation.
Truncation Error Bound
After choosing to be the PCA matrix of calibration activations at layer , the truncation error (squared norm of discarded activation components) is bounded by:
where is the -th eigenvalue of the empirical covariance at layer . For well-trained large models, the eigenvalue spectrum decays sharply (Zipfian-like), making small even at modest .
Figure 1: Computational Invariance Diagram
flowchart LR
subgraph Original["Original Parameterization"]
x1["x ∈ ℝᵈ"] --> W1["W₁ ∈ ℝ^{h×d}"]
W1 --> phi1["φ(·) element-wise"]
phi1 --> W2["W₂ ∈ ℝ^{d×h}"]
W2 --> y1["y ∈ ℝᵈ"]
end
subgraph Rotated["After inserting Q^T Q = I"]
x2["Qx ∈ ℝᵈ"] --> W1Q["W₁Q^T ∈ ℝ^{h×d}"]
W1Q --> phi2["φ(·) element-wise"]
phi2 --> W2b["W₂ ∈ ℝ^{d×h}"]
W2b --> y2["y ∈ ℝᵈ (identical)"]
end
Original -. "Insert Q^T Q = I\n(zero error)" .-> Rotated
Figure 1: The computation is identical in both parameterizations. Choosing Q as the PCA rotation orders the coordinates by variance, making the last k-to-d dimensions safe to discard.
The SliceGPT Algorithm
Algorithm 1: SliceGPT (Pseudocode)
Input:
f_θ pretrained transformer (L layers, hidden dim d)
D_calib calibration data: C sequences × T tokens each
(paper uses C=256, T=2048 from C4 dataset)
s global sparsity ratio (paper uses s=0.25)
Output:
f_θ̃ compressed transformer with hidden dim k = round(d·(1−s))
─────────────────────────────────────────────────────
Preprocessing (RMSNorm absorption):
For each transformer block l:
Fold scale parameter γ_l into the next weight:
For W reading immediately after RMSNorm at l:
W ← W · diag(γ_l)
Remove RMSNorm from the model graph.
(This step is exact: RMS normalization is invariant to orthogonal Q.)
─────────────────────────────────────────────────────
Layer-wise PCA and slicing:
For l = 0 to L−1:
(A) Collect activations:
Run D_calib through layers 0..l−1 with a forward hook.
A_l ← concatenate all token hidden states at position l
shape: (d, N) where N = C × T
(B) Compute PCA basis:
C_l ← (1/N) · A_l @ A_l.T # empirical covariance (d×d)
eigenvalues, Q_l ← eigh(C_l) # eigendecomposition
# Q_l rows = eigenvectors sorted by DESCENDING eigenvalue
(C) Choose slice width:
k_l ← round(d · (1 − s)) # uniform sparsity
# (non-uniform variant: optimize k_l via marginal EVR budget)
(D) Transform and slice all weights at position l:
For W_in ∈ {W_Q, W_K, W_V, W_gate, W_up} # read from stream at l
W_in ← (W_in @ Q_l.T)[:, :k_l] # rotate then keep top-k cols
For W_out ∈ {W_O, W_down} # write to stream at l+1
W_out ← (Q_{l+1} @ W_out)[:k_{l+1}, :] # rotate then keep top-k rows
(uses k_{l+1} from the NEXT iteration)
─────────────────────────────────────────────────────
Boundary transformations:
Input embedding E ∈ ℝ^{V×d}:
E ← (E @ Q_0.T)[:, :k_0]
Output LM head W_lm ∈ ℝ^{V×d}:
W_lm ← (W_lm @ Q_L.T)[:, :k_L]
─────────────────────────────────────────────────────
Optional recovery fine-tuning:
Fine-tune f_θ̃ for 1 epoch on D_calib (or larger dataset)
using standard AdamW with LoRA adapters.
Line-by-Line Explanation
Why absorb RMSNorm first?
RMSNorm computes . The RMS scale is , which is invariant to orthogonal transformation since . Therefore the RMS normalization itself is transparent to the basis change. The scale is a diagonal matrix and can be absorbed:
After absorption, there are no normalization layers to worry about. This simplification is not an approximation — it is algebraically exact.
Why layer-by-layer, not all at once?
PCA at layer must reflect the actual distribution of activations produced by layers with the weights and the calibration data. Using random Gaussian activations would give the wrong basis (the statistics of residual-stream activations are highly non-Gaussian). The layer-by-layer scan captures this correctly.
Why does this work for the residual stream?
Residual connections add the input to the output: . Both and live in the same space, so applying the same to both is consistent. The addition is preserved: .
The Q matrices disappear at inference time.
After transformation, is a matrix. It is stored as-is. At inference, the sliced model takes -dimensional inputs and produces -dimensional outputs. No Q matrix is consulted at inference — the rotation is baked into the weight values.
Dimension bookkeeping.
After slicing, each transformer block operates with:
- Input/output residual stream: dimensions
- Q/K/V matrices: (head dimension unchanged)
- MLP: and (intermediate dim unchanged)
Total parameters scale as for dominant terms.
Figure 2: SliceGPT Compression Pipeline
flowchart TD
A["Pretrained LLM\nhidden dim d"] --> B["Calibration dataset\n256 × 2048 tokens, C4"]
B --> C["Step 1: Absorb RMSNorm γ\ninto adjacent weights"]
C --> D["For each layer l:\nforward pass → A_l ∈ ℝ^{d×N}"]
D --> E["PCA: covariance C_l = A_l A_l^T\neigh → Q_l, eigenvalues"]
E --> F["Set k_l = round(d·(1−s))"]
F --> G["W_in ← (W_in Q_l^T)[:, :k]\nW_out ← (Q_{l+1} W_out)[:k, :]"]
G --> H{l < L?}
H -->|Yes, l++| D
H -->|Done| I["Transform embeddings E,\nLM head W_lm"]
I --> J["Optional: 1-epoch fine-tuning\nwith LoRA"]
J --> K["Compressed model\nhidden dim k = 0.75d\nDense matrices only"]
Figure 2: The full SliceGPT pipeline. The calibration phase (collecting activations, computing PCA) requires only forward passes — no gradients. The Q matrices are absorbed and not stored.
Handling Special Components
RMSNorm / LayerNorm
As derived above, RMSNorm is absorbed exactly into the first downstream weight. For LayerNorm (used in OPT), which also has a bias :
The bias is absorbed into the bias term of the following linear layer:
After absorption, both LayerNorm and RMSNorm disappear from the compressed graph.
Multi-Head Self-Attention
For heads with per-head dimension (and ):
Q/K/V projections all read from the same stream at layer :
Each becomes a matrix of shape (from ). The input dimension shrinks from to ; the per-head output dimension is unchanged.
Output projection writes to the stream at layer :
The output dimension shrinks from to ; the input dimension is unchanged.
Grouped-Query Attention (LLAMA2-70B uses GQA): The key and value heads are shared across multiple query groups. SliceGPT handles this identically — the input dimension to shrinks from to , while the per-head dimension stays fixed.
MLP Block (SwiGLU)
LLAMA2’s MLP uses SwiGLU:
Both and read from the layer- stream:
writes to the layer- stream:
The intermediate dimension (≈ for SwiGLU in LLAMA2) is not sliced in the basic algorithm. Slicing would require an additional PCA pass over post-nonlinearity activations and is left for future work.
Embedding and LM Head
The token embedding table maps discrete token IDs to the residual stream at position 0. It must be aligned with the layer-0 basis :
The output LM head reads from the final residual stream (position ):
After these transformations, the model is fully self-consistent. A -dimensional residual stream flows from the embedding table through all layers to the LM head with no mismatch.
Figure 3: Component-Level Slicing Map
flowchart LR
RS_l["Residual stream l\ndim: k_l = 0.75d"] --> WQ["W_Q Q_l^T [:, :k]\ndim: d_h × k_l"]
RS_l --> WK["W_K Q_l^T [:, :k]\ndim: d_h × k_l"]
RS_l --> WV["W_V Q_l^T [:, :k]\ndim: d_h × k_l"]
RS_l --> Wg["W_gate Q_l^T [:, :k]\ndim: d_ff × k_l"]
RS_l --> Wu["W_up Q_l^T [:, :k]\ndim: d_ff × k_l"]
WQ & WK & WV --> Attn["Attention\n(internal d_h unchanged)"]
Attn --> WO["Q_{l+1} W_O [:k, :]\ndim: k_{l+1} × d_h"]
Wg & Wu --> MLP["SiLU / GeLU\n(d_ff unchanged)"]
MLP --> Wd["Q_{l+1} W_down [:k, :]\ndim: k_{l+1} × d_ff"]
WO & Wd --> RS_l1["Residual stream l+1\ndim: k_{l+1} = 0.75d"]
Figure 3: Every weight that reads from the residual stream has its input dimension sliced from d to k. Every weight that writes to the stream has its output dimension sliced. Internal dimensions (d_h, d_ff) are unchanged.
Calibration: Practical Details
Dataset and Scale
SliceGPT uses 256 sequences of 2048 tokens from C4 (≈524K tokens total). The authors confirm that:
- C4 and Wikitext-2 give essentially identical results (PCA basis is data-distribution-insensitive within natural text)
- 128 sequences is sufficient; 512 provides marginal improvement
- The calibration needs only inference-mode forward passes — no gradients, no optimizer state
For LLAMA2-70B at 8192 dimensions, each covariance matrix is entries (256 MB in FP32). With 80 layers, the total covariance storage is ~20GB — manageable on a single A100.
Explained Variance Ratio
After running PCA at layer , the explained variance ratio (EVR) at width is:
For LLAMA2-70B at , EVR typically exceeds 99.5% at early/middle layers and drops slightly (to ~98.5%) at the final layers. This quantifies how much activation energy is preserved after slicing.
Non-Uniform Sparsity Allocation
A more principled variant optimizes per-layer subject to a global parameter budget:
This can be solved greedily: sort layers by marginal truncation error per parameter removed and allocate budget accordingly. The paper reports that non-uniform allocation gives marginal improvement over uniform sparsity for large models; small models benefit more.
Experiments and Results
Experimental Setup
- Models evaluated: LLAMA2-7B, 13B, 70B; OPT-13B, 30B, 66B; Phi-2 (2.7B)
- Calibration data: 256 × 2048 tokens from C4
- Evaluation benchmark: EleutherAI LM-Eval-Harness, 7 zero-shot tasks (WinoGrande, HellaSwag, PIQA, ARC-easy, ARC-challenge, OpenBookQA, BoolQ)
- Baselines: SparseGPT (50% unstructured), Wanda (50% unstructured), LLM-Pruner (20% structured)
Table 1: Zero-Shot Accuracy at 25% Parameter Reduction
| Model | Dense Acc. | SliceGPT Acc. | Retained | Dense GPUs | Sliced GPUs |
|---|---|---|---|---|---|
| LLAMA2-7B | 64.0% | 58.2% | 91.0% | 1×A100 | 1×A100 |
| LLAMA2-13B | 66.8% | 61.4% | 91.9% | 2×A100 | 1×A100 |
| LLAMA2-70B | 70.4% | 69.8% | 99.1% | 4×A100 | 2×A100 |
| OPT-66B | 66.7% | 66.1% | 99.1% | 4×A100 | 2×A100 |
| Phi-2 (2.7B) | 71.2% | 63.9% | 89.7% | 1×RTX3090 | 1×RTX3090 |
Key observation: Large models (≥66B) tolerate slicing far better than small models. LLAMA2-70B loses only 0.6 percentage points at 25% compression — within noise on individual tasks. Small models like Phi-2 lose ~7 points, reflecting their lower redundancy.
Table 2: Perplexity on Wikitext-2 (lower is better)
| Model | Dense | SliceGPT 20% | SliceGPT 25% | SparseGPT 50% |
|---|---|---|---|---|
| LLAMA2-7B | 5.47 | 5.82 | 6.82 | 6.51 |
| LLAMA2-13B | 4.88 | 5.12 | 5.72 | 5.40 |
| LLAMA2-70B | 3.32 | 3.40 | 3.52 | 3.51 |
| OPT-66B | 9.34 | 9.55 | 9.80 | 9.76 |
At 25% structural reduction, SliceGPT is competitive with SparseGPT at 50% unstructured sparsity on 66–70B models. On smaller models, unstructured sparsity has a slight perplexity edge.
Compute and GPU Reduction
For LLAMA2-70B at :
Empirically measured at 64–66% (slightly higher than theoretical because embedding and MLP-intermediate terms are not fully reduced). The model also fits on half the GPU count:
- Dense: 4×A100-40GB required
- Sliced: 2×A100-40GB sufficient
This is a practical infrastructure saving: half the hardware cost for 99% of the task performance.
Fine-Tuning Recovery
One epoch of LoRA fine-tuning after slicing:
- LLAMA2-7B: ~2 points recovered (58.2% → 60.1%)
- LLAMA2-70B: ~0.2 points recovered (already near-dense quality)
- Phi-2: ~3 points recovered (63.9% → 67.1%)
Fine-tuning is most beneficial where the initial accuracy drop is largest (small models, high sparsity).
Figure 4: Performance vs. GPU Count (LLAMA2-70B)
flowchart LR
subgraph DenseSetup["Dense LLAMA2-70B"]
G1["GPU 1\n16.7B params"] & G2["GPU 2\n16.7B params"] & G3["GPU 3\n16.7B params"] & G4["GPU 4\n16.7B params"]
G1 & G2 & G3 & G4 --> Perf1["70.4% zero-shot\n100% FLOPs"]
end
subgraph SlicedSetup["SliceGPT LLAMA2-70B (s=0.25)"]
SG1["GPU 1\n~26B params"] & SG2["GPU 2\n~26B params"]
SG1 & SG2 --> Perf2["69.8% zero-shot\n64% FLOPs\n(-0.6 pts, -2 GPUs)"]
end
Figure 4: SliceGPT halves the GPU count for LLAMA2-70B inference while losing only 0.6 percentage points on zero-shot benchmarks.
Comparison to Prior Work
Table 3: Structural Compression Method Comparison (LLAMA2-70B, Wikitext-2 PPL)
| Method | Type | Param Reduction | PPL | Custom Kernel | GPU Savings |
|---|---|---|---|---|---|
| Dense | — | 0% | 3.32 | No | — |
| SparseGPT | Unstructured | ~50% | 3.51 | Yes | No |
| Wanda | Unstructured | ~50% | 3.53 | Yes | No |
| LLM-Pruner | Structural | ~20% | 5.3 | No | Partial |
| SliceGPT | Structural | 25% | 3.52 | No | 4→2 GPUs |
SliceGPT is the only method in this table that simultaneously achieves: no custom kernels, actual GPU count reduction, and sub-4.0 perplexity at meaningful compression.
Why Computational Invariance is More Principled Than Magnitude-Based Pruning
Most structural pruning methods select neurons/channels to prune by magnitude, gradient, or Taylor expansion — all heuristics. SliceGPT instead:
- Applies a theoretically optimal basis change (the PCA rotation is optimal in the sense of minimizing reconstruction error after truncation, by the Eckart–Young theorem)
- The subsequent truncation discards directions that are demonstrably low-variance in the calibration distribution
- The basis change itself introduces zero approximation error — only the truncation does
This gives a principled upper bound on the error introduced: it is exactly the PCA reconstruction error (discarded eigenvalue sum), which can be computed and used to set the sparsity budget.
Figure 5: Taxonomy of Post-Training LLM Compression
flowchart TD
root["Post-Training LLM Compression"]
root --> Quant["Quantization\nGPTQ · AWQ · SmoothQuant · LLM.int8"]
root --> Unstruct["Unstructured Sparsity\nSparseGPT · Wanda · Magnitude"]
root --> Struct["Structural Pruning\nSliceGPT · LLM-Pruner · ShortGPT · FLAP"]
root --> LowRank["Low-Rank Decomposition\nSVD-LLM · ASVD · TrLoRA"]
Struct -->|"Theoretical basis:\ncomputational invariance + PCA"| SliceGPT_node["SliceGPT\n(this paper)"]
Figure 5: SliceGPT sits in the structured pruning quadrant, uniquely backed by a theoretical invariance argument rather than a magnitude-based heuristic.
Limitations and Boundary Conditions
Scale Dependence is Fundamental
The 99% retention at 25% sparsity holds only for 60B+ parameter models. At 7B:
- Performance drop: ~6 points
- Explanation: smaller models have less redundancy (eigenvectors of the activation covariance have flatter spectra, meaning more variance is spread across directions rather than concentrated in a few)
This is not a bug but a fundamental property: SliceGPT exploits over-parameterization. Models below ~13B are not sufficiently over-parameterized for 25% slicing to be near-lossless.
Intermediate MLP Dimension Untouched
The basic algorithm only slices the residual stream dimension . The MLP intermediate dimension (SwiGLU LLAMA2) is preserved. This means:
- The and matrices change from to : savings proportional to
- The changes from to : same savings
But the element-wise nonlinearity and the intermediate activations still occupy dimensions. For models where the MLP dominates (e.g., MoE models), this limits the FLOP savings.
Benchmark Narrowness
All evaluations use short-answer, zero-shot classification tasks. No results are reported for:
- Code generation (HumanEval, MBPP)
- Mathematical reasoning (GSM8K, MATH)
- Long-form instruction following (MT-Bench, AlpacaEval)
- Long-context tasks (SCROLLS, LongBench)
- Multilingual benchmarks
These tasks may be more sensitive to residual stream dimension reduction, particularly long-context tasks where the model must maintain a rich information state across many tokens.
Single-Architecture Evaluation at Large Scale
The 70B-scale experiments cover only LLAMA2 and OPT. Other modern large models — Falcon-180B, Mixtral-8×7B (MoE), GPT-NeoX-20B — are not evaluated. MoE architectures are especially interesting since their routing mechanism interacts with the residual stream in non-trivial ways.
Non-Linear Components Limit Invariance
Computational invariance holds for element-wise because only when is the identity (which it obviously isn’t). Wait — this needs clarification: the invariance holds because is applied to the intermediate vector (not the residual stream). The residual stream transformation cancels out before reaching . But if were applied to the residual stream directly (as in some architectures), this would break.
For standard transformer attention, the softmax is applied to attention scores — not the residual stream. The attention scores operate in the per-head space (dimension ), which is not sliced. So attention softmax is handled correctly.
Critical Assessment: Weaknesses & Improvements
(a) Weaknesses and Flaws
The “25% parameter reduction” framing is imprecise. SliceGPT reduces the hidden dimension from to . Weight matrices of shape become — one dimension changes. For a transformer with weight matrices of shape , the parameter reduction per matrix is . But the MLP’s matrices only shrink in one of their two dimensions, and the embedding table is very large. The net total parameter reduction depends on the model’s dimension ratios. The paper reports “up to 25% of model parameters including embeddings” — the “up to” qualifier deserves more prominence, and Table 2 in the paper shows exact per-model figures that vary meaningfully.
Perplexity comparison at different operating points. SliceGPT at 25% structural sparsity is compared against SparseGPT/Wanda at 50% unstructured sparsity. The paper frames this as “competitive,” but these methods are not at the same FLOP reduction point. SparseGPT at 50% unstructured sparsity has half the weight parameters but, without custom sparse kernels, no latency benefit — while SliceGPT at 25% structural sparsity has ~44% FLOP reduction but actual latency benefit. A fair comparison would match on actual measured throughput (tokens/second) at the same hardware budget, not on nominal parameter counts.
No ablation on calibration dataset size or domain. The paper states C4 and Wikitext-2 give similar results (one comparison), but provides no systematic study. For practitioners deploying SliceGPT on domain-specific models (medical, legal, code), it is unknown whether calibrating with C4 is adequate or whether domain-matched calibration data is necessary. This is a practical gap.
No latency measurements on realistic inference workloads. The paper reports FLOP counts and mentions running on fewer GPUs, but does not report actual tokens/second at various batch sizes. For memory-bandwidth-bound regimes (small batch sizes), FLOP reduction does not directly translate to latency reduction. The “faster” claim, while plausible, is not fully substantiated.
Limited fine-tuning analysis. The paper briefly mentions 1-epoch LoRA fine-tuning but does not explore: how much recovery is possible with more compute (3–5 epochs), what training data is optimal, or whether full fine-tuning outperforms LoRA for recovery.
(b) Limitations the Authors Understate
KV-cache size is unaffected in the basic algorithm. Since per-head dimension is not sliced (only the input dimension of changes), the K and V vectors output by the projections still have dimension per head. The KV cache size is therefore unchanged. For long-context inference where KV cache is the primary memory bottleneck, SliceGPT provides no direct benefit. The paper does not acknowledge this.
Tensor-parallel sharding may be complicated. The reduced hidden dimension may not be evenly divisible by the number of GPUs in tensor-parallel settings. For LLAMA2-70B: (easily divisible by 8), — divisible by 8 but not by all desired tensor-parallel degrees. For non-power-of-2 , padding or irregular sharding is needed.
Weight materialization overhead during compression. During compression, both the original weight and the transformed must be held in memory simultaneously. For LLAMA2-70B with 140B parameters at FP16, this transiently requires ~280GB — more than 4×A100-80GB can hold. The paper reports 4×A100-80GB is sufficient but does not detail how this is managed (likely layer-by-layer with careful memory management).
(c) Concrete Improvement Suggestions
1. Slice the MLP intermediate dimension. Apply PCA to the post-activation intermediate activations (the vector after the GeLU/SiLU) and additionally reduce . This requires two PCA passes per layer (one at the residual stream, one at the MLP intermediate) but would provide proportional FLOP savings across all weight matrices. Expected benefit: at on both and , total FLOPs reduce to rather than the current ~64–66%.
2. Non-uniform allocation with validation-loop tuning. Use a small held-out set (16–32 sequences) to measure actual perplexity impact of slicing each layer independently. Protect layers that show large perplexity sensitivity (typically layers near the input and output) and aggressively slice middle layers. Gradient-free black-box optimization (CMA-ES or a greedy scan) over the schedule could substantially improve the accuracy-compression trade-off without additional compute.
3. Evaluate on reasoning and code benchmarks. Add HumanEval (code generation), GSM8K (math), and MT-Bench (instruction following) to the evaluation suite. If SliceGPT degrades disproportionately on these tasks — which require multi-step precision — this should be transparently reported, with per-task analysis of which tasks are most sensitive to reduction.
4. Combine with quantization and measure jointly. Apply AWQ or GPTQ after SliceGPT and compare against AWQ/GPTQ alone on the same hardware. If SliceGPT + INT4 achieves better throughput than INT4 alone at similar accuracy, that is a compelling deployment story the paper misses. The combination is natural (slicing reduces the matrix sizes before quantization) but unexplored.
5. Measure KV-cache impact of slicing the head dimension. Extend the algorithm to also apply PCA on the per-head key/value activations (a separate PCA within each attention head) and slice . This would reduce KV cache memory proportionally, which is critical for long-context serving. This is a non-trivial extension but directly addresses the KV-cache limitation identified above.
Deep Dive: SliceGPT vs. Low-Rank Matrix Decomposition
It is easy to conflate SliceGPT with weight-level low-rank decomposition methods (e.g., SVD-LLM, ASVD). Both involve SVD and both produce smaller weight matrices. The difference is conceptual and has practical consequences.
Low-Rank Decomposition of Weights (What SliceGPT is NOT)
Conventional low-rank compression approximates each weight matrix individually:
where , . This replaces one matrix with two smaller ones: the computation changes from to .
Problems with weight-level SVD:
- Each matrix is approximated independently, ignoring that the approximation errors of consecutive layers accumulate through the residual stream
- The approximation is in the weight space — the truncated directions in may not correspond to directions that the activations actually occupy
- The two smaller matrices (, ) both need to be stored and multiplied; unless , the inference overhead can actually increase due to two separate GEMM calls
What SliceGPT Actually Does
SliceGPT applies SVD/PCA to the activations, not the weight matrices. The key mathematical distinction:
Step 1 (SliceGPT): Find such that the activations have maximum variance in the first coordinates.
Step 2 (SliceGPT): Rotate all weights that touch position to be consistent with this new basis. This step is exact (computational invariance).
Step 3 (SliceGPT): Truncate the last coordinates. This step is the only approximation, and its error equals the discarded eigenvalue sum.
The resulting weights are single matrices (not product pairs): a matrix that was becomes — one matrix, not two. This is why inference computation actually decreases rather than just being rearranged.
Formal Comparison of Error Sources
Low-rank SVD of weight :
This error is in weight space and may not reflect what activations actually use.
SliceGPT truncation at layer :
where are eigenvalues of the activation covariance. This error is in activation space — directly measuring how much of the actual runtime information is discarded.
The activation-space error bound is tighter and more meaningful for downstream task performance, because it directly measures how much the model “sees” in the directions being removed.
Figure 6: SliceGPT vs. Weight-Level Low-Rank Decomposition
flowchart TB
subgraph WeightSVD["Weight-Level SVD (e.g., SVD-LLM)"]
W1["W ∈ ℝ^{m×d}"] -->|"SVD truncation"| UV["U_k (m×k)\n× Σ_k V_k^T (k×d)\nTwo matrices"]
UV --> Err1["Error: ‖W − U_kΣ_kV_k^T‖_F\n(in weight space)"]
end
subgraph SliceGPT_Diag["SliceGPT (activation-space)"]
W2["W ∈ ℝ^{m×d}"] -->|"Rotate: W Q_l^T"| WQ["W Q_l^T ∈ ℝ^{m×d}\n(exact, zero error)"]
WQ -->|"Slice: keep cols 1:k"| Wk["W Q_l^T [:, :k] ∈ ℝ^{m×k}\nOne matrix"]
Wk --> Err2["Error: Σ_{i>k} λᵢ(activation covariance)\n(in activation space, tighter bound)"]
end
Figure 6: Weight-level SVD produces a product of two matrices and measures error in weight space. SliceGPT produces a single smaller matrix and measures error in activation space — directly bounding the impact on runtime behavior.
Sparsity Scaling Behavior
How Does Accuracy Degrade as Sparsity Increases?
Understanding the accuracy-sparsity curve is critical for practitioners choosing the operating point. The paper reports results at 20% and 25% sparsity for some models, and at 30%+ for others. The qualitative pattern is:
- 0–10% sparsity: Nearly zero accuracy loss for all model sizes. The high-variance PCA directions are far more important than the low-variance ones; removing only the tail is almost free.
- 10–20% sparsity: Negligible loss for 70B+, small loss (~1–2 points) for 7–13B. Still practically useful.
- 25% sparsity: The “sweet spot” for large models — 99% retention at 70B. For 7B, the 6-point loss becomes noticeable.
- 30%+ sparsity: Accuracy drops accelerate nonlinearly. The eigenvalue spectrum decays rapidly but not infinitely; at high sparsity, directions with meaningful variance are being removed.
This nonlinear degradation pattern matches the mathematical prediction: the truncation error grows slowly at first (low eigenvalues discarded) and then quickly (eigenvalues with non-trivial variance begin to be discarded).
Per-Layer Eigenvalue Spectra
Analyzing the eigenvalue spectra of at different layers reveals:
- Early layers (l = 0–10): Relatively flat spectra (activations use many directions roughly equally). These layers are harder to compress and benefit most from non-uniform sparsity (lower ).
- Middle layers (l = 10–60 for 70B): Steep spectra, very high EVR even at . High redundancy.
- Final layers (l > 60): Moderate spectra. The LM head needs to distinguish many different token predictions, requiring more dimensions.
This layered structure explains why uniform sparsity works well on average but non-uniform allocation (protecting early and late layers) can unlock better accuracy at the same compute budget.
Interaction with Model Architecture Variants
Tied embeddings: Some models tie the input embedding and output LM head . SliceGPT’s treatment of these as separate matrices would break the tie. The codebase handles this by only transforming one of them and re-tying after compression.
Rotary Position Embeddings (RoPE): LLAMA2 uses RoPE for positional encoding. RoPE is applied to Q and K after the projections, operating in the per-head space (dimension ). Since SliceGPT does not change , RoPE is unaffected.
ALiBi Positional Biases (OPT): Additive biases in attention scores, again in the per-head space. Unaffected by residual-stream slicing.
Reproducibility Notes
- Code: github.com/microsoft/TransformerCompression (MIT license)
- Calibration data: HuggingFace
allenai/c4English subset; 256 × 2048 tokens; takes ~30 min to preprocess - Compression runtime: ~1–2 hours on 4×A100-80GB for LLAMA2-70B; single-GPU is sufficient for ≤13B models
- Evaluation: EleutherAI LM-Eval-Harness v0.3+; 7-task zero-shot average
- Determinism: Fully deterministic given fixed calibration sequence order; no randomness after calibration sampling
- Dependencies: PyTorch ≥ 2.0,
transformers,datasets,scipy.linalg.eigh(for covariance eigendecomposition) - Memory for compression: Requires holding full-precision weights plus one layer’s covariance matrix at a time; ~160GB peak for LLAMA2-70B
The algorithm is straightforward: ~200 lines of PyTorch to implement from scratch, making SliceGPT one of the most accessible papers in post-training compression for pedagogical purposes.
Summary: Design Decisions at a Glance
Before concluding, here is a quick reference capturing SliceGPT’s key design choices and their implications:
| Decision | Rationale | Open Gap |
|---|---|---|
| PCA basis (not random Q) | Minimizes activation reconstruction error (Eckart-Young optimal) | Requires calibration forward pass |
| Uniform sparsity by default | Simple; near-optimal for large models | Suboptimal for small models — non-uniform is better |
| Absorb RMSNorm into weights | Exact simplification; no extra ops at inference | Only works for diagonal-scale norms |
| Preserve | Avoids second PCA pass | Leaves MLP FLOP savings on the table |
| 256-sequence calibration | Sufficient for stable PCA; low overhead | May be domain-sensitive for specialized models |
| Optional fine-tuning | Avoids training setup for large models | Small models benefit significantly from even 1 epoch |
Each “Open Gap” row is a concrete future research direction. Together they sketch a roadmap for extending SliceGPT to higher compression ratios and broader deployment scenarios.
Conclusion
SliceGPT makes a clean theoretical contribution — the computational invariance theorem — and translates it directly into an engineering outcome: smaller, faster, hardware-agnostic transformer inference. The insight that an orthogonal basis change is transparent to the computation, and that PCA identifies the optimal basis for subsequent truncation, is both elegant and practically powerful.
At 70B scale, the method delivers compelling results: 25% compression with 99% task performance, halved GPU count, and 34–36% FLOP reduction — all without custom kernels. For practitioners deploying LLAMA2-70B or similar models, SliceGPT represents one of the most deployment-friendly compression options available.
The method’s limitations are equally clear: it is most effective for large (≥30B) models, has been validated primarily on short zero-shot classification tasks, leaves the MLP intermediate dimension untouched, and does not address the KV cache. These are not disqualifying limitations but they define the boundary conditions for when SliceGPT is the right tool.
For researchers building on this work, the most impactful next steps are: MLP intermediate slicing, per-head KV-cache reduction via head-dimension PCA, extended evaluation on reasoning and code tasks, and composability with quantization. The computational invariance theorem itself is a result worth studying independently — it may underpin future compression methods for other neural architectures beyond transformers.
Personal take: SliceGPT is one of the most pedagogically clean papers in post-training compression. The core insight fits in five lines of algebra, the code is minimal and well-commented, and the 70B results are genuinely impressive. Reading the computational invariance proof is time well spent for anyone working with transformer internals.
Related Work and Further Reading
For deeper context:
- SparseGPT (Frantar & Alistarh, NeurIPS 2023) — unstructured counterpart; compare their layer-wise reconstruction against SliceGPT’s PCA calibration
- SVD-LLM (Wang et al., 2024) — weight-level SVD for LLMs; contrasts with SliceGPT’s activation-space philosophy
- ASVD (Yuan et al., 2023) — activation-aware SVD, thematically closest to SliceGPT but at the weight level
- QuIP# (Tseng et al., NeurIPS 2024) — uses random orthogonal incoherence transforms before quantization; the orthogonal-transform idea is mathematically related to SliceGPT’s basis change, applied to enable better quantization rather than slicing
- LLM-Pruner (Ma et al., 2023) — gradient-guided structural pruning; shows how heuristic-based methods compare in accuracy-compression trade-off
- TransformerCompression (GitHub) — official open-source implementation by Microsoft Research; clean, well-documented, actively maintained
Conceptual Reading Path
For a reader new to post-training compression, the recommended reading order is:
- This paper (SliceGPT) — start here to understand the theoretical framework
- SparseGPT — see how unstructured methods handle the same calibration-based problem differently
- SVD-LLM / ASVD — compare weight-space vs. activation-space SVD approaches directly
- QuIP# — see how the same orthogonal-transform idea is applied in a quantization context
- LLM-Pruner — understand gradient-guided structural pruning to appreciate SliceGPT’s calibration-only simplicity
This path builds a coherent mental model: the common thread across all these methods is using calibration data to guide compression decisions, with each method differing in what is compressed (weights, activations, structure) and how the calibration signal is used (Hessian, PCA, gradient).