DoRA: Weight-Decomposed Low-Rank Adaptation — Technical Review

DoRA: Weight-Decomposed Low-Rank Adaptation — Technical Review

Review date: 2026-05-22 Reviewer: Zhongzhu Zhou Paper: DoRA: Weight-Decomposed Low-Rank Adaptation Authors: Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen (NVIDIA & HKUST) arXiv: 2402.09353v6, 2024-07-09 Venue: ICML 2024 (Oral), PMLR 235

Short answer

LoRA is the workhorse of parameter-efficient fine-tuning — cheap, fast, and practical. But it consistently trails full fine-tuning (FT) in accuracy. The standard explanation has been “LoRA just doesn’t have enough trainable parameters.” DoRA challenges that story with hard evidence: the problem is not parameter count, it’s the structure of the update.

The key insight: full fine-tuning and LoRA update weights in qualitatively different ways. FT tends to make either large magnitude changes or large directional changes — not both proportionally. LoRA, by contrast, always couples them: increase rank means both go up, decrease rank means both go down. It lacks the fine-grained control to move the weight vector strongly in one dimension while holding the other nearly fixed.

DoRA fixes this by borrowing from weight normalization (Salimans & Kingma, 2016): decompose any weight matrix W0W_0 into a magnitude vector m\mathbf{m} and a direction matrix VV, then treat them as separate trainable quantities. Because the direction is high-dimensional, LoRA is applied there for efficiency. The magnitude — just one scalar per column — is trained directly. The merged weight at inference is identical to a plain dense matrix, so there’s zero inference overhead.

In concrete numbers: on LLaMA-3-8B commonsense reasoning, DoRA surpasses LoRA by +4.4 points while using virtually the same parameter budget (0.71% vs 0.70%). DoRA† (half the rank of LoRA) beats LoRA by +4.2 points with half the trainable parameters. On LLaVA-1.5-7B visual instruction tuning, DoRA improves by +0.7 points over LoRA and +1.1 over full FT. The improvement is consistent across every model, task, and rank setting tested — this is not cherry-picking.

The paper is also a good case study in analysis-driven design: the authors first built a diagnostic tool (weight decomposition analysis), found a structural difference between FT and LoRA, and then directly designed DoRA to close that gap. The resulting method is conceptually tight and the empirical improvements are unusually reproducible.

1. Prerequisites

This section is for readers who have worked with transformers but haven’t studied the theory of weight normalization, LoRA internals, or parameter-efficient fine-tuning design space. Skip §1.1–1.3 if you’ve read the LoRA and AdaLoRA papers; skip §1.4–1.5 if you’ve read the weight normalization paper.

1.1 Full fine-tuning and its cost

Given a pretrained model fθf_\theta with parameter vector θRP\theta \in \mathbb{R}^P, full fine-tuning (FT) finds:

θ=argminθL(θ;Dtarget)\theta^* = \arg\min_\theta \mathcal{L}(\theta; \mathcal{D}_{\text{target}})

with θ\theta initialized at the pretrained weights θ0\theta_0. For a 7-billion-parameter LLM, this means storing and updating 7B float32 parameters (28 GB) every step — plus optimizer states (Adam stores first and second moment: another 56 GB), plus activations for backprop. In practice, FT requires 80–160 GB of GPU memory for a 7B model, which rules it out for most practitioners.

1.2 The PEFT design space

Parameter-efficient fine-tuning (PEFT) methods reduce trainable parameters by orders of magnitude. The design space has three broad families:

Adapter-based: Insert small bottleneck modules (typically linear → nonlinear → linear with a narrow middle dimension) at specific points in the transformer (after self-attention, after FFN). Only the adapter weights are trained. Sequential adapters add latency because they cannot be merged; parallel adapters can sometimes be fused.

Prompt-based / prefix-tuning: Prepend trainable “soft tokens” to the input sequence or to each layer’s key-value cache (prefix tuning). The backbone is frozen; only the soft tokens are optimized. These are sensitive to initialization and usually underperform adapters.

Low-rank update (LoRA family): Model the weight update ΔW\Delta W as a low-rank product BABA rather than a full-rank matrix. After training, ΔW\Delta W is merged into W0W_0 with zero inference overhead. This is the dominant paradigm and the focus of DoRA.

1.3 LoRA: mathematical formulation

For a pretrained weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, LoRA (Hu et al., ICLR 2022) restricts the weight update to be low-rank:

W' = W_0 + \Delta W = W_0 + BA \tag{1}

where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d, k).

Initialization strategy: AA is initialized with random Gaussian (Kaiming uniform); BB is initialized to zero. This ensures ΔW=BA=0\Delta W = BA = 0 at the start of training — the model starts from exactly the pretrained weights.

Parameter count: Instead of dkdk parameters, LoRA uses r(d+k)r(d+k) parameters. For a typical attention projection with d=k=4096d = k = 4096 and r=16r = 16, this is 16×8192=13107216 \times 8192 = 131072 versus 1677721616777216 — a 128× reduction.

Inference merge: At deployment, compute W=W0+BAW' = W_0 + BA once and store the merged dense matrix. Forward pass is identical to the original model with no overhead.

Scaling: In practice, LoRA applies a scaling factor α/r\alpha/r to ΔW\Delta W, where α\alpha is a hyperparameter (often set to the same value as rr). This rescales the learning rate effect and decouples hyperparameter tuning from rank.

1.4 Weight normalization and the magnitude-direction decomposition

Weight normalization (Salimans & Kingma, NeurIPS 2016) reparameterizes a weight vector w\mathbf{w} as:

\mathbf{w} = g \cdot \frac{\mathbf{v}}{\|\mathbf{v}\|} \tag{2}

where gRg \in \mathbb{R} is a scalar magnitude and vRd\mathbf{v} \in \mathbb{R}^d is a direction vector. The motivation is conditioning: if the gradient covariance is better aligned with the identity matrix, SGD converges faster. The key property is that gg and v\mathbf{v} decouple magnitude from direction, so the optimizer can adjust them independently at different rates.

For a matrix WRd×kW \in \mathbb{R}^{d \times k} with columns {Wn}n=1k\{W_n\}_{n=1}^k (each column is a weight vector), the column-wise generalization is:

W = \mathbf{m} \cdot \frac{V}{\|V\|_c} \tag{3}

where mR1×k\mathbf{m} \in \mathbb{R}^{1 \times k} is the row vector of column-norms, VRd×kV \in \mathbb{R}^{d \times k} is the direction matrix, and c\|\cdot\|_c denotes the column-wise 2\ell_2 norm operation (i.e., divide each column of VV by its 2\ell_2 norm). After this, every column of V/VcV / \|V\|_c is a unit vector.

The difference from weight normalization is the initialization: weight normalization trains from random initialization (sensitive to initialization), whereas DoRA initializes from the pretrained weights (m=W0c\mathbf{m} = \|W_0\|_c and V=W0V = W_0 at the start), which sidesteps initialization sensitivity.

1.5 What “learning pattern” means in this context

DoRA introduces a weight decomposition analysis (Section 3 of the paper). Given a fine-tuned weight WtW^t at training step tt and the pretrained weight W0W_0, decompose both:

Wt=mtVtVtc,W0=m0V0V0cW^t = \mathbf{m}^t \frac{V^t}{\|V^t\|_c}, \quad W_0 = \mathbf{m}^0 \frac{V^0}{\|V^0\|_c}

Then define the magnitude difference:

\Delta M^t = \frac{1}{k} \sum_{n=1}^{k} |m_n^t - m_n^0| \tag{4}

and the directional difference:

\Delta D^t = \frac{1}{k} \sum_{n=1}^{k} \left(1 - \cos(V_n^t, W_0^n)\right) \tag{5}

where cos(,)\cos(\cdot, \cdot) is cosine similarity and Vnt,W0nV_n^t, W_0^n are the nn-th columns of VtV^t and W0W_0.

By plotting (ΔDt,ΔMt)(\Delta D^t, \Delta M^t) scatter plots across layers and training steps, the authors reveal that:

  • Full FT: Points scatter with a negative slope (large direction change correlates with small magnitude change, and vice versa).
  • LoRA: Points scatter with a positive slope (direction and magnitude always increase/decrease together).
  • DoRA: Points scatter with a negative slope similar to FT.

The Pearson correlation between ΔD\Delta D and ΔM\Delta M is 0.62-0.62 for FT, +0.83+0.83 for LoRA, and 0.31-0.31 for DoRA — confirming that DoRA’s learning pattern is qualitatively more similar to FT than LoRA is.

2. Method

2.1 The core problem with LoRA’s coupled updates

Figure 1: FT vs LoRA vs DoRA — learning patterns
graph TD
    subgraph FT["Full Fine-Tuning (FT)"]
        F1["Large ΔD → Small ΔM (or vice versa)"]
        F2["Negative slope in scatter plot"]
        F3["Correlation(ΔD, ΔM) = −0.62"]
    end
    subgraph LR["LoRA"]
        L1["ΔD and ΔM always proportional"]
        L2["Positive slope in scatter plot"]
        L3["Correlation(ΔD, ΔM) = +0.83"]
        L4["Cannot make subtle directional change\nwithout also changing magnitude"]
    end
    subgraph DR["DoRA"]
        D1["Large ΔD → Small ΔM (or vice versa)"]
        D2["Negative slope (like FT)"]
        D3["Correlation(ΔD, ΔM) = −0.31"]
        D4["Decoupled by design"]
    end
    FT -- "DoRA mimics" --> DR
    LR -- "DoRA improves" --> DR

Why does positive slope hurt? When LoRA wants to make a strong directional change (move the weight vector to point in a new direction), its positive coupling forces the magnitude to increase simultaneously. Conversely, when a small directional update suffices (the pretrained weight already points roughly right), LoRA still inflates the magnitude proportionally. This rigid coupling forces LoRA into a suboptimal learning trajectory — it can’t make “diagonal” updates in the (ΔD,ΔM)(\Delta D, \Delta M) plane the way FT does.

2.2 The DoRA formulation

Drawing on the weight decomposition from Eq. (3), DoRA decomposes the pretrained weight into magnitude and direction, then fine-tunes both:

W' = \mathbf{m} \cdot \frac{V + \Delta V}{\|V + \Delta V\|_c} = \mathbf{m} \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c} \tag{6}

What’s trained:

  • mR1×k\mathbf{m} \in \mathbb{R}^{1 \times k}: the magnitude vector, trained directly (column-wise scalar, tiny parameter count)
  • BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}: the LoRA matrices for directional update ΔV\Delta V

What’s frozen: V=W0V = W_0 (the original weight, used as the frozen base for direction)

Initialization: At start of training, B=0B = 0 (so ΔV=0\Delta V = 0), meaning V+ΔV=W0V + \Delta V = W_0. The magnitude m=W0c\mathbf{m} = \|W_0\|_c. This gives W=W0cW0/W0c=W0W' = \|W_0\|_c \cdot W_0 / \|W_0\|_c = W_0, so DoRA starts exactly at the pretrained weights — same as LoRA.

Inference merge: After training, W=m(W0+BA)/W0+BAcW' = \mathbf{m} \cdot (W_0 + BA) / \|W_0 + BA\|_c is a dense matrix of the same shape as W0W_0. It can be pre-computed and stored, with zero inference overhead.

2.3 The algorithm, step by step

Algorithm 1: DoRA Fine-Tuning

Input: Pretrained weight W₀ ∈ ℝ^{d×k}, rank r, target task dataset D
Output: Fine-tuned merged weight W' ∈ ℝ^{d×k}

---
Initialization:
  1.  Compute m ← column-wise ℓ₂ norm of W₀      # m ∈ ℝ^{1×k}
  2.  Set V ← W₀                                   # frozen direction base
  3.  Initialize A ~ Kaiming_uniform(r, k)         # LoRA A matrix
  4.  Initialize B ← 0_{d×r}                       # LoRA B matrix (zero init)
  5.  Mark as trainable: {m, A, B}
  6.  Mark as frozen:    {V (= W₀)}

Forward pass (each training step):
  7.  Compute ΔV ← B @ A                           # low-rank directional delta
  8.  Compute V' ← V + ΔV                          # updated direction (unnorm.)
  9.  Compute norms ← column_norms(V')             # ℝ^{1×k}, treated as CONSTANT
                                                   # (detach from grad graph)
  10. Compute W' ← m * (V' / norms)               # ∈ ℝ^{d×k}
  11. Compute output ← W' @ x

Backward pass:
  12. Compute ∂L/∂W' via autograd
  13. Gradient w.r.t. m:  ∂L/∂m = (∂L/∂W') · V' / norms
                                 = ||∇_{W'} L|| · cos(∇_{W'} L, v')   [Eq. 9]
  14. Gradient w.r.t. V': ∂L/∂V' = (m / norms) · ∂L/∂W'             [Eq. 11]
                           (propagated to A and B through ΔV = BA)
  15. Update {m, A, B} with optimizer step

Post-training merge (once, before deployment):
  16. Compute W' ← m * (W₀ + B@A) / column_norms(W₀ + B@A)
  17. Store W' as the deployed weight (dense, same shape as W₀)
  18. Discard m, A, B

Key implementation note (line 9): The column norms Vc\|V'\|_c are computed dynamically each step (so they track the evolving ΔV\Delta V), but they are detached from the gradient graph. This means VL\nabla_{V'}\mathcal{L} is computed as if norms were constant — i.e., VL(m/C)WL\nabla_{V'}\mathcal{L} \approx (m/C) \cdot \nabla_{W'}\mathcal{L} where C=VcC = \|V'\|_c. This eliminates a significant memory overhead in backprop (saves ~24% GPU memory on LLaMA-7B) with negligible accuracy loss (0.2\approx 0.2 points on commonsense reasoning).

2.4 Gradient analysis: why decomposition stabilizes LoRA

This is the most mathematically interesting part of the paper. Let’s derive the full gradient equations.

Starting from DoRA’s forward pass (treating Vc=C\|V'\|_c = C as constant per the optimization from §2.3):

W=mVC,V=V+ΔV=W0+BAW' = \mathbf{m} \cdot \frac{V'}{C}, \quad V' = V + \Delta V = W_0 + BA

Gradient of loss w.r.t. VV' (and thus w.r.t. ΔV=BA\Delta V = BA):

\nabla_{V'} \mathcal{L} = \frac{\mathbf{m}}{C} \cdot \nabla_{W'} \mathcal{L} \tag{7}

This is a pure rescaling of the weight gradient — the direction is the same, but magnitude is modulated by m/C\mathbf{m}/C. Notice what this does:

  • Columns of VV' with large magnitude relative to their norm (mn/Cnm_n / C_n large) receive larger gradients.
  • This mimics gradient preconditioning: the update to the direction is scaled by the “how important is this column’s current magnitude.”

Gradient of loss w.r.t. m\mathbf{m} (column-wise):

\nabla_{m_n} \mathcal{L} = \frac{\nabla_{W'} \mathcal{L} \cdot V'_n}{\|V'_n\|} = \|\nabla_{W'} \mathcal{L}_n\| \cdot \cos(\nabla_{W'}\mathcal{L}_n, V'_n) \tag{8}

Key insight from Eq. (8): The gradient for the magnitude scalar depends on the cosine alignment between the loss gradient and the current direction vector. When the loss gradient is nearly perpendicular to the current direction (small cosine → the directional update should be large, not the magnitude), mnL\nabla_{m_n}\mathcal{L} is small — so the magnitude barely changes while the direction updates. Conversely, when the gradient aligns well with the current direction (large cosine → the weight mostly needs to scale up/down without rotating), the magnitude update is large. This is exactly the negative correlation between ΔD\Delta D and ΔM\Delta M observed empirically.

In other words, Eq. (8) mathematically explains why DoRA exhibits FT-like learning patterns: the gradient geometry automatically decouples direction updates from magnitude updates.

Figure 2: Gradient flow in DoRA vs LoRA
graph LR
    subgraph LoRA_grad["LoRA Backward Pass"]
        lg1["∂L/∂W' ∈ ℝ^{d×k}"]
        lg2["∂L/∂B = (∂L/∂W') Aᵀ"]
        lg3["∂L/∂A = Bᵀ (∂L/∂W')"]
        lg1 --> lg2
        lg1 --> lg3
        lg4["ΔM and ΔD always coupled\n(via BA product)"]
        lg2 --> lg4
        lg3 --> lg4
    end
    subgraph DoRA_grad["DoRA Backward Pass"]
        dg1["∂L/∂W' ∈ ℝ^{d×k}"]
        dg2["∂L/∂m = (∂L/∂W')·V'/‖V'‖\n= ‖grad‖·cos(grad, v')"]
        dg3["∂L/∂V' = (m/C)·∂L/∂W'\n→ propagates to A, B"]
        dg1 --> dg2
        dg1 --> dg3
        dg4["cos(grad, v') large → big Δm, small ΔD\ncos(grad, v') small → small Δm, big ΔD\n(negative correlation, like FT)"]
        dg2 --> dg4
        dg3 --> dg4
    end

2.5 The subtle “decoupling” argument, more carefully

One might wonder: why does decoupling magnitude from direction help, specifically? Here is the precise argument from the paper.

Consider two hypothetical update scenarios S1S_1 and S2S_2 with equal gradient norms: S1=S2\|S_1\| = \|S_2\|. In S1S_1, the update is mostly along the current weight direction (large cos(L,v)|\cos(\nabla L, \mathbf{v})| ≈ large magnitude change, small directional change). In S2S_2, the update is mostly perpendicular to the current direction (small cos(L,v)|\cos(\nabla L, \mathbf{v})| ≈ small magnitude change, large directional change).

For LoRA:

  • AL=BWL\nabla_A \mathcal{L} = B^\top \nabla_{W'}\mathcal{L}, BL=WLA\nabla_B \mathcal{L} = \nabla_{W'}\mathcal{L} \cdot A^\top
  • These gradients update both magnitude and direction implicitly through BABA. There’s no mechanism to sense whether the current step should prioritize magnitude or direction.

For DoRA:

  • In scenario S1S_1: large cos\cos \Rightarrow large mL\nabla_m \mathcal{L} \Rightarrow large magnitude update, small direction update (because VL\nabla_{V'}\mathcal{L} is mostly in the direction already captured by m\mathbf{m}).
  • In scenario S2S_2: small cos\cos \Rightarrow small mL\nabla_m \mathcal{L} \Rightarrow small magnitude update, large direction update.

This auto-routing of gradient energy between magnitude and direction is the core efficiency gain.

2.6 DVoRA: DoRA + VeRA

DoRA is modular: the low-rank component ΔV=BA\Delta V = BA can be replaced by any LoRA variant. The paper demonstrates this with VeRA (Kopiczko et al., ICLR 2024).

VeRA (Vector-based Random Matrix Adaptation): freeze a single shared pair of random matrices {Ashared,Bshared}\{A_{\text{shared}}, B_{\text{shared}}\} across all layers; use only layer-specific scaling vectors {b,d}\{b_\ell, d_\ell\} as trainable parameters:

ΔW=diag(b)Bshareddiag(d)Ashared\Delta W_\ell = \text{diag}(b_\ell) \cdot B_{\text{shared}} \cdot \text{diag}(d_\ell) \cdot A_{\text{shared}}

VeRA achieves 10× fewer trainable parameters than LoRA at the cost of some accuracy. DVoRA plugs VeRA in as the directional update in DoRA:

W=mW0,+diag(b)Bshareddiag(d)AsharedW0,+cW'_\ell = \mathbf{m}_\ell \cdot \frac{W_{0,\ell} + \text{diag}(b_\ell) B_{\text{shared}} \text{diag}(d_\ell) A_{\text{shared}}}{\|W_{0,\ell} + \ldots\|_c}

On MT-Bench with LLaMA2-7B, DVoRA achieves score 6.0, matching DoRA and surpassing both VeRA (5.5) and LoRA (5.7), with only 0.04% trainable parameters vs LoRA’s 2.31%. This is a 58× parameter reduction at equal accuracy.

2.7 Architecture overview diagram

Figure 3: DoRA System Architecture
graph TD
    subgraph initialization["Initialization (once, from W₀)"]
        W0["W₀ ∈ ℝ^{d×k}\n(pretrained, frozen)"]
        decompose["Decompose into\nm = ‖W₀‖_c (trainable)\nV = W₀ (frozen direction base)"]
        W0 --> decompose
    end

    subgraph lora_branch["LoRA Branch (trainable)"]
        A["A ∈ ℝ^{r×k}\n(Kaiming init)"]
        B["B ∈ ℝ^{d×r}\n(zero init)"]
        delta["ΔV = B@A\n∈ ℝ^{d×k}"]
        A --> delta
        B --> delta
    end

    subgraph forward["Forward Pass"]
        add["V' = W₀ + ΔV"]
        norm["C = ‖V'‖_c\n(detached from grad)"]
        mag["m ∈ ℝ^{1×k}\n(trainable scalar per column)"]
        out["W' = m · (V'/C)\n∈ ℝ^{d×k}"]
        delta --> add
        W0 --> add
        add --> norm
        add --> out
        norm --> out
        mag --> out
    end

    subgraph merge["Merge (inference, once)"]
        merged["W'_merged = m · (W₀ + B@A) / ‖W₀+B@A‖_c\nStore as dense matrix, discard {m,A,B}"]
        out --> merged
    end

    decompose --> lora_branch
    decompose --> forward

3. Experiments

3.1 Commonsense reasoning (LLaMA family)

Setup: Eight commonsense reasoning benchmarks: BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA. Training data: the 8 task training sets combined (following the LLM-Adapters protocol from Hu et al., 2023). Models: LLaMA-7B, LLaMA-13B, LLaMA2-7B, LLaMA3-8B. Evaluated baselines: Prefix, Series adapter, Parallel adapter, LoRA, ChatGPT (zero-shot CoT).

Figure 4: Commonsense Reasoning Results Summary
ModelMethod#Params (%)Avg. Accuracy
LLaMA-7BChatGPT (0-shot)77.0
LLaMA-7BLoRA0.8374.7
LLaMA-7BDoRA† (r/2)0.4377.5
LLaMA-7BDoRA0.8478.4 (+3.7)
LLaMA-13BLoRA0.6780.5
LLaMA-13BDoRA0.6881.5 (+1.0)
LLaMA2-7BLoRA0.8377.6
LLaMA2-7BDoRA0.8479.7 (+2.1)
LLaMA3-8BLoRA0.7080.8
LLaMA3-8BDoRA0.7185.2 (+4.4)

What to notice: The improvement is not uniform across model sizes — it’s +3.7 on 7B, +1.0 on 13B, then bigger again (+2.1 and +4.4) on the newer architectures. This suggests the benefit of DoRA is not purely a function of model size but also of the “distance” between LoRA and FT’s optimal learning pattern, which may vary by architecture.

DoRA† (half the rank of LoRA) consistently beats LoRA with half the parameters: +2.8/+1.0/+2.9/+4.2 points on 7B/13B/2-7B/3-8B respectively. This is arguably more practically important than equal-rank comparisons — it means DoRA can match LoRA’s accuracy at 50% of the training cost.

3.2 Rank robustness analysis

Setup: Fix LLaMA-7B, vary rank r{4,8,16,32,64}r \in \{4, 8, 16, 32, 64\} for both LoRA and DoRA. Evaluate on commonsense reasoning.

Figure 5: Accuracy vs Rank (LLaMA-7B, Commonsense Reasoning)
RankLoRA Avg.DoRA Avg.Delta
r=439.561.9+22.4
r=840.777.9+37.2
r=1670.977.5+6.6
r=3274.778.4+3.7
r=6465.872.1+6.3

The most striking finding is the catastrophic failure of LoRA at r=4 and r=8 (39.5% and 40.7% — near random). DoRA maintains 61.9% at r=4 and 77.9% at r=8. This is a 37-point gap at r=8.

The explanation connects to §2.4: at very low ranks, LoRA’s coupled magnitude-direction updates waste gradient capacity — the limited rank budget must simultaneously correct both magnitude and direction. DoRA separates them, so even with a tiny rr (small directional budget), the trainable m\mathbf{m} handles the magnitude correction, and the LoRA matrices focus entirely on directional updates.

3.3 Visual instruction tuning (LLaVA-1.5-7B)

Setup: Fine-tune LLaVA-1.5-7B (Vicuna-1.5-7B language model + CLIP ViT-L/336px vision encoder) on standard visual instruction tuning data. Evaluate on seven VL benchmarks: VQAv2, GQA, VisWiz, SQA, VQAT, POPE, MMBench.

Method#Params (%)VQAv2GQAVisWizSQAVQATPOPEMMBenchAvg.
FT (100%)10078.561.950.066.858.285.964.366.5
LoRA4.6179.162.947.868.458.286.466.166.9
DoRA4.6378.662.952.269.957.087.266.167.6

Note: On VisWiz (visual QA for blind users, more challenging), DoRA improves by +4.4 points over LoRA. On SQA (science QA), +1.5 points. DoRA’s overall average of 67.6 beats both LoRA (66.9) and FT (66.5). The fact that DoRA beats FT here likely indicates that the training data setup causes FT to overfit, while DoRA’s constrained optimization (low-rank direction + scalar magnitude) provides implicit regularization.

3.4 Image/video-text understanding (VL-BART)

Setup: Fine-tune VL-BART (CLIP-ResNet101 + BARTBase) on four image-text tasks (VQAv2, GQA, NLVR2, MSCOCO) and four video-text tasks (TVQA, How2QA, TVC, YC2C).

TaskFTLoRADoRA
Image avg.77.376.577.4 (+0.9)
Video avg.83.5(see below)85.4 (+1.9)

DoRA nearly matches FT on image-text tasks (77.4 vs 77.3) while using only 6% of parameters. The +1.9 point gap on video-text is especially notable since video-text requires stronger temporal reasoning — suggesting DoRA’s fine-grained update control benefits tasks with higher adaptation complexity.

3.5 Training sample robustness

Setup: Fine-tune LLaMA2-7B and LLaMA-7B on instruction-tuning subsets of Alpaca (1000, 4000, 7000, 10000 samples). Evaluate on MT-Bench.

Figure 6: MT-Bench vs Training Set Size (LLaMA2-7B)
#SamplesLoRADoRAVeRADVoRA
1,0005.415.705.215.43
4,0005.555.825.385.60
7,0005.685.985.405.71
10,0005.706.005.506.00

DoRA’s advantage over LoRA grows as data decreases: at 1000 samples, the gap is +0.29; at 10000, it’s +0.30. This rules out the explanation that “DoRA just gets more from more data” — the benefit is stable. DVoRA at 0.04% parameters achieves the same score as DoRA at 2.33% parameters (6.00 on LLaMA2-7B with 10000 samples), a remarkable efficiency-accuracy tradeoff.

3.6 Tuning granularity: selective magnitude updates

DoRA’s analysis reveals that when directional updates dominate, magnitude changes are small. Exploiting this, the authors test a reduced granularity variant: apply full DoRA (direction + magnitude) to Q, K, V attention projections, but apply only magnitude updates to gate/up/down (MLP) projections.

Method#Params (%)LLaMA-7B Avg.LLaMA-13B Avg.
LoRA0.8374.780.5
DoRA (full)0.8478.181.5
DoRA (reduced)0.3977.581.3

The reduced granularity variant uses 0.39% parameters (less than half of LoRA’s 0.83%) and still beats LoRA by +2.8/+0.8 points. This confirms the earlier observation that MLP weights primarily need magnitude correction, not directional rotation — splitting the budget accordingly is meaningful.

4. Design choices, alternatives, and boundary conditions

4.1 Why column-wise norms, not row-wise or global?

The choice to normalize column-wise (c\|\cdot\|_c) matches the linear algebra of the forward pass. For WRd×kW \in \mathbb{R}^{d \times k} applied as WxW x where xRkx \in \mathbb{R}^k, each output dimension ii is Wi,:xW_{i,:} \cdot x (a row inner product). But each column W:,jW_{:,j} corresponds to the weight vector associated with input dimension jj. The column-wise normalization ensures each such weight vector is a unit direction, with the scalar magnitude absorbing the scale.

Alternative: Row-wise normalization (normalize each row of WW and learn a magnitude per row). This is less natural because it would decompose the “output neuron” rather than the “input feature weight.” The column-wise decomposition is also consistent with the weight normalization paper (Salimans & Kingma), which operates per output neuron in fully-connected layers.

Boundary: In practice, WW may have columns with near-zero norms (dead neurons). The λI\lambda I regularization in the Cholesky factor (analogous to the weight normalization paper’s guidance to initialize g=vg = \|v\|) prevents division by zero.

4.2 What happens if we train VV directly instead of via LoRA?

If VV (the direction matrix) were trained directly (without low-rank constraint), DoRA would be equivalent to full FT — just a reparameterization with more trainable parameters (both m\mathbf{m} and VV are trained). The LoRA constraint on ΔV\Delta V is what makes DoRA parameter-efficient.

This also means DoRA does not improve full FT — it’s not a method for full FT training. It specifically improves over LoRA in the PEFT setting because the limited rank budget is used more efficiently when magnitude and direction are decoupled.

4.3 Does DoRA add inference overhead?

No. The magnitude m\mathbf{m} and the LoRA matrices {A,B}\{A, B\} can all be merged into a single dense weight before deployment:

Wmerged=mW0+BAW0+BAcW'_{\text{merged}} = \mathbf{m} \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c}

This is a one-time computation. The deployed model has identical architecture and inference cost to the original pretrained model. This is DoRA’s key advantage over adapter-based methods, which add latency via sequential/parallel insertion.

4.4 Why is DoRA better at low ranks than LoRA?

At very low ranks (e.g., r=4r=4), LoRA must spend its limited expressiveness on both directional updates and magnitude corrections simultaneously. Since ΔW=BA\Delta W = BA couples these, small rank means neither can be done well.

DoRA makes magnitude correction free (it’s just kk scalars, one per column, with no rank restriction). The rank budget rr is exclusively dedicated to directional updates. At r=4r=4, DoRA can still make full-rank magnitude corrections, while LoRA’s full-rank delta is compressed to rank-4 approximation of both effects together.

Boundary condition: As rdr \to d (full rank), LoRA approaches FT and DoRA’s advantage shrinks. The improvement is most pronounced at small rr and large models (where the gap between low-rank expressiveness and FT is largest).

4.5 Relationship to previous weight decomposition work

Weight normalization (Salimans & Kingma, 2016): Same mathematical decomposition, but (1) applied during pretraining from scratch, (2) both gg and v\mathbf{v} are randomly initialized (sensitive), (3) motivates faster convergence via gradient covariance conditioning. DoRA uses the same decomposition for fine-tuning, initialized from pretrained weights (no sensitivity issue), and motivates it via learning pattern analysis rather than convergence speed.

SVD-based compression (SVD-LLM, ASVD): These methods approximate WW with a low-rank matrix by truncating singular values, for post-training compression. They do not train the model further. DoRA is a training method, not a compression method — the weight is not low-rank at deployment (it’s merged to a full dense matrix).

AdaLoRA: Adaptively allocates rank budget across layers by doing SVD on ΔW\Delta W and pruning small singular values. It’s still a LoRA variant — all gradient energy goes into updating ΔW\Delta W, with no explicit magnitude/direction separation. DoRA’s improvement comes from a fundamentally different mechanism.

4.6 QDoRA: combining with quantized backbones

QLoRA (Dettmers et al., NeurIPS 2023) quantizes the frozen backbone to 4-bit NF4 and applies LoRA adapters in full precision. QDoRA substitutes the LoRA component with DoRA:

W=mdequant(W0,4bit)+BAdequant(W0,4bit)+BAcW'_\ell = \mathbf{m}_\ell \cdot \frac{\text{dequant}(W_{0,\ell}^{4\text{bit}}) + BA}{\|\text{dequant}(W_{0,\ell}^{4\text{bit}}) + BA\|_c}

On Orca-Math (100k math word problems), QDoRA achieves exact-match 0.27 on LLaMA2-7B versus QLoRA’s 0.08 — a 3.4× improvement. On LLaMA3-8B, QDoRA achieves 0.31 versus QLoRA’s 0.23. Notably, QDoRA slightly outperforms full FT (which requires much more memory) on these benchmarks.

5. Limitations and boundary conditions

5.1 Training memory

The base DoRA (without the detach trick in §2.3) requires computing gradients through the normalization Vc\|V'\|_c, which increases the gradient graph depth and memory. The detach trick recovers most of this (−24.4% GPU memory on LLaMA-7B) at negligible accuracy cost. Still, DoRA requires slightly more memory than raw LoRA at equal rank because of the additional m\mathbf{m} vector and the dynamic norm computation.

5.2 Hyperparameter sensitivity for learning rate

The magnitude m\mathbf{m} and the LoRA matrices {A,B}\{A, B\} may prefer different learning rates. In the paper’s experiments, the same learning rate is used for all (with some per-experiment tuning), which works well but may not be optimal. Separate learning rate schedules for m\mathbf{m} vs {A,B}\{A, B\} could potentially improve results further.

5.3 Task coverage

The commonsense reasoning benchmark suite used (following Hu et al., 2023) has known limitations: all tasks are multiple-choice, which may not fully represent instruction-following, generation quality, or reasoning-intensive tasks. The MT-Bench evaluations (GPT-4 scored) provide a more nuanced signal and confirm the trend.

5.4 Comparison with newer LoRA variants

The paper was submitted in Feb 2024 (ICML 2024 accepted). More recent variants (PiSSA, LoRA+, FLORA) have since emerged. Whether DoRA remains the Pareto-optimal PEFT method in 2025–2026 requires updated comparison — though the underlying gradient decoupling insight is structural and unlikely to be superseded by minor tweaks to LoRA.

5.5 Does DoRA help with alignment fine-tuning?

The paper demonstrates DoRA on SFT (supervised fine-tuning) and instruction-following, but not on RLHF or DPO. Since DoRA modifies the learning dynamics of the gradient update, it’s plausible (but unproven) that it also improves reward modeling or preference learning. This is an open direction.

6. Reproducibility

6.1 Code availability

Official PyTorch implementation: https://github.com/NVlabs/DoRA

DoRA is integrated into Hugging Face PEFT (supported by the HF PEFT team, acknowledged in the paper). Standard usage:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    use_dora=True,          # ← DoRA flag
    target_modules=["q_proj", "v_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

6.2 Replicating the commonsense reasoning benchmark

Data: Hu et al. (2023) training protocol — combine training data from 8 commonsense tasks.

LLaMA-7B hyperparameters (DoRA):

Rank r:        16 (full), 8 (DoRA†)
Alpha:         32
Dropout:       0.05
Optimizer:     AdamW
LR:            2e-4
Scheduler:     Linear decay
Batch size:    16
Warmup steps:  100
Epochs:        3
Target modules: Q, K, V, Up, Down

Expected throughput: On an A100 80GB, DoRA with r=16r=16 on LLaMA-7B trains at approximately the same speed as LoRA (same FLOP count during forward; slight overhead in norm computation during backward).

6.3 Key ablation: verifying the detach trick

To verify the memory saving from §2.3 without accuracy loss:

# Standard DoRA (high memory):
W_prime = m * (V + delta_V) / (V + delta_V).norm(dim=0, keepdim=True)

# Efficient DoRA (detach norms from grad graph):
norms = (V + delta_V).norm(dim=0, keepdim=True).detach()
W_prime = m * (V + delta_V) / norms

Reported: 24.4% less GPU memory, 0.2 accuracy point difference on LLaMA-7B commonsense.

6.4 The weight decomposition diagnostic tool

The analysis tool from Section 3.2 of the paper can be implemented as:

import torch

def weight_decomp_analysis(W0, W_finetuned):
    """
    Returns (delta_M, delta_D) for each column.
    W0, W_finetuned: (d, k) tensors
    """
    m0   = W0.norm(dim=0)          # (k,)
    m_ft = W_finetuned.norm(dim=0) # (k,)
    
    v0   = W0 / m0.unsqueeze(0)           # unit columns
    v_ft = W_finetuned / m_ft.unsqueeze(0)
    
    delta_M = (m_ft - m0).abs().mean().item()
    
    # cosine similarity per column
    cos_sim = (v0 * v_ft).sum(dim=0)  # (k,)
    delta_D = (1 - cos_sim).mean().item()
    
    return delta_M, delta_D

This lets practitioners diagnose whether their LoRA vs FT gap is driven by magnitude issues, direction issues, or both — informing whether DoRA (or simpler magnitude-only tuning) is appropriate.

7. Summary and broader perspective

DoRA is a small but principled improvement to LoRA that is grounded in a concrete empirical observation. The key contributions are:

  1. Diagnostic method (weight decomposition analysis): A simple, general tool to compare the learning patterns of any fine-tuning method against FT. This is independently valuable beyond DoRA.

  2. DoRA: Decompose weights into magnitude and direction; train magnitude directly and direction via LoRA. The decomposition mechanistically explains the FT-LoRA accuracy gap and closes it by construction.

  3. Empirical breadth: Consistent improvements across LLaMA/LLaMA2/LLaMA3, LLaVA, VL-BART, NLP and vision-language tasks, instruction tuning and commonsense reasoning, with and without quantization.

The broader lesson is about diagnostic-driven design: rather than proposing a new architecture or loss function and hoping it improves accuracy, the authors first characterized the structural difference between what they had (LoRA) and what they wanted (FT behavior), then designed the minimal change to close that gap. This methodology tends to produce methods that generalize well precisely because they fix a root cause rather than add complexity.

For practitioners working with constrained GPU budgets, DoRA’s most actionable results are:

  • At low rank (r8r \leq 8): DoRA’s improvement over LoRA is massive (+22 to +37 points) and DoRA should be strongly preferred.
  • At standard rank (r=16r = 163232): DoRA gives consistent +2–4 point improvements with negligible overhead.
  • DoRA† (half rank): If training budget is tight, halving the rank and using DoRA consistently outperforms standard LoRA.
  • The HuggingFace PEFT integration (use_dora=True) makes adoption a one-line change.

Appendix A: Extended Derivations

A.1 Full gradient derivation without the detach approximation

Without the detach optimization, the normalization C=VcC = \|V'\|_c is part of the computation graph. Using the chain rule through the column-norm operation:

For column nn of VV', let vn=V:,nv'_n = V'_{:,n}. The column norm is Cn=vn2C_n = \|v'_n\|_2. The normalized direction is v^n=vn/Cn\hat{v}'_n = v'_n / C_n.

The loss gradient w.r.t. vnv'_n:

Lvn=mnCnLW:,nmnCn3(LW:,nvn)vn\frac{\partial \mathcal{L}}{\partial v'_n} = \frac{m_n}{C_n} \cdot \frac{\partial \mathcal{L}}{\partial W'_{:,n}} - \frac{m_n}{C_n^3} \left(\frac{\partial \mathcal{L}}{\partial W'_{:,n}} \cdot v'_n\right) v'_n

= \frac{m_n}{C_n} \left(I - \frac{v'_n v'^{\top}_n}{C_n^2}\right) \frac{\partial \mathcal{L}}{\partial W'_{:,n}} \tag{A.1}

This is a projection of the weight gradient onto the space orthogonal to vnv'_n. Equation (A.1) says: the gradient of VV' (and thus ΔV=BA\Delta V = BA) is the component of the weight gradient that is perpendicular to the current direction. The component parallel to vnv'_n is absorbed by the magnitude gradient (Eq. 8). This is a cleaner, more formal statement of why DoRA decouples direction from magnitude.

The detach approximation drops the mnCn3()vn-\frac{m_n}{C_n^3}(\ldots) v'_n term, replacing Eq. (A.1) with simply mnCnLW:,n\frac{m_n}{C_n} \frac{\partial \mathcal{L}}{\partial W'_{:,n}} (Eq. 7). The dropped term has magnitude proportional to mnCn3WLvnvn\frac{m_n}{C_n^3} \cdot |\nabla_{W'}\mathcal{L} \cdot v'_n| \cdot \|v'_n\|. Since vn=Cn\|v'_n\| = C_n, this is mnCn2WLvn\frac{m_n}{C_n^2} |\nabla_{W'}\mathcal{L} \cdot v'_n|, which is small when the gradient is nearly perpendicular to vnv'_n (i.e., when directional updates are needed) and comparable to the kept term when the gradient is aligned with vnv'_n (i.e., when magnitude updates are needed — but in that case, the VL\nabla_{V'}\mathcal{L} is small anyway by the routing argument). So the dropped term is small in both relevant regimes. This is why the approximation loses only 0.2 accuracy points.

A.2 Proof sketch: DoRA preserves LoRA’s inference merge property

Claim: After training, DoRA weights can be merged into a dense matrix with no additional inference computation.

Proof: Let m,B,A\mathbf{m}^*, B^*, A^* be the final trained values. Define:

Wmerged:=mW0+BAW0+BAcW'_{\text{merged}} := \mathbf{m}^* \cdot \frac{W_0 + B^*A^*}{\|W_0 + B^*A^*\|_c}

This is a dense matrix in Rd×k\mathbb{R}^{d \times k}. At inference, for any input xRkx \in \mathbb{R}^k:

Wx=WmergedxW' x = W'_{\text{merged}} x

No additional computation is needed. The magnitude vector m\mathbf{m}^* and LoRA matrices B,AB^*, A^* can be discarded. Memory footprint at inference = dkdk floats, identical to the original model. \square

The computation cost of computing WmergedW'_{\text{merged}} is O(dk+rkd)O(dk + rkd) (one matrix multiply for BAB^*A^*, one column-norm computation, one elementwise multiply by m\mathbf{m}^*). This is done once and amortized across all inference calls.

A.3 Parameter count comparison across PEFT methods

For a single linear layer WRd×kW \in \mathbb{R}^{d \times k}:

MethodTrainable ParamsNotes
Full FTdkdkAll params
LoRA (rank rr)r(d+k)r(d+k)Both AA and BB
DoRA (rank rr)r(d+k)+kr(d+k) + kAA, BB, plus magnitude m\mathbf{m}
AdaLoRAr(d+k)r(d+k)Same as LoRA, but rr varies per layer
VeRAd+kd + kOnly layer-specific scaling vectors
DVoRAd+k+kd + k + kVeRA vectors + DoRA magnitude
Prefix (length LL)2Ldmodel2 L d_{\text{model}}For each transformer layer

For LLaMA-7B with d=k=4096d = k = 4096 and r=16r = 16:

  • LoRA: 16×8192=131,07216 \times 8192 = 131,072 per layer
  • DoRA: 131,072+4,096=135,168131,072 + 4,096 = 135,168 per layer (+3.1%)
  • Full FT: 16,777,21616,777,216 per layer (128× more than DoRA)

The 3.1% parameter overhead of DoRA over LoRA (the kk magnitude scalars) is negligible in practice.

Appendix B: Implementation Details

B.1 HuggingFace PEFT implementation notes

The HuggingFace PEFT library implements DoRA as an extension of LoRA. Key implementation choices:

Column norm computation: PEFT computes column norms per forward pass, matching the paper’s “detach” variant. The implementation stores the computed norms as a buffer (not a parameter) to avoid redundant computation across the same layer.

Magnitude initialization: When use_dora=True, PEFT initializes lora_magnitude_vector (the m\mathbf{m} vector) from the column norms of the pretrained weight. This is equivalent to the paper’s initialization m=W0c\mathbf{m} = \|W_0\|_c.

Merge/unmerge: The PEFT library supports model.merge_adapter() and model.unmerge_adapter() for DoRA, correctly handling the normalization step in the merge computation.

B.2 Adapting DoRA for quantized models (QDoRA)

When using with BitsAndBytes 4-bit quantization:

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    use_dora=True,          # QDoRA
    target_modules=["q_proj", "v_proj", "k_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

The dequantization step (NF4 → bfloat16) happens automatically during the column norm computation and the DoRA forward pass. Memory usage: typically 8–12 GB for LLaMA3-8B on a single GPU.

B.3 Diagnosing whether DoRA will help: the quick check

The question “should I use DoRA instead of LoRA for my task?” can be answered with the weight decomposition analysis. A simple heuristic:

  1. Train LoRA for a few hundred steps.
  2. Compute the Pearson correlation between (ΔD,ΔM)(\Delta D, \Delta M) across layers and checkpoints.
  3. If correlation > +0.3 (LoRA-like positive coupling), DoRA is likely to help.
  4. If correlation is already negative (FT-like), DoRA may give marginal improvement.

In practice, the correlation is usually strongly positive for LoRA (the paper found +0.83), so DoRA almost always helps when LoRA is used as the baseline.

Appendix C: Relationship to Other Low-Rank Methods

C.1 Spectral perspective on LoRA vs DoRA

The weight matrix W0W_0 has a singular value decomposition W0=U0Σ0V0W_0 = U_0 \Sigma_0 V_0^\top. The columns of W0W_0 can be expressed in terms of right singular vectors: W0,:,j=iσiui(V0)ijW_{0,:,j} = \sum_i \sigma_i u_i (V_0)_{ij}.

LoRA’s update ΔW=BA\Delta W = BA adds a rank-rr perturbation in the full Rd×k\mathbb{R}^{d \times k} space. There’s no explicit constraint on which singular directions are updated.

DoRA’s direction update ΔV=BA\Delta V = BA also operates in the full space, but the normalization ensures that the column directions of W0+BAW_0 + BA are mapped to unit vectors before magnitude scaling. This implicitly prevents any single direction from dominating by keeping all column norms in a bounded range controlled by m\mathbf{m}.

C.2 Why DoRA might outperform LoRA more on newer architectures

LLaMA-3-8B shows a larger DoRA improvement (+4.4 points) than LLaMA-7B (+3.7) despite similar parameter counts. Several factors may contribute:

Group Query Attention (GQA): LLaMA-3 uses GQA, which means key and value projections have fewer heads than query projections. The matrices have different aspect ratios, and their “natural” low-rank direction in the task-specific fine-tuning objective may diverge more from LoRA’s isotropic update space.

Rotary Position Embeddings (RoPE): The RoPE variant in LLaMA-3 (with different base frequency) may result in weight matrices where the task-relevant fine-tuning directions are more strongly separated from the pretrained directions, making the magnitude/direction decoupling more valuable.

Embedding layer scale: LLaMA-3 uses a larger vocabulary (128K tokens vs 32K in LLaMA), affecting the embedding weight matrices where DoRA’s column-wise normalization has the strongest effect.

These are hypotheses — the paper does not provide ablations on these architectural differences. An interesting future experiment would be to apply the weight decomposition analysis separately to each module type (q/k/v/o/gate/up/down) for LLaMA-3 to identify where the largest FT-vs-LoRA learning pattern difference occurs.

Appendix D: Extended Experimental Context

D.1 The commonsense reasoning benchmark suite

The eight tasks used in the commonsense reasoning evaluation are:

TaskTypeSize (test)Description
BoolQBinary QA3,270Reading comprehension, yes/no
PIQAMC (2-choice)1,838Physical intuition QA
SIQAMC (3-choice)1,954Social interaction QA
HellaSwagMC (4-choice)10,003Sentence completion, activity
WinoGrandeCoreference1,267Winograd-style pronoun resolution
ARC-e (Easy)MC (4-choice)2,376Science exam questions, easy
ARC-c (Challenge)MC (4-choice)1,172Science exam questions, hard
OBQA (OpenBookQA)MC (4-choice)500Open-book science questions

These tasks vary widely in their linguistic demands. BoolQ requires careful passage reading; HellaSwag requires world knowledge about typical activity progressions; WinoGrande requires pronoun coreference with commonsense grounding. The combined training set contains 170,000+ examples across all tasks.

Following the LLM-Adapters protocol, all 8 training sets are combined for training, and evaluation is done on each task’s test set separately. The reported metric is accuracy (binary or multi-class), averaged across all 8 tasks for the summary number.

D.2 MT-Bench details and what the scores mean

MT-Bench evaluates 80 multi-turn conversations across 8 categories. The GPT-4 judge assigns each answer a score from 1 to 10. Interpreting the scores:

  • < 4.0: Poor instruction following, frequent off-topic or incoherent answers
  • 4.0 – 5.5: Below average; can follow simple instructions but struggles with multi-step reasoning
  • 5.5 – 6.5: Average; capable model with some reasoning ability
  • 6.5 – 7.5: Good; handles most MT-Bench categories well
  • > 7.5: Excellent; comparable to commercial APIs

The LoRA baseline scores of 5.1–5.7 and DoRA’s 5.5–6.0 are in the “below average to average” range, consistent with these being relatively small 7–13B models fine-tuned on limited instruction data. The improvement from DoRA (+0.3–0.5) is meaningful given the benchmark’s resolution.

D.3 Variance and statistical significance

The paper reports single-run results for most experiments. The commonsense reasoning results are relatively stable (low-variance tasks with large test sets). MT-Bench results have more variance due to GPT-4 judge noise (estimated ±0.2 per run). The paper’s 0.3 DoRA improvement on MT-Bench should be interpreted with this in mind — it’s a consistent trend, not a precisely measured delta.

For practical deployment decisions, the rank robustness results (the 37-point gap at r=8) are the most statistically decisive finding, as the difference is far larger than any plausible variance.

D.4 Comparison to concurrent work

At the time of DoRA’s publication (Feb 2024), the primary concurrent PEFT works were:

  • PiSSA (arXiv 2404.02948): Also uses SVD of W0W_0, but initializes LoRA AA and BB from the principal singular components rather than Kaiming/zero. PiSSA and DoRA target different root causes: PiSSA improves initialization, DoRA improves the structural coupling.

  • MoRA (arXiv 2405.12130): Replaces the two rectangular A,BA, B matrices with a single square matrix to allow higher-rank updates with the same parameter count. This is orthogonal to DoRA’s magnitude/direction decomposition.

  • LoRA+ (arXiv 2402.12354): Addresses the learning rate imbalance between AA and BB in LoRA. DoRA addresses a different problem (magnitude/direction coupling), and the two fixes could be combined.

None of these address the same root cause as DoRA, suggesting they could be combined. A DoRA variant with PiSSA-style initialization, LoRA+ learning rate scheduling, and DVoRA’s parameter efficiency has not been systematically studied but would be a natural next step.