Review date: 2026-06-30 Review author: Zhongzhu Zhou Paper reviewed: DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper authors: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, et al. (ByteDance Seed / SIA-Lab, Tsinghua AIR) arXiv: 2503.14476 Venue/Status: Preprint, v2 May 20 2025
Short Answer
DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is ByteDance’s fully open-source RL training recipe for LLM reasoning. It identifies four specific failure modes in naive GRPO-style training — entropy collapse, gradient starvation from zero-advantage batches, length bias in the per-sample loss, and reward noise from truncated responses — and fixes each with a targeted technique. Applied to Qwen2.5-32B, DAPO achieves 50% accuracy on AIME 2024 (avg@32 evaluation protocol) using only 50% of the training steps required by DeepSeek-R1-Zero-Qwen-32B (47% accuracy). The paper releases everything: algorithm, training code (verl framework), and a curated math dataset (DAPO-Math-17K).
Prerequisites: What You Need Before Reading This Paper
This section builds up the background knowledge required to follow DAPO’s technical contributions. Skip to the problem statement if you already know PPO and GRPO well.
1.1 The Reinforcement Learning Objective for Language Models
Training a language model with RL frames the process as a policy optimization problem. The model is the policy : given a prompt , it samples a response token by token. A reward function scores how good the response is. The goal is to maximize:
The policy gradient theorem (REINFORCE, Williams 1992) tells us the gradient of this expectation:
In practice we use the advantage instead of the raw reward, where is a baseline (often the value function or the mean reward) that reduces gradient variance without introducing bias.
1.2 Token-Level Decomposition of the Policy Gradient
A language model generates a response token by token, so the log-probability of the full response decomposes as:
The gradient over a full response is therefore a sum over per-token gradients, each scaled by the same response-level advantage :
Different implementations differ in how they weight this sum — this is exactly the distinction between sample-level and token-level loss that DAPO exploits.
1.3 Importance Sampling and the PPO Clipping Trick
Raw policy gradient methods are sample-inefficient: once parameters update, the old samples are off-policy and must be discarded. Importance sampling lets us reuse samples from the old policy to estimate gradients of the new policy by reweighting with the importance ratio:
The importance-sampled policy gradient objective becomes . The problem is that if drifts far from , can become very large or very small, causing variance explosion and training instability.
PPO (Schulman et al., 2017) addresses this by clipping the importance ratio to the interval (default ), then taking the pessimistic min:
The min enforces conservatism: when the ratio would produce a large update, the gradient is clipped. Multiple gradient steps can be taken on the same batch of rollouts, with held fixed.
1.4 GRPO: Group Relative Policy Optimization
PPO requires a value function (critic) to estimate advantages. Training a critic model for LLMs is expensive. GRPO (DeepSeekMath, 2024) eliminates the critic by computing advantages within a group of responses to the same prompt.
For each prompt , GRPO samples responses and computes rewards . The group-normalized advantage for response is:
The same advantage is assigned to all tokens within response . GRPO’s objective, at the sample level, is:
Key observations:
- No critic: advantages come from relative group performance, not a value network.
- KL penalty: anchors the policy to the reference (pretrained) model.
- Sample-level averaging: each response contributes to the gradient regardless of length.
1.5 Entropy, Entropy Collapse, and Why It Matters
Entropy of a discrete distribution measures randomness:
High entropy means diverse outputs; low entropy means near-deterministic outputs. In LLM RL, there is a persistent pressure toward entropy collapse: the model concentrates probability on a small set of tokens, losing the ability to explore different reasoning approaches. Once collapsed, the model is stuck in a local optimum and further RL training provides little benefit.
Entropy collapse is especially problematic for hard reasoning tasks like AIME, where problems often require trying multiple approaches before finding the correct one.
1.5a Variance Reduction and Why It Matters
High-variance gradient estimates slow learning. In REINFORCE (Eq. 2), the variance of the gradient estimator is:
For binary rewards (), variance depends on how balanced the correct/incorrect split is. When all responses are correct ( always), — but this means the gradient signal has zero useful information (we can’t learn to improve from a task we already solve perfectly). Dynamic sampling directly targets this: by filtering prompts to only those with mixed results, it maximizes the variance of advantages and therefore maximizes the information in each gradient step.
The group normalization in GRPO/DAPO (Eq. 7) is itself a variance reduction technique: by centering and scaling rewards within a group, it converts absolute reward values (which depend on the reward scale) into relative comparisons (which are scale-independent and have controlled variance). This is conceptually similar to using advantage functions in PPO, but without requiring a separate value network.
1.6 The KL Divergence Constraint
In standard RLHF, the KL term prevents the policy from deviating too far from the base model — useful for instruction-following where you want to stay close to the pretrained distribution. However, for mathematical reasoning, the base model has essentially no problem-solving ability. Constraining the policy near actively harms training. DAPO removes this term entirely.
The Problem: What Goes Wrong with Naive GRPO
When training a 32B language model on AIME-level math reasoning using GRPO out-of-the-box, the ByteDance team found four distinct failure modes:
Failure 1 — Entropy collapse. Policy entropy drops rapidly as training progresses. The model converges to stereotyped reasoning patterns. Performance plateaus early around 30% AIME accuracy and does not improve further.
Failure 2 — Gradient starvation. When all responses to a prompt are correct (accuracy=1) or all incorrect (accuracy=0), the group-normalized advantage (Eq. 7) is zero for every response: . Zero advantage → zero gradient. The batch contributes no training signal. As the model improves, an increasing fraction of easy prompts become “all correct” (Figure 3b in the paper shows this growing over training steps), wasting computation.
Failure 3 — Length bias in per-sample loss. GRPO weights each response equally () regardless of length. Long responses containing gibberish or repetition receive the same weight as high-quality short responses. Worse, sample-level averaging implicitly penalizes long responses by giving each token within them less gradient weight — discouraging the model from generating extended reasoning chains.
Failure 4 — Reward noise from truncated responses. Responses that exceed the maximum generation length are truncated. Applying a fixed penalty reward of to all truncated responses creates noise: a response that was heading toward a correct solution but got cut off is punished identically to a response that was clearly wrong from the start. This noise corrupts the gradient signal.
DAPO proposes one targeted fix per failure mode.
Background: GRPO as the Baseline
The paper uses naive GRPO on Qwen2.5-32B as the experimental baseline, achieving 30% accuracy on AIME 2024 (avg@32). The goal is to reach or exceed DeepSeek-R1-Zero-Qwen-32B’s 47% using DAPO’s four techniques.
graph TD
A["Naive GRPO: 30pct"] --> B["+ Overlong Filtering: 36pct"]
B --> C["+ Clip-Higher: 38pct"]
C --> D["+ Soft Overlong Penalty: 41pct"]
D --> E["+ Token-Level Loss: 42pct"]
E --> F["DAPO Full: 50pct"]
G["DeepSeek-R1-Zero-Qwen-32B: 47pct"] -.->|"comparison baseline"| F
style A fill:#fcc,stroke:#900
style F fill:#cfc,stroke:#060
style G fill:#ffc,stroke:#660
Figure 1: Ablation chain from the GRPO baseline (30pct) to full DAPO (50pct) on AIME 2024 avg@32. Each row in the table adds one technique. DAPO exceeds DeepSeek-R1-Zero by 3 percentage points using 50pct fewer training steps.
Innovation 1: Clip-Higher — Asymmetric Clipping Bounds
The Root Cause of Entropy Collapse
Standard PPO/GRPO clips the importance ratio symmetrically around 1 with radius :
When the advantage (the system wants to increase a token’s probability), the clipped update is limited by the upper bound . Consider a low-probability “exploration” token with :
- The importance ratio tries to grow above .
- After clipping, the maximum allowed .
- The absolute increase is . Tiny.
Now consider a high-probability “exploitation” token with :
- Maximum allowed (effectively capped at 1.0).
- The absolute increase can be up to . Enormous by comparison.
The symmetric upper clip hits a practical asymmetry: low-probability tokens can barely increase their probability, while high-probability tokens can increase by a large absolute amount. The paper empirically confirms this by showing the mean probability of up-clipped tokens remains below 0.2 throughout training — evidence that the clip is actively suppressing exploration.
Over many training steps, the model never meaningfully increases the probability of rare tokens, driving entropy downward.
The Fix: Decouple Upper and Lower Clip Bounds
DAPO introduces two separate clipping parameters:
with (unchanged) and (relaxed from 0.2).
The rationale for asymmetry:
- Lower bound (): Controls how aggressively the policy can decrease token probabilities. Keeping this at 0.2 prevents catastrophic forgetting of useful tokens — the “don’t forget too fast” constraint.
- Upper bound (): Controls how aggressively the policy can increase token probabilities. Raising this to 0.28 gives low-probability exploration tokens more headroom to grow — the “let me try new things” constraint.
Now the maximum allowed probability for an exploration token with becomes per step — still small in absolute terms, but 40% larger relative increase than the symmetric case. Accumulated over many steps, this makes a qualitative difference.
graph LR
subgraph "Standard PPO: Symmetric clip"
A1["r_t = 1.5\n(token wants to grow)"] --> B1["clipped to 1.2\n(ε = 0.2)"]
A2["r_t = 0.5\n(token wants to shrink)"] --> B2["clipped to 0.8\n(ε = 0.2)"]
end
subgraph "DAPO Clip-Higher: Asymmetric"
C1["r_t = 1.5\n(token wants to grow)"] --> D1["clipped to 1.28\n(ε_high = 0.28)"]
C2["r_t = 0.5\n(token wants to shrink)"] --> D2["clipped to 0.8\n(ε_low = 0.2 unchanged)"]
end
Figure 2: Symmetric (PPO) vs. asymmetric (DAPO Clip-Higher) clipping. The upper bound is relaxed from 1.2 to 1.28, giving more room for low-probability exploration tokens to grow, while the lower bound remains at 0.8 to protect against rapid forgetting.
Effect and Why Not Simply Raise ε Symmetrically?
Figure 2 in the paper shows entropy clearly diverges between Clip-Higher and no-Clip-Higher over training: with Clip-Higher, entropy grows gradually; without it, entropy collapses. On AIME 2024, Clip-Higher alone contributes +2 percentage points.
Raising symmetrically to 0.28 would also relax the lower bound, allowing the model to decrease token probabilities much faster. This can destabilize training by enabling rapid forgetting of learned behaviors. The asymmetric choice is intentional: conservative forgetting, aggressive learning.
Boundary Conditions for Clip-Higher
- too large: importance ratios become extreme, reintroducing training instability. The clipping interval grows so wide that it no longer provides the “trust region” guarantee PPO was designed to enforce.
- too close to : no benefit over symmetric case.
- The value 0.28 is empirically tuned; no theoretical justification is given in the paper.
Innovation 2: Dynamic Sampling — Eliminating Zero-Gradient Batches
Why Zero-Gradient Batches Arise
Recall the group-normalized advantage (Eq. 7). When all responses to a prompt have the same correctness (either all correct, for all , or all wrong, for all ):
Every token gets advantage zero → the gradient contribution from this prompt is exactly zero. The batch size effectively shrinks, gradient variance grows, and training efficiency drops. Figure 3b in the paper shows the proportion of “all-correct” prompts growing steadily across training steps — a structural problem that worsens as the model improves.
The Fix: Filter Prompts Before Training
DAPO enforces a hard constraint at sampling time: only keep prompts where the responses contain at least one correct and at least one incorrect response:
If this constraint is not satisfied, the prompt (and its responses) are discarded from the current training batch. Additional prompts are sampled from until the buffer contains valid prompts.
flowchart TD
A["Sample G responses for prompt q"] --> B{"All correct or all wrong?"}
B -- "Yes: 0 gradient" --> C["Discard prompt"]
B -- "No: mixed results" --> D["Add to training buffer"]
C --> E["Sample next prompt from dataset"]
E --> A
D --> F{"Buffer size >= N?"}
F -- "No" --> E
F -- "Yes" --> G["Proceed to policy update"]
Figure 3: Dynamic sampling flow. Prompts producing zero-gradient batches (all correct or all wrong) are discarded. New prompts are sampled until the buffer fills with N effective prompts, all of which contribute nonzero gradient signal.
Efficiency Analysis: Does Filtering Hurt Throughput?
The paper makes a crucial practical observation: in a synchronized RL system, generation time is dominated by the longest sample in each batch (because the accelerator must wait for all sequences to finish). Discarding some prompts (and their generated responses) does not necessarily increase wall-clock time, since the long-tail responses often take the same time whether the prompt is kept or discarded. Empirically (Figure 6 of the paper), dynamic sampling actually reduces convergence time despite generating more total responses per step.
Why Not Curriculum Learning Instead?
An alternative to dynamic sampling is curriculum learning: pre-sort prompts by difficulty and schedule easier prompts first, gradually increasing difficulty. Dynamic sampling is simpler and adaptive: it automatically tracks which prompts are at the right difficulty level for the current policy without requiring pre-computed difficulty scores or a training schedule. As the policy improves and more prompts become “all correct,” dynamic sampling automatically shifts to harder prompts.
Boundary Conditions for Dynamic Sampling
- On very hard datasets where even the trained model rarely gets anything correct, filtering may leave too few valid prompts per step, effectively halting training. Monitoring the filter acceptance rate is essential.
- The
is_equivalent(a, o_i)correctness check must be reliable. For integer-answer math (DAPO-Math-17K), this is exact match. For open-ended domains, designing a reliable equivalence function is much harder. - The effective batch size is always valid prompts; the number of discarded prompts varies and is not reported, making compute cost analysis difficult.
Innovation 3: Token-Level Policy Gradient Loss
The Sample-Level Averaging Problem
GRPO uses sample-level loss averaging: the loss for each response is first averaged over its tokens, then the per-response averages are averaged across the group:
where is the per-token gradient term.
Each response contributes equally () regardless of length. As a result, a short response with 200 tokens and a long response with 5,000 tokens both contribute to the total gradient. In terms of per-token weight, the short response’s tokens each receive and the long response’s tokens each receive — a 25× disparity.
This has two adverse effects:
- Long high-quality responses have lower per-token gradient weight, implicitly penalizing the model for generating extended reasoning chains even when they are correct. The model learns to prefer short answers.
- Long responses with gibberish or repetition get the same total weight as good short responses, so the garbage tokens in long responses are effectively treated as good — reinforcing bad patterns.
Figure 4 in the paper shows that without token-level loss, entropy and response length can increase in an “unhealthy” way (entropy growing due to gibberish generation, not genuine exploration).
The Fix: Token-Level Averaging
DAPO switches the denominator from (per-sample) to (per-token):
Each token now contributes to the gradient, regardless of which response it belongs to. Longer responses contribute proportionally more gradient signal — but each individual token within them is treated equally to tokens in shorter responses.
graph TD
subgraph "Sample-Level GRPO"
R1["Response A\n200 tokens, correct"] -- "weight 1/G" --> GL["Gradient Loss"]
R2["Response B\n5000 tokens, correct + gibberish"] -- "weight 1/G" --> GL
note1["Each response = equal weight\nLong gibberish is NOT penalized"]
end
subgraph "Token-Level DAPO"
R3["Response A\n200 tokens, correct"] -- "200 tokens at 1/N_total each" --> GL2["Gradient Loss"]
R4["Response B\n5000 tokens, correct + gibberish"] -- "5000 tokens at 1/N_total each" --> GL2
note2["Each token = equal weight\nGibberish tokens still penalized if low-reward"]
end
Figure 4: Sample-level (GRPO) vs. token-level (DAPO) gradient averaging. With token-level averaging, long responses contribute proportionally more total gradient signal, but each token is treated equally. Garbage tokens in long responses contribute their disadvantage-scaled gradient, naturally penalizing repetition.
Mathematical Derivation of the Difference
Let (total tokens in the group). Define (the per-sample mean gradient). Then:
The sample-level objective is an equally-weighted average of per-response means; the token-level objective is a length-weighted average. Responses are weighted by their fraction of total tokens .
If longer responses have higher (because they tend to be more detailed and correct), token-level improves training by upweighting them. If longer responses have lower (because they contain garbage), token-level penalizes them more than sample-level does.
Effect and Boundary Conditions
The paper reports that token-level loss brings only +1 percentage point on AIME 2024 directly, but “enhances training stability and makes the length increase more healthy” — meaning the model generates longer responses that contain genuine reasoning rather than repetitive filler.
Boundary condition: token-level averaging is length-neutral in terms of policy gradient direction — it does not explicitly reward or penalize length. However, combined with overlong reward shaping, longer responses that earn negative length penalties will have those penalties more heavily weighted in the gradient.
Innovation 4: Overlong Reward Shaping — Soft Length Penalties
The Problem: Hard Truncation Creates Reward Noise
When a response exceeds tokens, generation is truncated. A naive approach assigns a fixed penalty to all truncated responses. This is reward noise: a response that was 90% of the way through a correct solution gets the same penalty as one that rambled incoherently from the first token.
The paper first tries overlong filtering: mask the loss for all truncated samples (don’t update on them at all). Figure 5 in the paper shows this alone contributes a significant improvement — +6 percentage points on AIME 2024 — confirming that the hard truncation penalty was actively hurting training. However, filtering wastes the computation spent generating overlong samples.
The Fix: Graduated Soft Penalty
DAPO proposes a soft overlong punishment: a length-based penalty that is zero within the safe zone and linearly increases as responses approach or exceed the limit:
Parameters: tokens, tokens. So the penalty-free zone is tokens; truncated responses still receive .
The total reward for response to prompt is:
graph LR
A["Response length |y|"] --> B{"Is |y| <= 16384?"}
B -- "Yes" --> C["R_length = 0\nNo penalty"]
B -- "No" --> D{"Is |y| <= 20480?"}
D -- "Yes (16384 < |y| <= 20480)" --> E["R_length = linear\nfrom 0 to -1"]
D -- "No (|y| > 20480, truncated)" --> F["R_length = -1\nMaximum penalty"]
Figure 5: Overlong reward shaping function. Responses within 16,384 tokens incur no length penalty. Responses between 16,384 and 20,480 tokens receive a proportional soft penalty. Truncated responses (> 20,480 tokens) receive the full -1 penalty.
Why Soft Beats Hard
A hard penalty of for any truncated response tells the model “this was maximally bad.” The gradient says: suppress the entire sequence’s behavior. But if the sequence was almost correct, that suppression is counterproductive. The soft penalty provides a gradient signal proportional to how overlong the sequence is: slightly overlong sequences get a small penalty, giving the model a smooth objective surface to optimize.
Additionally, the linear interpolation zone ( tokens of buffer) gives the model early warning before it hits the hard limit, encouraging it to plan for response length and conclude before truncation.
Interaction with Token-Level Loss
With token-level loss, long overlong responses contribute many tokens to the gradient with the disadvantage computed from . If is close to for an overlong response, the combined reward penalizes the model more strongly than sample-level loss would — a compounding effect that discourages pathological length growth.
The Complete DAPO Objective Function
Combining all four innovations, the DAPO objective is:
subject to the dynamic sampling constraint:
where:
The five differences from GRPO (Eq. 8) are:
- No KL penalty: removed because for hard reasoning, the policy needs to deviate significantly from the pretrained model.
- Asymmetric clip bounds: (0.20 vs. 0.28) for Clip-Higher.
- Token-level denominator: instead of .
- Dynamic sampling constraint (Eq. 22) filters zero-gradient batches.
- Soft length penalty in the reward .
Algorithm: DAPO Step by Step
Pseudocode (Algorithm 1)
Algorithm DAPO
Input: policy π_θ, reward function R, prompts D,
hyperparameters G, ε_low, ε_high, L_max, L_cache,
N (target buffer size), μ (gradient steps per rollout)
for step = 1, 2, ..., M:
// === Phase 1: Dynamic Sampling ===
buffer ← []
while len(buffer) < N:
q ← sample_prompt(D)
{o_1, ..., o_G} ← π_{θ_old}(· | q) // sample G responses
{R_1, ..., R_G} ← [R_total(q, o_i) for i] // compute rewards incl. length penalty
n_correct = |{i : is_equivalent(answer(q), o_i)}|
if 0 < n_correct < G: // dynamic sampling filter
buffer.append((q, {o_i}, {R_i}))
// === Phase 2: Advantage Computation ===
for each (q, {o_i}, {R_i}) in buffer:
μ_R = mean({R_i}), σ_R = std({R_i})
A_i = (R_i - μ_R) / σ_R // group-normalized advantage
// === Phase 3: Multiple Gradient Updates ===
π_{θ_old} ← π_θ // freeze old policy
for update = 1, 2, ..., μ:
compute J_DAPO(θ) using Eq. (21)
θ ← θ + α ∇_θ J_DAPO(θ) // gradient ascent
Output: π_θ
Line-by-Line Explanation
Dynamic sampling loop: The while loop samples new prompts until the buffer contains valid items. Validity requires mixed correct/incorrect responses. This is the outer loop that ensures gradient quality before any model update.
Reward computation: R_total(q, o_i) = R_correctness(q, o_i) + R_length(o_i). For responses within tokens, the length penalty is 0.
Advantage computation: Standard GRPO-style group normalization, but applied to rather than just . The advantage captures both task success and length behavior.
Freeze old policy (): The importance ratios are computed relative to . Freezing at the start of each rollout step enables gradient updates with the same reference, following the PPO multi-epoch pattern.
Multiple gradient updates ( iterations): Reusing samples multiple times per rollout improves sample efficiency. The clip constraint ensures the policy cannot drift too far from in any single update.
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Base model | Qwen2.5-32B |
| Optimizer | AdamW |
| Learning rate | |
| Warmup steps | 20 (linear warmup) |
| Rollout prompt batch size | 512 |
| Responses per prompt () | 16 |
| Gradient update iterations () | 16 |
| Generation temperature | 1.0 |
| Top-p | 0.7 |
| 20,480 tokens | |
| 4,096 tokens | |
| 0.2 | |
| 0.28 | |
| Training infrastructure | 128 × H20 GPUs |
| Framework | verl (volcengine) |
Connections to Related Work
DAPO sits within a rapidly evolving landscape of RL algorithms for LLM post-training. Understanding where it fits helps clarify what is novel and what builds on prior work.
REINFORCE++ (Jian Hu, 2025)
REINFORCE++ (arXiv:2501.03262) is a simplified variant that removes the value function from PPO and uses a token-level baseline for variance reduction. It does not use group sampling. DAPO’s token-level loss is inspired by similar thinking, but DAPO additionally introduces dynamic sampling and Clip-Higher, which REINFORCE++ lacks.
RLOO (Leave-One-Out, 2024)
RLOO computes per-response baselines by leaving one response out: for response , the baseline is the mean of the other rewards. This reduces variance compared to the global mean baseline in GRPO. RLOO is equivalent to GRPO in the limit of large . DAPO uses the same group-relative baseline as GRPO (not RLOO), but the dynamic sampling filter ensures the baseline is computed only on informative samples.
GSPO (Group Sequence Policy Optimization, 2025)
GSPO (covered in this blog series, 2026-05-31) proposes sequence-level clipping rather than token-level clipping, arguing that token-level importance ratios can be pathologically small or large in long sequences even when the overall sequence distribution shift is moderate. DAPO’s Clip-Higher addresses a different aspect: it relaxes the upper bound for exploration rather than changing the clipping unit. The two techniques are complementary.
DeepSeek-R1-Zero (2025)
DeepSeek-R1-Zero uses GRPO directly on a large model with carefully curated data. The key recipe difference is that DeepSeek-R1-Zero uses the standard KL-regularized GRPO while DAPO removes KL and adds the four targeted fixes. DAPO achieves better performance with fewer training steps, suggesting that the DAPO modifications are more efficient than the DeepSeek approach even when applied to the same base model (Qwen2.5-32B).
Rule-Based Rewards: Why Not a Learned Reward Model?
A natural question: why use a simple binary correctness reward () rather than a learned reward model that could give nuanced feedback?
DAPO follows DeepSeek-R1’s approach of using rule-based rewards for verifiable tasks:
The key advantage is reward hacking resistance. Learned reward models can be “gamed”: the policy finds responses that score high on the reward model but are not actually correct — a well-documented failure mode called reward over-optimization or specification gaming. With verifiable integer answers, exact-match checking is immune to this: a response either equals the ground-truth integer or it does not.
There is also an accuracy advantage: learned reward models make mistakes, especially on hard math problems where the reward model may not understand the reasoning. An exact-match checker on integers has zero error rate on valid answers.
The limitation: binary rewards provide no partial credit. A response that completes 9 of 10 solution steps correctly but makes a final arithmetic error gets , identical to a completely wrong response. This coarseness may slow learning. More granular reward signals (process reward models, step-level rewards) are an active research direction, but they require much more expensive supervision collection.
Experiments: Setup and Evaluation
Dataset: DAPO-Math-17K
The training dataset was scraped from mathematical competition websites and manually annotated, then transformed to integer-answer format. The original math datasets include answers in various forms (fractions, surds, expressions). The team uses an LLM-driven transformation: given an answer like , the problem is rewritten so the answer is (an integer). This transformation:
- Enables exact-match reward computation () without a formula parser.
- Makes the reward signal reliable (no ambiguity in equivalence checking).
- Restricts the dataset to math problems with unambiguous correct answers.
The final dataset: 17,000 prompts each paired with an integer answer.
Dataset transformation example (from the paper’s Appendix):
- Original: “Let and be real numbers such that . Determine the smallest possible value of .” Answer: .
- Transformed: “The original answer is in the form . Find .” Answer: .
The transformation is LLM-driven: a chain-of-thought prompt asks the LLM to (1) extract the answer format, (2) rewrite the question, (3) solve the modified version, and (4) give an integer. The paper reports that in most cases, the LLM can successfully perform this transformation with high accuracy.
This approach is clever but domain-limited: it works for math problems with algebraic or numeric answers expressible as integer combinations. It cannot be directly applied to geometry (visual), proof-based problems, or open-ended reasoning.
Benchmark: AIME 2024
AIME (American Invitational Mathematics Examination) is a competition math benchmark. AIME 2024 has 30 problems (15 AIME I + 15 AIME II). The paper evaluates using avg@32: repeat each problem 32 times and compute average accuracy. This is more reliable than pass@1 for stochastic models.
“50 points” = 50% average accuracy on AIME 2024 (50% of 30 problems × 32 samples answered correctly on average). The model correctly solves about 15 out of 30 AIME problems on any given attempt.
Results: Incremental Ablation Analysis
Table 1 in the paper reports the progressive ablation:
| Technique Added | AIME 2024 avg@32 | Δ |
|---|---|---|
| DeepSeek-R1-Zero-Qwen-32B (reference) | 47% | — |
| Naive GRPO baseline | 30% | — |
| + Overlong Filtering | 36% | +6 |
| + Clip-Higher | 38% | +2 |
| + Soft Overlong Punishment | 41% | +3 |
| + Token-Level Loss | 42% | +1 |
| + Dynamic Sampling (= Full DAPO) | 50% | +8 |
The ablation is additive: each row adds exactly one technique to the configuration of the row above.
Key observations:
-
Overlong filtering is the biggest single fix (+6): The naive GRPO baseline was severely hurt by the hard truncation penalty. Simply filtering out overlong responses (ignoring them rather than penalizing) provides the largest single improvement. This tells us reward noise from truncation was the worst problem.
-
Dynamic sampling provides the final +8 jump: When all other techniques are in place, dynamic sampling moves accuracy from 42% to 50%. This large final jump suggests the other techniques enable the model to reach a point where gradient quality (which dynamic sampling addresses) becomes the binding constraint.
-
Token-level loss is subtle (+1 direct, stability improvement): The direct accuracy gain is small, but the paper reports qualitative benefits in training stability and length dynamics. The effect may be more pronounced over longer training runs.
-
Full DAPO (50%) exceeds DeepSeek-R1-Zero (47%) using only 50% of the training steps (Figure 1 of the paper) — a significant efficiency gain.
Case Study: Emergent Reflective Reasoning
One of the most remarkable findings in DAPO is the spontaneous emergence of self-reflection behaviors. Table 2 in the paper presents a case where the model, midway through generating a solution, pauses and reconsiders its approach:
Example problem: Given a tetrahedron S-ABC with equilateral base ABC, where the projection H of point A onto face SBC is the orthocenter of triangle SBC, the dihedral angle H-AB-C is 30°, and SA=2, find the volume. Express in the form and give .
The model’s response (paraphrased) shows:
- Setting up coordinates and computing distances.
- Attempting a geometric calculation.
- Stopping mid-calculation: “However, wait a moment, let’s rethink about the dihedral angle involving planes in a more thoughtful geometric way.”
- Restarting with a different approach (plane intersection method).
- Arriving at the correct answer.
This reflective behavior — pausing, questioning the current approach, and switching strategies — is characteristic of expert mathematical reasoning. It was not present in the base model before RL training and was not explicitly rewarded. It emerged purely because responses containing backtracking and self-correction were more likely to produce correct final answers, which the binary reward captures.
This finding has significant implications: RL with outcome-only rewards can produce process-level reasoning improvements that would normally require dense process supervision (process reward models) to elicit. The mechanism likely involves the model discovering that “check your work” behaviors increase final accuracy over many training steps.
Training Dynamics Analysis
The paper monitors four training metrics throughout the run:
Response Length (Figure 7a): Length generally increases as training progresses, reflecting the model learning to generate extended chain-of-thought reasoning. However, length can exhibit periods of stagnation or even decline — the paper attributes this to the model “finding shortcuts” (brief but correct patterns). The authors use length in conjunction with validation accuracy to detect deteriorating experiments.
Reward Score (Figure 7b): Training reward increases smoothly and stably. Importantly, the paper notes that at later training stages, training reward shows “little correlation with validation accuracy on AIME” — a sign of overfitting to the specific training prompts rather than generalizing the reasoning capability.
Entropy (Figure 7c): With DAPO (and specifically Clip-Higher), entropy maintains a slow upward trend throughout training. This is the diagnostic signature of healthy exploration: the model is continuously diversifying its reasoning approaches. Compare to GRPO without Clip-Higher, where entropy collapses.
Mean Probability (Figure 7d): Inversely correlated with entropy. As entropy rises, mean probability of tokens slightly decreases — the probability mass is spreading out rather than concentrating.
Emergent Reasoning Behaviors
One of the most striking results is the emergence of reflective behavior during RL training. In early training, the model never checks or revises its reasoning mid-response. As training progresses, the model spontaneously starts generating:
- Self-verification steps (“Let me verify this…”).
- Backtracking patterns (“Wait, let me reconsider…”).
- Error-correction sequences (identifying a mistake and restarting a sub-calculation).
These behaviors are not explicitly supervised — they emerge purely from the reward signal (task correctness). This provides strong empirical evidence that RL with outcome rewards can induce sophisticated metacognitive reasoning behaviors.
The verl Framework
DAPO is built on verl (volcengine reinforcement learning, also known as HybridFlow), an open-source RLHF framework from ByteDance designed for efficient large-scale LLM RL.
Architecture Overview
graph TD
A["Prompt Dataset D"] --> B["Rollout Workers\n(Policy π_θ_old generates G responses)"]
B --> C["Reward Computation\n(R_correctness + R_length)"]
C --> D["Dynamic Sampling Filter\n(keep only mixed batches)"]
D --> E["Advantage Normalization\n(group-level standardization)"]
E --> F["Policy Update\n(μ gradient steps with DAPO objective)"]
F --> G["Updated Policy π_θ"]
G --> B
style A fill:#e8f4fd,stroke:#2196f3
style F fill:#e8fde8,stroke:#4caf50
style G fill:#e8fde8,stroke:#4caf50
Figure 7 (system): End-to-end DAPO training pipeline built on verl. Rollout workers generate responses asynchronously; the dynamic sampling filter acts as a quality gate; policy updates use the DAPO objective with decoupled clipping and token-level averaging.
Key Design Principles
Hybrid parallelism: verl supports configuring actor (policy), reference, reward, and critic models with independent parallelism strategies. For a 32B model across 128 H20 GPUs, different components can use different tensor-parallel and pipeline-parallel configurations to maximize GPU utilization.
Flexible placement: Actor and reference models can be co-located or placed on separate GPU groups, allowing memory-compute tradeoffs based on available hardware.
Efficient rollout: The framework pipelines rollout generation with training — while the policy model is being updated, the next batch of responses can be generating in parallel.
Policy-reference decoupling: The policy model (frequently updated) and reference model (frozen) can run with different parallelism configurations. Since the reference model never updates, it can be kept at lower precision (e.g., BF16 instead of FP32) or more aggressively sharded to save memory.
Rollout-train overlap: While the policy model processes one batch of training updates, verl can concurrently generate the next rollout batch. This overlapping reduces GPU idle time — a critical efficiency factor when rollout dominates total compute time (which it does for long-CoT reasoning models generating 10,000+ tokens per response).
Released Components
The paper releases:
- Training code:
github.com/volcengine/verl - Training dataset:
DAPO-Math-17Kon HuggingFace - Complete hyperparameter configuration
This is an unusually complete open-source release for a state-of-the-art reasoning system.
The Four Techniques as a Coherent System
While each DAPO technique targets a specific failure mode, they are not independent — they interact:
Clip-Higher enables dynamic sampling to work better. With higher entropy (from Clip-Higher), the model generates more diverse responses per prompt, making it more likely that some responses are correct and some incorrect. This increases the proportion of valid prompts that pass the dynamic sampling filter, reducing overhead.
Dynamic sampling makes token-level loss more effective. By ensuring every prompt in the batch has non-trivial advantage variation, dynamic sampling ensures the token-level loss gradients are more meaningful — they come from prompts where the model has genuine uncertainty, not from trivial always-correct/always-wrong situations.
Token-level loss interacts with overlong reward shaping. Long overlong responses contribute many tokens to the gradient (token-level) each weighted by the negative length penalty (overlong shaping). This combination creates a stronger signal against pathological length growth than either technique alone.
Removing KL enables all four to work without constraint. The KL penalty would partially counteract Clip-Higher’s entropy promotion (by pulling the policy back toward the reference model’s lower entropy). Removing KL allows the policy to genuinely maintain high entropy during training.
This synergistic interaction likely explains the non-linear gains in the ablation table — the combined system’s +20pct improvement over naive GRPO is substantially larger than the sum of individual contributions (+12pct).
Limitations and Boundary Conditions
The paper itself acknowledges several limitations, and there are additional ones worth flagging:
0. No formal ablation of KL removal. Removing the KL penalty is presented as a straightforward design choice (“not necessary for long-CoT models”), but it is never ablated independently. An experiment comparing full DAPO against full DAPO + KL penalty would isolate its contribution. Given that KL removal is one of the largest architectural changes from GRPO, its independent effect should be quantified.
1. Hyperparameter sensitivity: “Even seemingly minor changes in initial conditions, such as variations in data and hyperparameters, can amplify through iterative reinforcement learning processes, yielding substantial deviations in outcomes.” RL training for LLMs is notoriously sensitive — different seeds can produce qualitatively different training dynamics.
2. Reward-validation divergence: At later training stages, training reward and validation accuracy become weakly correlated. This makes it unclear when to stop training. Researchers must monitor validation accuracy directly (expensive, as AIME evaluation with avg@32 requires 32× the inference compute).
3. Math-only: All experiments use verifiable mathematical tasks. The techniques are described as general, but no evidence is provided for other domains.
4. Scale requirement: 128 H20 GPUs is inaccessible to most academic groups. The practical accessibility of the technique is limited despite open-sourcing.
Critical Assessment: Weaknesses and Improvements
Weaknesses and Flaws
W1: The 8-point dynamic sampling jump lacks mechanistic explanation. The ablation shows that adding dynamic sampling as the final step moves accuracy from 42% to 50% (+8 percentage points). This is a dramatic jump for a single technique that supposedly only addresses zero-gradient batches. The paper does not analyze what happens mechanistically — does dynamic sampling enable the other techniques to work better in combination? Is this driven by a specific training phase where zero-gradient batches would otherwise dominate? The interaction effect is large and unexplained.
W2: Narrow benchmark — 30 AIME problems with high variance. AIME 2024 has 30 problems. With avg@32, the total evaluation pool is 30 × 32 = 960 trials. A 50% accuracy means ~480 correct trials. A 47% accuracy means ~451 correct trials. The difference (29 trials out of 960) is statistically marginal. The paper reports no confidence intervals, no statistical tests, no variance estimates across multiple evaluation runs. The 3-percentage-point gap over DeepSeek-R1-Zero may not be statistically significant.
W3: No generalization to non-math domains.
The paper title is “An Open-Source LLM Reinforcement Learning System at Scale” — suggesting broad applicability. All results are on math. Clip-Higher, token-level loss, and overlong reward shaping are plausibly general, but dynamic sampling requires a reliable is_equivalent check, which is far harder to implement for code, science questions, or open-ended tasks. The paper never discusses this limitation.
W4: ε_high = 0.28 has no theoretical or ablative justification. The choice of 0.28 is presented as a fixed hyperparameter without any ablation showing why 0.28 is better than 0.25 or 0.30. The optimal value of likely depends on the model, dataset, and training phase. Presenting a single value without ablation is insufficient for practitioners who want to reproduce the technique on different setups.
W5: Compute overhead of dynamic sampling is not quantified. Dynamic sampling generates more responses per effective training step. The paper claims this “does not significantly affect” total training time, citing that generation is bottlenecked by long-tail samples. But no numbers are provided: how many extra prompts need to be sampled on average? What is the total response generation count per step? Without these numbers, the efficiency comparison with DeepSeek-R1-Zero (“50% fewer steps”) is incomplete — fewer steps does not mean fewer GPU-hours if each step costs more.
W6: Comparison baseline set is narrow. The paper was submitted in March 2025. Contemporary reasoning models (Kimi K1.5, Gemini 2.0 Thinking, QwQ-32B) were publicly available. The paper compares only against naive GRPO and DeepSeek-R1-Zero-Qwen-32B. Without comparison to other RL training approaches (REINFORCE++, RLOO, GSPO), it is hard to assess whether DAPO’s specific design choices matter or whether any reasonably engineered RL recipe would perform similarly.
Limitations the Authors Understate or Omit
L1: The reference policy removal may harm alignment in non-reasoning domains. Removing the KL penalty allows the policy to drift arbitrarily far from the pretrained model. For math reasoning, this is beneficial (the base model has weak math ability). For other domains — instruction-following, safety-relevant tasks, conversational tasks — an unanchored policy may develop harmful or inconsistent behaviors. The paper presents KL removal as universally correct for “long-CoT reasoning models” without discussing when the KL term is actually important.
L2: The integer-answer transformation limits applicability. DAPO-Math-17K uses integer answers to simplify reward computation. But many important mathematical reasoning tasks involve non-integer answers (probabilities, continued fractions, geometric objects). The transformation process (LLM rewrites the problem to have integer answers) may introduce subtle errors, select for simpler problem types, or bias the model toward integer-arithmetic reasoning styles. This is not analyzed.
L3: Dynamic sampling scalability problem at high performance. As the model improves, more prompts produce “all-correct” batches. The filter acceptance rate drops. Eventually, dynamic sampling may need to generate 10× or 100× more prompts to fill a batch — making training impractically expensive at frontier performance levels. This is a fundamental scalability concern for the technique, and the paper does not address it.
Concrete Improvement Suggestions
S1: Ablate ε_high over a range (0.20, 0.22, 0.25, 0.28, 0.30, 0.35) with all other techniques fixed. Plot AIME 2024 accuracy vs. ε_high to show the sensitivity landscape and justify the 0.28 choice. This experiment requires only 6 training runs.
S2: Isolate the dynamic sampling interaction effect with a 2×2 ablation. Run: (a) Clip-Higher on/off × (b) dynamic sampling on/off, with all other DAPO techniques present. This would reveal whether the 8-point jump from dynamic sampling is explained by an interaction with Clip-Higher or is independent.
S3: Report confidence intervals on all benchmark numbers. With 30 AIME problems and 32 samples, bootstrap the 95% CI on the accuracy score. If the CI for DAPO (50%) overlaps with DeepSeek-R1-Zero (47%), the main result needs qualification.
S4: Quantify dynamic sampling overhead explicitly. Report: average number of prompts sampled per effective training step (with and without dynamic sampling), total response-tokens generated per step, and actual wall-clock time per step. These numbers are essential for practitioners reproducing the work.
S5: Test on code generation with execution-based rewards (HumanEval, LeetCode). Code execution provides a reliable binary reward signal (analogous to integer-answer math), making DAPO’s techniques directly applicable. A code RL experiment would demonstrate generality beyond math and attract a broader audience.
S6: Evaluate effect on general capabilities (MMLU, HellaSwag). RL training can cause regression in general language understanding. The paper does not report any general capability benchmarks after DAPO training. Including these would allow readers to assess the safety/capability tradeoff of removing the KL penalty.
Conclusion
DAPO represents a significant, practical, and well-motivated advance in LLM RL training for mathematical reasoning. Its four techniques each address a specific, diagnosable failure mode in naive GRPO, and the empirical results are strong: 50% AIME 2024 accuracy with Qwen2.5-32B, achieving new state-of-the-art performance at the time of publication, with full open-source release of code and data.
The most conceptually interesting contribution is Clip-Higher: the observation that symmetric clipping suppresses exploration of low-probability tokens, and the simple asymmetric fix that relaxes only the upper bound. This is a precise, mechanistically motivated change backed by direct measurement (mean up-clipped probability) that should generalize to any PPO-based RL training.
Dynamic sampling is practically important and underappreciated — zero-gradient batches are a real and growing problem in any curriculum where the model progressively masters easier examples, and the simple filtering solution is elegant.
The main weaknesses are the unexplained 8-point interaction effect when all techniques combine, the narrow benchmark scope (30 AIME problems), and the missing quantification of dynamic sampling overhead. Future work should extend these techniques to non-math domains, provide theoretical grounding for the asymmetric clip, and address the scalability concern of dynamic sampling at high policy performance levels.
Open Questions for Future Work
-
Does Clip-Higher generalize? The entropy collapse problem is not unique to GRPO — it likely affects any clipped policy gradient method for LLMs. Does asymmetric clipping help REINFORCE++ or RLOO as well?
-
What is the minimum group size for dynamic sampling? With and 50% base accuracy, roughly 87% of groups are non-trivial. With , only about 62% pass the filter. Is there a minimum below which dynamic sampling overhead becomes prohibitive?
-
Can process reward models replace dynamic sampling? With per-step rewards, even “all-correct” responses would have varied step-level advantages. This might eliminate zero-gradient batches without the sampling overhead.
-
Compute-optimal training recipe? Is DAPO compute-optimal in FLOPs-per-accuracy-point, or could a simpler algorithm with more compute reach the same result? The 50%-fewer-steps claim is about gradient steps, not total compute.
-
What is the interaction between Clip-Higher and the number of gradient updates per rollout ()? DAPO uses gradient updates per rollout. With the relaxed upper clip bound (ε_high=0.28), does the policy drift further between rollouts, requiring fewer reuse steps to stay within the trust region? The optimal for asymmetric clipping may differ from the standard PPO optimal.
Overall, DAPO is a technically sound, practically valuable, and openly shared contribution to the LLM RL training ecosystem. The research community owes its authors credit for choosing transparency over competitive advantage — releasing a complete, working recipe is rare and important in a field where key details are typically withheld.
For practitioners, the takeaway is actionable: if you are running GRPO or PPO for LLM reasoning and hitting entropy collapse or slow convergence, start with Clip-Higher (a two-line change) and overlong reward shaping (a reward engineering change), then add dynamic sampling if your dataset has many trivial prompts. Token-level loss is a stable improvement that should be used by default in any long-CoT RL setting.