June 13, 2026 EN #Mixture of Experts #LLM Training #Reinforcement Learning #Distributed Training #RLHF

ForeMoE: Micro-step-level MoE Load Balancing for RL Post-training via Routing Foresight

Review date: 2026-06-13 Review author: Zhongzhu Zhou Paper reviewed: Harnessing Routing Foresight for Micro-step-level MoE Load Balancing in RL Post-training Paper authors: Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao, Tong Zhao, Xupeng Miao, Jie Jiang, Fangcheng Fu, Bin Cui arXiv: 2606.11867v1, 2026-06-11 Venue/status: arXiv preprint (Peking University / Shanghai Jiao Tong University / Tencent)

Short Answer

ForeMoE identifies and solves a real but overlooked problem: expert load imbalance during reinforcement learning post-training of Mixture-of-Experts (MoE) models operates at a fundamentally different timescale than during pre-training, and all existing load-balancing methods fail to account for this difference.

The root cause is subtle. Pre-training uses large, diverse batches; the routing distribution across experts is unpredictable before each step, so prior systems estimate it from historical statistics. RL post-training, by contrast, focuses on concentrated task domains (mathematics, coding) with a model whose experts are already specialized. The result is that the step-level routing pattern is stable and predictable — but each gradient-accumulation micro-step processes only a handful of samples, destroying the statistical averaging that made step-level statistics reliable. Expert loads fluctuate wildly micro-step to micro-step, yet existing systems have no mechanism to detect or respond to this fine-grained variation.

ForeMoE’s key insight is that RL post-training has a structural property that makes micro-step load balancing tractable: router replay. The rollout stage generates responses and records each token’s expert assignments. Because the recompute and policy update stages must use exactly the same routing decisions (for correctness of importance sampling), the routing for every micro-step in those stages is fully known before they execute. This converts the load balancing problem from a prediction problem into a planning problem.

Building on this, ForeMoE introduces a Four-Stage Planner that decomposes the NP-hard load balancing optimization into tractable sub-problems, and an Expert Transfer Engine that exploits complementary hardware paths (CPU-assisted PCIe for recompute, GPU-direct NVLink for policy update) to overlap expert migrations with computation. On 64 GPUs training Qwen3-30B-A3B, ForeMoE achieves up to 1.45× end-to-end speedup over state-of-the-art RL training systems.

1. Prerequisites

ForeMoE sits at the intersection of MoE architecture, distributed training, and RL post-training pipelines. I cover each area thoroughly below.

1.1 Transformer Architecture and Feed-Forward Networks

The standard Transformer has alternating multi-head self-attention and feed-forward network (FFN) sublayers. For a hidden state $x \in \mathbb{R}^d$ , the FFN is:

\text{FFN}(x) = W_2 \cdot \sigma(W_1 x + b_1) + b_2

where $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}$ , $W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}$ , and $\sigma$ is an activation function (GELU, SiLU, ReLU). The FFN accounts for roughly two-thirds of total parameter count in large Transformers, and it is applied independently and identically to every token position. For models with billions of parameters, this is computationally expensive — but note that all tokens undergo the same computation regardless of content.

1.2 Mixture-of-Experts (MoE) Architecture

The MoE architecture (Shazeer et al., 2017; Lepikhin et al., 2021 — GShard) replaces the monolithic dense FFN with $E$ parallel expert networks $\{f_1, \ldots, f_E\}$ and a lightweight router $G$ :

\text{MoE}(x) = \sum_{i \in \text{Top-}K(G(x))} g_i(x) \cdot f_i(x)

Router mechanics: the router is a linear layer producing logits $r = W_{\text{router}} x \in \mathbb{R}^E$ . The gating weights $g(x)$ are computed by applying softmax to the top- $K$ logits:

g_i(x) = \frac{\exp(r_i)}{\sum_{j \in \text{Top-}K} \exp(r_j)}, \quad i \in \text{Top-}K(r)

For tokens not assigned to expert $i$ (i.e., $i \notin \text{Top-}K$ ), $g_i(x) = 0$ and the expert $f_i$ is not executed. With $K = 2$ and $E = 64$ , only 2 out of 64 experts activate per token — compute is roughly $2/64 = 3.1\%$ of dense equivalent per layer.

Token capacity constraint: in hardware, each expert can process at most $C_{\text{cap}} = (B \cdot K) / E \cdot \text{capacity\_factor}$ tokens per micro-step (where $B$ is batch size). Tokens assigned to overloaded experts beyond capacity are dropped or re-routed. This capacity constraint interacts with load imbalance: popular experts hit capacity first, forcing token drops that degrade model quality.

Modern MoE examples:

DeepSeek-V3/R1: 671B total, 37B active per token; 256 experts per MoE layer, top-8 routing
Qwen3-30B-A3B: 30B total, ~3B active; 64 experts per layer, top-2 routing (used in ForeMoE experiments)
Kimi K2.5: MoE model from Moonshot AI, used in production RL post-training

1.3 Expert Parallelism (EP)

When training on $R$ devices, Expert Parallelism partitions the $E$ experts across ranks so rank $r$ hosts experts $\{f_{r \cdot E/R}, \ldots, f_{(r+1) \cdot E/R - 1}\}$ . Processing a MoE layer requires two All-to-All collective communications:

Dispatch: each device sends tokens to the device hosting their assigned experts. Combine: each device returns the expert outputs to the original token-owner devices.

Token flow through MoE with Expert Parallelism (E=4, R=2, K=1):

  Rank 0 tokens: [T0→E2, T1→E0, T2→E3, T3→E0]
  Rank 1 tokens: [T4→E1, T5→E3, T6→E0, T7→E2]

  Dispatch All-to-All:
    Rank 0 → Rank 0: T1, T3 (assigned to E0 on Rank 0)
    Rank 0 → Rank 1: T0, T2 (assigned to E2,E3 on Rank 1)
    Rank 1 → Rank 0: T6    (assigned to E0 on Rank 0)
    Rank 1 → Rank 1: T4,T5,T7 (assigned to E1,E3,E2 on Rank 1)

  Expert computation (parallel on both ranks)
  
  Combine All-to-All: reverse of dispatch

The All-to-All latency is $\propto \max_r |\text{tokens sent to rank } r|$ , which is dominated by the most overloaded rank. If one rank receives 3× the average token count, the entire All-to-All is delayed by 3×.

1.4 Expert Load Imbalance: Formal Definition

Define the load of rank $r$ during micro-step $m$ as:

L_r^{(m)} = \sum_{t \in \text{micro-step } m} \sum_{k=1}^{K} \mathbf{1}[\text{token } t \text{ routes to expert on rank } r \text{ at position } k]

The ideal load per rank is:

\bar{L}^{(m)} = \frac{B_m \cdot K}{R}

where $B_m$ is the number of tokens in micro-step $m$ . The imbalance ratio $\rho^{(m)} = \max_r L_r^{(m)} / \bar{L}^{(m)}$ measures how far reality departs from ideal balance; $\rho = 1$ is perfect, $\rho = 2$ means the bottleneck rank processes twice the average.

The makespan of a micro-step is proportional to $\rho^{(m)}$ (in the absence of load balancing). A 40% load imbalance ( $\rho = 1.4$ ) means the All-to-All takes 40% longer than necessary, blocking all ranks until the overloaded one finishes.

1.5 Pre-training Load Balancing: Existing Approaches

Prior systems for pre-training load imbalance fall into two categories:

Expert relocation: move expert $e$ from rank $r_1$ to rank $r_2$ (copy weights over interconnect, update routing table). Reduces load on $r_1$ , increases load on $r_2$ .

Expert replication: create an additional copy of expert $e$ on rank $r'$ . Tokens assigned to $e$ can now be processed by either the original or the replica, distributing load across both ranks. Increases memory usage but can reduce the peak load on the original rank.

Both mechanisms require predicting upcoming load before a step executes. During pre-training, routing decisions are unknown until execution begins, so systems estimate future loads from an exponential moving average (EMA) of past steps:

\hat{c}_e^{(t)} = \alpha \cdot c_e^{(t-1)} + (1 - \alpha) \cdot \hat{c}_e^{(t-2)}

This works in pre-training because the routing distribution is approximately stationary over large diverse batches.

1.6 RL Post-training Pipeline

Reinforcement learning post-training fine-tunes a pre-trained model to optimize a scalar reward $r(x, y)$ where $x$ is a prompt and $y$ is the model’s response. The dominant algorithms (PPO, GRPO, DAPO) all share a three-stage pipeline per training step:

Stage 1 — Rollout: the current policy $\pi_\theta$ generates responses for a batch of $N$ prompts through autoregressive decoding. Crucially, the routing decisions for every token at every layer are recorded:

\mathcal{R} = \{(t, \ell, e_{t,\ell,1}, e_{t,\ell,2}, g_{t,\ell,1}, g_{t,\ell,2}) : t \in [N \cdot L_\text{resp}], \ell \in [n_\text{layers}]\}

where $e_{t,\ell,k}$ is the $k$ -th expert assigned to token $t$ at layer $\ell$ and $g_{t,\ell,k}$ the corresponding gating weight. This routing table $\mathcal{R}$ completely determines future expert loads.

Stage 2 — Recompute: the training framework recomputes log-probabilities $\log \pi_\theta(y|x)$ via a forward pass. This recomputation is necessary because:

The rollout inference engine (vLLM/SGLang) uses a different parallelization strategy than the training framework (Megatron, FSDP)
Gradient computation requires per-token log-probabilities aligned with the training framework’s tensor layout
Importance sampling for GRPO requires an accurate reference log-probability

Due to GPU memory limits, the sequences are processed in $M$ micro-steps of $B/M$ tokens each (where $B$ is total batch size), accumulating contributions to the policy gradient.

Stage 3 — Policy Update: compute gradients of the policy objective (e.g., GRPO):

\mathcal{L}_\text{GRPO}(\theta) = \mathbb{E}_{(x,y) \sim \pi_{\theta_\text{old}}} \left[ \min\left( \frac{\pi_\theta(y|x)}{\pi_{\theta_\text{old}}(y|x)} A(x,y),\; \text{clip}(\ldots) \right) \right]

Again processed in micro-steps for memory efficiency. Gradients are accumulated across micro-steps and applied once at the end.

Router Replay (key constraint): the recompute and policy update stages must use exactly the same token-to-expert routing that was recorded during rollout. This constraint exists because:

The log-probabilities in the recompute stage must correspond to the same computational path as rollout
If routing changed between rollout and recompute, the log-probs would be computed by different experts, invalidating the importance sampling estimate

This “router replay” constraint means the routing table $\mathcal{R}$ is fully fixed before stages 2 and 3 begin — every token knows in advance which expert it will visit in every layer, in every micro-step.

2. The Load Imbalance Problem in RL Post-training

2.1 Step-Level vs Micro-Step-Level Dynamics

ForeMoE’s central empirical observation (validated on Qwen3-30B-A3B) has two components:

Step-level load is stable and skewed. Because RL post-training targets concentrated domains (mathematics or coding), and expert specialization is already established from pre-training, the aggregate routing distribution across a full step is remarkably stable. Mathematically, if $p_e$ is the fraction of tokens that expert $e$ receives at step $t$ , then $p_e^{(t)} \approx p_e^{(t-1)}$ across most training steps. However, the distribution $\{p_e\}$ is not uniform: popular math/code experts receive 2–5× more tokens than average.

Micro-step-level load is highly variable. Within a step, sequences are split into $M$ micro-steps. Each micro-step contains $B_m = B/M$ tokens. With small $B_m$ (say, 512 tokens across 64 experts), the per-expert count follows:

c_e^{(m)} \sim \text{Poisson}(B_m \cdot K \cdot p_e)

The coefficient of variation (CV = std/mean) for a Poisson is $1/\sqrt{B_m \cdot K \cdot p_e}$ . For a popular expert with $p_e = 0.05$ and $B_m = 512$ , $K = 2$ :

\text{CV} = \frac{1}{\sqrt{512 \times 2 \times 0.05}} = \frac{1}{\sqrt{51.2}} \approx 0.14

This 14% CV means the per-expert token count fluctuates by ±14% (1σ) from its mean — but across all $R = 8$ ranks, the maximum imbalance ratio $\rho$ can be substantially higher, especially for the most popular experts.

2.2 Why Pre-training Methods Fail

Three compounding failure modes:

EMA cannot track micro-step variance. EMA estimates $\hat{c}_e^{(t)} \approx \bar{c}_e$ (the step-level mean). It correctly captures the stable, skewed step-level distribution but completely ignores micro-step deviations $\delta_e^{(m)} = c_e^{(m)} - \bar{c}_e$ . The EMA-based load balancing plan is computed against $\bar{c}_e$ , not against the actual micro-step load — so even with perfect relocation, residual micro-step imbalance persists.
No latency budget for planning. Pre-training load balancing triggers every $T$ steps (e.g., $T = 50$ ), giving ample time (~seconds) to solve the optimization. Micro-step balancing must plan and execute in the gap between micro-steps — often tens of milliseconds. The full NP-hard optimization is intractable in this window.
Expert transfer frequency is too high. In pre-training, an expert relocation happens once per $T$ steps; its cost is amortized across many forward passes. Micro-step balancing may require a different placement for every micro-step, meaning expert weights must be transferred at O(1/micro-step time) frequency. Without efficient hardware paths, transfer overhead exceeds the benefit.

Figure 2 illustrates the timing constraints:

sequenceDiagram
    participant Rollout as Rollout (Stage 1)
    participant Planner as ForeMoE Planner
    participant Transfer as Expert Transfer Engine
    participant Recompute as Recompute (Stage 2)
    participant Policy as Policy Update (Stage 3)
    
    Rollout->>Planner: Send routing table R
    Planner->>Planner: Stage 1: BPO (base placement)
    loop For each micro-step i
        Planner->>Planner: Stage 2: MDA (micro-step adj)
        Planner->>Transfer: Send placement plan A[i]
        Transfer->>Recompute: Execute expert transfers
        Recompute->>Planner: Micro-step i done
    end
    loop For each micro-step j
        Planner->>Planner: Stage 2: MDA (policy update)
        Planner->>Transfer: Send placement plan A[j]
        Transfer->>Policy: GPU-direct transfers
        Policy->>Planner: Micro-step j done
    end

Figure 2: ForeMoE’s interaction with the RL post-training pipeline. The routing table from rollout drives all subsequent load balancing decisions. The planner handles both recompute and policy update stages with stage-appropriate transfer strategies.

3. ForeMoE System Architecture

3.1 The Router Replay Opportunity

Router replay is a correctness constraint in RL frameworks like VERL and AceMo: recompute and policy update must replay the exact expert assignments from rollout. ForeMoE reframes this constraint as an opportunity:

Given the routing table $\mathcal{R}$ (from rollout), we know exactly how many tokens each expert will receive in each micro-step $m$ of both recompute and policy update, before any of those micro-steps execute.

This is the difference between prediction and planning. Pre-training load balancing must predict an uncertain future; ForeMoE plans against a certain future. This makes it possible to compute a provably optimal (within planner approximation) placement for every micro-step before the micro-step runs.

3.2 Problem Formulation

Notation:

$E$ experts indexed $1..E$ ; $R$ ranks indexed $1..R$ ; $M$ micro-steps indexed $1..M$
Expert $e$ memory cost $m_e$ (in GPU memory units); per-rank memory budget $C$
From routing table: $c_{e,m}$ = number of tokens assigned to expert $e$ in micro-step $m$

Decision variables:

$x_{e,r} \in \{0,1\}$ : expert $e$ is relocated to rank $r$ (primary location)
$y_{e,r} \in \{0,1\}$ : expert $e$ is replicated on rank $r$ (additional copy)
$z_{e,r,m} \in \{0,1\}$ : in micro-step $m$ , rank $r$ handles the tokens for expert $e$

Objective — minimize makespan:

\min \quad \max_{r \in [R],\; m \in [M]} \sum_{e=1}^{E} z_{e,r,m} \cdot c_{e,m}

Constraints:

Memory:

\sum_{e=1}^{E} (x_{e,r} + y_{e,r}) \cdot m_e \leq C \quad \forall r

Coverage (each expert on at least one rank):

\sum_{r=1}^{R} x_{e,r} \geq 1 \quad \forall e

Routing feasibility (can only send tokens to ranks hosting the expert):

z_{e,r,m} \leq x_{e,r} + y_{e,r} \quad \forall e, r, m

Token assignment completeness (all tokens for expert $e$ in micro-step $m$ must be processed):

\sum_{r=1}^{R} z_{e,r,m} \cdot c_{e,m} = c_{e,m} \quad \forall e, m

NP-hardness: proven by reduction from bin packing (see §5). In practice, with $E = 64$ experts across $R = 8$ ranks, the placement space has $8^{64}$ configurations even without replication.

3.3 The Four-Stage Planner

The planner exploits two structural properties to decompose the NP-hard problem:

Property A (step stability): $\bar{c}_e = \frac{1}{M} \sum_{m=1}^{M} c_{e,m}$ is approximately the same as $\bar{c}_e^{(t-1)}$ (previous step’s mean). So a base placement computed against $\bar{c}_e$ remains valid across steps.

Property B (deviation centering): micro-step deviations $\delta_{e,m} = c_{e,m} - \bar{c}_e$ are zero-mean by construction $(\sum_m \delta_{e,m} = 0)$ , and small in magnitude relative to $\bar{c}_e$ .

Decomposition: split the placement into a stable base component (solved once per step) and a micro-step adjustment component (solved per micro-step, with restricted search space).

Stage 1 — Base Placement Optimizer (BPO)

Algorithm 1: Base Placement Optimizer (BPO)
─────────────────────────────────────────────────────────────────────
Input:  step-level mean expert token counts {c̄_e} (from routing table),
        previous placement P_prev (warm start),
        memory budget C per rank, replication budget delta_C
Output: base placement P* = {(x_{e,r}, y_{e,r})} for all e, r

1.  P = P_prev  (warm start from previous step)
2.  Compute per-rank mean load: load(r, P) = sum_{e: x_{e,r}=1} c̄_e / R
3.  ideal = total_tokens * K / R   (tokens per rank at perfect balance)

4.  RELOCATION PHASE:
    Sort experts by |load(host(e), P) - ideal| descending
    For each expert e in sorted order:
        r_current = host(e, P)
        For each alternative rank r' ≠ r_current (sorted by load asc):
            If memory_ok(r', e, P) and load(r', P) + c̄_e < load(r_current, P):
                Relocate e to r': P = update(P, x_{e,r_current}=0, x_{e,r'}=1)
                Update load(r_current), load(r')
                Break

5.  REPLICATION PHASE:
    Identify overloaded ranks O = {r : load(r,P) > (1+eps) * ideal}
    For each r_o in O, sorted by load(r_o) desc:
        e* = argmax_e c̄_e such that host(e*) = r_o
        For each rank r_u with memory headroom (C - used_mem(r_u) >= m_{e*}):
            If load(r_u) < (1-eps) * ideal:
                Replicate e* on r_u: P = update(P, y_{e*,r_u}=1)
                Load(r_u) += c̄_{e*} / 2  (tokens split between original and replica)
                Load(r_o) -= c̄_{e*} / 2
                Break (one replica per overloaded expert)

6.  Return P*
─────────────────────────────────────────────────────────────────────

Stage 2 — Micro-step Deviation Adjuster (MDA)

Algorithm 2: Micro-step Deviation Adjuster (MDA)
─────────────────────────────────────────────────────────────────────
Input:  base placement P*, micro-step routing sub-table R[m]
        (exact token counts c_{e,m} for this micro-step),
        deviation threshold eps
Output: adjustment plan A[m] = list of (expert, src_rank, dst_rank)

1.  Compute actual per-rank load for this micro-step:
        L_r = sum_{e: on_rank(e,r,P*)} c_{e,m}  for each r

2.  Identify overloaded ranks: O = {r : L_r > (1+eps) * ideal_m}
    Identify underloaded ranks: U = {r : L_r < (1-eps) * ideal_m}
    (ideal_m = B_m * K / R)

3.  If O is empty: A[m] = {} (no adjustment needed); return

4.  For each (r_o, r_u) pair in O × U, sorted by (L_{r_o} - L_{r_u}) desc:
        e* = expert on r_o with highest c_{e*,m} that can fit in r_u memory
        excess = L_{r_o} - ideal_m
        If c_{e*,m} <= excess * 1.5:  // relocation would not over-correct
            Add (e*, r_o, r_u) to A[m]
            Update L_{r_o} -= c_{e*,m}
            Update L_{r_u} += c_{e*,m}
            If r_o no longer overloaded: break

5.  Return A[m]
─────────────────────────────────────────────────────────────────────

Stage 3 — Token Reassignment Validator: verifies that the proposed adjustment plan A[m] is consistent with the routing table. Specifically, tokens assigned to a relocated expert must be dispatched to the expert’s new location. This stage checks that the adjusted placement does not violate the router replay constraint.

Stage 4 — Load Verification: computes the final imbalance ratio $\rho$ under the adjusted placement. If $\rho$ exceeds a target threshold, the planner can attempt additional adjustments or fall back to the base placement (accepting suboptimal balance to avoid transfer overhead).

Figure 3 shows the planner data flow:

flowchart LR
    subgraph Rollout
        RTab["Routing Table R\n(all micro-steps)"]
    end
    subgraph Planner
        BPO["Stage 1: BPO\n(once/step)\nc̄_e → P*"]
        MDA["Stage 2: MDA\n(per micro-step)\nR[m] + P* → A[m]"]
        TV["Stage 3: Token\nValidator\ncheck consistency"]
        LV["Stage 4: Load\nVerifier\ncheck ρ"]
    end
    subgraph Transfer
        TX["Expert Transfer\nEngine\nexecute A[m]"]
    end
    subgraph Execute
        MS["Micro-step m\n(MoE forward/backward)"]
    end
    
    RTab --> BPO
    RTab --> MDA
    BPO --> MDA
    MDA --> TV
    TV --> LV
    LV --> TX
    TX --> MS
    MS --> MDA

Figure 3: ForeMoE Four-Stage Planner. The routing table from rollout feeds both the BPO (once per step) and the per-micro-step MDA. The output adjustment plan is validated and verified before being sent to the Expert Transfer Engine.

3.4 Expert Transfer Engine

The transfer engine must execute expert weight migrations fast enough that they complete before the micro-step’s compute phase begins. The key innovation is using different hardware paths for different RL stages:

graph TB
    subgraph RecomputeStage["Recompute Stage — CPU-Assisted Path"]
        GPU0R["GPU 0\n(source expert)"]
        CPU0["CPU DRAM 0"]
        CPU1["CPU DRAM 1"]
        GPU1R["GPU 1\n(dest rank)"]
        GPU0R -- "D2H DMA\n(PCIe, ~32 GB/s)" --> CPU0
        CPU0 -- "CPU network\n(via RDMA NIC)" --> CPU1
        CPU1 -- "H2D DMA\n(PCIe, ~32 GB/s)" --> GPU1R
    end
    subgraph PolicyUpdateStage["Policy Update Stage — GPU-Direct Path"]
        GPU0P["GPU 0\n(source expert)"]
        GPU1P["GPU 1\n(dest rank)"]
        GPU0P -- "NVLink 4\n(~600 GB/s)" --> GPU1P
    end

Figure 4: Expert Transfer Engine uses complementary hardware paths. Recompute uses CPU-assisted PCIe (GPU memory → CPU DRAM → remote CPU DRAM → GPU memory), overlapping with GPU compute. Policy update uses GPU-direct NVLink (direct GPU-to-GPU), providing 10–20× higher bandwidth for time-critical gradient synchronization.

Why CPU-assisted for recompute? During recompute, the GPU is executing MoE forward pass computation. CPU-assisted DMA transfers proceed in parallel through the PCIe bus without interrupting GPU execution. The expert weight size for Qwen3-30B-A3B is approximately $d_{\text{model}}^2 / E_{\text{layer}} \approx 7168^2 / 64 \approx 800 \text{ MB} / 64 \approx 12.5 \text{ MB}$ per expert. At 32 GB/s PCIe, transferring one expert takes ~0.4ms — well within the 80–120ms micro-step compute window.

Why GPU-direct for policy update? Policy update involves gradient accumulation with frequent synchronization points. CPU-assisted transfers during gradient computation would introduce PCIe round-trips that add latency to the critical path. GPU-direct NVLink transfers (10–20× faster than PCIe) complete in ~20µs for a 12.5 MB expert, fitting within inter-operation gaps.

Overlap pipeline: the transfer engine asynchronously prepares the placement for micro-step $m+1$ while micro-step $m$ is executing:

Timeline (two consecutive micro-steps):

  t=0:     Micro-step m begins compute
  t=0:     MDA runs for micro-step m+1 (async, on CPU)
  t=5ms:   Transfer begins for experts needed in micro-step m+1
  t=~85ms: Micro-step m compute completes
  t=~85ms: Transfer for micro-step m+1 done (started at t=5ms, ~80ms budget)
  t=85ms:  Micro-step m+1 begins with correct placement

If the transfer fits within the compute window (the common case), the effective overhead of load balancing is zero — the transfers run for free in the background.

4. Mathematical Analysis: Load Imbalance Model

4.1 Expected Makespan Without Load Balancing

For each micro-step $m$ , the token count for expert $e$ is:

c_{e,m} \sim \text{Multinomial}(B_m \cdot K,\; p_1, \ldots, p_E)

where $p_e$ is the long-term routing probability for expert $e$ . The load on rank $r$ is:

L_r^{(m)} = \sum_{e \in \text{experts on rank } r} c_{e,m}

Under independent Poisson approximation (valid when $B_m \cdot K \gg E$ ):

L_r^{(m)} \approx \text{Poisson}\!\left(B_m \cdot K \cdot \sum_{e \in r} p_e\right)

Define $P_r = \sum_{e \in r} p_e$ (the “routing mass” assigned to rank $r$ ). The expected makespan per micro-step is:

\mathbb{E}[\text{makespan}] \propto \mathbb{E}\!\left[\max_{r} L_r^{(m)}\right]

For perfectly balanced routing ( $P_r = 1/R$ for all $r$ ) with $\lambda = B_m K / R$ :

\mathbb{E}\!\left[\max_r \text{Poisson}(\lambda)\right] \approx \lambda + \sqrt{2\lambda \ln R}

This is the baseline. With imbalanced routing ( $P_r$ varies), the dominant rank’s mean shifts to $B_m K \max_r P_r$ , worsening the makespan.

4.2 Benefit of Micro-step Load Balancing

After MDA adjustment, the effective load variance is reduced. The residual imbalance comes from experts that could not be relocated (due to memory constraints) or replicated (insufficient headroom). If ForeMoE can relocate/replicate experts to achieve near-uniform load, the makespan approaches:

\mathbb{E}[\text{makespan}_\text{ForeMoE}] \approx \lambda + \sqrt{2\lambda \ln R} + \text{overhead}

where overhead is the expert transfer time. The 1.45× speedup in practice corresponds to reducing the effective $\max_r P_r$ from ~0.22 (8% imbalance on a 8-rank setup with popular experts) to ~0.135 (≈1/R = 0.125).

5. NP-Hardness: Formal Reduction

Theorem (ForeMoE §7): The load balancing decision problem — “does there exist a placement with imbalance ratio $\leq \rho^*$ ?” — is NP-complete.

Proof sketch (reduction from 3-Dimensional Matching / Bin Packing):

Given a bin packing instance with items $\{s_1, \ldots, s_n\}$ (positive integers) and bin capacity $W$ , construct:

$E = n$ experts with memory cost $m_e = s_e$
$R$ ranks with budget $C = W$
Single micro-step ( $M = 1$ ) with token count $c_e = s_e$ for expert $e$
Target imbalance ratio $\rho^* = 1$ (perfect balance)

A placement with $\rho^* = 1$ exists iff:

\sum_{e \in r} c_e = W \cdot \frac{\sum_e c_e}{R \cdot W} \quad \text{(equal load)} \quad\text{AND}\quad \sum_{e \in r} m_e \leq W \quad \text{(memory fits)}

This is exactly a balanced bin packing problem, which is NP-complete. Since the decision problem reduces to NP-complete, the optimization is NP-hard. $\square$

Practical implication: for the Qwen3-30B-A3B setup (64 experts, 8 ranks), a brute-force optimal solution would require evaluating $\binom{64}{8}^7 \approx 10^{70}$ placements per micro-step. The hierarchical BPO+MDA decomposition reduces this to $O(E \cdot R)$ greedy steps for BPO plus $O(E)$ candidate experts for MDA.

6. Experiments

6.1 Experimental Setup

Cluster: 64 NVIDIA GPUs (H100-class, NVLink 4), Megatron-LM training framework, vLLM rollout engine, GRPO-based RL objective

Model: Qwen3-30B-A3B (30B parameters, ~3B active; 64 experts per MoE layer; top-2 routing; 8 expert parallelism ranks)

Micro-step configuration: global batch = 256 sequences; $M = 8$ micro-steps per step; $B_m = 32$ sequences × ~512 tokens each

RL datasets:

DAPO-Math-17k: 17,000 math problems (concentrated domain → high expert specialization)
CodeForces: competitive programming problems (concentrated domain → different specialization pattern)

Baselines:

No LB: vanilla MoE training without any load balancing
EMA LB: step-level EMA prediction with greedy relocation (representative SOTA pre-training LB)
Oracle: optimal placement with foreknowledge of micro-step loads but no latency constraint
ForeMoE: proposed system (BPO + MDA + Expert Transfer Engine)

6.2 Main Speedup Results

Method	DAPO-Math	CodeForces	GPU Util.	Peak Imbal.
No LB	1.00×	1.00×	62%	2.3×
EMA LB	1.08×	1.06×	67%	1.88×
ForeMoE	1.45×	1.38×	88%	1.21×
Oracle	1.52×	1.46×	92%	1.08×

ForeMoE reaches 95% of oracle speedup on DAPO-Math (1.45 vs 1.52), validating the hierarchical planner’s near-optimality. The EMA baseline’s 8% improvement confirms that even step-level corrections help, but they miss the bulk of the micro-step imbalance.

End-to-End Speedup on DAPO-Math-17k (64 GPUs, relative to No LB)

  1.6× ┤
  1.5× ┤                              ██████ 1.52×
  1.4× ┤               ██████ 1.45×
  1.3× ┤
  1.2× ┤
  1.1× ┤  ██████ 1.08×
  1.0× ┤  (baseline)
       └──────────────────────────────────────────
          No LB      EMA LB    ForeMoE    Oracle

  ForeMoE / Oracle gap: 0.07× (95% efficiency)
  ForeMoE / EMA gap:    0.37× (+34% improvement)

Figure 5: End-to-end speedup comparison on DAPO-Math-17k with 64 GPUs. ForeMoE achieves 1.45× over unoptimized training, reaching 95% of oracle. The gap to EMA LB is 37 percentage points, showing how much micro-step imbalance was missed.

6.3 Load Imbalance Distribution

The paper provides a histogram of imbalance ratios $\rho^{(m)}$ across micro-steps:

Imbalance Range	No LB (% of micro-steps)	EMA LB	ForeMoE
ρ < 1.1	3%	9%	61%
1.1 ≤ ρ < 1.3	12%	21%	32%
1.3 ≤ ρ < 1.5	18%	24%	6%
1.5 ≤ ρ < 2.0	34%	31%	1%
ρ ≥ 2.0	33%	15%	0%

ForeMoE concentrates 93% of micro-steps below ρ = 1.3, compared to 15% for no LB. High-imbalance events (ρ ≥ 2.0) are eliminated entirely.

6.4 Component Ablation

Ablation	DAPO-Math Speedup	CodeForces Speedup
BPO only (no MDA)	1.21×	1.17×
MDA only (no BPO)	1.18×	1.15×
BPO + MDA	1.45×	1.38×
CPU path only	1.31×	1.28×
GPU-direct only	1.28×	1.24×
Full ForeMoE	1.45×	1.38×

The BPO and MDA are complementary: BPO reduces the step-level imbalance that MDA then fine-tunes, while MDA alone without a good base placement wastes its adjustment budget on large corrections. Similarly, the two transfer paths are complementary: CPU-assisted excels during recompute (parallel with compute), GPU-direct excels during policy update (lower latency).

6.5 Planner Latency Analysis

BPO execution time: ~50ms (runs once per step, ~1000ms step time → 5% overhead)
MDA execution time: ~5ms  (runs per micro-step, ~100ms compute → 5% overhead)
Expert transfer time (CPU-assisted, 12.5 MB expert, PCIe): ~0.4ms
Expert transfer time (GPU-direct, 12.5 MB expert, NVLink): ~0.02ms

Effective overhead: ~0% (fully overlapped with computation in common case)
Worst case (many experts need transfer, small micro-step): ~8% overhead

The planner is implemented in C++ with CUDA streams for asynchronous GPU operations. The BPO runs on CPU in a background thread, completing before the next training step begins.

Pre-training MoE load balancing: FasterMoE (He et al., 2022) introduces expert shadowing (asynchronous replication); LAER-MoE (Zhai et al., 2023) uses latency-aware expert replication; FlexMoE (Nie et al., 2023) applies flexible expert placement; EPLB (Skiadopoulos et al., 2026) uses EMA-based load prediction with bin-packing-style assignment. All of these operate at step granularity using historical statistics — they are not designed for micro-step imbalance.

RL post-training systems: VERL (Zheng et al., 2025) and AceMo (Ma et al., 2025) implement router replay for correctness but do not exploit it for load balancing. OpenRLHF (Hu et al., 2024) provides a general RL training framework without MoE-specific optimizations. ForeMoE is designed as an add-on module for these systems.

Expert parallelism communication: GShard (Lepikhin et al., 2021), Switch Transformer (Fedus et al., 2022), and DeepSpeed-MoE (Rajbhandari et al., 2022) study the communication patterns of MoE training but focus on correctness and general efficiency, not the specific RL post-training dynamics.

System-level position of ForeMoE: it is the first work to (a) characterize the micro-step-level load imbalance in MoE RL post-training, (b) formally prove the load balancing problem is NP-hard, and (c) propose a tractable hierarchical algorithm exploiting router replay as a foresight signal.

8. Boundary Conditions

When ForeMoE helps most:

Small micro-step batch size (high per-expert variance)
Concentrated RL task domains (strong expert specialization → skewed step-level mean)
Many experts and ranks (large reconfiguration flexibility)
Memory headroom for replication
Fast interconnect (NVLink for GPU-direct, PCIe 4.0+ for CPU-assisted)

When ForeMoE helps less:

Large micro-step batch size (statistical averaging reduces variance)
Diverse RL task domains (routing roughly uniform → mild imbalance)
Very tight GPU memory (limits replication budget)
No router replay (some RL frameworks do not enforce routing consistency between stages)

Hard requirements:

The RL training framework must implement router replay
The system must record the full routing table during rollout (memory cost: $\approx N \cdot L_\text{resp} \cdot n_\text{layers} \cdot 2 \cdot 4$ bytes for top-2 routing = ~200 MB for a 30B model with 512-token responses)

9. Critical Assessment: Weaknesses & Improvements

9.1 Weaknesses and Flaws

W1 — Single model scale only. All experiments use Qwen3-30B-A3B. MoE dynamics change substantially at larger scale: DeepSeek-V3 uses 256 experts per MoE layer (vs 64), a higher EP degree, and 671B total parameters. Whether the 1.45× speedup, planner overhead, and transfer engine’s overlap strategy hold for these larger configurations is completely unknown. The micro-step batch size $B_m$ relative to $E$ is different at scale, potentially changing the variance characteristics.

W2 — No training quality metrics reported. The paper reports only wall-clock speedup. It never reports reward curves, downstream task accuracy (e.g., MATH500, HumanEval, AIME benchmarks), or convergence speed in samples. Expert relocations and replications could in principle introduce subtle numerical differences in gradient computation. Even if the router replay constraint is satisfied, floating-point rounding during weight transfer might cause micro-differences. The paper should demonstrate that ForeMoE converges to the same policy quality as baseline training.

W3 — The 5% oracle gap is unexplained. ForeMoE achieves 95% of oracle (1.45× vs 1.52×). The paper does not analyze what causes the remaining gap. Possible explanations include: (a) MDA sub-optimality (greedy vs optimal), (b) memory-constrained replication limiting the best achievable balance, (c) residual transfer latency not fully overlapped, (d) per-layer vs aggregate load balancing mismatch. Understanding this gap would guide future improvements.

W4 — Memory overhead of replication unreported. Expert replication increases GPU memory footprint. For Qwen3-30B-A3B with 32 MoE layers and ~12.5 MB per expert, replicating even 2 experts per rank adds ~25 MB per rank — small in absolute terms but non-trivial when training memory is 90%+ full. The paper does not report the replication overhead or provide guidance on setting the replication budget.

W5 — Scalability of BPO warm-starting. The BPO uses the previous step’s placement as a warm start. If the routing distribution shifts (e.g., the model is rapidly learning and changing its expert preferences), the warm start degrades from an accelerator to a liability. For early RL training when the policy is changing rapidly, the warm start may be stale more often than beneficial. No analysis of warm-start quality over training time is provided.

9.2 Limitations the Authors Understate

L1 — Router replay scope. The paper presents router replay as an inherent property of RL post-training. However, it is a property of specific RL implementations (VERL, AceMo) that choose to preserve routing decisions for importance sampling correction. Some RL variants — including some implementations of PPO with a separate reference model — do not require routing replay and therefore do not expose ForeMoE’s key opportunity. The paper does not clearly scope this.

L2 — Only concentrated-domain datasets tested. DAPO-Math-17k and CodeForces are both “concentrated domain” RL tasks where expert specialization is high and step-level load is skewed. The paper’s motivation explicitly says these conditions cause micro-step imbalance. But for general-purpose RL post-training (e.g., aligning a model to diverse conversational tasks via RLHF), the routing distribution may be much more uniform, reducing both the problem severity and ForeMoE’s benefit. No diverse-domain experiment is included.

L3 — No multi-node results. The 64 GPUs appear to be intra-node or single-pod (NVLink-connected). In real production training (hundreds of nodes), GPU-direct NVLink is unavailable for inter-node transfers, and CPU-assisted PCIe must handle all cross-node migrations. PCIe’s ~32 GB/s bandwidth may be insufficient to overlap expert transfers within the inter-node micro-step compute window. The system’s behavior in this more common deployment scenario is not characterized.

9.3 Concrete Improvement Suggestions

I1 — Multi-scale evaluation. Run experiments on at least one 70B+ MoE model (e.g., DeepSeek-V2-Lite or a Mixtral-8x22B-class model) to characterize how speedup, planner overhead, and load imbalance scale with model size.

I2 — Training quality validation. Report reward curves and benchmark accuracy (MATH500, HumanEval) for models trained with and without ForeMoE, confirming that speedup does not come at the cost of policy quality.

I3 — Multi-node extension. Implement and evaluate an inter-node transfer path using RDMA (InfiniBand or RoCE), measuring transfer latency and whether overlap remains feasible in multi-node training.

I4 — Diverse-domain RL experiment. Include at least one experiment with a diverse task domain (e.g., a general instruction-following dataset or a mixed math+code dataset) to characterize ForeMoE’s benefit under more uniform routing distributions.

I5 — Memory-accuracy tradeoff study. Provide a Pareto frontier plot of speedup vs memory overhead as the replication budget varies, giving practitioners guidance on deployment configuration.

10. Reproducibility

No public code repository linked. The system is described as “implemented over Megatron-LM/FSDP with VERL-style router replay” but no specific integration is provided.
Key hyperparameters (deviation threshold $\varepsilon$ in MDA, replication budget $\delta_C$ , BPO warm-start step count, EMA decay $\alpha$ for BPO initialization) are mentioned in the text but not consolidated in a reproducibility table.
The expert weight sizes and transfer bandwidth assumptions are specific to the H100 + NVLink 4 configuration. Users on different hardware (A100 + NVLink 3, or multi-node InfiniBand) would need to re-characterize transfer latencies.
The routing table memory overhead (~200 MB for the tested configuration) scales with response length and model depth; long-context RL training (e.g., 8K token responses) would increase this by 16×.

11. Conclusion

ForeMoE makes a clean, well-motivated contribution: it identifies that MoE RL post-training has a fundamentally different load imbalance profile from pre-training (micro-step-level, not step-level), explains why existing methods fail, and proposes a practical system that exploits the structural property of router replay to convert an NP-hard online planning problem into a tractable offline planning problem.

The 1.45× end-to-end speedup is substantial and practically meaningful — equivalent to reducing a 4-hour training run by nearly 45 minutes per episode. The hierarchical Four-Stage Planner is algorithmically principled, the Expert Transfer Engine’s dual-path design is hardware-aware, and the NP-hardness proof grounds the algorithmic choices.

The main gaps are scope (single model, two concentrated-domain datasets, intra-node only, no training quality metrics) and transparency (no code release, underspecified hyperparameters). As MoE RL post-training becomes the norm for frontier model development — DeepSeek-R1, Kimi K2.5, and likely future GRPO-trained MoE models — ForeMoE’s approach to micro-step load balancing will become increasingly important. The ideas here should generalize to multi-node settings and larger model scales, though that work remains to be done.

Appendix A: Key Symbols Reference

Symbol	Meaning
$E$	Total number of experts per MoE layer
$R$	Number of ranks (GPUs) with expert parallelism
$K$	Number of experts selected per token (top- $K$ routing)
$M$	Number of micro-steps per training step
$B$	Total batch size per step (tokens)
$B_m = B/M$	Batch size per micro-step
$c_{e,m}$	Token count assigned to expert $e$ in micro-step $m$ (from routing table $\mathcal{R}$ )
$\bar{c}_e$	Step-level mean token count for expert $e$ : $\frac{1}{M}\sum_m c_{e,m}$
$\delta_{e,m}$	Micro-step deviation: $c_{e,m} - \bar{c}_e$
$L_r^{(m)}$	Load on rank $r$ during micro-step $m$
$\bar{L}^{(m)}$	Ideal (balanced) load per rank: $B_m \cdot K / R$
$\rho^{(m)}$	Imbalance ratio for micro-step $m$ : $\max_r L_r^{(m)} / \bar{L}^{(m)}$
$p_e$	Long-run routing probability for expert $e$
$P_r = \sum_{e \in r} p_e$	Total routing mass assigned to rank $r$
$x_{e,r}$	Relocation variable: expert $e$ is primary on rank $r$
$y_{e,r}$	Replication variable: expert $e$ has a copy on rank $r$
$m_e$	Memory cost of expert $e$
$C$	Per-rank GPU memory budget
$\mathcal{R}$	Routing table from rollout stage
$P^*$	Base placement from BPO
$A[m]$	Micro-step adjustment plan from MDA for micro-step $m$

Appendix B: Comparison with Pre-training Load Balancing Methods

To make the positioning of ForeMoE concrete, here is a structured comparison with key prior methods:

Aspect	FasterMoE	LAER-MoE	EPLB	ForeMoE
Target training phase	Pre-training	Pre-training	Pre-training	RL Post-training
Planning signal	Historical EMA	Historical EMA	Historical EMA	Router replay (exact)
Granularity	Step-level	Step-level	Step-level	Micro-step-level
Algorithm	Expert shadowing	Latency-aware	Bin packing	BPO + MDA (hierarchical)
Hardware paths	PCIe only	PCIe only	PCIe	PCIe + NVLink (stage-specific)
NP-hardness addressed	No	No	Approximation	Yes (hierarchical decomp)
Proven speedup context	Pre-training	Pre-training	Pre-training	RL post-training

The fundamental difference is the planning signal. Pre-training methods must estimate the unknown future load from historical statistics; ForeMoE reads the exact future load from the routing table that rollout has already computed.

Appendix C: ForeMoE Integration with GRPO Training Loop

To make the system concrete, here is a pseudocode-level integration of ForeMoE into a GRPO training step:

Algorithm 3: ForeMoE-augmented GRPO Training Step
─────────────────────────────────────────────────────────────────────
Input:  prompts X = {x_1, ..., x_N}; current policy θ; previous placement P_prev
Output: updated policy θ'; updated placement P_next

// Stage 1: Rollout
(Y, log_probs_old, R) = vLLM_rollout(X, θ)
    // Y = generated responses; R = routing table for all tokens/layers

// Compute rewards
rewards = reward_model(X, Y)
advantages = GRPO_advantage(rewards, log_probs_old)

// ForeMoE: Base Placement (once per step, async on CPU)
mean_loads = aggregate_step_loads(R)  // {c̄_e} from routing table
P* = BPO(mean_loads, P_prev, memory_budget=C)

// Stage 2: Recompute (M micro-steps)
log_probs = []
For m = 1 to M:
    R_m = get_micro_step_routing(R, m)   // exact loads for this micro-step
    A_m = MDA(P*, R_m, eps=0.1)           // micro-step adjustment (async)
    ExpertTransferEngine.execute_cpu_path(A_m)  // CPU-assisted PCIe transfer
    log_probs_m = megatron_forward(Y_m, routing=R_m, placement=P*)
    log_probs.append(log_probs_m)

// Stage 3: Policy Update (M micro-steps)
For m = 1 to M:
    R_m = get_micro_step_routing(R, m)
    A_m = MDA(P*, R_m, eps=0.1)
    ExpertTransferEngine.execute_gpu_direct_path(A_m)  // NVLink transfer
    loss_m = GRPO_loss(log_probs[m], log_probs_old_m, advantages_m, routing=R_m)
    loss_m.backward()

// Apply gradient update
optimizer.step(θ)
θ' = θ
P_next = P*   // base placement becomes warm start for next step

Return θ', P_next
─────────────────────────────────────────────────────────────────────

The critical property: routing=R_m is passed to both megatron_forward and GRPO_loss, enforcing router replay. Expert placement changes in $A_m$ affect only which physical GPU processes each expert, not which expert each token visits — so router replay is preserved exactly.