June 28, 2026 EN #LLM Serving #Mixture of Experts #LLM Inference

Moebius: Seamless Runtime Parallelism Switching for MoE LLM Serving

Review date: 2026-06-28 Review author: Zhongzhu Zhou Paper reviewed: Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch Paper authors: Shaoyu Wang, Yizhuo Liang, Jaeyong Song, Chong Li, Seo Jin Park arXiv: 2606.26607 Status / Venue: Preprint (2026)

Short Answer

Production MoE LLM serving faces a fundamental dilemma: tensor parallelism (TP) is faster at low batch concurrency, while expert parallelism (EP) wins at high concurrency. Workloads constantly cross this boundary—bursty request spikes, RL rollout tails, day/night traffic swings—yet current serving systems must commit to one parallelism layout at startup. Moebius breaks this lock by treating EP and TP as two different “data views” of the same model weights and switching between them at runtime in 215–434 ms without stopping the engine or dropping in-flight requests. On 8×H200 GPUs serving Qwen3-235B-A22B, Moebius matches the better static layout at every operating point and outperforms any single static choice by 1.16–1.25× on RL rollouts, with only 2.4% extra memory overhead.

Prerequisites

Before diving into Moebius, a reader should be comfortable with the following concepts. This section builds up the background in a self-contained way.

MoE Architecture Recap

A Mixture-of-Experts (MoE) language model replaces some or all of the dense feed-forward network (FFN) layers in a Transformer with a collection of $N$ expert FFN sub-networks. During inference, a lightweight router assigns each input token to a small subset $k$ of these experts (typically $k=2$ to $k=8$ ), and only the selected experts’ computations are performed for that token.

Formally, if $x$ is an input token embedding of dimension $H$ , the MoE output is:

\text{MoE}(x) = \sum_{i \in \text{top-}k(r(x))} g_i(x) \cdot \text{FFN}_i(x) \tag{1}

where $r(x)$ is the router output (a vector of expert scores), $g_i(x)$ is a gating weight, and $\text{FFN}_i$ is the $i$ -th expert network. Each expert $i$ has two weight matrices:

W^{(i)}_{\text{gate/up}} \in \mathbb{R}^{2I \times H}, \quad W^{(i)}_{\text{down}} \in \mathbb{R}^{H \times I} \tag{2}

where $H$ is the hidden dimension and $I$ is the expert intermediate dimension. In Qwen3-235B-A22B there are $E = 128$ experts per layer with $k = 8$ active per token, and expert weights dominate model memory—often 70–80% of total parameters.

Tensor Parallelism (TP) Basics

Tensor Parallelism distributes individual weight matrices across $P$ GPUs (ranks). For MoE experts under TP, each GPU holds all $E$ experts but each expert is partially sharded along the intermediate dimension:

W^{\text{TP}}_{\text{gate/up}} \in \mathbb{R}^{E \times (2I/P) \times H}, \quad W^{\text{TP}}_{\text{down}} \in \mathbb{R}^{E \times H \times (I/P)} \tag{3}

Computation across experts is split in column/row fashion across all $P$ GPUs; an All-Reduce collective after each expert layer synchronizes outputs. KV cache is also split by attention head: each rank holds $H_{kv}/P$ KV heads.

When TP wins: At low request concurrency (small batch $B$ ), expert GEMMs are memory-bandwidth-bound. TP splits expert weights across GPUs, each GPU seeing a smaller working set and feeding its compute units faster.

Expert Parallelism (EP) Basics

Expert Parallelism distributes the experts themselves across GPUs. Each GPU holds $E/P$ complete experts:

W^{\text{EP}}_{\text{gate/up}} \in \mathbb{R}^{(E/P) \times 2I \times H}, \quad W^{\text{EP}}_{\text{down}} \in \mathbb{R}^{(E/P) \times H \times I} \tag{4}

Token routing requires All-to-All communication: tokens are dispatched to whichever GPU owns their selected expert, computed there, and results are returned. Attention runs in data-parallel mode—each GPU handles its own set of requests’ full KV state.

When EP wins: At high batch concurrency (large $B$ ), expert GEMMs become compute-bound. EP keeps each expert’s full weight matrices on one GPU, maximizing batch size per expert and compute utilization.

The TP/EP Crossover

There exists a crossover batch size $B^*$ where TP and EP exchange optimality:

B < B^*: \quad \text{TP wins} \qquad B > B^*: \quad \text{EP wins} \tag{5}

For Qwen3-235B-A22B on 8×H200, $B^* \approx 128$ – $256$ . Production workloads oscillate across $B^*$ continuously: bursty serving sees request rates vary by 2–3 orders of magnitude, and RL rollouts start with many active sequences (EP-favoring) then decay to a few long stragglers (TP-favoring).

CUDA Graphs

CUDA graphs capture a sequence of GPU operations into a reusable graph object, replayed via a single cudaGraphLaunch call. This eliminates per-step CPU scheduling latency (1–10 ms per step for large models), which is critical at low batch sizes where GPU compute is fast.

The key constraint: CUDA graph replay requires that memory addresses of all tensors remain fixed from capture to replay. Weight tensors, KV buffers, and attention state must live at the same device addresses—a design constraint Moebius must carefully navigate.

Paged Attention and KV Cache

Modern serving systems (vLLM, SGLang) use paged attention: KV cache for each sequence is stored in non-contiguous pages (physical GPU memory blocks) referenced by a logical-to-physical page table. Under EP, each GPU holds the full KV state for its assigned requests. Under TP, KV cache is partitioned by attention head. A parallelism switch therefore requires redistributing KV state—one of the most expensive operations in Moebius.

1. Introduction: The TP/EP Dilemma

MoE models—Mixtral-8x7B, DeepSeek-V3, Qwen3-235B-A22B—dominate production LLM serving. Their sparse activation (few experts fire per token) enables strong capability at reduced per-token compute, but their large total parameter counts create parallelism challenges.

Figure 1. TP vs. EP layout for 8×GPU MoE serving.

┌─────────────────────────────────────────────────────────────────────┐
│              Tensor Parallelism (TP)                                │
│  GPU 0         GPU 1         GPU 2         GPU 3                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │All E exp.│  │All E exp.│  │All E exp.│  │All E exp.│           │
│  │shard I   │  │shard I   │  │shard I   │  │shard I   │           │
│  │[0:I/P]   │  │[I/P:2I/P]│  │[2I/P:3I/P]│ │[3I/P:I]  │           │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘           │
│     ←────────── All-Reduce after each expert layer ────────────→   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│              Expert Parallelism (EP)                                │
│  GPU 0         GPU 1         GPU 2         GPU 3                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │Exp 0..31 │  │Exp 32..63│  │Exp 64..95│  │Exp 96..127│          │
│  │full I dim│  │full I dim│  │full I dim│  │full I dim│           │
│  │own reqs' │  │own reqs' │  │own reqs' │  │own reqs' │           │
│  │full KV   │  │full KV   │  │full KV   │  │full KV   │           │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘           │
│     ←──────── All-to-All dispatch before each expert layer ──────→ │
└─────────────────────────────────────────────────────────────────────┘

Today’s systems must commit to one layout at startup. A serving system configured for EP suffers 52 ms TPOT during quiet periods (vs. TP’s 37 ms), while one configured for TP suffers 90s p99 TTFT during bursts. No static choice is universally optimal.

Why prior work fails for MoE:

Weight transfer scale: Expert weights dominate MoE memory (70–80%). Dense-model switching systems transfer modest weights; MoE requires tens of GB over NVLink per switch.
In-flight continuity: Production engines cannot drain all in-flight requests before switching—long-context requests would timeout.
CUDA graph invalidation: Engine restart breaks CUDA graph tensor addresses, forcing minutes of recapture warmup.

Moebius addresses all three simultaneously.

2. The Core Insight: EP and TP as Data Views

The key intellectual contribution is recognizing that EP and TP are not different systems—they are different data layouts of the same tensor data. Both layouts hold the same total data volume per GPU when $P | E$ :

\text{Expert data per GPU} = \frac{E \cdot 2I \cdot H}{P} \cdot \text{dtype\_bytes} \quad \text{(invariant across EP and TP)} \tag{6}

This volume invariance enables in-place resharding: no extra memory is allocated, and the transformation is volume-preserving. The data simply needs to be rearranged across GPUs.

EP layout (rank $r$ owns experts $[r \cdot E/P, (r{+}1) \cdot E/P)$ ):

W^{\text{EP}} \in \mathbb{R}^{(E/P) \times 2I \times H} \quad \text{(complete intermediate dim, subset of experts)} \tag{7}

TP layout (rank $r$ owns intermediate shard $[r \cdot 2I/P, (r{+}1) \cdot 2I/P)$ for all experts):

W^{\text{TP}} \in \mathbb{R}^{E \times (2I/P) \times H} \quad \text{(partial intermediate dim, all experts)} \tag{8}

Both layouts have the same total element count per GPU. To convert between them requires redistributing along different axes—a distributed AllToAll exchange.

3. Weight Resharding: The EP↔TP Transformation

3.1 EP→TP Resharding

Under EP, rank $r$ owns experts $[r \cdot E/P, (r{+}1) \cdot E/P)$ with full intermediate dimension. Under TP, rank $r$ needs shard $[r \cdot 2I/P, (r{+}1) \cdot 2I/P)$ of every expert. The transformation requires a distributed all-to-all exchange.

Algorithm 1: EP→TP Weight Resharding

Input:  W_EP[r] ∈ ℝ^{(E/P, 2I, H)} on rank r   (our E/P complete experts)
Output: W_TP[r] ∈ ℝ^{(E, 2I/P, H)} on rank r   (I/P-shard of all E experts)

Step 1 — Local permute (pack by destination rank):
  For each peer rank p in 0..P-1:
    chunk[p] = W_EP[r][:, p*(2I/P) : (p+1)*(2I/P), :]
    # shape: (E/P, 2I/P, H) — the I-shard of our experts that rank p needs

Step 2 — Concatenate into AllToAll send buffer:
  send_buf = concat(chunk[0], chunk[1], ..., chunk[P-1]) along axis 0
  # shape: (E, 2I/P, H)

Step 3 — AllToAll exchange via NVLink direct-transfer:
  recv_buf = AllToAll(send_buf)
  # recv_buf[(p*(E/P)) : ((p+1)*(E/P)), :, :]
  #   ← chunk[r] from rank p  (rank p's experts, our I-shard)

Step 4 — Direct placement (recv_buf is already the correct layout):
  W_TP[r] = recv_buf   # shape: (E, 2I/P, H) — all E experts, I/P shard ✓

Figure 2. EP→TP data flow with P=4 GPUs and E=8 experts (simplified).

EP Layout (before):                 TP Layout (after):
GPU 0: [Exp0, Exp1] all I           GPU 0: [Exp0..7] I[0:I/4]
GPU 1: [Exp2, Exp3] all I    ─→     GPU 1: [Exp0..7] I[I/4:I/2]
GPU 2: [Exp4, Exp5] all I    A2A    GPU 2: [Exp0..7] I[I/2:3I/4]
GPU 3: [Exp6, Exp7] all I           GPU 3: [Exp0..7] I[3I/4:I]

Each GPU sends P sub-chunks (one per peer), receives P sub-chunks (one per peer).
Total traffic per GPU: (P-1)/P × local_weight_bytes (one NVLink pass).

3.2 TP→EP Resharding

The reverse. Under TP, each rank has $2I/P$ of every expert. Under EP, each rank needs all $2I$ of $E/P$ experts.

Algorithm 2: TP→EP Weight Resharding

Input:  W_TP[r] ∈ ℝ^{(E, 2I/P, H)} on rank r   (I/P-shard of all E experts)
Output: W_EP[r] ∈ ℝ^{(E/P, 2I, H)} on rank r   (our E/P experts, full I dim)

Step 1 — Partition by expert owner:
  For each peer rank p in 0..P-1:
    chunk[p] = W_TP[r][p*(E/P) : (p+1)*(E/P), :, :]
    # shape: (E/P, 2I/P, H) — rank p's experts, our I-shard

Step 2 — Concatenate into AllToAll send buffer:
  send_buf = concat(chunk[0], ..., chunk[P-1])   # shape: (E, 2I/P, H)

Step 3 — AllToAll exchange:
  recv_buf = AllToAll(send_buf)
  # recv_buf[p*(E/P) : (p+1)*(E/P), :, :]
  #   ← chunk[r] from rank p  (our experts, rank p's I-shard)

Step 4 — Local interleave (assemble full I dimension per expert):
  For each local expert e in 0..E/P-1:
    For each peer rank p in 0..P-1:
      W_EP[r][e, p*(2I/P) : (p+1)*(2I/P), :] = recv_buf[p*(E/P)+e, :, :]
  # Result: W_EP[r] shape (E/P, 2I, H) with full intermediate dimension ✓

TP→EP is more expensive than EP→TP because TP→EP requires replicating KV cache heads: under EP, each rank has the complete KV state for its requests; under TP, KV heads are partitioned across ranks. Collecting all heads for each request’s new EP owner doubles NVLink traffic for KV migration.

3.3 Transfer Cost Analysis

The data volume transferred per GPU per reshard is:

V_{\text{reshard}} = \frac{P-1}{P} \times \text{ExpertWeightsPerGPU} \tag{9}

For Qwen3-235B-A22B (E=128, H=7168, I=2048, P=8, 94 layers, BF16):

\text{Weights per GPU} \approx \frac{128 \times (2 \times 2048 \times 7168 + 7168 \times 2048) \times 94}{8} \times 2 \text{ B} \approx 60 \text{ GB} \tag{10}

At 70% NVLink efficiency (∼280 GB/s effective per-GPU bandwidth for AllToAll):

t_{\text{weight}} \approx \frac{60 \text{ GB} \times \frac{7}{8}}{280 \text{ GB/s}} \approx 188 \text{ ms (weight transfer only)} \tag{11}

Moebius’s fused direct-transfer kernel achieves >70% NVLink peak and reports ~152 ms for the KV-free weight reshard—consistent with this estimate. The full 215–434 ms production window adds KV cache redistribution, which scales with cache occupancy.

4. Request Redistribution: KV Cache Migration

Weight resharding alone is insufficient—in-flight requests carry KV cache state that must be redistributed when layout changes.

4.1 EP→TP: KV Repartition by Attention Head

Under EP, rank $r$ holds the complete KV cache (all $H_{kv}$ heads) for its assigned requests. Under TP, each rank must hold $H_{kv}/P$ heads for all requests.

Algorithm 3: EP→TP KV Cache Redistribution

Input:  rank r has full KV for requests {R_r} (all H_kv heads, paged)
Output: rank r has H_kv/P heads of KV for ALL requests

Step 1 — Metadata All-Gather:
  Collect page table metadata from all ranks
  GlobalOrder = merge({R_0}, {R_1}, ..., {R_{P-1}}) sorted by sequence id

Step 2 — Determine head ownership:
  rank r owns heads [r * H_kv/P : (r+1) * H_kv/P]

Step 3 — Per-request, per-layer KV page migration:
  For each seq_id in GlobalOrder:
    source_rank = EP rank that currently owns seq_id
    For each layer l in 0..N_layers-1:
      For each page p belonging to (seq_id, layer l):
        Direct-transfer: source_rank sends KV[heads=r*H_kv/P:(r+1)*H_kv/P, page p]
                        → rank r at target page address (page table preserved)

The key optimization: the page table is used to locate physical KV pages without copying data to a contiguous buffer first—eliminating one HBM round-trip vs. naive approaches.

4.2 TP→EP: Sequence Redistribution

Under TP, all ranks jointly handle all requests. Under EP, each rank handles a disjoint subset.

Algorithm 4: TP→EP Sequence Assignment and KV Migration

Step 1 — Greedy longest-first assignment:
  Sort all in-flight requests by current KV page count (longest first)
  Initialize load[r] = 0 for each rank r
  For each request seq (sorted descending by length):
    r* = argmin_r(load[r])   # assign to least-loaded rank
    assign(seq → r*)
    load[r*] += pages(seq)

Step 2 — KV migration to new owners:
  For each request seq now assigned to rank r:
    For each layer l:
      For each KV page of seq, layer l:
        Gather all P head-shards from all TP ranks
        Write concatenated KV to rank r's memory at target page address
        Update page table entry on rank r

Why longest-first greedy? Greedy bin-packing with largest-item-first is a standard $(4/3 - 1/(3B))$ -approximation to optimal bin packing. Placing the largest sequences first prevents a scenario where large sequences cannot fit after small ones have fragmented the available capacity. Measured KV page count is used rather than future generation length (which is unknown), introducing a potential imbalance for sequences near completion—a limitation noted in the critical analysis below.

5. Unified Memory Manager: Fixed Addresses for CUDA Graphs

The subtlest engineering challenge is maintaining valid CUDA graph tensor addresses across switches.

5.1 The Fixed-Address Constraint

CUDA graph replay requires that device pointers used during capture remain unchanged. Naive dynamic weight reallocation breaks all graphs, requiring minutes of recapture warmup. Moebius must keep weights at stable addresses while still resharding their content.

5.2 UMM Design: One Buffer, Mode-Specific Aliases

Each GPU allocates a single large contiguous buffer at startup, dimensioned to hold either layout’s working tensors. Mode-specific aliases are established as pointer views (arithmetic offsets) into this buffer:

Unified buffer (per GPU): shape (N_layers + 1, max_expert_slot_bytes)

TP mode alias:  layer i weight  →  buffer[i]      (slot i)
EP mode alias:  layer i weight  →  buffer[i + 1]  (slot i+1)
Scratch space:  buffer[0]       →  always available (EP offset leaves slot 0 free for TP layer 0)

Figure 3. UMM buffer layout with offset-by-one aliasing.

Buffer slots: [  0  ] [  1  ] [  2  ] [ ... ] [ N_L ]
                 ↑
           scratch    ↑                  ↑
                   EP layer 0         EP layer N_L-1
                   TP layer 1         (TP layer N_L is buffer[N_L])

EP → TP switch for layer 0:
  1. Read from buffer[1] (EP alias for layer 0)
  2. Use buffer[0] as scratch for AllToAll receive staging
  3. Write final TP content back to buffer[1] (TP alias for layer 0... wait)
  
  Actually: after EP→TP switch, TP layer 0 is in buffer[0]? No—
  After switch, TP reads from buffer[i], EP reads from buffer[i+1].
  The spare slot is buffer[0]. During reshard of layer i:
    - Source (EP): read from buffer[i+1]
    - Staging: use buffer[0] (never overlaps with layer i)
    - Destination (TP): write to buffer[i]  (same as EP alias shifted down by 1)
  
  Correctness: buffer[0] is always scratch because EP layer 0 is at buffer[1],
  and TP layer 0 needs to go to buffer[0]—a non-overlapping transformation.

The offset-by-one is the key: EP layer $i$ sits at slot $i+1$ , TP layer $i$ sits at slot $i$ . Slot 0 is always free as a staging area. After switching from EP to TP, the weight data has moved from slot $i+1$ to slot $i$ —exactly where the CUDA graph for TP expects it. No graph recapture required.

5.3 Memory Cost

UMM adds one extra slot (the scratch space):

\text{Overhead}_{\text{UMM}} = \frac{1}{N_{\text{layers}} + 1} = \frac{1}{95} \approx 1.05\% \tag{12}

Combined with dual attention shards (TP attention metadata + EP attention metadata both resident), total overhead reaches 2.4%—small enough that Moebius funds it by forgoing a 2.8 GB KV capacity margin.

6. Fused Direct-Transfer Kernel

Standard NCCL collectives use CPU-pinned staging buffers, requiring multiple HBM passes. Moebius implements a custom fused direct-transfer kernel that writes directly from one GPU’s HBM to another’s via NVLink peer access:

Table 1. HBM pass comparison: NCCL vs. Moebius direct-transfer.

Transfer	Method	HBM Passes (Send+Recv)	NVLink Passes
Expert weights	NCCL (naive)	2 + 1 = 3	1
Expert weights	Moebius (direct)	1 + 0 = 1	1
KV cache	NCCL (naive)	3 + 2 = 5	1
KV cache	Moebius (direct)	1 + 0 = 1	1

Figure 4. Data path comparison: NCCL staging vs. Moebius direct NVLink write.

NCCL path (weight reshard):
  GPU_src HBM → pinned staging buffer → NVLink → pinned staging buffer → GPU_dst HBM
    (3 HBM reads/writes total, 1 NVLink transfer)

Moebius direct-transfer path:
  GPU_src HBM → NVLink → GPU_dst HBM
    (1 HBM read on src, 0 extra reads on dst, 1 NVLink transfer)

Results:

Expert weights: 1.49× faster than NCCL
KV cache: >2× faster than NCCL
NVLink utilization: >70% of peak bandwidth achieved

7. Switch Policy: Asymmetric Hysteresis

7.1 Policy Formulation

Moebius uses a threshold-based policy with asymmetric hysteresis to prevent oscillation:

Parameters (interactive serving mode):

$T_h = 256$ : switch TP→EP when current request count exceeds $T_h$
$T_\ell = 0.8 \cdot T_h = 205$ : switch EP→TP when mean count over $W$ steps drops below $T_\ell$
$C = 5\text{s}$ : cooldown period between consecutive switches
Capacity check: verify target mode has sufficient KV space before switching

Switch Policy Algorithm:

Per-inference-step:
  count ← number of active in-flight requests
  history.append(count)
  mean_count ← mean(history[-W:])

  if elapsed_since_last_switch < C:
    skip (cooldown active)

  if mode == TP and count ≥ T_h:
    if KV_capacity(EP) ≥ KV_needed(active requests):
      initiate_switch(TP → EP)
      reset_cooldown()

  elif mode == EP and mean_count < T_ℓ:
    if KV_capacity(TP) ≥ KV_needed(active requests):
      initiate_switch(EP → TP)
      reset_cooldown()

Figure 5. Switch policy state machine with asymmetric thresholds and hysteresis band.

Request count (B)
    │
256 ┼──────────────────────────────── T_h  (TP→EP trigger: immediate)
    │         ↑ switch                      ↑ switch
    │   ┌─────┤    EP regime    ┌────────────┤
    │   │     └─────────────────┘            │
205 ┼──────────────────────────────── T_ℓ  (EP→TP trigger: mean over W steps)
    │   TP    ↓ switch EP →TP                ↓ switch
    └───────────────────────────────────────────────────────→ Time
        TP    EP           TP           EP           TP
                     ↑                       ↑
               5s cooldown             5s cooldown

7.2 Asymmetry Explained

Fast EP entry ( $T_h$ , immediate): Burst onset harms TTFT immediately; react fast.
Slow TP return ( $T_\ell$ , mean over $W$ ): Transient dips in request count shouldn’t trigger a costly switch. Average over $W$ steps to distinguish sustained quiet from a momentary gap.
Hysteresis band ( $T_h > T_\ell$ ): Prevents oscillation. Without hysteresis, a workload hovering at 230 requests would alternate TP/EP every few steps.

7.3 Rollout Mode Variant

For RL rollout workloads, request count monotonically decreases (sequences complete; none arrive mid-step):

T_\ell^{\text{rollout}} = T_h, \quad W^{\text{rollout}} = 1 \tag{13}

Switch to TP the instant current batch drops below $T_h$ . No hysteresis needed since there is no oscillation risk—batch size only decreases.

8. System Architecture and Implementation

8.1 Architecture Overview

Moebius layers cleanly on SGLang v0.5.5 with only 200 lines of SGLang modification:

Figure 6. Moebius system architecture layered on SGLang.

┌────────────────────────────────────────────────────────────────────┐
│                      SGLang Frontend                               │
│   Request Scheduler → Token Sampler → Streaming Response          │
└─────────────────────────────┬──────────────────────────────────────┘
                              │ requests / decode steps
┌─────────────────────────────▼──────────────────────────────────────┐
│                  Moebius Switch Coordinator (NEW)                  │
│  ┌────────────────┐  ┌─────────────────┐  ┌──────────────────┐    │
│  │ Switch Policy  │  │ Capacity Monitor │  │ Runtime Selector │    │
│  │ (hysteresis,   │  │ (KV availability │  │ (TP / EP graph   │    │
│  │  cooldown)     │  │  check before    │  │  set pointer)    │    │
│  └────────────────┘  │  switch)         │  └──────────────────┘    │
│                      └─────────────────┘                           │
└─────────────────────────────┬──────────────────────────────────────┘
                              │ triggers reshard
┌─────────────────────────────▼──────────────────────────────────────┐
│              Unified Memory Manager + Reshard Engine (NEW)         │
│  ┌───────────────────────────────────────────────────────────┐     │
│  │ Unified Buffer [0..N_layers]: EP aliases / TP aliases     │     │
│  └───────────────────────────────────────────────────────────┘     │
│  ┌───────────────────────┐   ┌────────────────────────────┐        │
│  │ Direct-Transfer       │   │ Page Table Manager         │        │
│  │ Kernels (NVLink       │   │ (KV page redistribution,   │        │
│  │  GPU→GPU writes,      │   │  sequence reassignment)    │        │
│  │  no NCCL staging)     │   └────────────────────────────┘        │
│  └───────────────────────┘                                          │
└────────────────────────────────────────────────────────────────────┘

8.2 Dual Runtime Residence

Both the TP and EP CUDA graph sets (36 graphs per layout, per-rank batch up to 256) are captured at startup and kept resident. A switch involves:

Complete the current decode step normally
Run weight reshard kernels (direct-transfer AllToAll, Algorithm 1 or 2)
Run KV cache redistribution (Algorithms 3 or 4)
Atomically swap the runtime pointer from one CUDA graph set to the other
Resume inference—no graph recapture, no warmup, no engine restart

8.3 Implementation Cost

Core Moebius code: 7,400 lines
SGLang modifications: 200 lines
Both runtimes resident: TP + EP graph sets, communication groups, attention metadata—all simultaneously in memory

The 200-line modification count is a strong indicator of clean composability—Moebius slots into SGLang without major structural changes.

9. Experimental Results

9.1 Setup

Hardware: 8× NVIDIA H200 (141 GB HBM each, NVLink fully connected)
Model: Qwen3-235B-A22B, 94 layers, 64 query / 4 KV heads, BF16
Config: 2,048-request cap, 0.85 memory fraction, CUDA graphs enabled
Baselines: Static TP (optimal for low concurrency) and Static EP (optimal for high concurrency)

9.2 Bursty Online Serving

Workload: 3,107 ChatBot Arena requests over 375 seconds: two bursts (80 and 120 req/s) separated by 300 seconds of quiet (~5 req/s).

Metric	Static TP	Static EP	Moebius
Mean TTFT (burst)	9.9 s	—	3.1 s (3.2×)
p99 TTFT (burst)	90 s	—	6.0 s (15×)
Mean TPOT (quiet)	~37 ms	52 ms	~37 ms (tracks TP)
Mode switches	—	—	4 switches total

Static TP during bursts: catastrophic queue buildup, 90s p99 TTFT. Static EP during quiet: 52 ms TPOT vs. TP’s 37 ms. Moebius switches 4 times across the trace, matching the optimal layout at each phase.

9.3 RL Rollout Results

Workload: DeepMath benchmark, 9 rollout steps, 2,048 prompts per step, 32,768-token cap.

Each rollout step exhibits a characteristic high-to-low batch trajectory: starts with 2,048 active sequences (EP-favoring), decays as sequences complete to a few long-context stragglers (TP-favoring).

Comparison	Moebius Speedup
vs. worse static layout	up to 1.31×
vs. better static layout	1.16–1.25× (mean 1.22×)
vs. oracle (per-step layout selection)	beats oracle
Projected end-to-end training (Amdahl)	1.10–1.20×

Moebius beats the oracle. The oracle switches between rollout steps (using step-level foreknowledge), while Moebius switches within steps in response to intra-step batch decay. By transitioning to TP during each step’s straggler tail, Moebius captures per-step gains that step-level switching cannot.

9.4 Switch Latency Breakdown

Figure 7. Switch latency comparison across methods.

Method                              Latency
──────────────────────────────────────────────────────────────────────────
Engine restart (strawman)           93 – 133 s   ████████████████████████
Host-memory reload                  13 – 20  s   ██████
Moebius weight-only (no KV)        ~152    ms   █
Moebius production (low KV load)    215     ms   █
Moebius production (high KV load)   434     ms   █
──────────────────────────────────────────────────────────────────────────

The 215–434 ms range: weight resharding is a fixed ~152 ms floor; KV cache redistribution scales with cache occupancy (more in-flight requests → more pages to migrate).

Transfer speed comparison:

Fused kernel vs. NCCL: expert weights 1.49× faster, KV cache >2× faster
NVLink utilization: >70% of peak in both transfer directions

9.5 Memory Footprint

Figure 8. Per-GPU memory breakdown.

Per-GPU Memory (GB):
                  Static TP    Static EP    Moebius
Model weights     ████████     ████████     ████████  (identical)
TP attention KV   ████         ·            ▓         (TP KV heads replicated)
EP attention KV   ·            ████         ▓         (EP KV: data-parallel)
UMM overhead      ·            ·            █         (2.8 GB dual-mode buffer)
──────────────────────────────────────────────────────
vs. Static TP:    baseline     −3.9 GB      −3.7 GB
vs. Static EP:    +3.9 GB      baseline     +0.2 GB

Moebius’s memory is within 0.2 GB of Static EP and 3.7 GB below Static TP. The 2.8 GB dual-mode buffer (UMM + dual attention shards) is funded by forgoing a small KV capacity margin.

CUDA graph residence cost: Each layout requires 36 CUDA graphs; both kept resident. The sub-millisecond switch selection cost (choosing which graph set to use) is negligible vs. the minutes that lazy capture would require.

10. Design Choice Analysis: WHY / ALTERNATIVE / BOUNDARY

Choice 1: Unified Memory Buffer (UMM)

WHY: CUDA graphs require fixed tensor addresses. Without UMM, resharding would require either graph recapture (minutes per switch, unacceptable) or dual full-weight storage (100% memory overhead, infeasible for 235B models).

Alternative: Dual separate allocations (full EP weights + full TP weights both in HBM). Eliminates the offset-aliasing complexity but doubles expert memory—adding 30–40 GB per GPU for Qwen3-235B-A22B, leaving no room for KV cache.

Boundary: UMM works only because EP and TP layouts have equal total data volume per GPU (Equation 6). A hypothetical layout requiring more data per GPU (e.g., full data-parallel replication) would require a larger buffer and would partially defeat the memory efficiency goal.

Choice 2: In-Place AllToAll vs. Drain-and-Reload

WHY: Draining all in-flight requests before switching causes user-visible latency spikes or timeouts, especially for long-context requests (which may take minutes to complete). Host reload takes 13–20s—too slow even for batch serving.

Alternative: Drain queues, suspend serving, reload from SSD/host memory, then resume. Avoids the in-place resharding complexity but introduces 13–133s service interruptions. Unacceptable for interactive serving and disruptive for RL training pipelines.

Boundary: In-place resharding requires NVLink (GPU-to-GPU direct write). On InfiniBand-connected clusters, the direct-transfer kernel falls back to slower paths, potentially increasing switch latency by 3–5× and making Moebius less compelling in cloud deployments.

Choice 3: Greedy Longest-First Sequence Assignment (TP→EP)

WHY: Greedy longest-first is a well-studied bin-packing heuristic that achieves near-optimal load balance (within 4/3 of optimal) with O(n log n) complexity—fast enough to run synchronously during a switch.

Alternative: Balanced round-robin or random assignment. Much simpler but produces high KV load variance across GPUs, making some GPUs the bottleneck and degrading EP throughput by 10–20% in experiments.

Boundary: The greedy assignment uses current KV page count as the proxy for future load. Sequences near completion (small remaining generation) will have large historical KV but contribute little to future load—the heuristic may over-weight these sequences, causing suboptimal balance for short remaining generations.

Choice 4: Dual Resident CUDA Graph Sets

WHY: The alternative (lazy graph capture on first switch) would incur a 1–5 minute graph recapture delay on the first switch. For production serving, this means the first batch switch causes a service-level violation that invalidates the entire benefit of Moebius.

Alternative: Lazy capture. First switch takes minutes, but subsequent switches are instant. For deployments where switches are rare (< 1/day) and latency during the first switch is tolerable (offline batch), this reduces baseline memory usage.

Boundary: Dual residency requires that both sets of CUDA graphs fit in HBM simultaneously. On 141 GB H200 this is fine; on 24–48 GB A10 or older GPUs, the graph resident cost might crowd out KV cache slots unacceptably.

Choice 5: 200-Line SGLang Integration

WHY: Minimizing host-system modifications allows Moebius to benefit from SGLang’s ongoing optimizations (RadixAttention, chunked prefill, speculative decoding) without maintaining a divergent fork.

Alternative: Custom serving engine with dual-mode as a first-class primitive. Maximizes design freedom but requires reimplementing all SGLang optimizations—a multi-year investment with ongoing maintenance burden.

Boundary: The thin integration means Moebius depends on SGLang’s internal APIs remaining stable. Major SGLang refactors could require nontrivial Moebius patches.

11. Critical Assessment: Weaknesses & Improvements

W1: Fixed Threshold Policy Requires Manual Tuning

Weakness: The switch policy parameters ( $T_h$ , $T_\ell$ , $C$ , $W$ ) have no principled derivation. The values $T_h=256$ and $T_\ell=0.8 \times T_h$ were selected for Qwen3-235B-A22B on 8×H200. Different models (smaller MoE, different $E$ or $k$ ), hardware (4×H100, 8×A100), or parallelism degrees ( $P = 4$ or $16$ ) will have different crossover points $B^*$ requiring re-tuning.

What the paper underplays: The crossover $B^*$ is a function of NVLink bandwidth, memory bandwidth, expert count, hidden dimension, and GPU count—a multi-variable function with no closed-form expression. There is no profiling-based tool to automatically determine thresholds. Operators deploying Moebius on new hardware must run their own benchmarks and manually set parameters.

Improvement: Develop an online profiling routine that measures EP and TP step latency at several batch sizes at startup, fits a linear crossover model, and automatically calibrates $T_h$ and $T_\ell$ . This would make Moebius self-configuring for any hardware/model combination.

W2: Single Model and Hardware Configuration Evaluated

Weakness: All quantitative results use Qwen3-235B-A22B on 8×H200 with NVLink. No results for:

Smaller MoE models (Mixtral-8x7B at 46.7B or DeepSeek-V2 at 236B)
Different parallelism degrees ( $P=4$ or $P=16$ )
Non-NVLink hardware (InfiniBand-connected A100 clusters, common in cloud deployments)
Heterogeneous node configurations

What is missing: InfiniBand bandwidth is typically 25–100 GB/s (vs. NVLink’s 400–900 GB/s). The direct-transfer kernel’s advantage over NCCL (which is already optimized for InfiniBand) may shrink significantly, potentially making the 215–434 ms switch window grow to seconds—eliminating the benefit for interactive serving.

Improvement: Provide an analytical cost model parameterized by $P$ , model size, and interconnect bandwidth, validated against at least two hardware configurations (e.g., A100+InfiniBand and H100+NVLink). Show where Moebius is and is not beneficial as a function of these parameters.

W3: KV Cache Redistribution Correctness Under Edge Cases

Weakness: Algorithms 3 and 4 operate at page granularity. The paper asserts correctness but does not address:

Partial pages: The last page of each sequence’s KV is typically partially filled. Redistribution must handle non-full pages without corrupting neighboring data.
Shared prefix caching: SGLang’s RadixAttention shares KV pages across requests with common prefixes. Redistributing a shared page to one EP rank while another rank still references it could corrupt the prefix cache.
Mid-prefill sequences: Requests undergoing chunked prefill at switch time have partially computed KV—redistribution must handle these consistently.

Improvement: Provide formal correctness proofs or detailed case analyses for these edge cases. Add unit tests and integration tests specifically targeting partial pages, RadixAttention sharing, and mid-prefill redistribution.

W4: Memory Overhead May Be Understated

Weakness: The reported 2.4% overhead accounts for the UMM spare slot and dual attention shards, but the paper does not itemize:

Dual communication buffers (AllReduce communicators for TP + AllToAll communicators for EP, both resident)
CUDA graph side-tables (each of the 36+36 = 72 graphs has metadata tables)
Switch coordinator state (page table snapshots, history buffers, mode metadata)

For memory-constrained hardware these additional overheads could accumulate to several GB, narrowing the KV capacity advantage.

Improvement: Provide a complete memory budget table with absolute GB values for every component (weights, KV pool, CUDA graphs, communicators, UMM overhead, switch coordinator state) for the Qwen3-235B-A22B 8×H200 configuration.

W5: No Failure Recovery Protocol

Weakness: A GPU failure or NVLink partition during a reshard leaves the model in an inconsistent state: some layers in EP layout, others in TP layout. The paper does not discuss failure detection, rollback to the previous layout, or graceful degradation (e.g., drain remaining requests and restart). On clusters with thousands of GPUs, multi-GPU failures during a 215–434 ms window are rare but non-zero probability.

Improvement: Define the failure atomicity semantics (rollback to pre-switch state on failure?), implement a commit protocol that checks layout consistency before resuming inference, and add monitoring hooks that alert operators when switch consistency checks fail.

L1: InfiniBand Deployment Gap Understated

The paper mentions NVLink connectivity without adequately flagging InfiniBand as an unvalidated case. Since the majority of cloud-scale GPU clusters (AWS, Azure, GCP A100/H100 deployments) use InfiniBand rather than NVLink between nodes, this is a significant real-world applicability gap. The direct-transfer kernel’s 1.49–2×+ speedup over NCCL likely shrinks dramatically over InfiniBand, where NCCL is already highly optimized. The paper should include a discussion of expected behavior and a performance projection for InfiniBand deployments.

L2: Workload Generalizability Unclear

Results use ChatBot Arena (conversational) and DeepMath (math reasoning). Code generation, RAG, multi-modal, and structured output workloads have very different request length distributions and arrival patterns, which affect both the frequency of crossings above/below $B^*$ and the KV migration cost per switch. The 4-switch count and timing observed in the bursty trace may not generalize.

L3: Interaction with Speculative Decoding Unexplored

JetSpec, LayerSkip, and other speculative decoding systems maintain multiple KV caches per request (draft model KV + verifier model KV). A Moebius switch during speculative decoding would need to redistribute both KV sets, potentially doubling migration cost. Additionally, speculative decoding step sizes differ from standard decoding, potentially shifting the crossover $B^*$ . This interaction is unstated.

I1: Learned Switch Policy (RL Controller)

Replace the hand-tuned threshold policy with a lightweight RL controller observing {current batch size, TTFT trend, TPOT trend, KV occupancy} and outputting switch decisions optimized for SLO attainment. A tabular Q-learning agent would add minimal overhead and adapt automatically to novel workload distributions.

I2: Cross-Node Topology-Aware Switching

Scale beyond a single NVLink domain by implementing topology-aware KV redistribution: prefer intra-node KV moves first (NVLink), then inter-node (InfiniBand). This staged approach amortizes InfiniBand cost by maximizing local data reuse during redistribution.

I3: Hybrid Partial-Layer Switching

Apply EP to the bottom $N/2$ layers (where attention dominates) and TP to the top $N/2$ layers (where expert GEMMs dominate at intermediate batch sizes). This hybrid layout may outperform either pure EP or TP at $B \approx B^*$ , at the cost of implementing partial reshard protocols.

I4: Lazy KV Migration

Migrate KV cache pages on-demand during the first decode step after a switch, rather than proactively before resuming inference. This reduces the switch window from 215–434 ms to ~152 ms (weight-only), trading a brief burst of demand-driven page transfers for a compressed upfront cost.

I5: Prefix Cache Integration

Integrate with SGLang’s RadixAttention to migrate shared prefix pages only once (updating all referencing page table entries) rather than per-request. For workloads with high prefix reuse (e.g., system prompts shared across users), this could reduce KV migration cost by 50–80%.

12. Conclusion

Moebius establishes that serving-time parallelism should be dynamic, not static. The core insight—EP and TP are two data views of the same model—is clean and enables a practically engineered system: 215–434 ms switches, 2.4% memory overhead, dual resident CUDA graph sets, and a fused NVLink direct-transfer kernel that achieves >70% of peak bandwidth. The results are compelling: 15× p99 TTFT improvement during serving bursts, and 1.16–1.25× speedup on RL rollouts over the best static layout, beating even a per-step oracle by adapting within individual rollout steps.

The limitations—single-hardware evaluation, manually tuned thresholds, unaddressed InfiniBand deployments, and no failure recovery protocol—are real but addressable in follow-on work. As MoE models continue to dominate at scale and RL-based post-training becomes standard practice, systems that adapt parallelism to workload dynamics will become increasingly necessary infrastructure.

Practitioner Decision Guide:

Does your workload cross the TP/EP crossover batch size?
├── No (uniform high or low concurrency): Use static EP or TP. Moebius overhead unnecessary.
└── Yes (bursty serving or RL rollouts):
    Is your cluster NVLink-connected (single node or NVLink bridge)?
    ├── Yes: Moebius is a strong candidate. Expect 1.16–1.25× on rollouts.
    └── No (InfiniBand): Wait for IB-validated results. Switch cost may exceed savings.
         Is KV cache capacity available for 2.4% overhead?
         ├── Yes: Deploy with auto-calibrated thresholds (once available).
         └── No (very tight memory): Consider lazy-KV variant (I4 above).

A1. Roofline Model for the TP/EP Crossover

To understand the crossover batch size $B^*$ analytically, consider the compute and memory bandwidth requirements of a single expert GEMM.

Under TP (P GPUs share each expert):

Each expert GEMM has arithmetic intensity (AI) of:

\text{AI}_{\text{TP}} = \frac{B \cdot (2I/P) \cdot H \cdot 2}{(E \cdot (2I/P) \cdot H) \cdot 2} = \frac{B}{E} \quad \text{(FLOPs per byte)} \tag{A1}

Memory bandwidth achievable: $\text{BW}_{\text{HBM}}$ (full bandwidth, weight working set fits in L2 for small $E \cdot 2I/P$ ).

Under EP (P GPUs each own E/P experts):

Each GPU handles batch $B_{\text{local}} = B \cdot k / P$ tokens (where $k$ is the routing top-k) across its $E/P$ experts. The AI for a single expert local batch:

\text{AI}_{\text{EP}} = \frac{B_{\text{local}} \cdot 2I \cdot H \cdot 2}{(2I \cdot H) \cdot 2} = B_{\text{local}} = \frac{B \cdot k}{P \cdot E} \tag{A2}

But EP’s All-to-All communication adds overhead $t_{\text{A2A}} = \frac{(P-1)}{P} \cdot B \cdot H \cdot \text{dtype\_bytes} / \text{BW}_{\text{NVLink}}$ .

Crossover condition: TP wins when its step time is lower:

B^* \approx \frac{\text{AI}_{\text{ridge}} \cdot P \cdot E}{k} \tag{A3}

where $\text{AI}_{\text{ridge}} = \text{Peak\_FLOPS} / \text{BW}_{\text{HBM}}$ is the ridge point of the roofline. For H200: Peak_FLOPS ≈ 1,979 TFLOPS (BF16), BW_HBM ≈ 3.35 TB/s, giving $\text{AI}_{\text{ridge}} \approx 591$ FLOPs/byte. With $P=8$ , $E=128$ , $k=8$ :

B^* \approx \frac{591 \times 8 \times 128}{8} \approx 75,648 \tag{A4}

This is much higher than the empirically observed $B^* \approx 128$ – $256$ , reflecting that real-world crossover is dominated by All-to-All communication overhead and CUDA launch overhead rather than the pure compute/memory bandwidth ratio—showing why empirical profiling (not closed-form analysis) is needed to calibrate $T_h$ .

Runtime parallelism switching (dense models):

Amoeba (OSDI 2024): Runtime degree switching for dense Transformers; assumes weights can be quickly transferred from host memory (feasible for 7–13B models, infeasible for 235B MoE).
Flying Serving and Shift Parallelism: Similar approaches focused on rebalancing load under heterogeneous hardware; all assume engine drain before switching.

MoE-specific serving systems:

HAP (MLSys 2024): Offline hybrid attention+expert parallelism strategy selection; static once deployed.
DeepSpeed-MoE and DeepSpeed-Inference: Fixed parallelism strategy per deployment.
Lina: Expert load balancing within a fixed EP layout; orthogonal to Moebius’s parallelism switching.

RL training infrastructure:

HybridFlow, OpenRLHF, DAPO: Disaggregated rollout+training pipelines; focus on synchronous/asynchronous data flow, not within-step parallelism adaptation.
StreamRL and AReaL: Async rollout generation; Moebius is complementary (can be applied within their rollout workers).

Moebius is orthogonal to all of these—it reuses their kernels and scheduling algorithms while adding the dynamic layout-switching layer on top.

A3. CUDA Graph Capture Cost Breakdown

Moebius captures 36 CUDA graphs per layout (TP + EP = 72 total). The capture process for each graph involves:

Warmup pass (1–3 iterations): Execute the model in eager mode to populate kernel argument records.
Graph capture (1 iteration): Replay with cudaGraphBeginCapture, recording all kernel launches and memory operations.
Graph instantiation: JIT-compile the graph into an executable object.

For Qwen3-235B-A22B with 94 layers and per-rank batch 1–256, each graph set capture takes approximately 2–5 minutes. With both sets captured at startup, Moebius pays this cost once—amortized over the deployment lifetime, it is negligible. Lazy capture would impose this cost on the first switch, causing a service-level violation that Moebius explicitly avoids.