May 21, 2026 EN #LLM Inference #LLM Serving #KV Cache

SGLang: Efficient Execution of Structured Language Model Programs — Technical Review

Review date: 2026-05-21 Review author: Zhongzhu Zhou Paper reviewed: SGLang: Efficient Execution of Structured Language Model Programs Paper authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng (Stanford + UC Berkeley + SJTU + Texas A&M) arXiv: 2312.07104v2, 2024-06-06 Venue: NeurIPS 2024 Code: sgl-project/sglang

Short answer

Anyone who has shipped an LLM application past the toy stage has hit the same wall: a single gen() call is easy, but the moment you compose two or three or twenty calls with branching, parallel forks, structured outputs, and shared system prompts, you discover that the inference engine treats every call as if it were the first one ever. The KV cache is rebuilt. The system prompt is re-tokenized and re-embedded. The few-shot exemplars are re-prefilled. Forks of the same prompt do parallel work that should have shared compute. This is the world that SGLang sets out to fix.

SGLang is best read as two halves joined by an idea:

A Python-embedded DSL for language model programs (LM programs): gen(), select(), extend(), fork(), join(), plus image/video primitives. The frontend is small, idiomatic Python, and composable — you build branch–solve–merge, tree-of-thought, agent loops, and JSON-schema-constrained decoding with native control flow.
A co-designed runtime that knows which calls share prefixes with which other calls and exploits that information aggressively. The flagship technique, RadixAttention, manages the KV cache as a tree-structured LRU cache so that any common prefix — across calls, across instances, across tensor-parallel ranks — is reused for free.

Two more optimizations round out the runtime: a compressed finite-state machine (FSM) that recognizes when a regex constraint forces multiple tokens to be identical and decodes them in a single forward pass; and API speculative execution that opportunistically over-generates against a black-box endpoint (e.g. GPT-4) to save round-trips.

The empirical result is striking. Up to 6.4× throughput and 3.7× latency reduction over vLLM, LMQL, and Guidance on Llama-7B; comparable gains on Mixtral-8x7B and Llama-70B with tensor parallelism; 6× speedup on LLaVA image and video workloads. Cache hit rates between 50% and 99% on the benchmarks, with cache-aware scheduling reaching 96% of the offline-optimal rate. In production at Chatbot Arena, RadixAttention delivers a 52.4% hit rate on LLaVA-NeXT-34B and 74.1% on Vicuna-33B, cutting first-token latency by 1.7×.

This review (a) builds the prerequisites for readers who know vLLM’s PagedAttention but have not thought about prefix-aware scheduling, (b) walks through RadixAttention carefully — the data structure, the LRU policy, the cache-aware scheduling theorem, the frontend hints — (c) covers the compressed-FSM decoder and API speculation, (d) summarizes the experiments, including the ablations that matter, and (e) places SGLang against the rest of the LLM-serving and LM-programming literature.

1. Prerequisites

SGLang sits at the intersection of three sub-fields: LM programming frameworks, LLM serving systems, and structured generation. Without context the contributions look modest; with it, they read as one careful observation followed by a series of inevitable consequences. I’ll sketch what you need.

1.1 What is a “language model program”?

The phrase appears in DSPy and in this paper, with roughly the same meaning. An LM program is any executable artifact whose runtime behaviour involves more than one LLM call, with control flow connecting them. The canonical examples are:

Chain-of-thought + self-consistency. Sample $k$ different reasoning chains for the same question, then majority-vote the answers.
Branch–solve–merge. Generate $k$ partial solutions in parallel against different sub-prompts, then merge.
Tree-of-thought. Build a search tree where each node is a partial reasoning step generated by the model, expand promising branches, prune dead ones.
ReAct and other agents. Interleave model calls with tool calls (search, calculator, code execution).
Retrieval-augmented generation (RAG) pipelines. Embed → retrieve → re-rank → generate, sometimes with a self-critique loop.
JSON-mode / structured outputs. A single gen() call constrained by a regex (or a JSON schema compiled to a regex).

All of these share two structural traits:

Multiple LLM calls with control flow. You cannot answer the whole task in one forward pass; downstream calls depend on upstream calls’ outputs.
Structured inputs and outputs. Calls receive structured prompts (system prompt + few-shot exemplars + dynamic context) and emit structured outputs (JSON, a graded score, a choice from an enumerated set).

The two-trait observation is the moral foundation of the paper. Each trait creates an optimization opportunity that single-call engines cannot exploit: trait (1) creates prefix sharing across calls; trait (2) creates multi-token determinism inside a single call.

1.2 The KV cache, in 60 seconds

A decoder-only transformer at inference time produces, for every layer and every position, two vectors: the key $K$ and the value $V$ . The forward pass computing token $t$ needs the keys and values of all previous positions $0, 1, \dots, t-1$ . Recomputing those keys/values on every step would be $O(t^2)$ in compute and bandwidth. So we cache them: after each forward pass, the new $K_t, V_t$ are appended to a per-request KV cache that lives in HBM. The next step only computes $K_{t+1}, V_{t+1}$ and reuses the cached prefix.

The crucial point for SGLang: the KV cache depends only on the prefix tokens. If two different requests share the first $L$ tokens, their KV caches for those positions are bit-identical. Reusing them is mathematically lossless — it is not an approximation, not a quantization, not a learned compression. It is exact reuse.

Single-call engines such as the original vLLM treated each request as a self-contained unit, allocated a KV-cache slab on admission, and freed it on completion. Inter-request reuse was not on the table. That is the gap SGLang fills.

1.3 vLLM and PagedAttention

vLLM (Kwon et al., SOSP 2023) tackled a different KV-cache problem: intra-request fragmentation. Naively, each request reserves a contiguous KV-cache slab for its worst-case generation length, wasting HBM on requests that finish early. PagedAttention pages the KV cache into fixed-size blocks like an OS pages virtual memory, eliminating fragmentation and roughly doubling achievable batch sizes.

SGLang adopts paged storage too — it is compatible with, not a replacement for, PagedAttention. But it adds the inter-request dimension: those paged blocks can now be referenced by multiple requests when they share a prefix. Each page is in some sense an extent in a content-addressed store.

1.4 The radix tree

A radix tree (also known as a Patricia trie or compact prefix tree) is a trie in which every node with exactly one child is merged into its parent. Edges carry not single elements but variable-length sequences. For our purposes: it is the natural data structure for storing a set of strings (or token sequences) such that we can find the longest common prefix of any new string with the set in time linear in the new string’s length, regardless of how big the set is.

If you have ever implemented routing tables, you have used a radix tree. If you have ever implemented a tokenizer’s vocabulary lookup, you have used one. SGLang’s contribution is to apply it to KV-cache extents, with edges labeled by token-id sequences and leaves storing per-token KV-cache page references.

1.5 LMQL and Guidance — the two closest cousins

Two prior languages tackle the same niche:

LMQL (Beurer-Kellner et al., PLDI 2023) — a custom-syntax declarative language with where clauses for constraints. Strong type and constraint system, but relies on Hugging Face Transformers as the backend, which lacks continuous batching and tensor parallelism.
Guidance (Microsoft, 2023) — a Python-embedded DSL with similar primitives (gen, select). Uses llama.cpp as backend; lacks batching and parallelism support.

Both predate SGLang’s frontend, and SGLang acknowledges them. Where SGLang differs is precisely in the co-design of frontend and runtime: LMQL and Guidance can be retargeted to any backend, but no backend exists that understands LMQL or Guidance program structure well enough to exploit it. SGLang’s runtime (SRT — SGLang Runtime) is built specifically for the frontend’s prefix-sharing and parallelism semantics, and the frontend emits hints the runtime acts on.

1.6 Continuous batching

Iteration-level batching (Yu et al., OSDI 2022) is the established baseline: at every iteration, the batch is free to add a new request or evict a finished one. Every modern engine (vLLM, TGI, TensorRT-LLM, SGLang) uses some variant. The interesting question is scheduling: in what order do we admit requests when the queue is bigger than the batch can hold?

SGLang answers: pull the request whose prefix overlaps most with what is already cached. This is what cache-aware scheduling means in the paper.

1.7 Constrained / structured decoding

To force an LLM to emit valid JSON (or valid SQL, or valid XML, or any other regular language), the standard trick is: at every step, compute a vocabulary mask based on the current state of an automaton that accepts the constraint language; set the logits of disallowed tokens to $-\infty$ ; sample as usual. The automaton state advances based on the sampled token. Both Outlines and lm-format-enforcer popularized this approach in 2023.

The constraint usually starts as a regex. Most JSON-schema-to-regex compilers exist. The regex is converted to an NFA, then to a DFA, then often minimized into an FSM with manageable state count. The FSM is queried per step.

The inefficiency SGLang attacks: if from the current state the FSM has only one outgoing edge labelled with a sequence of tokens “summary”:”, then those tokens are forced — the LLM has no freedom. Token-by-token decoding wastes forward passes. We should decode all of them at once.

With those seven pieces in mind, the rest of the paper is roughly “store the KV cache in a radix tree, schedule against it, and decode forced subsequences in one shot.”

2. Programming Model

Section 2 of the paper introduces the frontend. I’ll move faster through this section because the runtime is the more substantive contribution, but the language matters: every runtime optimization in Sections 3–5 hinges on the frontend providing the right information.

2.1 The running example

The paper opens with a multi_dimensional_judge function: a branch–solve–merge essay grader that takes an image and an essay, asks the model whether the essay is on-topic (via select), forks into three parallel evaluators (clarity, originality, evidence), and finally produces a JSON summary with grade letter.

In raw OpenAI-style code, this is roughly 50 lines of asyncio glue plus brittle string parsing. In SGLang it is about 25. The space saving is not the point; the readability is. You can see the control flow because the control flow is Python.

@function
def multi_dimensional_judge(s, path, essay):
    s += system("Evaluate an essay about an image.")
    s += user(image(path) + "Essay:" + essay)
    s += assistant("Sure!")

    s += user("Is the essay related to the image?")
    s += assistant(select("related", choices=["yes", "no"]))
    if s["related"] == "no": return

    forks = s.fork(len(dimensions))
    for f, dim in zip(forks, dimensions):
        f += user("Evaluate based on the following dimension:" + dim + ".")
        f += assistant("Judgment:" + gen("judgment", stop="END"))

    judgment = "\n".join(f["judgment"] for f in forks)
    s += user("Provide the judgment, summary, and a letter grade")
    s += assistant(judgment + "In summary," + gen("summary", stop=".") +
                   "The grade of it is" + gen("grade"))

    schema = r'\{"summary": "[\w\d\s]+\.", "grade": "[ABCD][+-]?"\}'
    s += assistant(gen("output", regex=schema))

Five things to notice:

system / user / assistant apply the chat template — the program is agnostic to which model is in the back.
select calls the model and constrains the output to one of an explicit list. The runtime decides this by computing logprobs for each choice.
s.fork(k) creates $k$ parallel copies of the prompt state. Each fork inherits the prefix; subsequent appends diverge.
gen("name", stop=..., regex=...) is the workhorse. The regex argument plugs straight into the compressed-FSM decoder.
s["name"] fetches the result of a previous gen and is what makes Python control flow possible — without it, you cannot if-branch on a model output.

2.2 Primitives, formally

gen(name, stop=, regex=, max_tokens=) — sample tokens from the model and store under name. Optional regex constraint.
select(name, choices=) — pick the choice with the highest joint logprob.
extend or += — append a string (system/user/assistant role-tagged) to the state.
[name] — fetch a previously generated result, blocking until ready.
fork(k) — branch the state into $k$ parallel copies.
join — synchronize back. (Implicit when s["name"] is fetched on the parent or via for f in forks.)
image(path) / video(path) — attach multi-modal content.

The set is deliberately small. The paper’s Table 1 compares against LMQL and Guidance: SGLang adds video, fork, join explicitly, where the others did not have first-class parallelism primitives.

2.3 Execution modes — interpreter vs compiler

The interpreter — the default — runs the SGLang function as Python. Each primitive is submitted to a per-program stream running in a background thread; primitives are non-blocking unless the program reads back a result. This makes intra-program parallelism cheap: forks = s.fork(3) schedules three independent streams; each gen inside a fork launches asynchronously.

The compiler mode (discussed in Appendix D of the paper) traces the program into a static computational graph and applies graph-level optimizations: dead-code elimination, common-subexpression elimination, ahead-of-time prefix planning. The paper reports comparable performance in interpreter mode and uses interpreter mode for all the main results; compiler mode helps mostly for short-running programs where Python overhead dominates.

A useful mental model: the interpreter is to the SGLang runtime what CUDA streams are to a GPU. The program calls gen(), the call returns immediately, work happens in the runtime, and synchronization is implicit at the read.

2.4 Where SGLang sits in the stack

The paper draws a clean two-tier picture: high-level frameworks (LangChain, DSPy) provide prompt management and pipeline orchestration. Low-level languages (LMQL, Guidance, SGLang) provide direct primitives. SGLang acts as a compilation target for the high-level layer — the paper demonstrates DSPy → SGLang lowering and reports speedups for DSPy RAG pipelines as a result. This is the right factoring: the high-level layer owns prompt design and quality; the low-level layer owns throughput.

3. RadixAttention

This is the heart of the paper. Sections 3 and Appendix A.

3.1 The opportunity

In any non-trivial LM program, prefixes are shared in three ways:

Across calls within one program. The system prompt appears in every call. Few-shot exemplars stay constant. After a fork, the pre-fork prefix is identical across forks.
Across program instances. Two users running the same chat application share the system prompt; two users in the same chatbot share the model’s persona; a self-consistency sampling fans out from the same prefix.
Across tensor-parallel ranks. Every rank holds a shard of the model, but the prefix (in token-id form) is the same across ranks. Each rank reuses its own shard of the KV cache; no extra inter-rank synchronization is needed.

Without reuse, every shared prefix is recomputed for every request. With RadixAttention, it is computed once, stored in a tree, and looked up on the next request.

3.2 The data structure

A node of the radix tree holds:

an edge label — a sequence of token ids leading from the parent;
a pointer to a contiguous block of KV-cache pages corresponding to that edge;
a reference count — how many in-flight requests are reading this node’s tokens (so it cannot be evicted while in use);
last-access timestamp for LRU eviction.

The tree’s root is the empty prefix. For a new incoming request with token sequence $t_0 t_1 t_2 \dots t_n$ , the runtime walks the tree from the root, greedily following edges whose labels match the request’s prefix as far as possible. The matched prefix length $m$ tells the prefill what to skip: KV-cache entries for $t_0 \dots t_{m-1}$ are already on the GPU; the prefill only needs to compute entries for $t_m \dots t_n$ .

If the matching ends in the middle of an edge — i.e. the request’s prefix matches the first $k$ tokens of an edge labelled with $l > k$ tokens — the runtime splits that node into two: one parent edge holding the first $k$ tokens (kept cached and referenced by both old and new requests), one child edge holding the remaining $l - k$ tokens (still owned by the older request).

When a new request finishes, its generated tokens — plus any newly-prefilled prefix tokens — are inserted as new edges. Reference counts on every traversed node are decremented.

This is a classic copy-on-write radix tree with reference counting, except that the payload is GPU memory pages rather than disk blocks. The paper points out that maintaining the tree on the CPU side has negligible overhead — tree traversal and split costs are dominated by the model forward pass.

3.3 LRU eviction and reference counting

HBM fills quickly: a 7B model has KV-cache footprint of about 0.5 GB per 1000 tokens, so a 24 GB A10 has room for maybe 30k tokens of cache across all requests. Eviction is therefore inevitable.

RadixAttention evicts by LRU on leaves. Two important refinements:

Reference counting. A node currently being read by an in-flight forward pass is non-evictable. The reference count is incremented at request admission, decremented at request completion. This is what makes the cache live alongside the active batch without preallocation — there is no fixed “cache region” and “live region” in HBM; all KV-cache memory is the same pool, and when a node’s refcount hits zero it becomes eligible for eviction.
Leaf-first eviction. Always evict the LRU leaf. Internal nodes only become evictable once their children are evicted. This preserves common ancestors — exactly the prefixes most likely to be hit by future requests.

The paper provides a beautiful 9-step illustration (Figure 3) of the tree evolving across two chat sessions, a few-shot learning batch, and a self-consistency sample. The figure is worth reading even if you skip the rest of the paper: it makes concrete the abstract claim that the tree captures the natural sharing patterns of LM programs.

3.4 Frontend hints

A subtle implementation detail. When the SGLang interpreter executes s.fork(k), it could naively send $k$ full prompts (each identical up to the fork point) to the runtime. The runtime would then have to discover the shared prefix at request admission. This works, but it has a race: if the $k$ forks land in the runtime before the prefix is fully inserted into the tree, each fork might trigger its own prefill of the common prefix, wasting compute.

The frontend instead sends a hint: “I’m about to send $k$ requests sharing this prefix.” The runtime inserts the prefix first (as a no-completion warmup), then admits the $k$ children. This guarantees the prefix exists in the tree before any child arrives. The paper calls this a frontend hint, and the ablation in Figure 8(c) shows that disabling hints costs measurable throughput on tree-of-thought.

This is a textbook example of front-end / back-end co-design: neither the frontend nor the runtime could have made this optimization alone.

3.5 Cache-aware scheduling

The paper’s most theoretically satisfying contribution is cache-aware scheduling. The question: given a queue of waiting requests larger than the batch can absorb, in what order should we admit them?

The intuition: pick the request whose matched prefix is longest. That request prefills the least and frees up cycles for the rest. Equivalently, the request that hits the cache most.

Algorithmically: sort the waiting queue by matched prefix length against the current tree, descending. Pull the head into the next batch. After it finishes, recompute prefix lengths (the tree may have grown) and repeat.

The paper proves a clean offline-optimality result:

Theorem 3.1. For a batch of requests, an optimal cache hit rate is achieved by visiting the request radix tree in depth-first search order, with cache size at least the maximum request length. The longest-shared-prefix-first order is a DFS order.

The proof (Appendix A.3) is a small exchange argument: at any step, swapping the next-to-be-served request with a sibling in DFS order cannot increase the hit rate. The “cache size ≥ max request length” condition is needed so that the prefix of any single request never has to be evicted while serving that request.

In the online case, the DFS order is disrupted because requests arrive at unpredictable times. But the same longest-prefix-first heuristic approximates DFS on the dynamically-growing tree. The paper’s empirical hit rate (Figure 13) is 96% of the offline optimum on average across the eleven benchmarks.

A side warning: greedy cache-aware scheduling can starve a request whose prefix never matches anything. The paper acknowledges this and points at fair-scheduling extensions [42] as future work.

3.6 Distributed extension

For tensor parallelism, each rank holds a shard of the model and a sharded KV cache. The radix tree itself does not need to be sharded because the token-id sequence on which the tree is keyed is identical across ranks. Each rank traverses an identical tree and looks up its own shard of the KV-cache pages. Synchronization happens only through the model forward pass, exactly as with any other TP optimization.

For data parallelism across multiple workers, the trees are independent per worker. The paper discusses (Appendix A.4) a routing strategy that pins prompts with the same prefix to the same worker, increasing per-worker hit rate at the cost of some load imbalance.

3.7 Why does this work better than vLLM’s prefix caching?

vLLM has had experimental prefix caching for a while. The paper is careful to spell out the differences:

Sharing scope. vLLM caches a single common prefix at the system-prompt level; RadixAttention caches any prefix at any depth.
Tree structure. vLLM’s table-based cache cannot represent the branching of fork-style programs. The “No Tree-Structure” ablation in Figure 8(c) shows the throughput cost of going table-only.
Scheduling integration. vLLM admits requests FCFS; RadixAttention is prefix-aware. The “FCFS Schedule” ablation isolates the cost of giving up prefix-aware ordering.
Eviction policy. vLLM has no LRU on cache extents; RadixAttention does.

The “No Cache” ablation is the most dramatic: turning RadixAttention off entirely costs ~50–75% of throughput on the favourable benchmarks (LLM Judge, Tree-of-Thought, MMLU). Removing only the tree structure gives back about half of that. Switching to FCFS gives back maybe a quarter. Each component matters.

3.8 Overhead when there is no cache hit

A reasonable worry: if my workload has no shared prefixes — say, a ShareGPT-style stream of independent chat turns — does RadixAttention slow things down?

The paper’s answer (Section 6.3): on 100 ShareGPT requests, total runtime is 74.3 s, of which 0.2 s is RadixAttention bookkeeping. That is 0.27% overhead, which the paper rightly calls negligible.

This is the practical reason RadixAttention can be on by default. Workloads that happen to share prefixes get up to 6× speedup; workloads that don’t share anything pay essentially nothing.

4. Compressed FSM for Constrained Decoding

Section 4 of the paper. Smaller contribution than RadixAttention, but cleanly engineered.

4.1 Setup

The user provides a regex (or a JSON schema compiled to a regex). The runtime builds a deterministic FSM. At each decoding step, the FSM’s current state determines a set of allowed tokens; the logit mask zeros out everything else; the model samples; the FSM advances based on the sampled token.

The inefficiency: if the FSM’s current state has only one outgoing edge labelled s u m m a r y " : _ " (the prefix {"summary": " in the running example), the model has no freedom. Eleven separate forward passes are spent decoding eleven tokens that were determined the moment we entered this state.

4.2 The compressed FSM

Pre-process the FSM: any chain of states with single outgoing edges is fused into one super-edge labelled with the concatenation of the edges’ tokens. This is structurally identical to building a radix tree from the FSM (deliberate echo of Section 3, I think, but the paper doesn’t draw that parallel explicitly).

At runtime, when the model enters such a super-edge, the runtime decodes all the super-edge’s tokens in a single forward pass. Mechanically: the prefill kernel is reused with forced_tokens = [...]; the next sampling step picks up after the super-edge ends.

4.3 Why is this not trivial?

Two reasons it didn’t already exist in vLLM / TGI / TensorRT-LLM:

The FSM and the model runner are usually in different parts of the codebase. The mask is computed by a constraint library (Outlines, lm-format-enforcer); the runner sees the mask but does not see the constraint structure. To know “the next $k$ tokens are forced,” the runner must understand the FSM, not just consume a per-step mask.
The FSM is per-request. Each new structured generation has its own regex, and pre-processing the FSM has non-zero cost. SGLang preprocesses once per regex and caches the compressed FSM across all requests in the batch that use it. The paper notes that without batch-level FSM reuse, throughput drops by 2.4×.

The empirical effect: JSON decoding throughput goes up 1.6× with the compressed FSM enabled, on the same hardware and the same model.

4.4 Limitations

The technique applies only to forced subsequences — places where the regex has no branching. For free-form fields like "name": "<anything>", there is still one forward pass per token. The win is concentrated on long literal segments: JSON delimiters, schema keys, fixed boilerplate.

For workloads that are mostly free-form (e.g. natural-language summarization with a small JSON envelope), the per-request gain is modest. For workloads with heavy structural skeleton (deeply nested JSON, code generation with predictable scaffolding), the gain is larger. The paper’s JSON-decoding benchmark sits between the two extremes.

5. API Speculative Execution

Section 5 — the shortest section, applicable only to API-only models (GPT-4 etc.). Conceptually clean.

Suppose a program looks like

s += context + "name:" + gen("name", stop="\n") + "job:" + gen("job", stop="\n")

Naively this is two API calls, each charged for the input tokens of context. With API speculative execution, the first call ignores the stop="\n" argument and lets the model continue past the name. The interpreter parses the continuation against the upcoming template (“job:”) and, if the model has guessed correctly, fuses the second gen into the first. Two API calls become one, and the input-token cost of context is paid once instead of twice.

The accuracy of this speculation depends on prompt engineering: if context strongly cues the model to emit job: after the name, hits are common. The paper reports a 3× input-token cost reduction on a Wikipedia field-extraction task with few-shot prompting.

This is essentially template-aware speculative decoding for black-box APIs. The contribution is taking the idea out of model-internal speculation (where it requires logit access) into the API/template realm (where you only need the response text). It is small but useful.

6. Evaluation

Section 6. SGLang is implemented in PyTorch with custom CUDA kernels from FlashInfer and Triton.

6.1 Setup

Models. Llama-2 7B / 13B / 70B (dense), Mixtral-8x7B (MoE), LLaVA-v1.5-7B (image), LLaVA-NeXT-34B (video), GPT-3.5 (API).
Hardware. Mostly AWS EC2 g5 (A10G, 24 GB); some experiments on A100 (80 GB). Tensor parallelism for ≥13B models.
Baselines. Guidance v0.1.8 (llama.cpp), vLLM v0.2.5 (default API server), LMQL v0.7.3 (HF Transformers).
Workloads. Eleven: 5-shot MMLU, 20-shot HellaSwag, ReAct agents, generative agents, Tree-of-Thought (GSM8K), Skeleton-of-Thought (tip generation), LLM Judge (branch–solve–merge), JSON decoding, Multi-Turn Chat (short / long output), DSPy RAG pipeline.

6.2 Llama-7B results

Figure 5 (throughput) and Figure 6 (latency) of the paper. SGLang dominates on every workload that has prefix sharing. Specifically:

MMLU. Few-shot prefix is shared; cache hit rate is very high; SGLang ~3.4× vs vLLM.
HellaSwag. Two-level sharing — few-shot exemplars and the per-question multi-choice prefix. Strongest cache hit rate; SGLang ~5–6×.
ReAct / generative agents. Agent template + previous-turn context shared; gain comes from KV reuse across turns.
Tree-of-Thought / Skeleton-of-Thought. Within-program forks; gain comes from RadixAttention plus intra-program parallelism.
JSON decoding. Compressed FSM does the heavy lifting; 1.6× over single-token-FSM baseline.
Multi-Turn Chat (short). Chat history shared across turns; KV reuse cuts first-token latency.
Multi-Turn Chat (long). Almost no speedup — decode time dominates, sharing only affects prefill.
DSPy RAG pipeline. Context examples shared across pipeline calls; SGLang accelerates the DSPy backend transparently.

The “almost no speedup on long-output chat” is an honest disclosure. It tells you exactly where this technique stops working: anywhere prefill is a small fraction of end-to-end time, prefix sharing cannot help.

6.3 Mixtral-8x7B and Llama-70B

Figure 7 and Figure 12 (Appendix) — tensor-parallel runs across multiple A10Gs. Similar speedups (1.5–4×). The paper drops LMQL and Guidance from these comparisons because neither supports TP efficiently. SGLang’s TP scaling is consistent with the underlying inference engines (vLLM is the only competitive baseline), so the comparison is fair within that bracket.

Table 2: 6.4× throughput on LLaVA-v1.5-7B image benchmark, 5× on LLaVA-NeXT-34B video. The mechanism: hash the input image, use the hash as a key in the radix tree, and reuse the KV cache of the vision encoder’s image tokens whenever the same image is queried multiple times (common in llava-bench-in-the-wild).

This is a nice generalization. The “prefix” abstraction doesn’t have to be only language tokens; vision tokens are also prefix-cacheable.

6.5 Chatbot Arena production data

The strongest external validation: real Chatbot Arena traffic over one month.

LLaVA-Next-34B: 52.4% cache hit rate.
Vicuna-33B: 74.1% cache hit rate, 1.7× lower first-token latency on average.

These hit rates are not from controlled benchmarks; they come from the messy distribution of real user prompts. The fact that more than half of all token positions hit the cache is a strong indicator that prefix sharing is a fundamental property of real LM workloads, not an artefact of benchmark design.

6.6 Ablations (Figure 8)

Three sub-plots, each tells a separate story:

Cache hit rate vs latency / throughput. Almost linear: hit rate up → batch size up → throughput up → first-token latency down. The plot is generated by artificially throttling the hit rate, so it isolates the effect of cache hits from any other variable.
Cache hit rate vs first-token latency. Same story from the latency side. The slope shows that on tree-of-thought, every 10 percentage points of hit rate gain ~25 ms of TTFT.
Component ablation. Disabling tree structure, cache-aware scheduling, frontend parallelism, and frontend hints each individually costs 10–30% throughput on Tree-of-Thought / LLM-Judge / MMLU / Multi-Turn Chat (short). All four are necessary for the headline numbers.

The component ablation matters most. It would have been easy to attribute all the gain to “we added prefix caching”; the ablation refutes that and shows the gains compose.

6.7 Cost of RadixAttention when there’s no reuse

Already mentioned: 0.27% overhead on ShareGPT. This is the key result that justifies turning RadixAttention on by default.

7. Discussion

7.1 What this paper got right

Co-design. This is the rare paper where a frontend language and a runtime are designed against each other rather than retrofit. RadixAttention without the frontend’s fork() and shared-prefix semantics would still work but would lose the hint mechanism and the parallelism. The DSL without the runtime would still work but would not be faster than LMQL.
Composability with other engine optimizations. Continuous batching, PagedAttention, tensor parallelism, FlashAttention — RadixAttention sits orthogonal to all of them. The paper is explicit that it composes.
Honest negative results. The Multi-Turn Chat (long) result, where SGLang is essentially tied with vLLM, is in the same Figure 5 as the favorable results. The paper does not hide it. This is unusual.
The radix-tree abstraction. Once you see it, you cannot un-see it: the LM-program prefix-sharing problem and the routing-table longest-match problem are the same problem. Radix trees are the right data structure for both.

7.2 What I think is undersold

The fairness story. The paper says cache-aware scheduling can starve requests. That is true, but the paper does not provide a fairness baseline. In production, this matters — Chatbot Arena will not adopt a scheduler that starves users. A follow-up [42] addresses this.
The compiler mode. The compiler is mentioned but its results are only in Appendix D. For latency-sensitive single-program workloads, the compiler should help a lot; the paper underplays its potential.
The radix tree’s role in image / video. The Table 2 result is a one-paragraph subsection, but it generalizes the technique to non-text modalities in a way that almost everyone reading the paper misses on first pass.

7.3 Limitations

No fuzzy / semantic matching. Two prompts that mean the same thing but differ by a token cannot share KV cache. The paper acknowledges this as future work, and recent follow-ups (e.g. CacheBlend, RAG-Cache) address it with embedding-based proxies.
No DRAM / disk extension. The tree lives entirely in HBM. For very large prompt corpora (think a 200k-context RAG store), spilling to host memory or NVMe would be useful. The paper points at this in §8.
Greedy starvation. Already mentioned.
Limited applicability to long-output workloads. Anywhere decode dominates, RadixAttention helps only the small prefill fraction. Mitigations exist (speculative decoding, etc.) but they are outside the paper’s scope.
API speculative execution is dependent on prompt engineering. When the model fails to follow the template, the speculation is wasted bytes. The paper does not measure failure modes carefully.
The frontend is Python-only. Embedding in another host language (JavaScript, Rust) is non-trivial because the interpreter relies on Python’s threading model.

7.4 What this changed

In the eighteen months since the paper appeared, SGLang has become one of the dominant open-source serving stacks alongside vLLM and TensorRT-LLM. RadixAttention (in some form) has been ported into vLLM’s prefix caching. The compressed-FSM idea has been ported into Outlines and into TensorRT-LLM’s structured-output mode. The frontend DSL is used in production at multiple labs (Together, DeepSeek, Mistral) as the orchestration layer for agent workflows.

The conceptual influence is broader still. The paper helped clarify that LLM serving is not just inference; it is interpretation of programs whose forward passes happen to be neural. Once you accept that framing, the right abstraction is not “request” but “call graph,” and the right primitive is not “KV cache” but “tree-of-shared-extents.” Both abstractions show up explicitly in nearly every serving paper since.

8. Place in the literature

For my own purposes, here is how I file SGLang against the LLM-systems papers I have been reviewing on this blog:

vLLM (PagedAttention, 2023) — fixes intra-request KV-cache fragmentation. SGLang fixes inter-request KV-cache reuse. Complementary, both used in production simultaneously.
Orca (2022) — introduced iteration-level batching. SGLang inherits Orca’s batching loop and adds prefix-aware scheduling on top.
Sarathi-Serve (OSDI 2024) — fixes the throughput–latency tradeoff with chunked prefills and stall-free hybrid batches. Orthogonal to SGLang: a stack can do chunked prefills and RadixAttention. Recent SGLang versions integrate Sarathi-style scheduling.
DistServe (OSDI 2024) — disaggregates prefill and decode onto separate worker pools. Different design point; SGLang is the single-pool design, DistServe the two-pool design. The trade-off is interconnect bandwidth vs scheduler complexity.
HydraGen (2024) and ChunkedAttention (2024) — CUDA kernel optimizations for shared-prefix attention. Lower-level than SGLang; SGLang sits above them and decides what to share, not how to compute the shared attention.
Prompt Cache (2023) — modular prompt caching with intentional accuracy loss. SGLang’s exact reuse is lossless; Prompt Cache trades accuracy for higher reuse rates.
LMQL (2023), Guidance (2023) — frontend DSL ancestors. SGLang’s frontend borrows their primitives but invests heavily in the runtime, which is the lever the others lacked.
DSPy (2023) — high-level framework. SGLang is a backend; DSPy compiles down to SGLang.
FlexGen (ICML 2023) — high-throughput single-GPU inference via offloading. Orthogonal; SGLang assumes everything fits in HBM, FlexGen does not.

The most useful one-line summary I can offer: SGLang is to LM programs what a JIT compiler is to a dynamic language. The frontend captures program structure; the runtime exploits it. Once the analogy clicks, every individual contribution — RadixAttention, compressed FSM, API speculation, frontend hints — falls into place as a specific instance of “use known structure to skip work.”

9. Closing thoughts

Reading this paper after Sarathi-Serve and DistServe gave me a clearer mental model of where the LLM-serving stack is going. Three layers are crystallizing:

Kernels. FlashAttention, FlashInfer, HydraGen — the fastest possible primitives.
Engines. vLLM, TensorRT-LLM, SGLang Runtime — orchestrate kernels into iteration-level batches with paged memory.
Programs / orchestrators. SGLang frontend, DSPy, LangChain, LlamaIndex — describe what work the engine should do.

SGLang is the one paper I have read that contributes meaningfully to both (2) and (3), and treats them as a single design problem. Whether or not the SGLang code-base wins the long-run engine war (vLLM is also moving fast), the conceptual lesson — programs are first-class, prefixes are shared, schedule against the tree — is permanent.

The next interesting unanswered question, in my view, is what happens when “prefix” gets generalized to “semantic prefix.” Two prompts that mean the same thing but differ by one synonym share no token prefix and therefore no KV cache today. If we can compute an approximate KV-cache reuse for semantically-similar prefixes (with bounded accuracy loss, like Prompt Cache), the hit rate on real chatbot traffic could rise from 74% to >90%. That is the next 2× of throughput, and SGLang’s radix tree is the obvious place to plug it in.

Code and reproducible benchmarks: https://github.com/sgl-project/sglang. Paper: https://arxiv.org/abs/2312.07104.