June 21, 2026 EN #KV Cache #LLM Serving #Operating Systems

Tutti: GPU-Centric SSD-Backed KV Cache That Finally Makes SSDs Practical for Long-Context LLM Serving

Review date: 2026-06-21 Review author: Zhongzhu Zhou Paper reviewed: Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving Authors: Shi Qiu, Yifan Hu, Xintao Wang, Wenhao Zhu, Jianqin Yan, Hao Chen, Kaiqiang Xu, Kai Chen, Yiming Zhang arXiv: 2605.03375 Status/Venue: arXiv preprint, May 2026

Short Answer

Tutti is a GPU-centric KV cache system that moves both the data path and I/O control path off the CPU and onto the GPU, enabling NVMe SSDs to achieve DRAM-like inference performance for long-context LLM serving. Its three core innovations — a GPU-native object store with Scatter Gather List addressing, a GPU io_uring mechanism that mirrors Linux’s asynchronous I/O subsystem, and a slack-aware scheduler that uses offline profiling to avoid bandwidth contention — collectively reduce Time-to-First-Token by 78.3% over GDS-enabled LMCache and cut serving cost by 27% by exploiting SSDs that are roughly 100× cheaper per GB than DRAM.

Prerequisites: What You Need to Know Before Reading

Before diving into Tutti’s design, this section lays out the background knowledge that makes the paper’s contributions legible. Readers comfortable with LLM inference internals and GPU storage systems may skim to the next section.

1. LLM Inference: Prefill and Decode

A Transformer-based LLM processes a request in two distinct phases:

Prefill: the entire input prompt is processed in one parallel forward pass. The model computes Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) matrices for every token simultaneously. This phase is compute-bound and its latency is measured as Time to First Token (TTFT).
Decode: the model generates output tokens one at a time, autoregressively. At each step, attention is computed between the new query vector and all previously generated keys and values. Decode latency is measured as Inter-Token Latency (ITL).

The critical efficiency observation: during decode, recomputing the $K$ and $V$ matrices for every previous token at every step is extremely wasteful. Those matrices are deterministic given the input, so they can be cached and reused.

2. Key-Value (KV) Cache

The KV cache stores the $K$ and $V$ matrices computed for every token in the context so they do not need to be recomputed on subsequent decode steps. For a model with $L$ layers and $H$ attention heads of dimension $d$ , the KV cache for a sequence of $T$ tokens requires approximately:

\text{KV size} = 2 \times L \times T \times H \times d \times \text{dtype\_bytes} \tag{1}

The factor of 2 accounts for both $K$ and $V$ . For Llama3-8B ( $L=32$ , $H=32$ , $d=128$ ) with BF16 precision and $T=128{,}000$ tokens, this is about 32 GB — nearly the entire 80 GB HBM on an H100.

Beyond single-session caching, modern inference engines exploit prefix caching: if many requests share a common prefix (e.g., a system prompt), the KV cache for that prefix is computed once and reused across all requests, reducing prefill cost by up to 90%.

3. Paged KV Memory Management

As sequences grow dynamically and requests arrive with varying lengths, fixed-size static KV cache allocations lead to severe memory fragmentation. The solution — pioneered by vLLM’s PagedAttention — is to divide the KV cache into fixed-size blocks (typically 16–32 tokens per block). Blocks are allocated on demand from a pool, similar to OS virtual memory paging.

Each block holds:

\text{Block shape} = [B_{\text{tokens}}, H, d] \tag{2}

where $B_{\text{tokens}}$ is the block size (e.g., 16 tokens). Critically, the blocks belonging to a single sequence are not contiguous in GPU memory — they are linked via a block table. This is the root cause of Tutti’s problem, as we will see.

4. KV Cache Tiering: Why SSDs Are Necessary

GPU HBM (High Bandwidth Memory) is fast but small (40–80 GB per H100). When context windows grow to hundreds of thousands of tokens and concurrent session counts increase, HBM is quickly exhausted. Systems use a tiered hierarchy:

HBM (GPU): fastest, $\sim$ 3.35 TB/s bandwidth, $\sim$ 80 GB per H100
CPU DRAM: slower, $\sim$ 50–100 GB/s bandwidth, $\sim$ 2 TB server capacity
NVMe SSD: slowest, $\sim$ 7–30 GB/s sequential bandwidth per drive, $\sim$ 100 TB capacity per server

Evicted KV blocks that don’t fit in HBM are offloaded to DRAM or SSDs. When a new request needs an evicted prefix, the system restores it from the tier where it was stored. The latency of this restoration directly adds to TTFT.

5. Why DRAM Works But SSDs Don’t (Before Tutti)

DRAM works well because:

Fine-grained random access has low latency ( $\sim$ 100 ns)
CPU-GPU memory copies via cudaMemcpyAsync batched by layer are efficient
Layer-wise pipelining can hide transfer time behind attention computation

SSDs fail because of a fundamental mismatch between paged KV layout and SSD access patterns:

A 128K-token KV cache for Qwen3-32B ( $L=64$ , block size 64) fragments into $256{,}000$ scattered 80 KB objects
Each object must be individually addressed, creating hundreds of thousands of small random I/O requests
NVMe SSDs are optimized for sequential or large random I/O, not millions of tiny scattered transfers

6. NVMe I/O Architecture: Queues, Descriptors, and PRPs

NVMe (Non-Volatile Memory Express) was designed for flash SSDs to maximize parallelism. Its key architectural elements:

Submission Queue (SQ) and Completion Queue (CQ): software ring buffers where the host (traditionally CPU) places I/O commands and receives completions
Doorbell register: memory-mapped register that the host writes to signal the SSD controller that new commands are in the SQ
Physical Region Pages (PRP): the standard descriptor format for specifying which host memory pages contain the I/O data — a linked list of 4 KB page pointers

The problem: each NVMe I/O command requires the host to: (1) build PRP descriptors, (2) write the command to the SQ, (3) ring the doorbell, (4) wait for the completion entry in the CQ. Steps 1–4 are CPU operations. For thousands of concurrent small I/Os, this creates severe CPU serialization overhead.

7. GPU Direct Storage (GDS) and Its Limits

NVIDIA’s GPU Direct Storage (GDS) attempts to solve one part of this: it establishes a direct DMA path from SSDs to GPU HBM, bypassing the CPU bounce buffer in DRAM. This sounds ideal, but GDS still requires the CPU to initiate every I/O request — it only removes the CPU from the data path, not the I/O control path. The CPU remains a bottleneck for I/O submission and completion signaling, especially at the high I/O parallelism that LLM inference requires.

7. Linux io_uring and Asynchronous I/O

The Linux io_uring subsystem (introduced in Linux 5.1) is the state-of-the-art asynchronous I/O interface for userspace applications. Its design uses two ring buffers shared between kernel and userspace:

Submission Queue (SQ): the application writes I/O requests here
Completion Queue (CQ): the kernel writes completions here

This lock-free ring buffer design achieves near-zero overhead for submitting and reaping I/O. Tutti’s key idea is to implement an analogous mechanism on the GPU — gio_uring — so the GPU can autonomously submit and reap NVMe I/O without CPU involvement.

8. NVIDIA Green Contexts: GPU Resource Isolation

NVIDIA’s green context feature (available on Hopper and later GPUs) provides hardware-level resource isolation within a single GPU. It allows partitioning the GPU’s Streaming Multiprocessors (SMs) into independent domains: one for LLM computation and another for I/O control kernels. Without this isolation, a long-running I/O polling kernel can monopolize SMs and block latency-critical compute kernels, since the GPU’s hardware scheduler is largely non-preemptive.

Paper Overview: The Problem and Solution in One Picture

graph TB
    subgraph CPU-Centric["CPU-Centric (LMCache + GDS)"]
        direction LR
        A1["Inference Engine\n(GPU)"] -->|"Initiate I/O\n(per block)"| B1["CPU"]
        B1 -->|"GDS DMA"| C1["NVMe SSD"]
        C1 -->|"Data → HBM\n(direct)"| A1
    end

    subgraph GPU-Centric["GPU-Centric (Tutti)"]
        direction LR
        A2["Inference Engine\n(GPU)"] -->|"Load IOCB\nonce per layer"| B2["CPU"]
        A2 -->|"Issue I/O via\ngio_uring"| C2["NVMe SSD"]
        C2 -->|"Data → HBM\n(P2P DMA)"| A2
    end

    style CPU-Centric fill:#ffcccc,stroke:#cc0000
    style GPU-Centric fill:#ccffcc,stroke:#00cc00

Figure 1: CPU-centric vs. GPU-centric KV cache storage. In LMCache+GDS, the CPU is on the critical I/O control path for every block. In Tutti, the CPU only loads I/O kernels once per layer at startup; the GPU drives all NVMe commands directly.

The core insight: the CPU’s involvement is a bottleneck not because it handles data (GDS already fixed that), but because it handles every I/O control request — preparing descriptor addresses, submitting to NVMe queues, and receiving completions — for each of the potentially hundreds of thousands of KV blocks per request. Tutti eliminates this by giving the GPU its own I/O control mechanism.

How Existing Tiered KV Cache Systems Work (and Fail)

Before understanding Tutti, it helps to understand exactly what LMCache does and why it fails at scale. LMCache is the state-of-the-art KV cache offloading system that Tutti improves upon. Its architecture:

DRAM tier (LMCache-DRAM-LW): when KV blocks are evicted from HBM, they are batched and transferred to CPU DRAM via cudaMemcpyAsync. On restoration, the reverse transfer happens. LMCache applies layer-wise pipelining: while the GPU computes layer $l$ , it simultaneously copies KV data for layer $l+1$ from DRAM. This overlap is effective because DRAM-HBM bandwidth (~50 GB/s) is sufficient to transfer one layer’s KV before the next layer’s attention completes.
SSD tier (LMCache-SSD): KV blocks are serialized from GPU HBM → CPU DRAM → NVMe SSD (via the filesystem). Restoration reverses the path. The problem is twofold: (a) all the copy overhead that applies to DRAM, plus (b) the far lower SSD bandwidth and high per-I/O overhead.
SSD tier with GDS (LMCache-GDS): uses NVIDIA cuFile to enable direct SSD→HBM DMA, bypassing the DRAM bounce buffer. But cuFile still requires the CPU to prepare cuFile transfer descriptors and call cuFileRead/cuFileWrite for each transfer. At the scale of 256,000 objects per request, this CPU overhead is the bottleneck.

The paper’s measurement (Figure 2) is striking: with LMCache-GDS on vLLM v0.17.0, GPU bubble time at 75% hit rate reaches 72.3% of total inference latency. The SSD tier is making performance worse than recomputation at 9.4 seconds total vs. DRAM’s 1.7 seconds.

The Root Cause: Three Bottlenecks in Tiered KV Cache

Bottleneck 1: I/O Fragmentation from Paged Layout

The paged KV memory layout is fundamental to efficient GPU memory management. But it creates a severe problem when combined with SSD I/O. Consider evicting the KV cache for a 128K-token sequence from a 64-layer model with block size 64:

\text{Num blocks} = \frac{128{,}000}{64} = 2{,}000 \text{ blocks} \tag{3}

\text{Num objects} = 2 \times 64 \times 2{,}000 = 256{,}000 \text{ objects (K and V per layer)} \tag{4}

Each object ( $K$ or $V$ for one layer, one block) is approximately 80 KB. Restoring this sequence requires fetching 256,000 randomly scattered 80 KB objects. Even at 30 GB/s peak SSD bandwidth, the overhead from CPU-side I/O submission for each object dominates over raw data transfer time.

Bottleneck 2: CPU Serialization in GDS

GPU Direct Storage removes CPU from the data plane (no DRAM bounce buffer), but the CPU still:

Receives an I/O request from the inference engine for each block
Computes Physical Region Pages (PRP) descriptors for that block
Submits the NVMe command to the submission queue
Waits for (or polls) the completion queue
Signals the GPU that data is ready

Steps 2–4 happen for every block, serially on the CPU. With 256,000 objects to restore, this creates massive queuing delay. The paper measures GPU bubble time (time the GPU spends waiting for I/O) at 70–80% of total inference latency with GDS.

Bottleneck 3: Read/Write Bandwidth Contention

During KV cache prefill, the system simultaneously:

Reads previously evicted KV for the current request (to restore its prefix)
Writes newly computed KV to SSD (for future evictions)

This concurrent read/write causes a 60% bandwidth collapse on NVMe SSDs. The reason is that large-block reads and writes contend for the SSD’s internal write buffer and cache structures, degrading effective bandwidth from ~30 GB/s to ~12 GB/s combined.

Tutti’s Design: Three Interlocking Solutions

Design 1: GPU-Native Object Store (§3.1)

The goal is to give the GPU a way to access KV cache objects stored on NVMe without CPU intervention on the critical path. This requires solving two sub-problems: object management (indexing, allocation) and physical addressing (translating virtual GPU addresses to NVMe-visible physical addresses).

GPU File Pool

Tutti extends GeminiFS (a companion GPU file system) with a GPU file pool that aligns storage allocation with vLLM’s KV block manager. The key design decisions:

Each memory block (covering $B_{\text{tokens}}$ tokens) maps to one GPU file
Each GPU file contains $2L$ objects (one $K$ and one $V$ object per layer)
Objects follow the Tensor-Stripe layout: each GPU file maps to multiple NVMe files, striped along tensor granularity (not fine-grained storage pages)
Multiple GPU files are distributed across SSDs via round-robin placement

This design ensures that I/O granularity naturally aligns with KV transfer granularity ( $\text{block\_size} \times H \times d$ ), avoiding the mismatch between tiny NVMe I/Os and large KV transfers.

Critically, CPU-side management (allocation, indexing, hash tables) stays on the CPU, but it operates off the critical path — at startup and during idle periods, not during active inference. At inference time, the GPU only needs to look up a pre-built P2P mapping table.

P2P Memory Mapping Table with SGL

The second sub-problem is physical addressing: when the GPU submits an NVMe I/O command, it must provide the physical addresses of the target HBM pages. The standard NVMe mechanism for this is Physical Region Pages (PRP): a list of pointers, one per 4 KB page.

For a 60 GB KV cache pool:

\text{Total 4KB pages} = \frac{60 \times 1024^3}{4096} = 15{,}728{,}640 \text{ pages} \tag{5}

At 64 KB PRP list pages (16 pointers each):

\text{PRP list pages} = \frac{15{,}728{,}640}{16} = 983{,}040 \text{ pages} \approx 3.75 \text{ GB HBM overhead} \tag{6}

3.75 GB of HBM wasted just on address metadata is unacceptable. Tutti uses Scatter Gather Lists (SGL) instead. An SGL entry describes a contiguous memory region with just 16 bytes (physical address + length + identifier):

\text{SGL memory} = 983{,}040 \times 16 \text{ B} = 15 \text{ MB HBM overhead} \tag{7}

A 250× reduction. More importantly, SGL enables bulk transfers: instead of one 4KB PRP entry per page (generating fine-grained I/Os), a single SGL entry covers a large contiguous HBM region, matching the natural $\sim$ 100 KB KV cache object size.

The improvement is dramatic: in a microbenchmark, SGL achieves 8.9 GB/s read bandwidth vs. PRP’s 0.29 GB/s — a 31× improvement.

Algorithm 1: GPU File Store/Retrieve Interface

Procedure: retrieve_layer(layer_id, block_ids[], output_hbm[])
  1. For each block b in block_ids:
  2.   gpu_file_id = cpu_block_to_file[b]         // CPU-managed hash table
  3.   sgl_entry = p2p_table[b][layer_id]         // Pre-computed P2P mapping
  4.   ioctx = allocate_ioctx(gio_uring)
  5.   ioctx.sgl = sgl_entry
  6.   ioctx.offset = gpu_file_id * layer_stride + layer_id * object_size
  7.   ioctx.len = object_size
  8.   enqueue_to_sq(gio_uring, ioctx)
  9. End for
  10. // GPU executes all IOCTXs concurrently via gio_uring
  11. wait_cqe(gio_uring, block_ids)

This reduces CPU overhead from $O(\text{layers} \times \text{blocks})$ to $O(\text{layers})$ — one CPU operation per layer to load the I/O kernels, after which the GPU executes everything.

graph LR
    subgraph HBM["GPU HBM"]
        KB["KV Block Pool\n(paged, non-contiguous)"]
        P2P["P2P Mapping Table\n(SGL entries, 15 MB)"]
        GFP["GPU File Pool\n(file → block mapping)"]
        SQ["SQ Ring Buffer"]
        CQ["CQ Ring Buffer"]
    end

    subgraph SSD["NVMe SSD"]
        NF["NVMe Files\n(GeminiFS physical extents)"]
    end

    subgraph CPU["CPU (off critical path)"]
        HT["Hash Table\n(block → GPU file ID)"]
        KM["Kernel Loader\n(once per layer)"]
    end

    KB --> P2P
    P2P --> SQ
    GFP --> SQ
    SQ -->|"P2P DMA"| NF
    NF -->|"Direct → HBM"| KB
    NF --> CQ
    HT -.->|"lookup at startup"| GFP
    KM -.->|"once per layer"| SQ

    style CPU fill:#fff3cd,stroke:#ffc107
    style HBM fill:#d1ecf1,stroke:#17a2b8
    style SSD fill:#d4edda,stroke:#28a745

Figure 2: GPU-centric KV cache object store layout. CPU manages metadata off the critical path; GPU drives all NVMe I/O via SGL descriptors in ring buffers.

Design 2: GPU io_uring (gio_uring) (§3.2)

Even with the GPU file pool and P2P table, the GPU needs a mechanism to asynchronously issue and track NVMe commands without blocking compute. This is the GPU analog of Linux’s io_uring.

Ring Buffer Architecture

gio_uring uses two ring buffers resident in GPU HBM but mapped to the CPU via non-cached mmap:

Submission Queue (SQ): GPU writes I/O commands here; NVMe controller reads them
Completion Queue (CQ): NVMe controller writes completion events; GPU polls here

Each SQ entry is an I/O Control Block (IOCB), containing 2048 I/O Contexts (IOCTXs). An IOCTX records:

SGL address (physical pointer, 8 bytes)
GPU file offset (4 bytes)
I/O length (4 bytes)

The batching structure (2048 IOCTXs per IOCB) matches the GPU’s minimum scheduling unit. On an H100 (2 SMs per unit), each SM supports 64 warps × 32 threads = 2,048 concurrent threads. By aligning IOCTX count with thread count, Tutti maximizes parallelism.

Algorithm 2: gio_uring Async I/O Lifecycle

Phase 1: Init (CPU, once at startup)
  init_queue(depth):
    Allocate SQ[depth] and CQ[depth] in HBM
    mmap SQ and CQ to CPU virtual address space (non-cached)
    Pre-register all NVMe queue pairs to GPU

Phase 2: Prepare (CPU, once per layer per request)
  get_iocb(nums, event):
    Retrieve nums available IOCBs from SQ
    For each IOCB: fill IOCTXs from CPU-side virtual addresses
    Insert CUDA event: I/O kernel starts only after dependency satisfied
    Return IOCB_ids

Phase 3: Issue (GPU kernel)
  issue_io(IOCB_ids, SMs):
    Launch I/O kernel on dedicated SM partition
    For each IOCTX in IOCB:
      Translate SGL virtual → physical address
      Enqueue NVMe command to SQ
      Ring NVMe doorbell register
    Poll CQ for completions (non-blocking, separate from compute)
    On completion: atomically write IOCB_id to CQ

Phase 4: Wait (GPU compute kernel)
  wait_cqe(IOCB_ids):
    Check CQ for specific IOCB_id
    Return when all requested IOCBs completed
    No CPU participation required

SM Partitioning via NVIDIA Green Contexts

A naive implementation of gio_uring would have the I/O polling kernel compete with the attention kernel for SMs. Since the GPU hardware scheduler is non-preemptive, a long-running I/O kernel can delay a critical compute kernel indefinitely.

Tutti uses NVIDIA green contexts to partition SMs into two isolated domains:

Compute Domain: runs attention, GEMM, normalization kernels
I/O Control Domain: runs gio_uring submission and polling kernels

The I/O control kernel runs on a fixed set of dedicated SMs, completely isolated from compute resource fluctuations. This provides deterministic Quality of Service and eliminates long-tail latency from SM contention.

sequenceDiagram
    participant CPU as CPU Runtime
    participant IO_SM as I/O Control SMs
    participant Compute_SM as Compute SMs
    participant NVMe as NVMe SSD

    CPU->>IO_SM: Load I/O kernels (once per layer)
    activate IO_SM

    Note over Compute_SM: Layer N-1 attention (in progress)

    IO_SM->>NVMe: Enqueue read commands (SGL, doorbell)
    NVMe-->>IO_SM: Completions (via CQ)
    IO_SM->>IO_SM: Write IOCB_id to CQ

    Note over Compute_SM: Layer N attention waits for KV

    Compute_SM->>IO_SM: wait_cqe(IOCB_ids)
    IO_SM-->>Compute_SM: KV data ready in HBM
    activate Compute_SM
    Note over Compute_SM: Layer N attention (executes)
    deactivate Compute_SM
    deactivate IO_SM

Figure 3: SM-partitioned gio_uring execution model. I/O control and compute run in parallel on isolated SM partitions. The compute kernel only blocks at wait_cqe, which completes as soon as the I/O kernel confirms data arrival.

Design 3: Slack-Aware I/O Scheduler (§3.3)

Even with gio_uring and SM partitioning, two interference problems remain. The slack-aware scheduler addresses both.

Problem 1: Read/Write Bandwidth Contention

During layer-wise pipelining, the natural execution order simultaneously reads KV for the next layer while writing KV from the previous layer. This concurrent read/write causes a 60.1% bandwidth collapse:

Separate read: ~29 GB/s
Separate write: ~12 GB/s
Concurrent read + write: ~12 GB/s combined (the write bandwidth effectively drops to zero)

The root cause: large-block reads and writes compete for the NVMe’s internal write buffer and cache structures, causing write stalls that also degrade read performance.

Problem 2: I/O Kernel SM Contention

Even with SM partitioning, if the I/O control kernel is launched too aggressively, it can consume SMs that the compute kernel occasionally needs for small operators (embedding layers, normalization, projection layers). These operators have lower compute intensity than attention and can spill onto available SMs; if those are occupied by I/O, compute latency increases.

The Scheduler Solution: Offline Profiling + Lookup Table

The key insight: prefill computation time varies predictably with input and prefix length, because attention complexity scales quadratically with context length while other operators are unaffected. This predictability makes offline profiling feasible.

Tutti profiles each model layer offline, generating a slack table indexed by $(L_{\text{input}}, L_{\text{prefix}})$ :

\text{SlackTable}[L_{\text{input}}][L_{\text{prefix}}] = (\text{window\_duration}, \text{SM\_budget}, \text{max\_IOCBs}) \tag{8}

Each entry records:

The duration of the schedulable slack window (gap between compute kernels)
Available SM budget during that window
Maximum number of IOCBs that can be launched without impacting compute

Algorithm 3: Slack-Aware I/O Scheduling (Prefill Phase)

Input: Request with L_input tokens, L_prefix cached tokens

At request arrival:
  priority = READ   // reads are on critical path

For each layer l = 0 to L-1:
  slack = SlackTable[L_input][L_prefix][l]

  // Schedule reads first (critical path)
  if READ queue non-empty:
    n_read = min(len(READ_queue), slack.max_IOCBs)
    issue_io(read_iocbs[:n_read], slack.SM_budget)

  // Schedule writes only if reads done and slack remains
  if WRITE queue non-empty and slack.remaining > 0:
    n_write = min(len(WRITE_queue), slack.remaining_IOCBs)
    issue_io(write_iocbs[:n_write], slack.SM_budget)
  else:
    // Defer writes to decode phase (best-effort)
    defer_writes_to_decode()

  // Execute compute kernel for layer l
  attention_and_ffn(l)
  wait_cqe(issued_iocb_ids)

Algorithm 4: Slack-Aware I/O Scheduling (Decode Phase)

For each decode step:
  slack = decode_slack_profile[current_length]

  // Issue any remaining writes opportunistically
  if WRITE queue non-empty and slack.window_exists:
    n_write = min(len(WRITE_queue), slack.max_IOCBs)
    issue_io(write_iocbs[:n_write], slack.SM_budget)

  // Defer remaining writes to next decode steps
  // No reads needed (decode KV is always in HBM)
  generate_next_token()

The decoupled scheduling guarantees:

Reads always have higher priority than writes during prefill
Concurrent read/write is forbidden (no bandwidth collapse)
I/O kernel SM usage is bounded by slack.SM_budget (no compute interference)

graph LR
    subgraph L0["Layer 0 (t=0..30ms)"]
        R0["Read KV\n(t=0..20ms)"] 
        C0["Compute\n(t=0..30ms)"]
    end
    subgraph L1["Layer 1 (t=30..65ms)"]
        R1["Read KV\n(t=30..50ms)"]
        C1["Compute\n(t=30..65ms)"]
    end
    subgraph L2["Layer 2 (t=65..95ms)"]
        R2["Read KV\n(t=65..80ms)"]
        C2["Compute\n(t=65..95ms)"]
    end
    W["Write KV to SSD\n(deferred, t=80..100ms)"]

    C0 --> R1
    C1 --> R2
    R2 --> W

    style R0 fill:#d1ecf1,stroke:#17a2b8
    style R1 fill:#d1ecf1,stroke:#17a2b8
    style R2 fill:#d1ecf1,stroke:#17a2b8
    style C0 fill:#d4edda,stroke:#28a745
    style C1 fill:#d4edda,stroke:#28a745
    style C2 fill:#d4edda,stroke:#28a745
    style W fill:#fff3cd,stroke:#ffc107

Figure 4: Tutti’s slack-aware layer-wise pipeline. Reads overlap with compute within each layer’s slack window; writes are deferred to later slots that don’t overlap with read I/O, preventing bandwidth contention.

Key Equations and Derivations

Cost Per Million Tokens

The paper defines the serving cost as:

\text{Cost}_{1M} = \frac{P_{\text{GPU}} \cdot N_{\text{GPU}} + P_{\text{mem}} \cdot S_{\text{mem}} + P_{\text{ssd}} \cdot S_{\text{ssd}}}{\text{Throughput (tokens/hour)}} \times 10^6 \tag{9}

Where:

$P_{\text{GPU}}$ = $5/hour per H100
$P_{\text{mem}}$ = $0.0088/GB/hour for DRAM
$P_{\text{ssd}}$ = $0.000082/GB/hour for NVMe SSD

The ratio $P_{\text{mem}} / P_{\text{ssd}} = 107\times$ — DRAM costs 107 times more per GB than SSD. When Tutti enables SSDs to match DRAM throughput (by eliminating I/O overhead), the denominator (throughput) remains the same, but the numerator drops dramatically: a 14 TB SSD volume costs $\sim$ 1.15 $/hour, vs.$ \sim $12.3$ /hour for 1.4 TB of DRAM.

GPU Bubble Time Analysis

The crossover point between compute-bound and I/O-bound operation occurs when:

T_{\text{compute}}^{(l)} = T_{\text{transfer}}^{(l)} \tag{10}

For layer $l$ with prefix of $L_{\text{prefix}}$ tokens and new tokens $L_{\text{new}}$ :

T_{\text{compute}}^{(l)} \propto L_{\text{new}} \cdot L_{\text{prefix}} \cdot d + L_{\text{new}} \cdot d_{\text{ff}} \tag{11}

The attention term grows linearly with $L_{\text{prefix}}$ , while the FFN term is constant. As $L_{\text{prefix}}$ increases, the slack window grows (more time for I/O to hide behind compute), which is why Tutti’s crossover point reaches 98.3% hit rate — at high cache reuse, attention dominates and provides ample slack.

For low hit rates (small $L_{\text{prefix}}$ ), attention is fast and the slack window narrows, making it harder to hide I/O latency. This explains the 20.6% gap versus DRAM at the >96K prefix regime in Fig. 11.

Implementation Details

Tutti is implemented in $\sim$ 8,000 lines of C++ for the core GPU storage layer and integrated with vLLM’s KVConnector interface using $\sim$ 1,500 lines of Python. Key implementation choices:

Pre-allocation at startup: The KV cache pool is pre-allocated at initialization and remains stable. This allows the P2P mapping table to be computed once and reused, avoiding per-request address translation overhead.
Warm-up profiling: Before inference begins, Tutti profiles per-layer slack windows for the specific model and hardware configuration. The resulting profile is cached and reused across inference runs.
Multi-GPU support via independent queues: Each GPU maintains independent NVMe submission/completion queue pairs (via a local daemon). Solidigm D7-PS1010 SSDs support up to 256 I/O queues, sufficient for 32 queues per GPU across 8 GPUs.
Distributed extension via Mooncake: For multi-node deployments, Tutti handles local HBM↔SSD transfers; Mooncake provides cluster-wide KV metadata and routing. Remote retrieval currently uses CPU-side RDMA (a noted future work item).

Worked Example: Tracing a Request Through Tutti

To make the design concrete, let’s trace what happens when a request with a 64K shared prefix arrives at a Tutti-enabled vLLM instance running Llama3-8B (32 layers, block size 16, BF16 precision).

Step 1: Request Arrives, Prefix Lookup

vLLM’s scheduler receives the request and checks the KV cache for the 64K-token prefix. With HBM hit rate of ~8% (per Table 1 for LEval), the 64K prefix is likely evicted to SSD. The KV block manager identifies the block IDs for the prefix.

Number of blocks for 64K tokens:

\text{Num blocks} = \frac{64{,}000}{16} = 4{,}000 \text{ blocks} \tag{12}

\text{Num I/O objects} = 2 \times 32 \times 4{,}000 = 256{,}000 \text{ objects} \tag{13}

In LMCache-GDS, the CPU would need to issue 256,000 separate cuFile calls. In Tutti, the CPU only loads I/O kernels once per layer.

Step 2: P2P Table Lookup and IOCB Preparation (CPU, once per layer)

For layer $l = 0$ , the CPU runtime:

Looks up GPU file IDs for the 4,000 block IDs (hash table lookup)
Retrieves 8,000 SGL entries from the pre-computed P2P table ( $K$ and $V$ for each block)
Fills 8,000 IOCTXs into the gio_uring SQ (fast: 8,000 × 16 B = 128 KB of metadata)
Inserts a CUDA event dependency
Returns IOCB handles to the GPU runtime

This CPU work completes in microseconds per layer, not seconds.

Step 3: GPU Issues Layer-0 I/O (I/O Control SMs)

The gio_uring I/O kernel on dedicated SMs:

Translates 8,000 SGL entries to NVMe commands (parallel warp-level execution)
Enqueues all 8,000 NVMe read commands to the SQ in one batch
Rings the NVMe doorbell once
The NVMe controller begins 8,000 concurrent DMA transfers to HBM
Meanwhile, the compute SMs begin Layer 0 embedding and normalization

Total data for Layer 0: $4{,}000 \times 2 \times 1 \times 16 \times 128 \times 2 \text{ B} = 32 \text{ MB}$

At 25.9 GB/s retrieval bandwidth: transfer takes ~1.2 ms. Layer 0 attention for 64K tokens takes ~3.5 ms on H100. Slack window: 2.3 ms.

Step 4: Scheduler Checks Slack and Issues Writes

Before starting Layer 0 compute, the scheduler looks up SlackTable[64K][0][layer=0]:

Window duration: 2.3 ms
SM budget: 4 SMs (out of 132 on H100)
Max IOCBs that fit: 320 write IOCBs

If there are pending KV writes from the previous request’s decode phase, the scheduler issues up to 320 write IOCBs using the spare SMs and the remaining 2.3 ms window.

Step 5: Compute Starts, I/O Completes in Parallel

Layer 0 compute executes on the compute SM partition. When compute reaches wait_cqe(), the 8,000 read I/Os for Layer 0 are already complete (they started at the same time and took 1.2 ms, while compute needs 3.5 ms). Zero bubble time.

This pattern repeats for all 32 layers. Total prefill time is dominated by 32 × 3.5 ms = 112 ms of compute, with I/O fully hidden. TTFT = ~112 ms (vs. ~3.9 s for LMCache-GDS).

The key equation: Tutti achieves zero bubble time whenever:

T_{\text{transfer}}^{(l)} \leq T_{\text{compute}}^{(l)} \tag{14}

At 25.9 GB/s transfer bandwidth and sufficient attention complexity, this holds for all layers when the hit rate is below 98.3%.

Design Trade-off Summary

Tutti’s design involves several explicit trade-offs that deserve enumeration:

Design Choice	Alternative	Why Tutti’s Choice Wins
GPU-managed I/O control (gio_uring)	CPU-managed (GDS)	Eliminates CPU serialization bottleneck for 256K concurrent I/Os
SGL addressing (16 B/region)	PRP addressing (4 KB/page)	250× lower HBM metadata overhead; enables bulk transfers
SM partitioning (green contexts)	CUDA streams only	Deterministic QoS; prevents I/O kernel from monopolizing SMs
Offline slack profiling	Online dynamic estimation	Zero runtime overhead; profiling is one-time cost
Decoupled R/W scheduling	Layer-wise interleaved pipelining	Prevents 60% bandwidth collapse from read/write contention
Object-level granularity (per block × 2L)	Fine-grained block/file	Natural alignment with KV transfer granularity
Pre-computed P2P mapping table	Per-request address construction	Eliminates per-request physical address overhead

The key insight behind each choice is the same: move as much work as possible to initialization time (offline profiling, P2P table pre-computation, I/O kernel pre-loading), so that inference-time overhead is purely proportional to $O(L)$ (layers) rather than $O(L \times B)$ (layers × blocks).

Experimental Results

Setup

Server: 64-core Intel Xeon 6530, 512 GB DRAM, 2× H100 80GB, 4× Solidigm D7-PS1010 7.68 TB SSDs
Primary model: Llama3-8B (single GPU)
Scalability model: GLM-4-9B-Chat-1M (2 GPUs, tensor parallelism)
Workloads: LEval (3K–200K tokens) and LooGLE (>100K tokens)
Baselines: HBM-only, LMCache-DRAM, LMCache-SSD, LMCache-GDS

End-to-End Performance (Figure 8)

System	Avg TTFT (s) @ 1.5 req/s	vs Tutti
HBM only	7.2	8.3× slower
LMCache-DRAM	2.8	3.2× slower
LMCache-SSD	6.5	7.5× slower
LMCache-GDS	3.9	4.5× slower
Tutti	0.87	baseline

Figure 5: TTFT comparison at high request rate (LEval, vLLM v0.17.0, 1.5 req/s). Tutti achieves 69.1% lower TTFT than DRAM and 78.3% lower than GDS, supporting 2× more requests under a 1-second SLO.

An important nuance in these results: the gap between Tutti and DRAM varies by workload. On LEval (moderate context lengths, 3K–200K), Tutti even outperforms DRAM at some load levels by up to 13.4% (Figure 11, 16K–96K prefix). The reason: Tutti’s effective I/O-compute overlap can hide SSD transfer latency so well that the higher raw bandwidth of SSD (via RAID) actually exceeds DRAM’s bandwidth in the aggregate. Only at extremely high reuse (>96K prefix) does DRAM’s lower latency advantage re-emerge, giving DRAM a 20.6% lead. This suggests that for workloads with moderate prefix lengths and high hit rates, SSD can genuinely outperform DRAM — a counterintuitive result that challenges the conventional wisdom that DRAM is always preferable for KV caching.

Key results:

TTFT (LEval, 1.5 req/s, vLLM v0.17.0): Tutti achieves the best TTFT, 69.1% below DRAM, 78.3% below GDS
Under 1s TTFT SLO: Tutti supports 50% more requests than DRAM and 100% more than GDS
LooGLE (0.6 req/s, vLLM v0.17.0): Tutti TTFT is 93.2% below DRAM, 62.0% below GDS
ITL (LEval, 1.5 req/s): Tutti reduces ITL by 22.0% vs. DRAM and 24.4% vs. GDS

Storage Bandwidth (Figure 9)

Retrieve bandwidth: Tutti scales near-linearly with context length to 25.9 GB/s at 128K tokens; LMCache-GDS saturates at 11.9 GB/s (2.2× gap)
Store bandwidth: Tutti sustains ~10 GB/s (device-limited); LMCache-GDS reaches only ~7 GB/s

Bubble Time Analysis (Figure 13)

The crossover point analysis is particularly revealing:

System	Crossover Hit Rate
LMCache-SSD	~50%
LMCache-DRAM-LW	~97.9%
Tutti	~98.3%

Tutti matches DRAM’s behavior almost exactly — at most hit rates below 98.3%, the bubble time is near zero (averaging 25 ms, dropping to 6 ms at 93.75% hit rate). This confirms that the slack-aware scheduler successfully hides I/O latency behind compute for all practical deployment scenarios.

Cost Analysis (Figure 14)

On LooGLE at 0.5 req/s:

Tutti vs. LMCache-SSD: 66.2% cost reduction (higher throughput, same SSD cost)
Tutti vs. LMCache-GDS: 27% cost reduction

Multi-GPU Scalability (Figure 12)

On GLM-4-9B-1M with 2 GPUs and 4 SSDs:

At 640K prefix length: Tutti achieves 1.2s TTFT
LMCache-GDS fails (OOM) at 512K and 640K due to cuFile staging buffer overhead
Tutti succeeds at all tested lengths, demonstrating architectural robustness

Understanding Tutti’s position requires situating it in the broader KV cache serving landscape:

HBM-only systems (vLLM, SGLang): Keep all KV cache in HBM. Fast but limited capacity leads to high eviction rates (8% hit rate on LEval). These are the baseline for all other approaches.

DRAM-extension systems (LMCache-DRAM, CachedAttention, HCache): Extend capacity into CPU DRAM. DRAM provides good bandwidth (~50 GB/s) and low latency. Layer-wise pipelining effectively hides transfer latency. The capacity ceiling is ~2 TB per server, inadequate for large multi-session workloads.

CacheBlend (EuroSys’25): Focuses on KV cache reuse for RAG workloads, blending cached and computed KV entries. Complementary to Tutti — Tutti handles the storage tier, CacheBlend handles semantic cache matching.

Strata: A hierarchical context caching system that uses importance-based eviction. Not directly compared in Tutti, but the orthogonal focus (eviction policy vs. storage I/O efficiency) suggests they could be combined.

IMPRESS (FAST’25): A multi-tier prefix KV storage system that uses importance scores to decide which KV entries to retain in which tier. Like Strata, the eviction policy is orthogonal to Tutti’s I/O path improvement — a combined system could use IMPRESS’s policy with Tutti’s SSD access layer.

BaM, GeminiFS, GoFS: GPU-centric storage systems for generic workloads (raw blocks and files). Tutti builds on GeminiFS for its underlying file system, and addresses the specific challenges of adapting GPU-centric storage to KV cache workloads (abstraction mismatch, granularity gap, contention).

Tutti’s unique contribution: First to jointly address the I/O control path (gio_uring), physical addressing overhead (SGL vs PRP), and I/O-compute contention (slack-aware scheduler) in an integrated, production-integrated system. Each piece is necessary; neither BaM nor GeminiFS alone is sufficient.

Critical Assessment: Weaknesses & Improvements

(a) Weaknesses and Flaws

0. Missing end-to-end comparison with Tutti disabled (ablation). The paper shows excellent ablation studies for individual components (PRP vs SGL, slack scheduling), but there is no single ablation that disables all three innovations simultaneously to show the baseline “naive GPU-centric” performance. Understanding which of the three design choices (SGL, gio_uring, slack scheduling) contributes most to end-to-end gains would be valuable for future system designers who might want to implement only part of Tutti.

1. Narrow baseline coverage. The paper compares primarily against LMCache (v0.4.2). By May 2026, the KV cache serving space includes Strata, IMPRESS, HCache, CacheBlend, and SGLang HiCache. None of these are evaluated. IMPRESS in particular (FAST’25) specifically addresses tiered storage for KV cache — its absence is conspicuous.

2. Single-model evaluation. Core end-to-end results use only Llama3-8B. This is an 8B parameter model with only 32 layers — a relatively small model where per-layer I/O overhead is less severe than in larger models (70B, 405B) that Charles’s research targets. The paper shows GLM-4-9B-1M for scalability, but provides no TTFT comparison with baselines at that scale. Claims about “DRAM-like efficiency” should be validated on larger models where the KV cache footprint is substantially larger relative to HBM.

2. Cache hit rate dependency not decomposed. The paper evaluates performance at fixed hit rates (derived from the LEval/LooGLE datasets). In production, hit rates depend heavily on traffic patterns, session mix, and eviction policy. The paper does not study how Tutti’s performance degrades as hit rates fall (e.g., cold-start scenarios with 0% hit rate) or how it compares to LMCache-GDS under equal hit rate conditions imposed by experimental control rather than natural workload variation.

3. GDS implementation version not specified. The paper notes that GDS “still relies on CPU intervention” but uses LMCache v0.4.2 as the GDS baseline. NVIDIA has continued optimizing GDS and cuFile. The gap between Tutti and GDS may narrow on newer hardware (PCIe 6.0, CXL-attached memory) or with better GDS implementations — the paper does not discuss this.

4. Write bandwidth bottleneck. Tutti’s store bandwidth is device-limited at ~10 GB/s per SSD (sequential write peak). For write-heavy workloads with many new unique sessions, this could become a bottleneck. The paper acknowledges this but does not quantify write latency under sustained high eviction rates.

4a. Tensor parallelism interaction not analyzed. Under tensor parallelism, each GPU holds only a shard of the KV cache. Tutti spawns one instance per GPU, and each instance only manages its GPU’s KV shard. The paper mentions multi-GPU support but does not analyze how tensor parallelism’s all-reduce communication pattern interacts with concurrent SSD I/O, particularly when both compete for PCIe bandwidth on the same root complex.

5. Sparse decode phase. The slack-aware scheduler admits that decode slack windows are “short and less predictable,” resulting in deferred writes during decode. In workloads with many long decode sequences, the accumulated write backlog could cause KV eviction stalls when HBM is full and writes cannot complete fast enough.

(b) Limitations the Authors Understate or Omit

1. Warm-up overhead not reported. Tutti requires offline profiling to generate the slack table before inference can start. The profiling complexity is $O(L_{\text{max\_input}} \times L_{\text{max\_prefix}})$ model evaluations — potentially hours for million-token models. The paper mentions “the profile only needs to be generated once” but provides no profiling time measurements.

2. Remote retrieval path is unoptimized. Section 3.4 notes that multi-node remote KV retrieval “uses a CPU-side interface to read the GPU file into host memory and then transfers it across nodes via RDMA.” This CPU-side remote path completely negates Tutti’s CPU-elimination benefit for inter-node KV sharing, which is the common case in cluster-level prefix caching. This limitation is mentioned as “future work” but its performance impact is not quantified.

3. SSD endurance not discussed. NVMe SSDs have finite write endurance (typically 1–3 drive writes per day). High write amplification from constant KV cache eviction/restore cycles could substantially reduce SSD lifespan. A production deployment would need to account for SSD replacement costs, which are not reflected in the cost analysis.

4. CUDA event overhead not analyzed. gio_uring uses CUDA events to serialize I/O and compute kernels. For very short decode steps, the event synchronization overhead may be non-negligible. The paper provides no CUDA event overhead measurements.

5. Single-SSD vs. multi-SSD results conflated. Section 4.1 evaluates TTFT using “two SSDs with 29 GB/s peak bandwidth” but does not report single-SSD results. The bandwidth improvements shown may require multi-SSD RAID configurations that add hardware cost not reflected in the $0.000082/GB/hour SSD pricing.

(c) Concrete Improvement Suggestions

1. Evaluate with 70B+ models. The claims generalize most to deployments where HBM exhaustion is most acute — i.e., large models with long contexts. A single Llama3-70B forward pass with a 128K sequence uses ~140 GB of KV cache, requiring aggressive SSD offloading. Results at this scale would be far more convincing.

2. Benchmark against Strata and IMPRESS. These systems represent the current state of the art in tiered KV caching at the time of submission. Including them would strengthen the paper’s positioning considerably.

3. Quantify and optimize the remote retrieval path. GPU-initiated RDMA (e.g., NVIDIA SHARP or GPUDirect RDMA) would extend Tutti’s CPU-elimination principle to the inter-node case. The paper already mentions this as future work, but preliminary benchmarks would help assess the opportunity.

4. Adaptive slack window sizing. The current scheduler uses offline-profiled slack tables, requiring re-profiling when model, hardware, or vLLM version changes. An online adaptive version that dynamically estimates slack from runtime measurements would make the system more robust to deployment variability.

5. SSD wear-leveling simulation. A Markov chain model of KV eviction and restoration patterns would allow estimating drive write amplification and predicting SSD endurance under production workloads — important for total cost of ownership calculations.

Broader Impact: What Tutti Means for LLM Infrastructure

Tutti’s results suggest a re-examination of how LLM serving infrastructure should be designed. Currently, the standard practice is to keep all KV cache in DRAM when possible, accepting the 1–2 TB capacity ceiling that DRAM provides. This forces systems into a two-bad-options situation: accept low cache hit rates (HBM only) or pay high DRAM costs for large capacity.

Tutti breaks this dichotomy. By making SSD performance comparable to DRAM for KV cache workloads, it suggests a new “SSD-first” architecture for KV serving:

Primary tier: SSD (nearly infinite capacity, DRAM-like performance via Tutti)
Cache tier: HBM (hot working set, highest-reuse prefixes)
Optional DRAM tier: for multi-node setups where NVMe is not locally attached

This architectural shift could fundamentally change the economics of long-context serving. At current cloud pricing, a server with 4× 7.68 TB SSDs ( $0.000082/GB/h × 30 TB =$ 2.46/h) provides more KV cache capacity than 300 H100 GPUs’ HBM combined — at a fraction of the cost.

The agentic AI workload is a particularly compelling use case: long-running agents maintaining large conversational histories across many sessions could have their KV states persisted cheaply on local SSDs, making multi-turn interaction with trillion-token contexts economically viable.

Conclusion

Reproducibility Notes

The paper reports that Tutti is implemented and integrated with vLLM. The code consists of ~8,000 lines of C++ for the GPU storage layer and ~1,500 lines of Python for vLLM integration. As of the paper’s submission, the source code is described as “open-source” but the repository URL is not provided in the paper text. Key reproducibility requirements:

Hardware: H100 GPUs (required for green context SM partitioning, PCIe 5.0 for SGL bandwidth)
Software: GeminiFS (from the same research group, FAST’25), vLLM v0.12+ or v0.17+
SSDs: enterprise NVMe with >20 GB/s sequential read (Solidigm D7-PS1010 used in paper)

Reproducing the full evaluation requires significant hardware (2× H100, 4× enterprise SSDs). Partial reproduction of the bandwidth comparisons (SGL vs PRP, gio_uring throughput) would require only the GPU storage layer and a single H100.

Tutti makes a fundamental architectural shift: by giving the GPU autonomous I/O control over NVMe SSDs through gio_uring, it breaks the CPU bottleneck that has historically made SSD-backed KV cache impractical. The combination of GPU-native object abstraction (SGL addressing), asynchronous GPU io_uring (SM-partitioned, lock-free ring buffers), and slack-aware scheduling (offline-profiled, read/write decoupled) achieves SSD-backed KV caching that matches DRAM performance in most operating regimes while offering nearly infinite capacity at 100× lower cost per GB.

The most important insight is the crossover analysis: Tutti’s slack-aware scheduler keeps the system in the compute-bound regime up to a 98.3% cache hit rate — effectively matching DRAM’s behavior at all practically relevant operating points. For long-context serving where hit rates naturally reach 80–90%, Tutti provides DRAM-like latency with SSD-level cost, fundamentally changing the economics of large-scale LLM deployment.

The paper’s main open questions — remote GPU-initiated RDMA, adaptive slack estimation, and validation at 70B+ scale — leave clear directions for follow-up work that could make Tutti’s approach the default for production LLM serving infrastructure.

From a research perspective, Tutti’s most replicable conceptual contribution is the identification of the CPU I/O control path (not the data path) as the primary bottleneck in SSD-backed KV serving. GDS had already addressed the data path; Tutti’s insight that this was insufficient, and that the control path needed to move to the GPU, is the key intellectual step. The gio_uring implementation and slack-aware scheduler are the engineering realization of that insight.

For practitioners deploying LLM serving systems today: Tutti is not yet production-ready for all use cases (the remote path is unoptimized, hardware requires H100+), but its benchmark results provide strong evidence that SSD-backed KV serving can match DRAM performance at DRAM-like cache hit rates. Monitoring for public code release and evaluating on your specific workload profile would be a reasonable next step.

The paper appears in arXiv in May 2026. Given its strong results and clear engineering contributions, it is a strong candidate for submission to a top systems venue (OSDI, EuroSys, or ATC). The reproducibility path is well-defined; the code, once released, will likely become a reference implementation for GPU-centric KV cache serving.