Page 2 / 17

200 posts in total. Keep on posting.

Showing posts 13–24 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

06-26 EN

SigmaScale: Learning to Scale Weight Matrices for Better SVD-Based LLM Compression

SigmaScale learns per-layer row and column scaling vectors to reshape weight-matrix singular-value spectra before truncated SVD, improving compression quality in the mild-to-moderate regime without requiring specialized hardware.
06-26 中

SigmaScale 阅读笔记：通过学习缩放矩阵改进 SVD 大语言模型压缩

SigmaScale 通过梯度下降学习每层权重矩阵的行列缩放向量，在截断 SVD 之前重塑奇异值谱，从而在温和到中等压缩比下超越基于解析缩放的现有方法，且不依赖任何专用硬件。
06-25 EN

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving

ReMP turns static TP/PP topology into a runtime-adjustable resource, achieving topology switches in 1–7 seconds (100× faster than restart) through shared CPU weight stores, two-dimensional KV cache migration, and pre-built MPU state snapshots — enabling adaptive LLM serving under dynamic workloads.
06-25 中

ReMP：LLM 推理服务中的低停机运行时并行拓扑重配置

ReMP 将 TP/PP 拓扑从启动时的静态参数变成可在线切换的动态资源，通过 CPU 共享权重存储、二维 KV Cache 迁移和预构建 MPU 状态快照，在 7B 到 70B 参数规模上将拓扑切换时间从分钟级压缩到 1-7 秒，速度提升达 100 倍。
06-24 EN

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

SparDA introduces a fourth per-layer Forecast projection that decouples KV block selection from attention computation, enabling lookahead CPU-to-GPU prefetch and a compact GQA-level indexer—delivering up to 1.7x decode speedup and 5.3x throughput over sparse attention baselines on 128K-context inference.
06-24 中

SparDA：稀疏解耦注意力，让长上下文推理又快又准

SparDA 通过引入第四个逐层 Forecast 投影，将 KV 块选择从注意力计算中解耦，实现提前一层预测并异步预取 CPU KV 缓存，在 128K 上下文推理中实现最高 1.7 倍解码加速和 5.3 倍吞吐量提升。
06-23 EN

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Critique-GRPO integrates natural language critiques into online RL loops to overcome the three core failure modes of purely numerical reward feedback—performance plateaus, failed self-reflection, and persistent failures—achieving up to +21.6% Pass@1 improvements over GRPO on challenging math and reasoning benchmarks.
06-23 中

Critique-GRPO：用自然语言批评反馈突破强化学习训练瓶颈

Critique-GRPO 将自然语言批评反馈引入在线强化学习循环，解决纯数值奖励训练的三大结构性瓶颈——性能平台期、无效的自发自我反思与持续性失败——在 AIME 2024 等高难度推理基准上实现最高 +26.7% 的 Pass@1 提升。
06-22 EN

MRAgent: Why Memory Should Be Reconstructed, Not Retrieved

MRAgent replaces passive top-k retrieval with active, multi-step graph traversal over a Cue-Tag-Content associative memory, achieving up to 23% improvement on long-horizon conversational benchmarks while using 5x fewer tokens than competing methods.
06-22 中

MRAgent：记忆应该被重建，而不是被检索

MRAgent 用主动多步图遍历取代被动 top-k 检索，在长对话记忆基准上最高提升 23%，同时将 token 消耗降低至竞品的 1/5。
06-21 EN

Tutti: GPU-Centric SSD-Backed KV Cache That Finally Makes SSDs Practical for Long-Context LLM Serving

Tutti eliminates CPU intervention from the KV cache I/O path by introducing GPU io_uring and slack-aware scheduling, achieving DRAM-like efficiency from NVMe SSDs at 100x lower cost per GB.
06-21 中

Tutti 阅读笔记：GPU 原生 SSD KV 缓存，让 NVMe 固态硬盘真正可用于长上下文大模型推理

Tutti 通过 GPU io_uring 机制消除 KV 缓存 I/O 路径中的 CPU 介入，配合时隙感知调度器，使 NVMe SSD 达到接近 DRAM 的推理性能，同时将每 GB 存储成本降低约 100 倍。