Page 2 / 17
200 posts in total. Keep on posting.
Showing posts 13–24 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- EN
SigmaScale: Learning to Scale Weight Matrices for Better SVD-Based LLM Compression
SigmaScale learns per-layer row and column scaling vectors to reshape weight-matrix singular-value spectra before truncated SVD, improving compression quality in the mild-to-moderate regime without requiring specialized hardware.
- 中
SigmaScale 阅读笔记:通过学习缩放矩阵改进 SVD 大语言模型压缩
SigmaScale 通过梯度下降学习每层权重矩阵的行列缩放向量,在截断 SVD 之前重塑奇异值谱,从而在温和到中等压缩比下超越基于解析缩放的现有方法,且不依赖任何专用硬件。
- EN
ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving
ReMP turns static TP/PP topology into a runtime-adjustable resource, achieving topology switches in 1–7 seconds (100× faster than restart) through shared CPU weight stores, two-dimensional KV cache migration, and pre-built MPU state snapshots — enabling adaptive LLM serving under dynamic workloads.
- 中
ReMP:LLM 推理服务中的低停机运行时并行拓扑重配置
ReMP 将 TP/PP 拓扑从启动时的静态参数变成可在线切换的动态资源,通过 CPU 共享权重存储、二维 KV Cache 迁移和预构建 MPU 状态快照,在 7B 到 70B 参数规模上将拓扑切换时间从分钟级压缩到 1-7 秒,速度提升达 100 倍。
- EN
SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
SparDA introduces a fourth per-layer Forecast projection that decouples KV block selection from attention computation, enabling lookahead CPU-to-GPU prefetch and a compact GQA-level indexer—delivering up to 1.7x decode speedup and 5.3x throughput over sparse attention baselines on 128K-context inference.
- 中
SparDA:稀疏解耦注意力,让长上下文推理又快又准
SparDA 通过引入第四个逐层 Forecast 投影,将 KV 块选择从注意力计算中解耦,实现提前一层预测并异步预取 CPU KV 缓存,在 128K 上下文推理中实现最高 1.7 倍解码加速和 5.3 倍吞吐量提升。
- EN
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Critique-GRPO integrates natural language critiques into online RL loops to overcome the three core failure modes of purely numerical reward feedback—performance plateaus, failed self-reflection, and persistent failures—achieving up to +21.6% Pass@1 improvements over GRPO on challenging math and reasoning benchmarks.
- 中
Critique-GRPO:用自然语言批评反馈突破强化学习训练瓶颈
Critique-GRPO 将自然语言批评反馈引入在线强化学习循环,解决纯数值奖励训练的三大结构性瓶颈——性能平台期、无效的自发自我反思与持续性失败——在 AIME 2024 等高难度推理基准上实现最高 +26.7% 的 Pass@1 提升。
- EN
MRAgent: Why Memory Should Be Reconstructed, Not Retrieved
MRAgent replaces passive top-k retrieval with active, multi-step graph traversal over a Cue-Tag-Content associative memory, achieving up to 23% improvement on long-horizon conversational benchmarks while using 5x fewer tokens than competing methods.
- 中
MRAgent:记忆应该被重建,而不是被检索
MRAgent 用主动多步图遍历取代被动 top-k 检索,在长对话记忆基准上最高提升 23%,同时将 token 消耗降低至竞品的 1/5。
- EN
Tutti: GPU-Centric SSD-Backed KV Cache That Finally Makes SSDs Practical for Long-Context LLM Serving
Tutti eliminates CPU intervention from the KV cache I/O path by introducing GPU io_uring and slack-aware scheduling, achieving DRAM-like efficiency from NVMe SSDs at 100x lower cost per GB.
- 中
Tutti 阅读笔记:GPU 原生 SSD KV 缓存,让 NVMe 固态硬盘真正可用于长上下文大模型推理
Tutti 通过 GPU io_uring 机制消除 KV 缓存 I/O 路径中的 CPU 介入,配合时隙感知调度器,使 NVMe SSD 达到接近 DRAM 的推理性能,同时将每 GB 存储成本降低约 100 倍。