Page 1 / 17
204 posts in total. Keep on posting.
Showing posts 1–12 of 204. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- EN
MosaicKV: Dynamic Two-Dimensional KV Cache Compression for Long-Context LLM Serving — Technical Review
MosaicKV solves the long-context KV cache bottleneck by applying dynamic per-vector element selection and segment-adaptive strategies across both sequence and channel dimensions, achieving 16x attention speedup and 7.3x throughput gain at only 1.76pct average accuracy loss.
- 中
MosaicKV:面向超长上下文LLM服务的动态二维KV缓存压缩——阅读笔记
MosaicKV通过逐向量元素选择与分段自适应策略同时压缩序列维度和通道维度,在LongBench和RULER上仅损失1.76pct准确率,实现16倍注意力加速和7.3倍吞吐提升。
- EN
AIR: Activation- and Influence-Aware SVD Compression for LLMs — Technical Review
AIR adds a closed-form element-wise influence ALS sweep on top of SVD-LLM(W) whitening, achieving 18-45pct perplexity gains at 20-60pct parameter retention while cutting peak memory 64pct and per-token latency 53pct on an A100.
- 中
AIR 阅读笔记:激活与影响力双重感知的SVD低秩LLM压缩
AIR 在激活白化基础上引入元素级反向传播影响力矩阵,通过封闭形式ALS迭代实现混合感知低秩近似,在60%参数保留下困惑度降低18%,峰值内存削减64%,推理延迟降低53%。
- EN
Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization
Tangram decouples parallelization planning from GPU heterogeneity by abstracting heterogeneous clusters into homogeneous GPU islands, then composing partial plans from existing parallelizers into work-balanced pipelines — achieving up to 2.3× higher throughput than heterogeneous baselines while retaining full support for expert parallelism, ZeRO, and activation recomputation.
- 中
Tangram:为异构GPU集群隐藏硬件差异的高效LLM并行化系统
Tangram将异构GPU集群抽象为同构GPU岛,让现有的同构并行化器生成部分计划,再通过动态规划组合成全局负载均衡的流水线——在保留专家并行、ZeRO、激活重计算等全部特性的同时,比现有异构并行化器吞吐量高出最多2.3倍。
- EN
SSV: Sparse Speculative Verification for Efficient LLM Inference
SSV resolves the structural mismatch between speculative decoding and dynamic sparse attention by grouping overlapping verifier queries, fusing NSA branches across layers, and adaptively orchestrating draft-verify strategies per prompt — achieving up to 3.49x end-to-end throughput on H100 GPUs.
- 中
SSV:稀疏投机验证——在动态稀疏注意力中做投机解码
SSV 通过重叠感知的查询分组、刷新/复用式 NSA 核融合与自适应策略编排,彻底解决了投机解码与动态稀疏注意力的结构性矛盾,在 H100 GPU 上实现最高 3.49 倍端到端吞吐提升。
- EN
DAPO: An Open-Source LLM Reinforcement Learning System at Scale — Technical Review
DAPO introduces four targeted algorithmic fixes to GRPO — asymmetric clip bounds, dynamic sampling, token-level gradient averaging, and soft overlong penalties — achieving 50pct accuracy on AIME 2024 with Qwen2.5-32B in 50pct fewer steps than DeepSeek-R1-Zero.
- 中
DAPO:大规模 LLM 强化学习系统阅读笔记
DAPO 针对 GRPO 的四个具体问题分别提出解法——非对称截断(Clip-Higher)、动态采样、逐 Token 策略梯度损失和软超长惩罚——使 Qwen2.5-32B 在 AIME 2024 上达到 50pct 准确率,所用训练步数比 DeepSeek-R1-Zero 减少一半。
- EN
ACTS: Steering How LLMs Reason, Not Just How Long
ACTS introduces an RL-trained controller agent that steers a frozen reasoning LLM step-by-step through a budget-aware Markov decision process, achieving Vanilla-level accuracy with up to 57 percent token savings and even surpassing full-thinking baselines on harder tasks by eliminating overthinking spirals.
- 中
ACTS:用强化学习训练的控制器,让 LLM 推理更聪明而不只是更短
ACTS 把链式推理的控制建模为预算约束下的马尔可夫决策过程,训练一个轻量控制器 agent 逐步为冻结推理模型分配推理策略,以最多节省 57% token 的代价维持甚至超越原模型精度。