Page 4 / 17

200 posts in total. Keep on posting.

Showing posts 37–48 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

06-13 EN

ForeMoE: Micro-step-level MoE Load Balancing for RL Post-training via Routing Foresight

ForeMoE exploits the unique structure of RL post-training — where rollout routing decisions are replayed in later stages — to predict and proactively balance MoE expert loads at micro-step granularity, achieving up to 1.45x speedup over state-of-the-art RL training systems on 64 GPUs.
06-13 中

ForeMoE：利用路由预见性实现 RL 后训练中 MoE 微步级负载均衡

ForeMoE 利用 RL 后训练特有的路由回放结构——rollout 阶段的路由决策在后续阶段被重用——实现对每个梯度微步的 MoE 专家负载精确预测与主动均衡，在 64 张 GPU 上实现最高 1.45× 的端到端加速。
06-12 EN

SliceGPT: Post-Training LLM Compression via Computational Invariance

SliceGPT exploits an exact structural symmetry in transformers to physically delete rows and columns from every weight matrix, achieving 25% parameter reduction with 99% zero-shot performance on LLAMA2-70B and OPT-66B — no custom hardware kernels required.
06-12 中

SliceGPT 阅读笔记：用计算不变性删除 Transformer 的行与列

SliceGPT 证明了 Transformer 计算对正交基变换具有精确不变性，并以 PCA 为工具将权重矩阵旋转到方差最集中的方向后直接裁去低方差维度，在 LLAMA2-70B 上以 25% 参数缩减保住 99% 零样本性能，且无需任何自定义 CUDA 算子。
06-11 EN

MegaScale: Engineering 55% MFU at 12,288 GPUs for LLM Training

MegaScale is ByteDance's full-stack production system for training LLMs at more than 10,000 GPUs, achieving 55.2% Model FLOPs Utilization through co-designed algorithmic optimizations, communication overlapping, and deep observability for fault tolerance.
06-11 中

MegaScale：ByteDance 如何在 12,288 块 GPU 上实现 55% MFU 的大规模 LLM 训练

MegaScale 是 ByteDance 用于超大规模 LLM 训练的生产系统，通过算法-系统协同设计、通信计算重叠、算子优化和深度可观测性，在 12,288 块 GPU 上实现了 55.2% 的 Model FLOPs Utilization，比 Megatron-LM 提升 1.34 倍。
06-10 EN

KeepKV: Lossless KV Cache Compression via Electoral Votes and ZIP-Merging

KeepKV introduces Electoral Votes and Zero Inference-Perturbation Merging to achieve single-step lossless KV cache compression, provably fixing the Attention Sag problem that plagues all prior merging methods.
06-10 中

KeepKV：用「选举票」机制和零扰动合并实现无损 KV 缓存压缩

KeepKV 提出了「选举票」机制和零推理扰动合并（ZIP-Merging），在数学上证明了单步无损 KV 缓存压缩，从根本上解决了所有现有合并方法都存在的「注意力衰落」问题。
06-09 EN

VAPO: Value-Augmented Proximal Policy Optimization for Long-CoT Reasoning

VAPO revives value-model-based RL for LLM reasoning by introducing Length-adaptive GAE and a suite of complementary techniques, reaching 60.4 on AIME 2024 with Qwen2.5-32B — outperforming DAPO by more than 10 points in under 5,000 training steps.
06-09 中

VAPO：面向长链推理的价值增强近端策略优化

VAPO 通过引入长度自适应 GAE 以及一套互补技术，让基于价值模型的强化学习重新超越了无价值模型方法，在 Qwen2.5-32B 上以不足 5000 步达到 AIME 2024 得分 60.4，比 DAPO 高出 10 分以上。
06-08 EN

ExpWeaver: How LLM Agents Learn from Past Experience in Latent Space

ExpWeaver replaces text-based experience retrieval with latent-space RAG — encoding past agent trajectories as dense hidden-state vectors and retrieving them at every decoding step via cross-attention, achieving SOTA on 12/13 tasks with 1.5-2x better token efficiency.
06-08 中

ExpWeaver：LLM 智能体如何在隐空间中从经验中学习

ExpWeaver 用潜空间 RAG 替代文本检索——将智能体的历史轨迹编码为稠密隐状态向量，在每个解码步骤通过交叉注意力检索并融合，在 12/13 个任务上取得 SOTA，同时将词元消耗降低 1.5-2 倍。