Page 6 / 17

204 posts in total. Keep on posting.

Showing posts 61–72 of 204. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

06-02 EN

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

REINFORCE++ proves that GRPO's per-prompt advantage normalization is a biased estimator, then fixes it with a single global batch normalization step — achieving state-of-the-art results across general RLHF, complex reasoning, and long-horizon agentic tasks, all without a critic network.
06-02 中

REINFORCE++：用全局优势归一化稳定免批评家策略优化

REINFORCE++ 从数学上证明了 GRPO 的逐 prompt 局部归一化是一个有偏估计量，并用全局批次归一化替换它——在通用 RLHF、复杂推理和长时序 agent 任务上全面超越 GRPO 和 PPO，同时无需任何批评家网络。
06-01 EN

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

AutoSci is a memory-centric agentic system that automates the full scientific research lifecycle — reading, ideation, experimentation, writing, and rebuttal — through four integrated modules (SciMem, SciFlow, SciDAG, SciEvolve). I walk through the architecture from scratch and critically assess its evaluation methodology and narrow domain coverage.
06-01 中

AutoSci：以记忆为中心的全科研生命周期自主智能体系统

AutoSci 是北大团队提出的「永久性科研环境」，用以记忆为中心的多智能体把读文献、提想法、做实验、写论文、回审稿人串成一个能自我进化的闭环。本文从零梳理它的四大模块（SciMem / SciFlow / SciDAG / SciEvolve）与两个端到端案例，并批判性分析其评测方法与适用边界上的局限。
05-31 EN

Group Sequence Policy Optimization: A Sequence-Level RL Algorithm for Training Large Language Models

GSPO replaces GRPO's token-level importance ratios with a single sequence-level ratio, yielding more stable and efficient RL for LLMs — and crucially fixing the training collapse that plagues RL on large Mixture-of-Experts models. A from-scratch walkthrough of the math and algorithm, plus a critical look at what the paper leaves untested.
05-31 中

Group Sequence Policy Optimization：序列级重要性采样修正 GRPO 的 RL 训练方法

GSPO 把 GRPO 的 token 级重要性比率换成单一的序列级比率，让 LLM 的强化学习训练更稳、更省，并解决了大型 MoE 模型上 RL 训练崩溃的难题。本文从零讲清它的数学动机与算法细节，并批判性地分析了论文尚未验证的部分。
05-29 EN

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

How double-sided KL-aware whitening, adaptive heterogeneous rank allocation, and loss-aware remapping combine to push SVD-based LLM compression to a new state of the art — with 4.34× decode throughput and minimal quality loss even at 60% parameter removal.
05-29 中

IO-SVD：基于输入输出双侧白化的自适应秩LLM压缩方法

KL散度感知的双侧白化 + 贪婪异构秩分配 + 损失感知量化重映射，三招组合将SVD压缩推到新的SOTA——在LLaMA-7B 80%保留率下PPL降至5.59，同时带来4.34倍解码吞吐提升。
05-28 EN

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

How Moonshot AI's Kimi serving platform redesigned LLM infrastructure around KV cache disaggregation—achieving 525% throughput gains for long-context workloads while maintaining strict TTFT and TBT SLO compliance.
05-28 中

Mooncake：以 KV Cache 为核心的大模型推理服务解耦架构

Moonshot AI（Kimi）如何将整个 LLM 服务系统围绕 KV Cache 的调度、复用与迁移重新设计——在长文本场景下实现 525% 的吞吐量提升，同时满足严格的 TTFT 和 TBT 延迟 SLO。
05-26 EN

SimPO: Simple Preference Optimization with a Reference-Free Reward

SimPO replaces DPO's reference-model-dependent implicit reward with a length-normalized average log probability, eliminates the reference model entirely, adds a target reward margin to the Bradley-Terry objective, and achieves up to +6.4 points on AlpacaEval 2 and +7.5 on Arena-Hard — all while keeping response length controlled. The Gemma-2-9B-it SimPO model ranked #1 on Chatbot Arena among all <10B models.
05-26 中

SimPO：无需参考模型的简洁偏好优化

SimPO 将 DPO 依赖参考模型的隐式奖励，替换为长度归一化的平均对数概率，彻底移除参考模型，并在 Bradley-Terry 目标中加入目标奖励边距。最终在 AlpacaEval 2 上超越 DPO 最高 +6.4 分、在 Arena-Hard 上超越最高 +7.5 分，且不引入回答长度膨胀。基于 Gemma-2-9B-it 的 SimPO 模型在 Chatbot Arena 人类真实投票中排名全部 10B 以下模型第一。