Page 6 / 17

204 posts in total. Keep on posting.

Showing posts 61–72 of 204. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

  • EN

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    REINFORCE++ proves that GRPO's per-prompt advantage normalization is a biased estimator, then fixes it with a single global batch normalization step — achieving state-of-the-art results across general RLHF, complex reasoning, and long-horizon agentic tasks, all without a critic network.

  • REINFORCE++:用全局优势归一化稳定免批评家策略优化

    REINFORCE++ 从数学上证明了 GRPO 的逐 prompt 局部归一化是一个有偏估计量,并用全局批次归一化替换它——在通用 RLHF、复杂推理和长时序 agent 任务上全面超越 GRPO 和 PPO,同时无需任何批评家网络。

  • EN

    AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

    AutoSci is a memory-centric agentic system that automates the full scientific research lifecycle — reading, ideation, experimentation, writing, and rebuttal — through four integrated modules (SciMem, SciFlow, SciDAG, SciEvolve). I walk through the architecture from scratch and critically assess its evaluation methodology and narrow domain coverage.

  • AutoSci:以记忆为中心的全科研生命周期自主智能体系统

    AutoSci 是北大团队提出的「永久性科研环境」,用以记忆为中心的多智能体把读文献、提想法、做实验、写论文、回审稿人串成一个能自我进化的闭环。本文从零梳理它的四大模块(SciMem / SciFlow / SciDAG / SciEvolve)与两个端到端案例,并批判性分析其评测方法与适用边界上的局限。

  • EN

    Group Sequence Policy Optimization: A Sequence-Level RL Algorithm for Training Large Language Models

    GSPO replaces GRPO's token-level importance ratios with a single sequence-level ratio, yielding more stable and efficient RL for LLMs — and crucially fixing the training collapse that plagues RL on large Mixture-of-Experts models. A from-scratch walkthrough of the math and algorithm, plus a critical look at what the paper leaves untested.

  • Group Sequence Policy Optimization:序列级重要性采样修正 GRPO 的 RL 训练方法

    GSPO 把 GRPO 的 token 级重要性比率换成单一的序列级比率,让 LLM 的强化学习训练更稳、更省,并解决了大型 MoE 模型上 RL 训练崩溃的难题。本文从零讲清它的数学动机与算法细节,并批判性地分析了论文尚未验证的部分。

  • EN

    IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

    How double-sided KL-aware whitening, adaptive heterogeneous rank allocation, and loss-aware remapping combine to push SVD-based LLM compression to a new state of the art — with 4.34× decode throughput and minimal quality loss even at 60% parameter removal.

  • IO-SVD:基于输入输出双侧白化的自适应秩LLM压缩方法

    KL散度感知的双侧白化 + 贪婪异构秩分配 + 损失感知量化重映射,三招组合将SVD压缩推到新的SOTA——在LLaMA-7B 80%保留率下PPL降至5.59,同时带来4.34倍解码吞吐提升。

  • EN

    Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

    How Moonshot AI's Kimi serving platform redesigned LLM infrastructure around KV cache disaggregation—achieving 525% throughput gains for long-context workloads while maintaining strict TTFT and TBT SLO compliance.

  • Mooncake:以 KV Cache 为核心的大模型推理服务解耦架构

    Moonshot AI(Kimi)如何将整个 LLM 服务系统围绕 KV Cache 的调度、复用与迁移重新设计——在长文本场景下实现 525% 的吞吐量提升,同时满足严格的 TTFT 和 TBT 延迟 SLO。

  • EN

    SimPO: Simple Preference Optimization with a Reference-Free Reward

    SimPO replaces DPO's reference-model-dependent implicit reward with a length-normalized average log probability, eliminates the reference model entirely, adds a target reward margin to the Bradley-Terry objective, and achieves up to +6.4 points on AlpacaEval 2 and +7.5 on Arena-Hard — all while keeping response length controlled. The Gemma-2-9B-it SimPO model ranked #1 on Chatbot Arena among all <10B models.

  • SimPO:无需参考模型的简洁偏好优化

    SimPO 将 DPO 依赖参考模型的隐式奖励,替换为长度归一化的平均对数概率,彻底移除参考模型,并在 Bradley-Terry 目标中加入目标奖励边距。最终在 AlpacaEval 2 上超越 DPO 最高 +6.4 分、在 Arena-Hard 上超越最高 +7.5 分,且不引入回答长度膨胀。基于 Gemma-2-9B-it 的 SimPO 模型在 Chatbot Arena 人类真实投票中排名全部 10B 以下模型第一。