Page 3 / 10

116 posts in total. Keep on posting.

Showing posts 25–36 of 116. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

  • EN

    SpecGuard: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

    A technical review of SpecGuard, a verification-aware speculative decoding method that uses model-internal attention and log-probability signals to improve multi-step reasoning efficiency.

  • SpecGuard:用于多步推理的验证感知推测解码

    一篇关于 SpecGuard 的阅读笔记:它用模型内部的注意力与对数概率信号改进多步推理场景下的推测解码验证。

  • EN

    GRASP Technical Review: Replacing Redundant LLM Layers with Adaptive Singular Parameters

    A detailed review of GRASP, which replaces redundant transformer layers with gradient-selected adaptive singular parameters instead of simply deleting layers or keeping only the largest singular values.

  • EN

    PipeDream: Turning Pipeline Parallelism into a Practical Training System — Deep Technical Review

    1. Why this paper still matters in 2026 I think PipeDream is one of those papers that is easier to appreciate after the field has moved on. If I explain it in one sentence, I would say: PipeDream turned pipeline parallelism from a vague idea into a system-level recipe: profile the model, partition it automatically, keep multiple minibatches in flight, and repair the optimization semantics enough that training still converges. That sounds modest today because pipeline parallelism is now normal vocabulary in large-model training. But in 2018, this was an important systems step. The paper is historically important for at least four reasons. It clearly shows that data parallelism is not always the right default. When models become large, or when interconnects are weak relative to GPU speed, weight synchronization becomes a real bottleneck. It reframes pipeline parallelism as a joint scheduling and optimization problem, not just a diagram where layers are placed on different GPUs. It identifies the subtle but crucial issue of parameter-version mismatch between forward and backward passes. That is the kind of detail that separates a classroom concept from a production system. It anticipates a lot of the design space that later became standard in large-scale training stacks: stage partitioning, pipeline schedules, weight-version policies, stage replication, and runtime-managed buffer reuse. I also think the paper is still useful for modern readers because it teaches a systems mindset that remains valid: first find the actual bottleneck, then pick the right parallelization dimension, then ask what semantic damage the optimization introduces, then engineer around that damage carefully. That sequence is still exactly how good ML systems work today.

  • PipeDream:把 Pipeline Parallelism 做成真正可训练系统——深度阅读笔记

    1. 为什么这篇论文到 2026 年仍然值得读 如果让我用一句话概括这篇论文,我会说: PipeDream 的价值,不只是“把模型切成几段在不同 GPU 上跑”,而是把 pipeline parallelism 真正做成了一个完整训练系统:先 profile,后 partition,再 schedule,同时处理参数版本一致性问题,最后用 time-to-accuracy 来衡量系统价值。 今天大家谈大模型训练,已经很习惯使用 pipeline、tensor parallel、ZeRO、FSDP、activation checkpointing 这些术语,所以回头看 PipeDream,好像会觉得它只是早期工作之一。 但如果放回 2018 年的语境,这篇论文做了几件非常关键的事: 它明确说明了:数据并行不是永远正确的默认解。 它把 pipeline parallelism 从“概念图”推进到了可实现、可验证、可比较的系统设计。 它抓住了一个非常本质的问题:同一个 minibatch 的 forward 和 backward 如果看到的不是同一版参数,会不会把训练语义搞坏? 它让后来很多大模型训练系统里的概念变得更容易表达,比如 stage 划分、1F1B 调度、weight version、stage replication 等等。 我觉得它到今天仍然值得认真读,原因不是“它还能直接拿来训练最新 LLM”,而是它教会了我们一个很重要的系统思路: 先找真正的瓶颈; 再决定用哪一种并行方式; 再追问这种并行方式会不会破坏训练语义; 最后才是运行时与实现层面的工程落地。 这个思路今天一点都不过时。

  • LayerSkip:让大模型“提前退出 + 自校验推理”成为可部署方案——深度阅读笔记

    1. 为什么这篇论文值得认真读 如果要我用一句最朴素的话概括这篇论文: 它让同一个大模型“先用前几层快速猜,再用后几层批量核对修正”,从而在不引入第二个草稿模型的情况下实现明显加速。 这件事看起来像“推理技巧”,但本质上是训练与部署的联合设计。 今天大模型推理的核心痛点是: 每生成一个 token,通常都要走完整网络深度; 自回归导致串行瓶颈,无法像训练时那样大规模并行; 延迟和成本都很高; 多模型 speculative decoding 虽然有效,但显存与工程复杂度上去了。 LayerSkip 的价值在于它不是简单“后处理加速补丁”,而是三步联动: 训练阶段让模型早层更有预测能力; 推理阶段允许早退层先草拟 token; 用同一模型剩余深层做校验修正,并复用缓存减少额外开销。 论文给出的代表性速度收益是: CNN/DM 最高 2.16×; coding 最高 1.82×; TOPv2 最高 2.0×。 如果你是系统工程师,这篇论文最重要的不是“2.16×这个数字”,而是它提出了一个更有长期价值的问题: 我们能不能在训练时就把“可加速推理”写进模型能力结构里,而不是等部署时硬抠? 这是一个方向性问题。LayerSkip 给出了一个可行答案。

  • EN

    LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding — Deep Technical Review

    1. Why this paper matters If I had to explain this paper to a non-specialist in one sentence, I would say: The paper teaches a large language model to make decent predictions from earlier layers, then uses the remaining layers as a built-in checker so that inference becomes faster without needing a second draft model. That sounds simple, but it addresses a very real systems bottleneck. Modern LLM inference is expensive because each generated token usually pays for the full depth of the model. If a model has 32 or 40 transformer layers, then every next token runs through essentially all of them. That is painful for three reasons: latency is high, GPU cost is high, memory pressure becomes a serious deployment constraint. A lot of acceleration work tries to reduce one of these costs by quantization, sparsity, pruning, or a separate draft model. Those are useful directions. But they all come with trade-offs: quantization can hurt quality or require hardware-aware kernels, sparsity often needs special kernels to pay off, separate-model speculative decoding doubles some engineering complexity and increases memory footprint. What LayerSkip tries to do is elegant in a systems sense: train one model so its intermediate layers are more predictive, let those early layers draft tokens, let the later layers verify and correct them, reuse shared computation and cache because draft and verification come from the same network. I like this paper because it sits exactly at the boundary of model training design and serving systems design. It is not merely “here is a trick that is 3% better on one benchmark.” It is asking a deeper question: Can we train the model so that its internal depth becomes more usable at inference time? That is a powerful framing. Instead of treating inference optimization as something that happens only after training, the authors redesign training so that faster inference becomes natural. The headline results justify paying attention: up to 2.16× speedup on CNN/DM summarization, up to 1.82× speedup on coding, 2.0× speedup on TOPv2 semantic parsing, and code/checkpoints are open sourced. For an inference paper, that is already respectable. But the deeper contribution is conceptual: the paper turns one deep model into an ensemble of sub-models of different depths plus a built-in verifier.

  • EN

    Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts — Deep Technical Review

    1. Why this paper matters If I explain this paper to a non-specialist in one sentence: The paper tries to make reward models less like mysterious black boxes and more like structured judges that can say, in effect, “I value helpfulness this much, safety this much, and verbosity this much for this prompt.” That is a very important problem. In modern RLHF pipelines, the reward model is often the quiet center of power. People talk more about PPO, DPO, rejection sampling, or the final chatbot behavior, but the reward model is the component that decides what counts as “good.” If that judge is biased, the whole pipeline can drift in a strange direction. A classic example is verbosity bias: the reward model gives higher scores to longer answers, the policy learns to write longer answers, humans then receive bloated, repetitive, not-actually-better outputs. So the question is not merely “can we train a reward model?” We already can. The deeper question is: Can we build a reward model whose internal preferences are more interpretable, more controllable, and less vulnerable to hidden shortcuts? This paper answers with a fairly elegant design: predict multiple human-readable reward dimensions first, then learn a prompt-dependent gating network that decides how to combine them, while explicitly correcting for verbosity correlation. Even though the paper is short, the design idea is rich. It touches several central issues in alignment: how to represent human preference, how to keep reward models from becoming opaque hacks, how to move beyond simple pairwise wins/losses, how to separate “what is being judged” from “how those judgments are combined.” I think this makes the paper more important than its page count suggests.

  • ArmoRM:用“多目标奖励建模 + 混合专家门控”做可解释偏好学习——深度阅读笔记

    1. 为什么这篇论文值得认真读 如果让我用一句很直白的话来描述本文: 它不是在“再造一个更大的奖励模型”,而是在尝试把奖励模型从黑箱打分器,改造成“可分解、可检查、可调权重”的偏好判断系统。 这件事在 RLHF 里非常关键。 因为在很多对齐流水线里,真正最有“隐形权力”的组件不是 PPO 也不是 DPO,而是奖励模型: 它决定什么样的回答会被判定为“好”; 它的偏差会被后续策略优化放大; 一旦它错了,模型会“稳定地朝错误方向更努力”。 最典型的错误就是 冗长偏置(verbosity bias): 奖励模型潜意识里更偏爱长回答; 策略模型学到“越长越安全”; 最终用户得到的不是更好答案,而是更啰嗦、更绕、甚至信息密度更低的答案。 所以本文真正的问题不是“奖励模型能不能做”。这个问题早就有答案。 它要回答的是更深一层的问题: 能不能把奖励模型做成“多维、可解释、可按场景动态调节”的结构,减少黑箱偏差和 reward hacking 风险? 我认为,这个问题抓得非常准。

  • EN

    Toolformer: Language Models Can Teach Themselves to Use Tools — Deep Technical Review

    1. Why this paper still matters in 2026 If I explain this paper in one sentence to a non-technical reader: Toolformer teaches a language model to decide by itself when to ask outside tools for help, and then use the returned information inside normal text generation. That sounds simple, but the timing of this paper was very important. In early LLM waves, people observed a paradox: Large models were amazing at fluent writing. The same models were often bad at arithmetic, date reasoning, up-to-date facts, and precise retrieval. A common workaround was to manually design prompting pipelines: "For this benchmark, always call calculator first" "For this benchmark, use retrieval prompt template X" But those pipelines were usually task-specific and hand-wired. Toolformer asked a deeper systems question: Can the model itself learn when and how to call tools, from self-supervised signals, without large human annotation datasets for tool usage? This question is still central in 2026 because production AI systems now rely heavily on tool use: search, code execution, calculators, calendars, retrieval, domain APIs. The paper is not "the final answer" to tool-using agents, but it gives a clear baseline recipe with measurable gains.

  • Toolformer:让语言模型自己学会“什么时候调用工具”——深度阅读笔记

    1. 这篇论文在今天(2026)为什么仍然重要 先给一句最朴素总结: Toolformer 的核心,不是“给模型外挂工具”,而是“让模型自己学会:什么时候该调用哪个工具、怎么把工具结果用回生成过程”。 这点很关键。 早期大模型给人的感觉是“什么都会”,但真正做系统时很快会遇到几类典型问题: 算术不稳定,特别是多步计算; 日期/时间推理容易错; 对新近事实可能过时; 事实性问答会出现幻觉。 以前常见做法是人工写流程: 这个任务先调用 calculator; 那个任务必须先 retrieval; 再拼一个固定 prompt 模板。 这样能工作,但很“手工流水线化”,迁移性差。 Toolformer 的价值在于它提出了一个更自动化的问题: 能不能在几乎没有大规模人工标注的情况下,让模型通过自监督信号学会工具使用策略? 到 2026 年,这个问题仍然是工业界 Agent 系统的核心问题之一,所以这篇论文依旧有学习价值。

  • Voyager:一个能在 Minecraft 中持续成长的 LLM 具身智能体 —— 深度阅读笔记

    1. 为什么这篇论文值得“周末整块时间”认真读 如果我要用一句最朴素的话概括这篇论文,我会这样说: Voyager 的核心不是“让模型答对一道题”,而是“让模型像会成长的玩家一样,在世界里持续探索、持续积累、持续变强”。 这句话非常关键。 很多早期 LLM Agent 看起来很聪明,是因为它们能: 解释问题; 写一个计划; 调用一个工具; 完成一次循环。 但它们常见的短板也很明显: 每次都像“第一次做题”; 成功经验不一定沉淀成可复用能力; 长任务容易中途崩; 没有稳定的“能力增长曲线”。 Voyager 真正试图回答的是更难的问题: 能不能在没有固定终点的开放世界里持续探索? 能不能自动选“当前合适的下一步任务”? 能不能把成功动作沉淀成未来可复用技能? 能不能把学到的技能迁移到新世界继续解新任务? 这已经不是“聊天机器人范式”了,而是明显更接近“持续学习系统范式”。 我欣赏这篇论文的一点是:它并没有吹“AGI 已经解决”。它做的是很扎实的系统工程工作: 任务选择机制; 代码动作生成; 反馈驱动修复; 技能存储与检索; 可解释的评估指标。 它没有重新训练一个超大模型,而是主要依赖: Prompt 结构设计; 记忆组织方式; 执行反馈闭环; 程序化动作抽象。 换句话说,这篇论文最重要的贡献,不在“更大的模型”,而在“更正确的 Agent 架构”。 这也是为什么它今天仍然值得细读。