Page 8 / 17

204 posts in total. Keep on posting.

Showing posts 85–96 of 204. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

05-19 EN

KTO: Model Alignment as Prospect Theoretic Optimization — Technical Blog Review

Technical review of KTO (Ethayarajh et al., Stanford / Contextual AI, ICML 2024, arXiv:2402.01306): reframes DPO and PPO-Clip through Kahneman-Tversky prospect theory as a family of Human-Aware Losses (HALOs), then derives Kahneman-Tversky Optimization — an alignment objective that needs only a binary desirable/undesirable signal per response, no preference pairs. KTO matches or exceeds DPO across Pythia-1.4B to Llama-30B (GSM8K +13.5 pts on Zephyr-β-SFT/UltraFeedback) and stays robust under 1:10 class imbalance via λD / λU reweighting.
05-19 中

KTO：把模型对齐看成「前景理论」优化 —— 阅读笔记

KTO 阅读笔记：把 DPO 与 PPO-Clip 放到 Kahneman-Tversky 前景理论框架下，统一为 Human-Aware Losses (HALO)，再推出只需『二元 desirable/undesirable 信号』的 Kahneman-Tversky Optimization。在 Pythia-1.4B → Llama-30B 全尺度追平或超过 DPO（Zephyr-β-SFT + UltraFeedback 上 GSM8K +13.5 pts），且在 1:10 类不平衡下仍稳健。Stanford / Contextual AI, ICML 2024, arXiv 2402.01306。
05-18 EN

Why Single-Agent LLMs Beat Multi-Agent Systems on Multi-Hop Reasoning — A Budget-Controlled Story

Technical review of Tran & Kiela (Stanford, arXiv 2604.02460): once you fix the thinking-token budget as the sole resource axis, single-agent LLMs (SAS) match or beat every multi-agent architecture (Sequential / Subtask-parallel / Parallel-roles / Debate / Ensemble) across a 336-configuration matrix (Qwen3-30B-A3B, DeepSeek-R1-Distill-Llama-70B, Gemini-2.5-Flash/Pro × FRAMES + MuSiQue 4-hop × 100–10000 tokens). The paper grounds this in a clean Data Processing Inequality argument, identifies the regime flip under heavy context degradation (substitution/masking at α=0.7), and audits the Gemini 2.5 thinking_budget API artifact that motivates the SAS-L scaffold.
05-18 中

思考预算锁死之后，单 Agent 为什么打过多 Agent —— 阅读笔记

Tran & Kiela (Stanford, arXiv 2604.02460) 阅读笔记：把『思考 token 预算』作为唯一资源轴，单 Agent (SAS) 在 Qwen3-30B-A3B / DeepSeek-R1-Distill-70B / Gemini-2.5-Flash/Pro × FRAMES + MuSiQue 4-hop × 100–10000 预算的 336 个配置上几乎处处与最强多 Agent (Sequential / Subtask-parallel / Parallel-roles / Debate / Ensemble) 持平或更优。论文给出 Data Processing Inequality 的贝叶斯论证、上下文退化下的反向 DPI 相位变化，以及 Gemini 2.5 thinking_budget API 计量伪影的审计（即 SAS-L 前缀的来源）。
05-17 中

PipeSD：基于推测解码的云边协同流水线推理框架 —— 阅读笔记

PipeSD 把云边协同推测解码视为三资源（草稿、网络、验证）流水线问题，用 DP 最优 token-batch 调度 + 双阈值 NAV 触发器，在真实云边测试床上把 TPT 提升 1.16×–2.16×，能耗下降 14.3%–25.3%。
05-17 EN

PipeSD: Cloud-Edge Collaborative Pipeline Inference with Speculative Decoding — Technical Review

PipeSD reframes cloud-edge speculative decoding as a three-resource pipelining problem (draft, network, verify) and shows that DP-optimal token-batch scheduling plus a confidence-based verify trigger together yield 1.16×–2.16× TPT improvement on a real edge-cloud testbed.
05-16 中

用 Little 定律解释推测解码在真实服务中的提速曲线 —— 阅读笔记

Kong 等人提出的「面向真实服务的推测解码延迟模型」阅读笔记。用 roofline 风格的延迟分解加 Little 定律，把不同 RPS、模型、硬件下的延迟曲线压缩到同一条 1/(1-x) 通用形上，并从机制层面解释了「batch=1 SD 提速在高负载下消失」的现象。
05-16 EN

An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review

A detailed technical review of Kong et al.'s interpretable latency model for speculative decoding under real serving workloads. Using a roofline-style decomposition plus Little's Law, the paper collapses RPS-versus-latency curves onto a single universal form and gives a mechanistic explanation for why batch=1 SD speedups erode under load.
05-15 中

Zero Sum SVD：用「损失零和」做全局奇异值预算分配的 LLM 压缩方法

一篇关于 Zero Sum SVD 的中文阅读笔记：把所有层的奇异值堆到一个全局优先队列里，用带符号的损失敏感度和「零和守恒」的贪心规则一次性决定全模型的秩预算，让异质化的逐层秩自然从一条标量约束里掉出来。
05-15 EN

Zero Sum SVD: A Global, Loss-Aware Rank Budget for LLM Compression

A detailed technical review of Zero Sum SVD, which replaces per-layer rank optimization with a global, signed loss-sensitivity heap and a greedy zero-sum rule, letting heterogeneous per-layer ranks fall out of one scalar conservation law.
05-14 中

DisagMoE：用解耦 Attention 和 FFN 打通 MoE 训练的 all-to-all 瓶颈

一篇关于 DisagMoE 的中文阅读笔记：把 attention 和 FFN 分别放到独立 GPU 池，用 AF-Pipe 调度和 M2N 通讯原语把两侧拼起来，从而把 MoE 训练里的 all-to-all 瓶颈藏进计算之下。
05-14 EN

DisagMoE: Disaggregating Attention and FFN to Beat the MoE All-to-All Bottleneck

A detailed technical review of DisagMoE, which disaggregates attention and FFN layers onto separate GPU pools and stitches them together via the AF-Pipe schedule to hide the MoE all-to-all bottleneck during training.