Page 5 / 10
116 posts in total. Keep on posting.
Showing posts 49–60 of 116. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- 中
Switch Transformers:用简单高效的稀疏性扩展到万亿参数模型 — 深度阅读笔记
Switch Transformer 将每个 token 只路由到一个专家,在相近计算量下实现稀疏万亿参数模型的高效训练。
- EN
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — In-Depth Technical Review
AWQ uses activation-aware scaling to protect salient weights during low-bit quantization, enabling strong LLM accuracy with efficient deployment on edge and server hardware.
- 中
AWQ:感知激活值的大模型权重量化压缩与加速 — 深度阅读笔记
AWQ 利用激活感知缩放保护最关键的权重,在低比特量化下仍保持较强精度,并适合真实设备部署。
- 中
GPipe:微批次流水线并行的大规模模型训练 — 深度阅读笔记
GPipe 提出了微批次流水线并行方法,实现大规模神经网络的高效训练。本文从零讲解流水线调度算法、梯度累积、重计算内存优化,以及在 AmoebaNet 和 Transformer 上的实验结果。
- EN
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism — In-Depth Technical Review
GPipe introduces micro-batch pipeline parallelism for efficient training of large neural networks. This review covers the pipeline scheduling algorithm, gradient accumulation, re-materialization for memory optimization, and experimental results on AmoebaNet and Transformer models.
- EN
Layer Pruning for Efficient Large Language Models — In-Depth Technical Review
Layer pruning removes redundant layers from LLMs to reduce compute and memory costs. Covers layer importance metrics, pruning strategies, and fine-tuning recovery.
- EN
Constitutional AI: Harmlessness from AI Feedback — In-Depth Technical Review
Constitutional AI trains harmless AI assistants using AI-generated feedback instead of human labels. Covers the critique-revision pipeline, RLAIF, and comparison with RLHF.
- EN
Chain-of-Thought Prompting Elicits Reasoning in LLMs — In-Depth Technical Review
Chain-of-Thought prompting enables LLMs to perform complex reasoning by generating intermediate steps. Covers few-shot CoT, zero-shot CoT, and analysis across arithmetic, commonsense, and symbolic tasks.
- EN
Ring Attention: Blockwise Transformers for Near-Infinite Context — In-Depth Technical Review
Ring Attention enables near-infinite context length by distributing attention computation across devices in a ring topology. Covers blockwise computation, online softmax, and memory analysis.
- EN
Mamba: Linear-Time Sequence Modeling with Selective State Spaces — In-Depth Technical Review
Mamba introduces selective state space models as an alternative to Transformers with linear-time complexity. Covers selective scan, hardware-aware algorithms, and language modeling results.
- EN
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — In-Depth Technical Review
GaLore reduces memory requirements for LLM training through gradient low-rank projection. Covers the mathematical foundation, subspace switching, and memory savings analysis.
- EN
Alpa: Automating Inter- and Intra-Operator Parallelism — In-Depth Technical Review
Alpa automates the search for optimal parallelism strategies combining data, tensor, and pipeline parallelism. Covers the ILP formulation, inter-operator DP, and compilation framework.