Tag

#Distributed Training

29 posts tagged with this label. Back to all tags or the main feed.

2026

07-02 EN

Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization
07-02 中

Tangram：为异构GPU集群隐藏硬件差异的高效LLM并行化系统
06-13 EN

ForeMoE: Micro-step-level MoE Load Balancing for RL Post-training via Routing Foresight
06-13 中

ForeMoE：利用路由预见性实现 RL 后训练中 MoE 微步级负载均衡
06-11 EN

MegaScale: Engineering 55% MFU at 12,288 GPUs for LLM Training
06-11 中

MegaScale：ByteDance 如何在 12,288 块 GPU 上实现 55% MFU 的大规模 LLM 训练
05-15 EN

Zero Sum SVD: A Global, Loss-Aware Rank Budget for LLM Compression
05-15 中

Zero Sum SVD：用「损失零和」做全局奇异值预算分配的 LLM 压缩方法
05-14 EN

DisagMoE: Disaggregating Attention and FFN to Beat the MoE All-to-All Bottleneck
05-14 中

DisagMoE：用解耦 Attention 和 FFN 打通 MoE 训练的 all-to-all 瓶颈
05-07 EN

Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
04-29 EN

FEPLB Technical Review: Nearly Free MoE Load Balancing with the NVLink Copy Engine
04-24 EN

FEPLB: Zero-Cost MoE Load Balancing via NVLink Copy Engine
04-16 EN

PipeDream: Turning Pipeline Parallelism into a Practical Training System — Deep Technical Review
04-16 中

PipeDream：把 Pipeline Parallelism 做成真正可训练系统——深度阅读笔记
04-04 EN

Switch Transformers: Scaling to Trillion-Parameter Sparse Models — In-Depth Technical Review
04-04 中

Switch Transformers：用简单高效的稀疏性扩展到万亿参数模型 — 深度阅读笔记
04-02 EN

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism — In-Depth Technical Review
04-02 中

GPipe：微批次流水线并行的大规模模型训练 — 深度阅读笔记
03-29 EN

Ring Attention: Blockwise Transformers for Near-Infinite Context — In-Depth Technical Review
03-26 EN

Alpa: Automating Inter- and Intra-Operator Parallelism — In-Depth Technical Review
03-19 EN

ZeRO: Shattering the Memory Wall — How DeepSpeed Trains Trillion-Parameter Models
03-12 EN

Megatron-LM: NVIDIA's Blueprint for Training Billion-Parameter Models at Scale
03-12 EN

PaRO: Smarter Partitioning for Distributed Training — Beyond ZeRO's One-Size-Fits-All

2020

09-25 EN

Slurm-Day5
09-09 EN

Slurm-Day4
09-05 EN

Slurm-Day2
09-05 EN

Slurm-Day3
09-04 EN

Slurm-Day1