Research

Research portfolio

My research spans efficient machine learning and systems, from model pretraining quality to algorithms and system co-design for LLM training, inference, and agent infrastructure. Projects are organized below across five long-running themes. For paper details and authors, please refer to the Selected Publications section on the home page or my Google Scholar profile.

Efficient ML Algorithm

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning (Efficient training · NeurIPS 2024)
Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping (Efficient inference · ICML 2025)
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention (Efficient inference · ICLR 2026)
Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient (Efficient training)
Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time (Efficient inference)
I-DLM: Introspective Diffusion Language Models (Efficient inference)
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution (Efficient inference)
VocabPrune (Efficient inference)
Diffusion Router (Efficient inference)
MixOfSpeculator: Mix-Architecture Speculator Design (Efficient inference)
Phoenix Speculator (Efficient inference)
Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation (Efficient training)
Bio-Inspired LLM-Based Multiagent Systems (Efficient inference)
Tail Likelihood Reinforcement Learning (Efficient training)
Scaling Law of Speculative Decoding (Efficient inference)

Efficient ML System

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (RL training system · DeepSpeed)
Aurora: When RL Meets Adaptive Speculative Training — A Unified Training-Serving System (Speculator training system · ICML 2026)
Pre-Expedite: Hierarchical Structure Space for Improving Small File Access in Parallel File Systems (ML file system)
HybridShare: Universal Resource Scheduling for Hybrid Jobs (ML scheduling system)
MAEM: Multiple-Application co-Execution Time Estimation (ML scheduling system)
EmReal: A Digital Twin Framework of Emulated and Real Components for Robots with Reinforcement Learning (RL training system)
XoRL (RL training system)
Hierarchical Performance Isolation for Distributed LLM (Agent system)
AgentGO (Agent system)
Smart KV (Agent system)
Universal KV System (Agent system)

Quantization

Flash-LLM: Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity (VLDB 2024)
Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design on Modern GPUs (USENIX ATC 2024)
KITTY: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost (MLSys 2026)
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization (2026)

Modeling

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
CoderForge-Preview (TogetherAI Blog)
Loop Diffusion

Survey

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Models (IEEE TPAMI)
Survey of LLM Agents

Looking to collaborate?

Feel free to reach out — zhongzhu.zhou@sydney.edu.au — if you have aligned interests in efficient ML systems, LLM training/serving infrastructure, quantization, or coding-agent research. For a complete role-by-role breakdown of contributions (motivation + specific contributions), see the Experience page.