Research

Research portfolio

My research spans efficient machine learning and systems, from model pretraining quality to algorithms and system co-design for LLM training, inference, and agent infrastructure. Projects are organized below across five long-running themes. For paper details and authors, please refer to the Selected Publications section on the home page or my Google Scholar profile.

Efficient ML Algorithm

  • CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning (Efficient training · NeurIPS 2024)
  • Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping (Efficient inference · ICML 2025)
  • CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention (Efficient inference · ICLR 2026)
  • Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient (Efficient training)
  • Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time (Efficient inference)
  • I-DLM: Introspective Diffusion Language Models (Efficient inference)
  • Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution (Efficient inference)
  • VocabPrune (Efficient inference)
  • Diffusion Router (Efficient inference)
  • MixOfSpeculator: Mix-Architecture Speculator Design (Efficient inference)
  • Phoenix Speculator (Efficient inference)
  • Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation (Efficient training)
  • Bio-Inspired LLM-Based Multiagent Systems (Efficient inference)
  • Tail Likelihood Reinforcement Learning (Efficient training)
  • Scaling Law of Speculative Decoding (Efficient inference)

Efficient ML System

  • DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (RL training system · DeepSpeed)
  • Aurora: When RL Meets Adaptive Speculative Training — A Unified Training-Serving System (Speculator training system · ICML 2026)
  • Pre-Expedite: Hierarchical Structure Space for Improving Small File Access in Parallel File Systems (ML file system)
  • HybridShare: Universal Resource Scheduling for Hybrid Jobs (ML scheduling system)
  • MAEM: Multiple-Application co-Execution Time Estimation (ML scheduling system)
  • EmReal: A Digital Twin Framework of Emulated and Real Components for Robots with Reinforcement Learning (RL training system)
  • XoRL (RL training system)
  • Hierarchical Performance Isolation for Distributed LLM (Agent system)
  • AgentGO (Agent system)
  • Smart KV (Agent system)
  • Universal KV System (Agent system)

Quantization

  • Flash-LLM: Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity (VLDB 2024)
  • Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design on Modern GPUs (USENIX ATC 2024)
  • KITTY: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost (MLSys 2026)
  • SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
  • OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization (2026)

Modeling

  • DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
  • CoderForge-Preview
  • Loop Diffusion

Survey

  • RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Models (IEEE TPAMI)
  • Survey of LLM Agents

Looking to collaborate?

Feel free to reach out — zhongzhu.zhou@sydney.edu.au — if you have aligned interests in efficient ML systems, LLM training/serving infrastructure, quantization, or coding-agent research. For a complete role-by-role breakdown of contributions (motivation + specific contributions), see the Experience page.