Improving Performance of Deep Learning Training and Inference in Machine Learning Clusters

  

Abstract

Deep Learning (DL) has driven significant advancements across numerous domains, but its training and inference jobs are resource-intensive and often bound by strict deadlines. Distributed DL training divides datasets into smaller tasks processed by multiple workers. However, a major challenge in these environments is managing stragglers—slower-performing tasks caused by factors such as resource contention, hardware heterogeneity, or insufficient resources. Stragglers delay training iterations and waste resources, as faster-performing workers remain idle while waiting, which is especially problematic in dynamic, shared, or heterogeneous environments like production clusters or spot markets (e.g., AWS EC2). Traditional DL training systems often assume static resource availability and rely on reactive methods, such as task migration or job re-launching, which fail to address the underlying causes and lead to inefficiencies. While proactive methods like spawning duplicate tasks or adjusting straggler data sizes exist, they either increase resource consumption or risk compromising model accuracy. This dissertation addresses these above issues by proposing novel solutions to minimize the impact of stragglers in DL training, thereby reducing training times and improving resource efficiency. It introduces a Straggler-Avoiding DL Training Job Scheduling (SAS) system, which leverages machine learning to predict resource availability for each node and generates a resource matrix for clustering nodes with similar availability. Tasks are then assigned to node groups that optimize training job completion time while ensuring high probabilities of adequate resource provisioning. Additionally, a deep reinforcement learning-based approach, SAS-RL, is proposed to accelerate decision-making in SAS. To further tackle stragglers, this dissertation presents a Straggler Mitigation System (SMS), which predicts potential stragglers by leveraging predictable computation patterns in DL tasks. SMS employs a novel batch-size reduction technique to dynamically adjust workloads, removing similar data from clusters within the batch to maintain accuracy. SMS also integrates a gradient transfer control mechanism to bypass redundant gradient transfers from stragglers. Furthermore, for DL inference, we are working on a memory-efficient inference system for Mixture-of-Experts (MoE) based large language models (LLMs), which predicts experts based on their recurrence patterns to load only the necessary experts into GPU memory, significantly reducing memory consumption during LLM inference.

Committee

  • Yue Cheng, Committee Chair (CS/SEAS, SDS/UVA)
  • Haiying Shen, Advisor (CS, ECE/SEAS/UVA)
  • Shangtong ZHang (CS/SEAS/UVA)
  • Chang Lou (CS/SEAS/UVA)
  • Anand Iyer (School of CS/College of Computing/Georgia Tech)