Enhancing Performance of Deep Learning Training and Inference on Resource-constrained Edge Devices


Abstract:

As edge computing gains prominence for its local computation advantages, deep learning (DL) training and updates, alongside inference on edge devices, have become increasingly relevant. Existing training methods for DL models often rely on centralized scheduling or the remote cloud. However, scaling DL models and managing large datasets pose significant challenges in edge scenarios due to the resource constraints of edge devices. Similarly, inference of DL models on edge devices is challenging, despite the advantages of on-device processing using accelerators such as GPUs, NPUs, and DSPs. Due to high energy consumption, users may limit floating-point precision. Additionally, many edge device users are in regions where it is prohibitively expensive for manufacturers to include high-fidelity accelerators. As a result, low-cost edge devices are often equipped with low floating-point precision accelerators, sacrificing accuracy.
In parallel, large language models (LLMs) have recently transformed natural language processing, enabling advanced tasks such as automated customer support, text generation, and real-time translation. However, deploying these models in real-world edge environments presents performance bottlenecks, particularly in handling KV-cache (KVC) during inference. This KVC bottleneck can lead to frequent preemptions, increased queuing delays, and high response latencies, which are especially problematic in time-sensitive applications such as autonomous systems and healthcare diagnostics. Consequently, achieving low latency while managing memory effectively is essential for enabling LLMs in edge settings, where constrained resources require optimal KVC handling to support real-time processing demands.

This dissertation focuses on minimizing the time required for both DL training and inference on edge devices by addressing the above challenges. It proposes systemic heuristic and reinforcement learning (RL)-based approaches to reduce training and inference time without significant accuracy loss. First, the dissertation introduces a distributed training system, DMP, which emphasizes Data and Model Parallelism. DMP optimizes the training structure by clustering edge devices and leveraging geographically close nodes for data sensing and model partitioning, thereby reducing overall training time. Next, it introduces SROLE, a system that employs Shielded RL for decentralized scheduling in data and model parallel training scenarios to reduce the load on any single node within a cluster of edge devices. SROLE enables autonomous job scheduling at each edge node, mitigating resource overloading and action collisions, with the shared goal of reducing training time.

For inference, the dissertation proposes a system for Fast, Accurate DNN Inference on Low-Cost Edges for pre-trained models, which dynamically determines layer assignments across CPUs and accelerators using heuristic and RL methods. Finally, the dissertation introduces a Confidence-Aware Padding with Efficient Preemption technique for efficient LLM inference, providing a critical steppingstone for deploying LLMs on edge devices. This system addresses the KVC bottleneck by using confidence-guided KVC allocation based on response length predictions and dynamically adjusting padding to better utilize GPU memory while reducing the response latency. Additionally, it employs a cost-based heuristic to make real-time decisions between recomputation and swapping, depending on resource occupancy and system load. The proposed approaches are evaluated using well-known ML models across various applications, demonstrating their generality and effectiveness in real-world scenarios.
 

Committee:  

  • Yue Cheng, Committee Chair (CS/SEAS, SDS/UVA)
  • Haiying Shen, Advisor (CS,BME/SEAS,SDS/UVA)
  • Lu Feng (CS, SIE/SEAS/UVA)
  • Yangfeng Ji (CS/SEAS/UVA)
  • Anand Iyer (School of CS/College of Computing/Georgia Tech)