Understanding and Optimizing Memory Access Behaviors via Hardware/Software Co-design
Due to the ever-increasing gap between the speed of processing elements and the speed at which memory systems can feed them with data, current computing systems are often bottlenecked by limited memory bandwidth. I hypothesize that, to alleviate the memory wall issue, algorithms and data structures for data-intensive applications must be designed to leverage hardware features in the memory hierarchy, and in some cases, a software-hardware co-design approach is beneficial. To that extent, first, we propose Hopscotch, which is a micro-benchmark suite for memory performance characterization. It aims to fill the gap left by existing memory benchmarks by providing kernels covering a wide range of spatio-temporal locality spectrum. Additionally, the Hopscotch suite contains tools for memory-access pattern visualization as well as empirically finding the Roofline model, thereby helping users to isolate performance bottlenecks and reverse-engineer memory characteristics. Next, we optimize two data-intensive applications by leveraging the memory-centric hardware features in traditional general purpose architecture. The BigMap approach optimizes the memory access behavior of a popular fuzzer, AFL, to enable large bitmaps for accurate coverage tracking. In BigMap, we propose a two-level hashing scheme that consolidates scattered random accesses, vastly improving the spatial locality of the bitmap operations. In the GraphTango approach, we propose a hybrid storage format for streaming graphs and a cache-friendly hashing scheme to minimize cache line accesses. Unlike the prior approaches, GraphTango provides excellent update/analytics throughput regardless of a graph's degree distribution. My last two works focus on the hardware-software co-optimization for reducing data movement. In Pulley, we provide an in-memory sorting accelerator that can leverage the subarray level parallelism for a distributed LSB-first radix sorting. This approach avoids the single-point merging bottleneck of the prior near-data and in-memory sorting accelerators. Finally, we propose a PIM architecture for accelerating the inference on temporal graph neural networks. We incorporate a vertex migration technique to cater to the evolving nature of the graph, where vertices get clustered together with their neighbors over time, reducing data movement.
- Ashish Venkat, Committee Chair, CS/SEAS/UVA
- Kevin Skadron, Advisor, CS/SEAS/UVA
- Felix Lin, CS/SEAS/ UVA
- Adwait Jog, CS/SEAS/ UVA
- Jundong Li, ECE/CS/DS/SEAS/ UVA