Computer Science Location: Rice 242
Add to Calendar 2019-11-18T09:30:00 2019-11-18T09:30:00 America/New_York PhD Qualifying Exam Presentation by Marzieh Lenjani Fulcrum: a Simplified Control and Access Mechanism toward Flexible and Practical in-situ Accelerators Abstract: Rice 242

Fulcrum: a Simplified Control and Access Mechanism toward Flexible and Practical in-situ Accelerators


Many modern and emerging applications are memory intensive, where the cost of moving data from memory to processor dominates the cost of computations. In-situ approaches process data very close to the memory cells, in the row buffer of each subarray, immediately after the data are read. This minimizes data movement costs and also affords parallelism across subarrays. However, current in-situ approaches are limited to only row-wide bitwise (or few-bit) operations applied uniformly across the row buffer. Therefore they cannot support common operations (32-bit addition and multiplications), operations with data dependency, and operations based on predicates. Moreover, with current peripheral logic, communication among subarrays is inefficient and bits in a word are not physically adjacent. Our proposed method, Fulcrum, addresses these issues. We propose a new lightweight access and control mechanism that sequentially processes the data, enables operations with data dependencies along the row buffer, and supports operations based on a predicate. The sequential control mechanism requires only a one-word, scalar ALU, at the edge of every two subarrays, capable of a range of common operations (bitwise, addition, and multiplication). We show that this one-word ALU outperforms a row-wide bitwise ALU. For algorithms that require communication among subarrays, we augment the peripheral logic with broadcasting capabilities and a previously-proposed method for low-cost inter-subarray data movement. In order to realize true subarray-level parallelism, we introduce a lightweight column-selection mechanism through shifting one-hot encoded values. This technique enables independent column selection in each subarray. Our evaluation shows that, on average, our method outperforms (i) a contemporary GPU (Nvidia P100 with HBM2) by 118×, and (ii) an ideal model of a GPU (which only accounts for the overhead of data movement) by 83×.


  • Worthy Martin (Chair)
  • Kevin Skadron (Advisor)
  • Yangfeng Ji
  • Ashish  Venkat