A Compiler for Throughput Optimization of Graph Algorithms on GPUs
Writing high-performance GPU implementations of graph algorithms
can be challenging. In this paper, we argue that three optimizations
called throughput optimizations are key to high-performance
for this application class.
These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand.
To address this problem, we have implemented these optimizations in a compiler that
produces CUDA code from an intermediate-level program representation
Compared to state-of-the-art handwritten CUDA implementations of eight graph applications,
code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never
more than 30% slower for the others. Throughput optimizations contribute an improvement
up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.
Wed 2 NovDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:30 - 12:10
Optimization and PerformanceOOPSLA at Matterhorn 1
Chair(s): Jan Vitek Northeastern University
|A Compiler for Throughput Optimization of Graph Algorithms on GPUs|
Sreepathi Pai University of Texas at Austin, USA, Keshav Pingali University of Texas at Austin, USADOI Pre-print
|Automatic Parallelization of Pure Method Calls via Conditional Future Synthesis|
Rishi Surendran Rice University, USA, Vivek Sarkar Rice University, USADOI
|Portable Inter-workgroup Barrier Synchronisation for GPUs|
Tyler Sorensen Imperial College London, Alastair F. Donaldson Imperial College London, Mark Batty University of Kent, Ganesh Gopalakrishnan University of Utah, Zvonimir Rakamaric University of UtahDOI Pre-print
|Parallel Incremental Whole-Program Optimizations for Scala.js|
Sébastien Doeraene EPFL, Switzerland, Tobias Schlatter EPFL, SwitzerlandDOI Pre-print