Writing high-performance GPU implementations of graph algorithms
can be challenging. In this paper, we argue that three optimizations
called throughput optimizations are key to high-performance
for this application class.
These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand.
To address this problem, we have implemented these optimizations in a compiler that
produces CUDA code from an intermediate-level program representation
Compared to state-of-the-art handwritten CUDA implementations of eight graph applications,
code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never
more than 30% slower for the others. Throughput optimizations contribute an improvement
up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.
Wed 2 Nov
|10:30 - 10:55|
|10:55 - 11:20|
|11:20 - 11:45|
|11:45 - 12:10|