Writing high-performance GPU implementations of graph algorithms
can be challenging. In this paper, we argue that three optimizations
called throughput optimizations are key to high-performance
for this application class.
These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand.
To address this problem, we have implemented these optimizations in a compiler that
produces CUDA code from an intermediate-level program representation
called IrGL.
Compared to state-of-the-art handwritten CUDA implementations of eight graph applications,
code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never
more than 30% slower for the others. Throughput optimizations contribute an improvement
up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.
Wed 2 NovDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:30 - 12:10 | |||
10:30 25mTalk | A Compiler for Throughput Optimization of Graph Algorithms on GPUs OOPSLA DOI Pre-print | ||
10:55 25mTalk | Automatic Parallelization of Pure Method Calls via Conditional Future Synthesis OOPSLA DOI | ||
11:20 25mTalk | Portable Inter-workgroup Barrier Synchronisation for GPUs OOPSLA Tyler Sorensen Imperial College London, Alastair F. Donaldson Imperial College London, Mark Batty University of Kent, Ganesh Gopalakrishnan University of Utah, Zvonimir Rakamaric University of Utah DOI Pre-print | ||
11:45 25mTalk | Parallel Incremental Whole-Program Optimizations for Scala.js OOPSLA DOI Pre-print |