Blogs (9) >>
Sun 30 October - Fri 4 November 2016 Amsterdam, Netherlands
Mon 31 Oct 2016 16:05 - 16:30 at Matterhorn 1 - Session 2

Numerical software in computational science and engineering often relies on highly-optimized building blocks from libraries such as BLAS and LAPACK. Examples of such blocks include, but are not limited to, matrix multiplications, matrix factorizations, and solvers for Sylvester-like equations. While the BLAS and LAPACK libraries have been very successful in providing portable performance for a wide range of computing architectures, they still present severe limitations in terms of flexibility. First, these libraries are optimized for large matrices (of sizes at least in the hundreds). Second, the interface in terms of operations and matrix structures they provide specifically targets computational science. These limitations can render those libraries suboptimal in performance or code size for applications in communications, graphics, and control, which may require smaller scale computations and a more flexible interface. To overcome these limitations, we advocate a domain-specific program generator capable of producing library routines tailored to the specific needs of the application in terms of sizes, interface, and target architecture. In this work, we introduce such a generator that translates a desired linear algebra computation, annotated with matrix properties, into optimized C code, optionally vectorized with intrinsics. The generator unites prior work on two independent frameworks: The FLAME-based CL1CK and LGen, which was designed after Spiral. For a given linear algebra problem such as a matrix factorization, matrix inversion, or equation to be solved, CL1CK synthesizes families of blocked algorithms that rely on basic computations provided by BLAS. These, in turn, are compiled into efficient, vectorized C code by (an extension of) LGen. As case studies, we consider the Cholesky decomposition, and solvers for the continuous-time Lyapunov and Sylvester equations. We compare the performance of our generated code with the commercial Intel MKL showing competitive results.