Optimizing GEMM/GEMV on x86 CPUs and NVIDIA GPUs

Mainly focusing on low-level performance optimization for math libraries, my PhD research is rooted in highly efficient hand-tuned GEMM/GEMV. Here I disassembled some common strategies that close-source commercial libraries (Intel oneMKL, NVIDIA cuBLAS) adopt for GEMM/GEMV optimizations.

On NVIDIA GPUs (tested on RTX 2080 Super, TU102)
- SGEMM (tiling, warp-level tiling, register blocking, prefetching, double buffer). [code] [tutorial]
- SGEMV (register blocking, vectorization). [code]
On Intel CPUs (tested on Xeon Gold W2255, Cascade Lake)
- DGEMM (register blocking, cache blocking, packing, SIMD, prefetching). [code] [tutorial]
- DGEMV (register blocking, SIMD, OpenMP multithreading). [code]

Accelerating Homomorphic Encryption on Intel GPUs [paper]

The first-ever SYCL-based GPU backend for Microsoft SEAL APIs.
The first HE library based on the CKKS scheme optimized for Intel GPUs
Optimizating from instruction level, algorithmic level and application level to accelerate our HE library.
Our NTT implementations reaches up to 85.7% of the theoretical peak performance on latest Intel GPUs.

FT-BLAS: A High-Performance BLAS Implementation With Online Fault Tolerance [paper]

Brand new BLAS implementation (Level-1/2/3) featuring Intel AVX512 instructions. Comparable/faster than state-of-the-art BLAS libraries on latest Intel CPUs.
Encoded fault tolerant codes into assembly kernels with negligible overhead (0.5% - 3%) added to the baseline.
The fault tolerant library remains comparable/faster than commercial libraries (MKL/OpenBLAS/BLIS) with/without runtime computing errors injected.

Yujia Zhai

Optimizing GEMM/GEMV on x86 CPUs and NVIDIA GPUs

Accelerating Homomorphic Encryption on Intel GPUs [paper]

FT-BLAS: A High-Performance BLAS Implementation With Online Fault Tolerance [paper]