Optimizing GEMM/GEMV on x86 CPUs and NVIDIA GPUs
Mainly focusing on low-level performance optimization for math libraries, my PhD research is rooted in highly efficient hand-tuned GEMM/GEMV. Here I disassembled some common strategies that close-source commercial libraries (Intel oneMKL, NVIDIA cuBLAS) adopt for GEMM/GEMV optimizations.
- On NVIDIA GPUs (tested on RTX 2080 Super, TU102)
- SGEMM (tiling, warp-level tiling, register blocking, prefetching, double buffer). [code] [tutorial]
- SGEMV (register blocking, vectorization). [code]
- On Intel CPUs (tested on Xeon Gold W2255, Cascade Lake)
- DGEMM (register blocking, cache blocking, packing, SIMD, prefetching). [code] [tutorial]
- DGEMV (register blocking, SIMD, OpenMP multithreading). [code]
Accelerating Homomorphic Encryption on Intel GPUs [paper]
- The first-ever SYCL-based GPU backend for Microsoft SEAL APIs.
- The first HE library based on the CKKS scheme optimized for Intel GPUs
- Optimizating from instruction level, algorithmic level and application level to accelerate our HE library.
- Our NTT implementations reaches up to 85.7% of the theoretical peak performance on latest Intel GPUs.
FT-BLAS: A High-Performance BLAS Implementation With Online Fault Tolerance [paper]
- Brand new BLAS implementation (Level-1/2/3) featuring Intel AVX512 instructions. Comparable/faster than state-of-the-art BLAS libraries on latest Intel CPUs.
- Encoded fault tolerant codes into assembly kernels with negligible overhead (0.5% - 3%) added to the baseline.
- The fault tolerant library remains comparable/faster than commercial libraries (MKL/OpenBLAS/BLIS) with/without runtime computing errors injected.