Optimizing GEMM/GEMV on x86 CPUs and NVIDIA GPUs

Mainly focusing on low-level performance optimization for math libraries, my PhD research is rooted in highly efficient hand-tuned GEMM/GEMV. Here I disassembled some common strategies that close-source commercial libraries (Intel oneMKL, NVIDIA cuBLAS) adopt for GEMM/GEMV optimizations.

Accelerating Homomorphic Encryption on Intel GPUs [paper]

FT-BLAS: A High-Performance BLAS Implementation With Online Fault Tolerance [paper]