This benchmarks libxsmm vs LoopVectorization.jl for the batched matmul C = ∑ᵢ Aᵢ* Bᵢ where C is an accumulator of size m×n, all Aᵢ are m×k and all Bᵢ are k×n, and the sum runs over 10k batches. Both libxsmm and LoopVectorization.jl use column-major matrices.

Instructions:

git clone --recursive https://github.com/haampie/smm-bench.git
cd smm-bench
make AVX={1,2,3} -j
julia --project -e 'using Pkg; pkg"instantiate"; include("julia.jl"); benchmark()'