To make sure CPUSummary 1.11 and newer are using Hwloc, you may want to run
julia> using CPUSummary
julia> CPUSummary.use_hwloc(true);which will hopefully enable accurate hardware information. This is the default, so it should typically be unnecessary.
Octavian.jl is a multi-threaded BLAS-like library that provides pure Julia matrix multiplication on the CPU, built on top of LoopVectorization.jl.
Please see the Octavian documentation.
Octavian dropped 32bit Julia support. See PR#157. If you're interested in restoring it, please file a PR to fix failing tests.
You can run benchmarks using BLASBenchmarksCPU.jl:
julia> @time using BLASBenchmarksCPU
  7.278954 seconds (17.59 M allocations: 1.107 GiB, 6.22% gc time)
julia> rb = runbench(sizes = logspace(10, 1_000, 200)); plot(rb, displayplot = false);
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 2:25:04
  Size:               (1000, 1000, 1000)
  BLIS:               (MedianGFLOPS = 1051.0, MaxGFLOPS = 1476.0)
  Gaius:              (MedianGFLOPS = 765.8, MaxGFLOPS = 941.7)
  MKL:                (MedianGFLOPS = 1348.0, MaxGFLOPS = 1589.0)
  Octavian:           (MedianGFLOPS = 1816.0, MaxGFLOPS = 1895.0)
  OpenBLAS:           (MedianGFLOPS = 1254.0, MaxGFLOPS = 1385.0)
  Tullio:             (MedianGFLOPS = 1102.0, MaxGFLOPS = 1196.0)
  LoopVectorization:  (MedianGFLOPS = 1552.0, MaxGFLOPS = 1721.0)
julia> versioninfo()
Julia Version 1.7.0-DEV.1124
Commit d18cf93bac* (2021-05-19 16:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36| Julia Package | CPU | GPU | 
|---|---|---|
| Gaius.jl | Yes | No | 
| GemmKernels.jl | No | Yes | 
| Octavian.jl | Yes | No | 
| Tullio.jl | Yes | Yes | 
In general:
- Octavian has the fastest CPU performance.
- GemmKernels has the fastest GPU performance.
- Tullio is the most flexible.