Simplified example for a NERSC OpenMP issue.
First of all, one needs to build the demo executable (not doing anything meaningful but keeping each CPU core busy for about a second) with make, which should work out of the box at NERSC.
At NERSC, submit a batch job with sbatch demo0.sh, or request an interactive one with bash demo_interactive.sh 0.
Either only requests 5 minutes and should finish much faster.
For me, the clock time was almost 256 times larger than the wtime which I think means a pretty good parallelism.
At NERSC, submit a batch job with sbatch demo1.sh, or request an interactive one with bash demo_interactive.sh 1.
Either only requests 5 minutes and should finish a bit faster.
For me, the clock time (same as in previous case) is very close to the wtime which seems to indicate the C++ code is only given one CPU core/hyperthread after sklearn.cluster.KMeans usage in demo.py.