 The current
 uses 1 MPI rank per node:
 max-num-threads = 24
 num-threads     = 24

 This is usually not the best way to set things up, I would eg have
 expected that the default choice would be something like 1 MPI rank per
 NUMA domain.
 Given that, unless limited by communication overhead, we seem to obtain
 fastest per-node performance when using only MPI and no OpenMP (about a
 factor of 50% speedup on my 12 core workstation with 2 NUMA domains) if
 anyone is using Comet for production work and wants to contribute their
 machine description file that would be great.

