[Users] Cactus Core Usage on HPC Cluster
allgwy001 at myuct.ac.za
Sun Feb 5 11:09:33 CST 2017
Hi Ian and Erik,
Thank you very much for all the advice and pointers so far!
I didn't compile the ET myself; it was done by an HPC engineer. He is
unfamiliar with Cactus and started off not using a config file, so he had
to troubleshoot his way through the compilation process. We are both
scratching our heads about what the issue with mpirun could be.
I suspect he didn't set MPI_DIR, so I'm going to suggest that he fixes that
and see if recompiling takes care of things.
The scheduler automatically terminates jobs that run on too many
processors. For my simulation, this appears to happen as soon as
TwoPunctures starts generating the initial data. I then get error messages
of the form: "Job terminated as it used more cores (17.6) than requested
(4)." (I switched from requesting 3 processors to requesting 4.) The number
of cores it tries to use appears to differ from run to run.
The parameter file uses Carpet. It generates the following output (when I
request 4 processors):
INFO (Carpet): MPI is enabled
INFO (Carpet): Carpet is running on 4 processes
INFO (Carpet): This is process 0
INFO (Carpet): OpenMP is enabled
INFO (Carpet): This process contains 16 threads, this is thread 0
INFO (Carpet): There are 64 threads in total
INFO (Carpet): There are 16 threads per process
Mpirun gives me the following information for the node allocation: slots=4,
max_slots=0, slots_inuse=0, state=UP.
The tree view of the processes looks like this:
PID TTY STAT TIME COMMAND
19503 ? S 0:00 sshd: allgwy001 at pts/7
19504 pts/7 Ss 0:00 \_ -bash
6047 pts/7 R+ 0:00 \_ ps -u allgwy001 f
Adding "cat $PBS_NODEFILE" to my PBS script didn't seem to produce
anything, although I could be doing something stupid. I'm very new to the
Erik: documentation for the cluster I'm trying to run on (called HEX) is
available under the Documentation tab at hpc.uct.ac.za, but it's very basic.
Thanks again for all your help! I'll let you know if we make any progress.
On Sun, Feb 5, 2017 at 10:55 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
> On 4 Feb 2017, at 20:13, Gwyneth Allwright <allgwy001 at myuct.ac.za> wrote:
> Hi All,
> I'm trying to get the Einstein Toolkit installed on an HPC cluster running
> SLES. The trouble is that Cactus tries to use all the available processors
> even when I specify a smaller number (by setting ppn in my PBS script).
> As a test, I tried running the compilation with a parameter file that
> required about 5 GB of RAM. In my PBS script, I set nodes=1 and ppn=3, and
> then ran using openmpi-1.10.1:
> mpirun -hostfile $PBS_NODEFILE <ET exe> <parameter file>
> This resulted in the simulation running on all 24 available processors,
> even though I'd only requested 3. Since PBS and MPI are integrated, I was
> told that using -np with mpirun wouldn't help.
> Does anyone know how to address this issue?
> Hi Gwyneth,
> A few questions:
> Can you check which nodes are appearing in $PBS_NODEFILE?
> When you say it's running on all 24 available processors, what do you
> mean? Do you mean that "top" shows 24 processes, or that the CPU is at
> 100%? Could it be that you do in fact have three processes, but due to
> OpenMP, each process actually has 12 threads?
> Can you do
> ps -u $USER f
> on the node while Cactus is running? This will show you a tree view of
> the processes that are running.
> Is the parameter file using PUGH or Carpet? The standard output from the
> first process should show you the number of processes that Cactus thinks it
> is running on.
> When you compiled the ET, did you specify the location of the MPI library
> with MPI_DIR? If Cactus failed to find the library automatically, it will
> have automatically built its own version of OpenMPI and linked with that
> version. If you then use an mpirun from a different installation of MPI,
> things will not work correctly.
> Maybe you can increase the verbosity of mpirun, to check how many
> processes it is trying to start.
> Ian Hinder
> Disclaimer - University of Cape Town This e-mail is subject to UCT
> policies and e-mail disclaimer published on our website at
> http://www.uct.ac.za/about/policies/emaildisclaimer/ or obtainable from +27
> 21 650 9111 <+27%2021%20650%209111>. If this e-mail is not related to the
> business of UCT, it is sent by the sender in an individual capacity. Please
> report security incidents or abuse via csirt at uct.ac.za
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Users