[Users] simulationfactory issue?
Richard O'Shaughnessy
oshaughn at gravity.phys.uwm.edu
Sun Oct 17 10:17:01 CDT 2010
Hi Erik,
Thanks for the breakdown.
With regard to condor, I know Duncan Brown at Syracuse has a working script 'condor_mpirun' that's GPL'd. I'm including a copy of a message that circulated to a LIGO software email list that describes how to run condor mpi jobs.
So you'd submit to a cluster with
condor_submit <name-of-script-text-file-included-below>
which would in turn call this script
See http://www.cs.wisc.edu/condor/manual/v7.4/2_9Parallel_Applications.html
for a description of the underlying issues. I personally haven't used condor in parallel in a few years, but it was particularly convenient when I did. And it's the submission system I'll likely be using in the future.
-Richard
> From: Duncan Brown <dabrown at physics.syr.edu>
> Date: October 6, 2010 8:35:36 AM CDT
> To: daswg at gravity.phys.uwm.edu, Vivien Raymond <vivien at u.northwestern.edu>
> Cc: Stuart Anderson <anderson at ligo.caltech.edu>, LDAS_ADMIN_ALL <ldas_admin_all at ligo.caltech.edu>
> Subject: Re: [DASWG] MPI jobs on LDG clusters
> Reply-To: daswg at gravity.phys.uwm.edu
>
> Hi Vivien,
> On Oct 5, 2010, at 8:26 PM, Vivien Raymond wrote:
>> Thanks Stuart, I can now compile and run my jobs locally on
>> ldas-pcdev1.ligo.caltech.edu. As far as I understand from Xavier,
>> there
>> is no way to condor_submit MPI jobs yet, is there?
>
>
> On sugar-dev1.phy.syr.edu, you can use the installation of openmpi
> 1.2.5 installed in
>
> /opt/condor/openmpi-1.2.5
>
> You should be able to build and link your code against this without
> any problems. Once you've mpi-compiled, you can use the script below
> to run your job. Change
>
> /path/to/your/mpiexecutable
>
> as appropriate. Note, *don't* change executable, that has to be
> condor_mpirun. Change HOWMANYCORES to the appropriate integer. Dump
> the lines below into a condor .sub file (e.g. spinspiral.sub) and the
> condor_submit it as normal.
>
> Let me know if you have problems. A similar thing should work at CIT
> (I'll have to point you to my condor_mpirun, as it's not installed
> system-wide) but let's try sugar first. There's may be some SpEC-
> specific stuff we have to iron out.
>
> BTW, if you modify your code so that it exits gracefully and can
> resume after a SIGUSR2 the code can checkpoint on eviction (SpEC does
> this). If not, I'll just lock you on as many cores as you need and we
> can deal with this later.
>
> Cheers,
> Duncan.
>
> universe = parallel
> executable = /opt/condor/bin/condor_mpirun
> arguments = --verbose --stdout cluster$(CLUSTER).proc$(PROCESS).mpiout
> --stderr cluster$(CLUSTER).proc$(PROCESS).mpierr /path/to/your/
> mpiexecutable
> machine_count = HOWMANYCORES
> log = cluster$(CLUSTER).proc$(PROCESS).log
> output = cluster$(CLUSTER).proc$(PROCESS).subproc$(NODE).out
> error = cluster$(CLUSTER).proc$(PROCESS).subproc$(NODE).err
> notification = Always
> on_exit_remove = (ExitBySignal == True) || (ExitCode != 143)
> rank = (40 - (2.0 * TotalCondorLoadAvg))
> queue
>
>
> --
>
> Duncan Brown Room 263-1, Department of Physics,
> Assistant Professor of Physics Syracuse University, NY 13244, USA
> Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
>
>
------
On Oct 16, 2010, at 2:48 PM, Erik Schnetter wrote:
> Richard
>
> A cactus configuration is defined by its thorn list, i.e. by the set
> of thorns which it includes. You can compile more thorns into a
> configuration than you activate at run time. I usually have a single,
> large configuration containing many thorns, and activate a few of them
> at run time. I use the default name "sim" for this configuration.
>
> (I also have a variant sim-debug with debugging enabled, which runs at
> reduced speeds, but contains more run-time checks to catch coding
> errors.)
>
> Others prefer to have smaller configurations, and create a new
> configuration for each project or simulation. I can see that this is a
> good idea, since re-building my large configuration from scratch can
> take some time.
>
> No, there is no consistency between configurations on different
> machines. We enforce consistency between source trees, but not
> configuration. However, I think this is a good idea. We could e.g.
> derive the configuration name from the thorn list, and replicate thorn
> lists to remote systems -- this would (in a way) ensure consistency.
> Of course, you then still need to ensure that a configuration is
> re-built whenever the thorn list or the source tree changes, which we
> don't do yet automatically.
>
> We don't have submit scripts for Condor's Parallel Universe. Can you
> give us a pointer to details of this system?
>
> -erik
>
> On Fri, Oct 15, 2010 at 4:21 PM, Richard O'Shaughnessy
> <oshaughn at gravity.phys.uwm.edu> wrote:
>> Hi Erik,
>> Thanks -- I didn't realize I accidentally (and consistently!) was adding
>> static_tov to the build command.
>> How does the build system work? It's obvious now that I can (for example)
>> build wavetoy, static_tov, and ks-mclachlan at once, and run instances of
>> each independently. But what about maintaining the source trees -- do I
>> need to rebuild the name? Can I check if there are updates to a particular
>> configuration's source tree? Is there any consistency enforced between
>> configurations used on a local host and remote build (i.e., if I want to be
>> sure I use the same code tag on each of many target clusters)?
>> -Richard
>> PS: On a related note, are there submission scripts for condor's parallel
>> universe?
>> On Oct 15, 2010, at 3:58 PM, Erik Schnetter wrote:
>>
>> Richard
>>
>> You created a configuration with the non-default name "static_tov".
>> (The default name would be "sim"). Therefore you need to specify this
>> configuration name when you create and submit a simulation:
>>
>> ./simfactory/sim create-submit static_tov --configuration=static_tov
>> --parfile=...
>>
>> You will then have a configuration and a simulation with the same
>> name; this does not matter. I usually have a single configuration
>> "sim", and use this configuration for all my simulations.
>>
>> -erik
>>
>> On Fri, Oct 15, 2010 at 1:16 PM, Richard O'Shaughnessy
>> <oshaughn at gravity.phys.uwm.edu> wrote:
>>
>> Hi Erik
>>
>>
>> After a seemingly successful compile, I tried a simple single-cpu run (using
>>
>> generic.sh submission) on my cluster head node. I believe this
>>
>> installation is the release version; the only changes have been to the
>>
>> optionlist and udb.pm. A similar configuration works on other machines
>>
>> (i.e, my laptop, albeit with svn rather than release versions), but for some
>>
>> reason not here. It's not creating the simulations directory at all.
>>
>> Thoughts?
>>
>> --- command-line
>>
>> [oshaughn at hydra Cactus]$ ./simfactory/sim create-submit static_tov
>>
>> --parfile=par/static_tov.par --procs=1 --walltime=8:0:0
>>
>> Simulation Factory:
>>
>> Configuration name(s) not specified -- using default configuration "sim"
>>
>> Uncaught exception from user code:
>>
>> Configuration "sim" contains no executable at ./simfactory/sim line
>>
>> 5216.
>>
>> at ./simfactory/sim line 5216
>>
>> main::get_executable() called at ./simfactory/sim line 1883
>>
>> main::command_create('static_tov') called at ./simfactory/sim line
>>
>> 2955
>>
>> main::command_create_submit('static_tov') called at ./simfactory/sim
>>
>> line 452
>>
>> Richard
>>
>> The error message "no executable" indicates that your build didn't
>>
>> complete. There are probably problems with your compiler or linker
>>
>> options.
>>
>> - All Cactus executables are stored in Cactus's "exe" directories.
>>
>> What does "ls exe" say?
>>
>> - Did you specify a different name for your configuration while
>>
>> building? If so, you need to use the --configuration=... option when
>>
>> submitting the simulation.
>>
>> - If you use the --debug or --profile flag while building, you also
>>
>> need to specify it while submitting a simulation, since you'll need to
>>
>> use the debugging or profiling executable.
>>
>>
>> Ok. I'm simply trying to follow the EinsteinToolkit new user instructions
>>
>> on a new machine, as a test case, with one CPU.
>>
>> 1) executables are made
>>
>> [oshaughn at hydra Cactus]$ ls exe
>>
>> cactus_static_tov static_tov
>>
>> 2) I just changed udb.pm (see original email) The build command (which
>>
>> says it completed successfully) is
>>
>> [oshaughn at hydra Cactus]$ ./simfactory/sim build static_tov
>>
>> --thornlist=manifest/einsteintoolkit.th
>>
>> 3) I didn't specify any debugging options.
>>
>> -erik
>>
>> --
>>
>> Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
>>
>> Richard O'Shaughnessy oshaughn at gravity.phys.uwm.edu
>>
>> 462 Physics Building Phone: 414 229 6674
>>
>> Center for Gravitation and Cosmology
>>
>> University of Wisconsin, Milwaukee 53211
>>
>>
>>
>>
>> --
>> Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
>>
>> Richard O'Shaughnessy oshaughn at gravity.phys.uwm.edu
>> 462 Physics Building Phone: 414 229 6674
>> Center for Gravitation and Cosmology
>> University of Wisconsin, Milwaukee 53211
>>
>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
Richard O'Shaughnessy oshaughn at gravity.phys.uwm.edu
462 Physics Building Phone: 414 229 6674
Center for Gravitation and Cosmology
University of Wisconsin, Milwaukee 53211
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20101017/b8dbb578/attachment-0002.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: condor_mpirun
Type: application/octet-stream
Size: 21897 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20101017/b8dbb578/attachment-0001.obj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20101017/b8dbb578/attachment-0003.html
More information about the Users
mailing list