[ET Trac] [Einstein Toolkit] #1772: Simfactory: potentially serious problem with CACHE directory in the simulations directory
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Tue May 5 10:21:21 CDT 2015
#1772: Simfactory: potentially serious problem with CACHE directory in the
simulations directory
------------------------+---------------------------------------------------
Reporter: bmundim | Owner:
Type: defect | Status: new
Priority: critical | Milestone: ET_2015_05
Component: SimFactory | Version: development version
Keywords: CACHE |
------------------------+---------------------------------------------------
Is the directory CACHE in the simulation directory really necessary? We
are talking about executables with at most 400MB of size, which is nothing
compared to current HPC storage systems.
I think I might have found a design flaw on simfactory use of CACHE
directory which can go unnoticed until it is too late with potential loss
of thousands of SUs. Suppose we have the following situation:
1) We build a configuration A and send a simulation A1 with with parameter
file 1. So simfactory copies the executable from configuration A to
simulation A1 simfactory directory and creates a symlink from
/scratch/simulations/CACHE/exe/cactus_A to
/scratch/simulations/A1/SIMFACTORY/exe/cactus_A.
2) We then create a new simulation A2 with a different parameter file 2.
This time simfactory symlink the simulation executable
/scratch/simulations/A2/SIMFACTORY/exe/cactus_A to the cached one
/scratch/simulations/CACHE/exe/cactus_A.
3) After a few days (or restarts) of simulations A1 and A2, you come up
with a better idea/fix/new parameter which requires to recompile your
configuration A. Note that we don't want to build a new configuration from
scratch since cactus configurations consume both a lot of time and space
to build. So you rebuild your configuration A and its executable cactus_A
is updated.
4) Let's say now we submit the updated configuration with the same
parameter file 2 in order to test your new idea/fix/parameter and compare
it with the simulation A2, which is still running and have a few extra
restarts to completion. Call this simulation A2_updated. Simfactory then
copy the new updated executable cactus_A from the Cactus/exe/cactus_A to
the simulation directory
/scratch/simulations/A2_updated/SIMFACTORY/exe/cactus_A *and* update the
CACHE symlink to that new simulation directory, ie:
$ cd /scratch/simulations/CACHE/exe
$ ls -l cactus_A
cactus_A ->
../../../../scratch/simulations/A2_updated/SIMFACTORY/exe/cactus_A
5) The problem: now my simulation A2 restarts are compromised with a new
executable. Remember that that simulation executable is actually a symlink
to the one in the CACHE directory, which has just been updated.
I think this whole cache directory intermediate step introduces
unnecessary complexity for the user to track; it is really unnecessary and
in my opinion not a good design choice. I would vote to eliminate it from
simfactory completely as soon as possible, ideally even for this release.
Just use one copy of the executable from cactus/exe to
simulation/SIMFACTORY/exe and that's it. This is all we need to have that
simulation and future ones running consistently with the same executable.
Thanks!
PS: I have actually noticed this issue on Hershel release (there is no
option pointing to Hershel release on trac). I am working on tests for
development version to confirm this issue, but give simfactory commits I
believe it is still there.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1772>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list