[ET Trac] [Einstein Toolkit] #2073: IO corruption on SDSC oasis file systems
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Fri Aug 25 08:52:10 CDT 2017
#2073: IO corruption on SDSC oasis file systems
-------------------------------+--------------------------------------------
Reporter: rhaas | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Other | Version: development version
Keywords: Comet Gordon SDSC |
-------------------------------+--------------------------------------------
I am experiencing file corruption in ASCII output produced by the Cactus
code on Comet's (and Gordon's as far as I remember) scratch file systems.
This manifests as lines of output being mashed together in the output
file.
I have added strace calls to my job script to capture all arguments to the
OS's write() function and re-created the write() calls based on this.
Those write calls, when replayed on a login node, do produce a correct (no
mashed lines) file.
All output to the file in question was from rank 0 only even though the
code used MPI and ran on two MPI ranks.
The same code and number of MPI ranks produces a correct output file when
run on the $HOME file system.
Thus it seems to me as if there may be an issue with the file system. I
can try and reduce the test case to a more minimal example (right now it
is a full simulation even though it runs only for <1minute) .
You can find the job script (for account, SLURM options etc) here:
/oasis/scratch/comet/rhaas/temp_project/simulations/OSTREAM_2_12/output-0000/SIMFACTORY/SubmitScript
the script that launches the MPI executable here:
/oasis/scratch/comet/rhaas/temp_project/simulations/OSTREAM_2_12/output-0000/SIMFACTORY/RunScript
the strace output here:
/home/rhaas/strace/strace.1882[67].log
and the awk script to recreate the write calls is:
gawk -vFS='"' '/write.*\/grid-coordinates.xy.asc/{print "printf
\""$2"\""}' ~/strace.18826.log >recreate.sh
The corrupted line is eg. line 161 of
/oasis/scratch/comet/rhaas/temp_project/simulations/OSTREAM_2_12/output-0000/TEST/sim/CarpetIOASCII/newsep
/grid-coordinates.xy.asc
which reads
1 4 3 4 1 0.1666666666660.505076272276105285714 etc
but should read
1 4 3 4 1 0.166666666666667 -0.0714285714285714
I can avoid the file corruption by flushing the output file after each
line.
I am wondering if there is anything known about this or if there is a
workaround that does not boil to first writing all data to a file system
local to the compute node and copying to /oasis/scratch after the job is
finished (how much local space would be available since I would also have
to do so for eg checkpoint files and 3d hdf5 output).
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/2073>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list