[ET Trac] #2888: race condition writing properties.ini in simfactory

Roland Haas trac-noreply at einsteintoolkit.org
Wed Sep 24 13:39:51 CDT 2025


#2888: race condition writing properties.ini in simfactory

 Reporter: Roland Haas
   Status: new
Milestone: 
  Version: 
     Type: bug
 Priority: major
Component: SimFactory

Changes (by Roland Haas):
Currently both the `submit()` as well as the `run()`function in `lib/simrestart.py` write \(update\) the file `output-NNNN/properties.ini`. In particular `submit()` does so _after_ submitting to record the `jobid` field while `run()` does so before running to record the `checkpointing` value.

This means that \(eg on SLURM where jobs start quickly, an particular when the `submit` command contains a `sleep 5` to slow down job submission\) there can be race condition:

1. `submit` writes initial copy of `properties.ini` lacking `jobid`
2. `submit` submits the job to SLURM
3. job starts and run reads `properties.ini`
4. `submit` writes updated `properties.ini` with `jobid`
5. `run` writes `properties.ini` with `checkpointing`

at this point `jobid` is lost from `properties.ini`

Fixes would be:

* add a lock for `properties.ini` which `submit` only release once it has done its final update
* submit jobs in “held” state \(`sbatch --hold`\) and only release them once the final update of properties.ini has happened

both will require updates to simfactory’s code. The first option would work without updates to the machine ini files, the second will require extra entries to tell simfactory how to release a job. This could also be used to implement `hold` and `release` commands in simfactory. The first option will require fallback code in case the lock file is left behind \(e.g. wait no longer than 1 minute before assuming a stale lock and going ahead anyway\). The lock file will require at least some level for POSIX conformance from the file system \(namely that file creation and removal is an atomic operation cluster-wide\).

This most likely is the reason for occasional strange failures on clusters with jobid being unset.

--
Ticket URL: https://bitbucket.org/einsteintoolkit/tickets/issues/2888/race-condition-writing-propertiesini-in
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/trac/attachments/20250924/c9f2c0e8/attachment.htm>


More information about the Trac mailing list