[ET Trac] #2543: Consolidate data formats to simplify postprocessing

Mon Aug 2 08:40:17 CDT 2021

#2543: Consolidate data formats to simplify postprocessing

 Reporter: Wolfgang Kastaun
   Status: new
Milestone: 
  Version: development version
     Type: enhancement
 Priority: minor
Component: 

Currently, writing postprocessing tools for ET is unnecessarily difficult because required information needs to be collected from many locations, has to accommodate competing standards,  and sometimes require guesses using heuristics. Below is a list of improvements from the postprocessing viewpoint, which is not complete and can be augmented over time.

‌

1. Table of content for grid variable output. Each output folder should contain a machine-readable file that keeps track of all files containing grid data, with a list of variables and the available timesteps for each variable. Of course this should distinguish between 1D, 2D and 3D output. Currently, one has to open all files and parse the content for metadata, which can be very slow with HDF5. The issue is especially problematic when using one file per group.
2. The same for reduction output.
3. A machine readable file with all parameters and their values, including those not set in the parfile and set to default. The values should be values, postprocessing code should not have to emulate the handmade programming language parfiles have become. Each folder with a restart should have one such file in a standard location/name.
4. The reductions thorn should also output enough information to convert norm1/average into volume integrals, i.e. a scalar x such that `volume integral = x * average`
5. Unique extensions. There should be one and only one unique extension for each type of file, across all standard thorns that produce output. In particular, just adding ‘.h5’ is not enough. For example 3D data currently has extension `xyz.h5` or just `.h5` and multipole data can have extension `.h5` as well.
6. Simfactory should also provide machine-readable metadata about restarts and simulation folders. It should be possible to easily obtain a tree-like structure of the various restarts, complete with iteration ranges.
7. One standard format for timeseries. Currently reductions, 0D output, multipoles, and AHorizonFinder all have their own formats, multipoles has even two formats. Timeseries files should also contain metadata with the range of available iterations and times.
8. Settle on one format for each type of data and deprecate the rest, unless several formats are really needed, e.g. for performance reasons. A prime example is 2D ascii output, which is inferior to hdf5 in every way. If deprecation is not possible, having tools to convert from all competing formats into a canonical format after the simulation might help. Another duplication of effort is caused by the one-per-group/one-per-variable duality.
9. There should support to add arbitrary metadata to simulations. For example, initial data thorns could add model properties such as BNS masses, spins, separations. There should be an API such that code can add metadata, and some standard location where all metadata is collected in a machine and human readable format such as json. This should be on the simulation level, such that new metadata can be added during restarts, but existing metadata is immutable. Each thorn should have its own metadata namespace. 

‌

--
Ticket URL: https://bitbucket.org/einsteintoolkit/tickets/issues/2543/consolidate-data-formats-to-simplify
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/trac/attachments/20210802/73c939f8/attachment.html