[Users] visualization-friendly HDF5 format

Roland Haas rhaas at aei.mpg.de
Thu Aug 20 02:49:52 CDT 2015


Hello all,

> One of the problems is that readers currently have to iterate over all
> datasets present to get to some information they need. Getting the list
> of all datasets by itself can take quite a while on a regular hdf5 file,
> but then readers also have to look at attributed within each dataset.
> While of of this is necessary to visualize all data in a given file,
> most of the time not all of the data is actually necessary, and
> certainly not for just 'opening the file'.
For postprocessing needs what you usually need is a list of timesteps
and in the file along with which variables and times exist in/correspond
to those timesteps. For each datasset you then need to know pretty much
all of the attributes to be able to properly place the dataset when
reading it from disk and to decide which map/refinement level etc it is
on. For postprocessing a simple table just just concatenates all
attribute information from all datasets in all files that make out the
output (eg for one set of file for one variable) is sufficient since one
is usually fine even if this file becomes large (GB) since it is still
small compared to the full dataset. For visualization/quick tests the GB
sized file may be unsuitable.

A good compromise would be a (it seems to me) a hierarchical structure
that is first indexed by timestep along with a table for times as a
function of timestep (since the timestep is arbitrary and just a counter
in Cactus). This would implicitly assume that the grid structure is
identical between variables (which need not be true and in fact I had a
case where this was not true). This works well for postprocessing since
one typically steps through the timeseries sequentially.

An alternative that may be easier for visualization is to index by
variable name first.

I agree that rather than make up some set of data out of thin air we
should try and get some input from the target audience (visualization
tools, postprocessing needs). For visualization I guess asking the yt
developers seems like a good idea, we can also check what eg the VisIt
reader needs to provide to VisIt (which is per timestep and per variable
information I think) and we could try and see what the F5 format stores
given that the author is also involved in visualization toolkits.

> operations a reader needs to be fast:
> - list of variables (at a given iteration)
> - list of time steps / iterations
> - AMR structure for one given iteration (all maps, rls and components)
All agreed.

> Regardless of these additional meta-data, we already established the
> need for a meta-data file, effectively a copy of all datasets but
I would actually prefer if this information was included in the HDF5
files itself (replicated in each of them if possible) and written at the
time the HDF5 file was created. We have some type of metadata in the
index files that CarpetIOHDF5 can write yet they are not sufficient for
very large datasets (since they do no reduce the number of files one has
to parse).

> without the actual data. Am I remembering correctly that the idea was to
> write this at run-time (eventually - right now it could be generated as
> post-processing)?
Yes, I would very much like if it was written at runtime.

> Is this all readers would need?
I suspect that there will always be use cases where the information is
arranged in just the "wrong" way to be convenient. So the best we can
try is to support "typical" "well known" usage cases.

Yours,
Roland



More information about the Users mailing list