[Users] Another restart from checkpoint failure

Jakob Hansen jakobidetsortehul at gmail.com
Thu Feb 17 22:47:54 CST 2011


Thank you for suggestions,

Oleg, your h5repack script did the trick and I can now resume my simulation,
thanks :)

Cheers,
Jakob

2011/2/18 Oleg Korobkin <korobkin at phys.lsu.edu>

> Hi Jakob,
>
> I had the same problem some time ago. Since your HDF5 files are
> compressed, Cactus tries to unpack them before reading, and this
> operation somehow is very memory-intensive. Try to repack your
> checkpoint files before restarting to reduce compression level to zero.
> You can use the standard h5repack utility for that. Here's a small
> script to uncompress all checkpoint files and put them to an $OUTDIR
> directory:
>
> ----------------
> #!/bin/bash
>
> ITER=5555            # which iteration to use
> OUTDIR=$SCRATCH/tmp  # output directory
> for f in checkpoint.chkpt.it_${ITER}.file_*.h5; do
> echo processing $f...
>  h5repack -i $f -o $OUTDIR/$f -f NONE
> done
> ----------------
>
> This will make your checkpoint files 5-10 times larger, but because they
> don't need to be unpacked in RAM, less memory is required to read them.
>
> Also, try the following:
> 1. set:
>   CarpetIOHDF5::open_one_input_file_at_a_time = yes
>   IO::abort_on_io_errors = yes
> 2. use less cores per node.
> 3. try running on larger number of processors.
>
> Cheers,
>  - Oleg Korobkin
>
> Frank Loeffler wrote:
> > Hi,
> >
> > On Thu, Feb 17, 2011 at 03:24:04PM +0900, Jakob Hansen wrote:
> >>   #005: H5Zdeflate.c line 133 in H5Z_filter_deflate(): memory allocation
> >> failed for deflate uncompression
> >>     major: Resource unavailable
> >>     minor: No space available for allocation
> >
> >> Any ideas for possible cause and solution to this?
> >
> > Looking at these error messages I suggest you first do an h5ls/h5dump to
> > see if the files are actually ok. If that is so, what I would suspect
> next
> > is that reading the files takes more memory than is available, as
> indicated
> > by the message above. In this case I suggest to try the workarounds
> > mentioned in this thread:
> >
> >
> http://lists.einsteintoolkit.org/pipermail/users/2011-February/000852.html
> >
> > If this also doesn't help I would suggest to try again using more
> > available memory, e.g. using more nodes.
> >
> > Frank
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Users mailing list
> > Users at einsteintoolkit.org
> > http://lists.einsteintoolkit.org/mailman/listinfo/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20110218/ba7e4154/attachment.html 


More information about the Users mailing list