[Users] Error to write PittNull during checkpointing

Yosef Zlochower yosef at astro.rit.edu
Sun Jul 29 16:00:54 CDT 2012


I wonder if this hack may help.

In SphericalHarmonicDecomp_DumpMetric
add a blocking IO operation before the Decompose3D calls.
Perhaps something like:

{
    const char *outdir = *out_dir ? out_dir : io_out_dir;
    char filename[BUFFSIZE];
    snprintf(filename, sizeof filename,
         "%s/obs_%d_test_io_ready", dir, obs);
    FILE *file = fopen(filename, "a");
    assert (file);
    fprintf(file, "test_if_ready\n");
    fflush(file);
    fclose(file)
}

If that doesn't help, then perhaps you can set SphericalHamonicDecomp
to abort the run when this happens.

On 07/29/2012 04:31 PM, Yosef Zlochower wrote:
> On 07/28/2012 07:38 AM, Jakob Hansen wrote:
>> Hi all,
>>
>> Thanks for fast replys ^^
>>
>> 2012/7/27 Yosef Zlochower <yosef at astro.rit.edu 
>> <mailto:yosef at astro.rit.edu>>
>>
>>     Hi,
>>
>>      SphericalHarmonicDecomp should not be writing output at the
>>     same time as a checkpoint. Are you using an NFS mount? 
>>
>>
>> No, we're using Lustre filesystem.
>>
>>
>>     I noticed
>>     issues with NFS servers becoming unresponsive (due to a large number
>>     of blocking io operations) during a checkpoint. Perhaps right after
>>     a checkpoint, the server is still too busy.
>>
>>
>> Well, indeed this happens just after checkpointing, however not at 
>> every checkpoint and not for all output. I experienced this error 
>> twice on two different simulations, once for each simulation. In each 
>> case it affected the metric_obs_0_Decomp.h5 right after checkpointing :
>>
>> Case 1 : Output from ascii_output > gxx.asc :
>> 2.7627600000000001e+02 -7.1882256506489145e-05 1.5746413856874709e-05
>> 2.7640800000000002e+02 -7.1881310717781865e-05 1.5759232641588166e-05
>>                  nan 0.0000000000000000e+00 0.0000000000000000e+00
>> 2.7667200000000003e+02 -7.1875929421365471e-05 1.5784711415784759e-05
>> 2.7680400000000003e+02 -7.1871563990008602e-05 1.5797542778149928e-05
>>
>> In this case there was a checkpoint at time 276.408 : INFO 
>> (CarpetIOHDF5): Dumping periodic checkpoint at iteration 268032, 
>> simulation time 276.408
>>
> The NaN is just the way ascii_ouput let's you know it couldn't read 
> the data.
>
>>
>> Case 2: Output from ascii_output > gxx.asc :
>> 3.2010000000000002e+02 -6.7572912816444132e-05 2.1144268516557760e-05
>> 3.2023200000000003e+02 -6.7570803118733184e-05 2.1156692762978387e-05
>>                  nan 0.0000000000000000e+00 0.0000000000000000e+00
>> 3.2049600000000004e+02 -6.7568973330671772e-05 2.1182079214383173e-05
>> 3.2062800000000004e+02 -6.7569118827631724e-05 2.1194791416536239e-05
>>
>> With a checkpoint at time 320.232 :  INFO (CarpetIOHDF5): Dumping 
>> periodic checkpoint at iteration 310528, simulation time 320.232
>>
>>
>> Also, in both cases it only affected the metric_obs_0_Decomp.h5 file, 
>> the other detection radius files, _1 and _2, had all data.
>>
>>
> Do the  Caltech fixes help?  If that doesn't work, then it may be
> that your IO system is saturated. Perhaps then a crude workaround would be
> to put a delay in after a checkpoint to give the IO system time to 
> process its
> backlog of IO requests.
>> <snip>
>>
>>
>>
>> 2012/7/28 Erik Schnetter <schnetter at cct.lsu.edu 
>> <mailto:schnetter at cct.lsu.edu>>
>>
>>     Jakob
>>
>>     I have not heard about such a problem before.
>>
>>     When an HDF5 file is not properly closed, its content may be
>>     corrupted. (This will be addressed in the next major release.)
>>     There may be two reasons for this: either the file is not closed
>>     (which would be an error in the code), or there is a write error
>>     (e.g. you run out of disk space). The latter is the major reason
>>     for people encountering corrupted HDF5 files. Since you don't see
>>     error messages, this is either not the case, or these HDF5 output
>>     routines suppress these errors.
>>
>>     The thorn SphericalHarmonicDecomp implements its own HDF5 output
>>     routines and does not use Cactus. I see that it uses a
>>     non-standard way to determine whether the file exists, and that
>>     it does not check for errors when writing or closing. I think
>>     that HDF5 errors should cause prominent warnings in stdout and
>>     stderr (did you check?), and if you don't see these, the writing
>>     should have succeeded.
>>
>>
>> The errors I see on stderr are the ones I mentioned in my first mail :
>>
>>
>>
>> HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0:
>> #000: H5F.c line 1509 in H5Fopen(): unable to open file
>> major: File accessability
>> minor: Unable to open file
>> #001: H5F.c line 1300 in H5F_open(): unable to read superblock
>> major: File accessability
>> minor: Read failed
>>
>> .... etc.
>>
>> stdout does not produce any errors or warnings related to this.
>>
>>
>>     You mention checkpointing. Are you experiencing these problems
>>     right after recovery, i.e. during the first
>>     SphericalHarmonicDecomp HDF5 output afterwards?
>>
>>
>> No, this happened during the first run, not related to recovery.
>>
>>     In this case, did you maybe switch to a new directory where this
>>     file doesn't exist?
>>
>>     If not, then it may be the non-standard way in which the code
>>     determines whether the file already exists, combined with
>>     something that may be special about your file system.
>>
>>     (The "standard" way operates as follows: open the file as if it
>>     existed; if this fails, open it by creating it. The code works
>>     differently: it opens the file as binary file. If this fails, the
>>     HDF5 file is created; if it succeeds, the file is closed and
>>     re-openend as HDF5 file. Maybe the quick closing-then-reopening
>>     causes problems?)
>>
>>     -erik
>>
>>
>>
>>
>> 2012/7/28 Roland Haas <roland.haas at physics.gatech.edu 
>> <mailto:roland.haas at physics.gatech.edu>>
>>
>>     Hello all,
>>
>>     >> I have not heard about such a problem before.
>>     I believe Nick Taylor at Caltech had similar issues. Bela has since
>>     fixed some bugs but had trouble actually committing them (he just saw
>>     your emails). I'll grab his changes and commit them.
>>
>>
>> Sounds interesting, I'm looking forward to appying the changes and 
>> see if the problem disappears.
>>
>> Cheers,
>> Jakob
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
>
> -- 
> Dr. Yosef Zlochower
> Center for Computational Relativity and Gravitation
> Assistant Professor
> School of Mathematical Sciences
> Rochester Institute of Technology
> 85 Lomb Memorial Drive
> Rochester, NY 14623
>
> Office:74-2067
> Phone: +1 585-475-6103
>
> yosef at astro.rit.edu
>
> CONFIDENTIALITY NOTE: The information transmitted, including
> attachments, is intended only for the person(s) or entity to which it
> is addressed and may contain confidential and/or privileged material.
> Any review, retransmission, dissemination or other use of, or taking
> of any action in reliance upon this information by persons or entities
> other than the intended recipient is prohibited. If you received this
> in error, please contact the sender and destroy any copies of this
> information.
>
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users


-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Assistant Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yosef at astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20120729/cf19019c/attachment.html 


More information about the Users mailing list