[Users] Stampede
Yosef Zlochower
yosef at astro.rit.edu
Fri May 2 07:08:26 CDT 2014
Hi
I have been having problems running on Stampede for a long time. I
couldn't get the latest
stable ET to run because during checkpointing, it would die. I had to
backtrack to
the Orsted version (unfortunately, that has a bug in the way the grid is
set up, causing some of the
intermediate levels to span both black holes, wasting a lot of memory).
Even with
Orsted , stalling is a real issue. Currently, my "solution" is to run
for 4 hours at a time.
This would have been OK on Lonestar or Ranger,
because when I chained a bunch a runs, the next in line would start
almost right away, but on stampede the delay is quite substantial. I
believe Jim Healy opened
a ticket concerning the RIT issues with running ET on stampede.
On 05/02/2014 05:55 AM, Ian Hinder wrote:
> Hi all,
>
> Has anyone run into problems recently with Cactus jobs on Stampede?
> I've had jobs die when checkpointing, and also mysteriously hanging
> for no apparent reason. These might be separate problems. The
> checkpointing issue occurred when I submitted several jobs and they
> all started checkpointing at the same time after 3 hours. The hang
> happened after a few hours of evolution, with GDB reporting
>
>> MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
>> at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
>> 296 for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
>> ++i)
>
> Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've
> been in touch with support and they said the dying while checkpointing
> coincided with the filesystems being hit hard by my jobs, which makes
> sense, but they didn't see any problems in their logs, and they have
> no idea about the mysterious hang. I repeated the hanging job and it
> ran fine.
>
> --
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>
>
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20140502/d8b71b76/attachment.html
More information about the Users
mailing list