[Users] Stampede

Yosef Zlochower yosef at astro.rit.edu
Fri May 2 07:08:26 CDT 2014


Hi

I have been having problems running on Stampede for a long time. I 
couldn't get the latest
stable ET to run because during checkpointing, it would die. I had to 
backtrack to
the Orsted version (unfortunately, that has a bug in the way the grid is 
set up, causing some of the
intermediate levels to span both black holes, wasting a lot of memory). 
Even with
Orsted , stalling is a real issue. Currently, my "solution" is to run 
for 4 hours at a time.
This would have been  OK on Lonestar or Ranger,
  because when I chained a bunch a runs, the next in line would start
almost right away, but on stampede the delay is quite substantial. I 
believe Jim Healy opened
a ticket concerning the RIT issues with running ET on stampede.


On 05/02/2014 05:55 AM, Ian Hinder wrote:
> Hi all,
>
> Has anyone run into problems recently with Cactus jobs on Stampede? 
>  I've had jobs die when checkpointing, and also mysteriously hanging 
> for no apparent reason.  These might be separate problems.  The 
> checkpointing issue occurred when I submitted several jobs and they 
> all started checkpointing at the same time after 3 hours.  The hang 
> happened after a few hours of evolution, with GDB reporting
>
>> MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
>>   at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
>> 296    for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
>>   ++i)
>
> Unfortunately I didn't ask for a backtrace. I'm using mvapich2.  I've 
> been in touch with support and they said the dying while checkpointing 
> coincided with the filesystems being hit hard by my jobs, which makes 
> sense, but they didn't see any problems in their logs, and they have 
> no idea about the mysterious hang.  I repeated the hanging job and it 
> ran fine.
>
> -- 
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>
>
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20140502/d8b71b76/attachment.html 


More information about the Users mailing list