[ET Trac] [Einstein Toolkit] #349: pyc files when syncing

Sun Jun 19 12:36:43 CDT 2011

#349: pyc files when syncing
----------------------------+-----------------------------------------------
  Reporter:  barry.wardell  |       Owner:  mthomas
      Type:  defect         |      Status:  new    
  Priority:  major          |   Milestone:         
 Component:  SimFactory     |     Version:         
Resolution:                 |    Keywords:         
----------------------------+-----------------------------------------------

Comment (by eschnett):

 Replying to barry.wardell:

 > Replying to eschnett:

 > > You are introducing a new "paths" syntax to the sync command. I
 thought previously that one could just list multiple machines, and
 simfactory would sync to all of them -- apparently that isn't the case,
 that was lost in translation.

 > Yes, sorry for the confusion. This patch introduces three changes
 (switching to filter rules system, changing the behavior of sim sync with
 multiple arguments and removing the --sync-parfiles and --sync-sourcetree
 options) which should ideally be separated into separate issues for
 consideration. The reason I didn't do so was that the the three changes
 naturally came at the same time in terms of the changes to the code. The
 last two can be restored to their original behavior if desired. However, I
 actually prefer the new behavior because:

 > I am much more likely to want to sync specific paths than to sync to
 multiple machines at once.

 I see. Myself, I'm much more likely to sync to different machines. For
 example, while debugging at scale, I may build and submit on three or four
 machines at once, to increase my chances of a job starting quickly

 On the other hand, I don't sync only part of my source tree. Not having to
 do so is exacly the advantage that Simfactory is supposed to provide,
 because it can lead to strange errors when one forgets to sync a file. I
 usually find that the first sync is slow, but subsequent ones are much
 faster. If your experience is different, then we should introduce another
 high-level automatic mechanism instead of asking people to do the low-
 level file management themselves again. For example, Simfactory could
 remember the time of the last sync to another machine, and then look
 locally for files that changed since. This would avoid accessing Kraken's
 slow file system, and shouldn't be more than a line or two with find.

 > The paths system provides more flexibility and control than the --sync-
 parfiles and --sync-sourcetree options did and makes them somewhat
 unnecessary. This flexibility is particularly useful on machines with
 slower filesystems where only syncing a specific path can save a lot of
 time.

 > What are other people's opinions on this?

 The main goals of Simfactory are not flexibility or control, but to
 provide safe and convenient default choices that work almost all the time.
 In many cases, people want flexibility and control only because something
 else is not working right -- in this case, sync is apparently too slow for
 you. I would thus suggest (1) a quick work-around for you that is,
 hopefully, temporary, and (2) trying to come up with a good solution that
 lets you "just sync" files without worrying about its performance. But I
 would not want to design much additional flexibility into Simfactory,
 because this makes it more difficult to learn and more dangerous to use.

 > > However, it seems that filter.cactus.rules contains only a list of
 top-level paths, and isn't supposed to contain any actual rules -- if it
 did contain rules, then the result would be confusing, because "sim sync"
 and "sim sync paths" would copy and/or delete different sets of files.
 Also, people may want to change this default list of paths, so there
 should probably also be a filter.cactus.local.rules... Should this list of
 paths instead be stored in an ini file, where there is already a mechanism
 to configure settings, and where simfactory could check that these are
 actually only path names and not accidentally patterns?

 > The idea is that if specific paths are not given, then the file
 filter.cactus.rules is read in and gives a default list of paths to be
 included. I agree that this should not contain any actual rules for the
 reason you give and have added a comment to this effect to the top of
 filter.cactus.rules. Any filter rules to be applied to all paths should be
 put in filter.rules.

 > If the user wants to modify this, they can add a .rsync.rules file in
 their Cactus base directory which is read in first and so will override
 anything in filter.cactus.rules. I don't really like storing these in ini
 files because that would be moving away from using rsync's filter rules
 system (unless simfactory parsed the ini file and generated the
 appropriate .rsync.rules file).

 Can we just put the filter rules into ini files and have Simfactory write
 out these configuration files? In this way, all configuration files are in
 a single place and are easy to find. This would make it easier for new
 users to configure their Simfactory, or to understand/copy a Simfactory
 setup from someone else.

 > > How do you expect people to use the "paths" mechanism? Can one give
 just top-level paths, or also directly paths deep into the hierarchy?
 Would you expect to do this regularly? If so, why? I find this somewhat
 dangerous, because people may miss transferring an updated file. Instead
 of telling simfactory what to do, the user currently tells simfactory
 his/her intent, e.g. "copy source files" or "copy parameter files", which
 are prerequisites to either building or submitting. Simfactory then deals
 with the details, ensuring things are done in a safe way. Would you find
 it inconvenient if you had to use an option to specify a pathname, e.g.
 "sim sync damiana -p par"?

 > The idea is that there are three modes of operation:

 > Without any paths specified we sync all paths given in
 filter.cactus.rules (and also include anything any modifications in the
 file .rsync.rules). This is essentially the same as what happened before.

 Sure.

 > With a list of paths given, only those paths are synchronized. Both
 filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any
 .rsync.rules files in the specified paths are read). For consistency, only
 toplevel paths are accepted in this mode.

 > With a single path given, only this path is synchronized. Both
 filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any
 .rsync.rules files in the specified paths are read). In this case, non-
 toplevel paths are allowed and handled appropriately.

 Do these two last rules mean that .svn files in these paths are then
 copied to the remote system? That would be strange.

 How important are per-directory .rsync.rules for you? Are these just a
 coincidence of your implementation using a relative path name in
 etc/filter.rules? Or do you use this for some purpose? Isn't it strange
 that only those rules in the exact directories that the user specifies are
 used, while any rules in subdirectories are ignored?

 I also wonder why you don't allow synchronising two non-toplevel
 directories at the same time. Assume you modified a source file and a
 parameter file, and want to copy over both. Currently, you would need to
 use two separate sync commands. If this is just due to an additional test,
 we can just leave it out. But I assume that it has more to do with rsync's
 path specifications, and you would need to call rsync multiple times. In
 this case, I would rather report a "not yet implemented" error message.

 > The main case where I would expect to use this regularly is when syncing
 to machine with a slow filesystem (eg. Kraken) where simply checking which
 files need to be synced can sometimes take a long time. In fact, before
 now I often used rsync manually instead of 'sim sync' when I was syncing
 small changes often (eg. when debugging a problem, setting up a new
 parameter file, etc.). I quite like how things work with this patch
 applied. We could add the --sync-sources and --sync-parfiles convenience
 options back, although I'm not sure if I would personally use them.

 > What is the advantage of using an option to specify a pathname?

 The advantage is that one can list multiple machines.

 I would approve the patch if you use an option to specify paths, so that
 we can continue to sync to multiple machines. I would strongly favour
 keeping the list of top-level directories in Simfactory ini files, because
 I have additional such paths, and being able to configure these is
 important to me, and since this is currently already working in Simfactory
 we don't need to change this. I would also prefer to have the rsync rules
 in a Simfactory ini file for the same reason.

 However, it may also be good to have a bit more discussion about these
 approaches. Do people think that "sim sync" is too slow? Do you prefer
 .rsync.filer files over ini files, and if so, why?

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/349#comment:14>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit