[ET Trac] [Einstein Toolkit] #349: pyc files when syncing
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Sun Jun 19 12:36:43 CDT 2011
#349: pyc files when syncing
----------------------------+-----------------------------------------------
Reporter: barry.wardell | Owner: mthomas
Type: defect | Status: new
Priority: major | Milestone:
Component: SimFactory | Version:
Resolution: | Keywords:
----------------------------+-----------------------------------------------
Comment (by eschnett):
Replying to barry.wardell:
> Replying to eschnett:
> > You are introducing a new "paths" syntax to the sync command. I
thought previously that one could just list multiple machines, and
simfactory would sync to all of them -- apparently that isn't the case,
that was lost in translation.
> Yes, sorry for the confusion. This patch introduces three changes
(switching to filter rules system, changing the behavior of sim sync with
multiple arguments and removing the --sync-parfiles and --sync-sourcetree
options) which should ideally be separated into separate issues for
consideration. The reason I didn't do so was that the the three changes
naturally came at the same time in terms of the changes to the code. The
last two can be restored to their original behavior if desired. However, I
actually prefer the new behavior because:
> I am much more likely to want to sync specific paths than to sync to
multiple machines at once.
I see. Myself, I'm much more likely to sync to different machines. For
example, while debugging at scale, I may build and submit on three or four
machines at once, to increase my chances of a job starting quickly
On the other hand, I don't sync only part of my source tree. Not having to
do so is exacly the advantage that Simfactory is supposed to provide,
because it can lead to strange errors when one forgets to sync a file. I
usually find that the first sync is slow, but subsequent ones are much
faster. If your experience is different, then we should introduce another
high-level automatic mechanism instead of asking people to do the low-
level file management themselves again. For example, Simfactory could
remember the time of the last sync to another machine, and then look
locally for files that changed since. This would avoid accessing Kraken's
slow file system, and shouldn't be more than a line or two with find.
> The paths system provides more flexibility and control than the --sync-
parfiles and --sync-sourcetree options did and makes them somewhat
unnecessary. This flexibility is particularly useful on machines with
slower filesystems where only syncing a specific path can save a lot of
time.
> What are other people's opinions on this?
The main goals of Simfactory are not flexibility or control, but to
provide safe and convenient default choices that work almost all the time.
In many cases, people want flexibility and control only because something
else is not working right -- in this case, sync is apparently too slow for
you. I would thus suggest (1) a quick work-around for you that is,
hopefully, temporary, and (2) trying to come up with a good solution that
lets you "just sync" files without worrying about its performance. But I
would not want to design much additional flexibility into Simfactory,
because this makes it more difficult to learn and more dangerous to use.
> > However, it seems that filter.cactus.rules contains only a list of
top-level paths, and isn't supposed to contain any actual rules -- if it
did contain rules, then the result would be confusing, because "sim sync"
and "sim sync paths" would copy and/or delete different sets of files.
Also, people may want to change this default list of paths, so there
should probably also be a filter.cactus.local.rules... Should this list of
paths instead be stored in an ini file, where there is already a mechanism
to configure settings, and where simfactory could check that these are
actually only path names and not accidentally patterns?
> The idea is that if specific paths are not given, then the file
filter.cactus.rules is read in and gives a default list of paths to be
included. I agree that this should not contain any actual rules for the
reason you give and have added a comment to this effect to the top of
filter.cactus.rules. Any filter rules to be applied to all paths should be
put in filter.rules.
> If the user wants to modify this, they can add a .rsync.rules file in
their Cactus base directory which is read in first and so will override
anything in filter.cactus.rules. I don't really like storing these in ini
files because that would be moving away from using rsync's filter rules
system (unless simfactory parsed the ini file and generated the
appropriate .rsync.rules file).
Can we just put the filter rules into ini files and have Simfactory write
out these configuration files? In this way, all configuration files are in
a single place and are easy to find. This would make it easier for new
users to configure their Simfactory, or to understand/copy a Simfactory
setup from someone else.
> > How do you expect people to use the "paths" mechanism? Can one give
just top-level paths, or also directly paths deep into the hierarchy?
Would you expect to do this regularly? If so, why? I find this somewhat
dangerous, because people may miss transferring an updated file. Instead
of telling simfactory what to do, the user currently tells simfactory
his/her intent, e.g. "copy source files" or "copy parameter files", which
are prerequisites to either building or submitting. Simfactory then deals
with the details, ensuring things are done in a safe way. Would you find
it inconvenient if you had to use an option to specify a pathname, e.g.
"sim sync damiana -p par"?
> The idea is that there are three modes of operation:
> Without any paths specified we sync all paths given in
filter.cactus.rules (and also include anything any modifications in the
file .rsync.rules). This is essentially the same as what happened before.
Sure.
> With a list of paths given, only those paths are synchronized. Both
filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any
.rsync.rules files in the specified paths are read). For consistency, only
toplevel paths are accepted in this mode.
> With a single path given, only this path is synchronized. Both
filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any
.rsync.rules files in the specified paths are read). In this case, non-
toplevel paths are allowed and handled appropriately.
Do these two last rules mean that .svn files in these paths are then
copied to the remote system? That would be strange.
How important are per-directory .rsync.rules for you? Are these just a
coincidence of your implementation using a relative path name in
etc/filter.rules? Or do you use this for some purpose? Isn't it strange
that only those rules in the exact directories that the user specifies are
used, while any rules in subdirectories are ignored?
I also wonder why you don't allow synchronising two non-toplevel
directories at the same time. Assume you modified a source file and a
parameter file, and want to copy over both. Currently, you would need to
use two separate sync commands. If this is just due to an additional test,
we can just leave it out. But I assume that it has more to do with rsync's
path specifications, and you would need to call rsync multiple times. In
this case, I would rather report a "not yet implemented" error message.
> The main case where I would expect to use this regularly is when syncing
to machine with a slow filesystem (eg. Kraken) where simply checking which
files need to be synced can sometimes take a long time. In fact, before
now I often used rsync manually instead of 'sim sync' when I was syncing
small changes often (eg. when debugging a problem, setting up a new
parameter file, etc.). I quite like how things work with this patch
applied. We could add the --sync-sources and --sync-parfiles convenience
options back, although I'm not sure if I would personally use them.
> What is the advantage of using an option to specify a pathname?
The advantage is that one can list multiple machines.
I would approve the patch if you use an option to specify paths, so that
we can continue to sync to multiple machines. I would strongly favour
keeping the list of top-level directories in Simfactory ini files, because
I have additional such paths, and being able to configure these is
important to me, and since this is currently already working in Simfactory
we don't need to change this. I would also prefer to have the rsync rules
in a Simfactory ini file for the same reason.
However, it may also be good to have a bit more discussion about these
approaches. Do people think that "sim sync" is too slow? Do you prefer
.rsync.filer files over ini files, and if so, why?
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/349#comment:14>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list