Status: Draft (as of 2008-05-14)
Contents
Implement Google version control practices and work flow with Subversion.
g4-compatible command-line and p4lib-compatible API are explicit non-goals. Instead, gvn provides Subversive solutions to the problems g4 and p4lib solve.
Stuff one needs to know to understand this doc: motivating examples, previous versions and problems, links to related projects/design docs, etc. You should mention related work outside of Google if applicable. Note: this is background; do not write about your design or ideas to solve problems here.
Start out with a paragraph with one or two sentences speaking broadly about the whys and wherefores of source control, and then how we turned around and gave all that up with g4 pending changelists. Then Mondrian came in to version those changes on the side. Now let's bring it back under one source control system.
Link to some svn docs, SCM patterns site, ...
Terminology
"source branch" is used in comments and docstrings all over the place, but it's confusing: to many, "trunk" is not conceptually a branch, and it sometimes isn't technically. "code line" is a better term, but what about "source"? It's the "source" from which the changebranch is copied, but it's also the "target" to which the change is eventually submitted.
Include very brief overview of svn, including copy/modify/merge model, branches are copies, atomic commits, arbitrary metadata, and so on.
One page high-level overview; put details in the next section and background in the previous section. Should be understandable by a new Google engineer not working on the project.
If you're feeling masochistic, see Appendix A: Where We've Been.
TODO(epg): document properties:
A changebranch is an entry in a project's changes directory in the
repository, associated metadata, and a branch. The default location
for changebranches is //changes
, so user basil's change
"foo" is located at //changes/basil/foo
, which
contains //changes/basil/foo/bar
, the actual branch.
"bar" is the basename of the directory being copied.
Users refer to changebranches with the canonical change name rather
than the path in the repository. The canonical change name is of the
form "user/change-name@revnum". The canonical name of basil's "foo"
change is "basil/foo@17". Short forms "foo", "foo@17", and
"basil/foo" are all valid as well. gvn
assumes the
user's own changes when the "user/" portion is missing, and assumes
the HEAD
revision in the absence of "revnum".
Creation of a changebranch is the same operation as later snapshots (updates to the changebranch); both are simply calls to gvn.changebranch.ChangeBranch.Branch . This creates any parent directories of the changebranch on the fly, meaning changebranch creation is transparent. gvn change and gvn snapshot open the working copy to the deepest common path of all paths to be changebranched. Branch starts a transaction, deletes the branch (if any) copies the source branch from the base revision of the deepest common path, and then applies the user's changes to the branch. Because working copies almost always have mixed revisions, the branch may have been copied from a different revision than what should be used for a file. Branch deletes such files from the branch and copies them with the correct revision. This behavior is identical to that of svn cp . URL.
gvn change
takes the same options as svn
for setting the log message of the snapshot, but when none are
specified and it runs the user's editor, the log message of the
previous snapshot (if any) is loaded into the form, ready to become
the log message of this snapshot. The user may edit this, or leave it
as is. gvn submit
copies the log message of the last
snapshot to become the log message of the final submitted revision.
See gvn submit
(TODO href me) below for future plans.
gvn
today
maintains the list of changebranched files in
the .gvnstate
file, with none of the
journalling svn
itself uses for wc operations.
Duplicating that would be pointless; instead we should add
to svn
a general mechanism for applications to store
additional metadata in the Subversion metadata area (.svn
directory).Another disadvantage is that the smallest change to a changebranch involves sending all the diffs to the repository again. If the user wants to change only the change description, or change one file out of a hundred, the diffs for all hundred files must be transmitted.
gvn uses the list of changed paths on the latest (or specified) snapshot to determine which files are considered part of the change, and which action is taken for each. The problem with this is that actions implementing the changebranch are conflated with the actions the user took. Since the branch itself is always an Add or Replace, gvn has no way to know when the user is changing a property (e.g. svn:ignore or svn:mergeinfo) on the deepest common path itself, nor does it have a way to distinguish Replace files from Modified files.
We knew from the beginning that using a branch of any kind to manage pending changes only works for reasonably sized changes. Some changes are so large as to be unreviewable. For example, when importing or updating a third-party tree, reviewers can only check the log message, list of files; they're not going to read over the entire tree. The same applies to large merges. In addition to the human problem, sending these large changes twice (once to the changebranch and again for the final submit) is an unreasonable burden. So, we need a way to create an artifact with certain metadata about the change without sending the change itself until final submit.
Additionally, we have the problems described in "Where We Are". We
can solve all these problems with one change to the model: when
snapshotting a changebranch, store our own copy of the path change
metadata, and only re-branch files as needed. We store this as a file
called state
in the changebranch container (e.g.
//changes/epg/foo/state
). This state
file
is a JSON serialization of the
new gvn.changebranch.State
class.
For the class of changes we heuristically decide are "big" (or
maybe the user has to use an option), we just don't create the
branch
part of the changebranch; we create
//changes/epg/foo/state
but not
//changes/epg/foo/branch
. gvn review
can
still show the changed files and how they were changed (and even where
things are being copied from, e.g. svn mv gigantic
src/gigantic
).
Now, for changes where we do branch
to //changes/epg/foo/branch
, we do a full snapshot only
at changebranch creation. On subsequent snapshots,
unless --force is specified, we don't
snapshot files unless NeedsSnapshot.
E.g.
# wc is @15, Makefile is @16 wc% gvn opened M Makefile M testing/testcommon.py wc% gvn change -c foo --non-interactive Sending Makefile Sending testing/testcommon.py Changed epg/foo@17. => same as today, copied //trunk@15 and applied diffs A /changes/epg/foo A /changes/epg/foo/state A /changes/epg/foo/branch (from:/trunk@15) R /changes/epg/foo/branch/Makefile (from:/trunk/Makefile@16) M /changes/epg/foo/branch/testing/testcommon.py but review looks at the path's action in the State object and prints an M not an R. wc% gvn opened --- foo M Makefile M testing/testcommon.py wc% gvn change -c foo --non-interactive -m 'new log message' Changed epg/foo@18.
The state file looks like:
{"base": "trunk", "base_rev": 15, "paths": { "Makefile": { "action": "M", "base_rev": 16 }, "testing/testcommon.py": { "action": "M" } } }
gvn
notices that
nothing NeedsSnapshot, and just tries to
change the svn:log
property of the last snapshot. If
that fails, it commits a change just to state
, updating
all the snap
entries to the last snap revision. Hmm, or
maybe it always does that, rather than trying to change the log.
The State objects haven't changed, so we
didn't have to re-send them and gvn review
has no
trouble showing the correct information and diffs. We didn't have to
send any diffs, we're just making a new revision with a
new svn:log
.
wc% gvn opened --- foo *M Makefile M testing/testcommon.py wc% gvn change -c foo --non-interactive -m 'new log message' Sending Makefile Changed epg/foo@19. => notices only Makefile needs snapshot, so does this: M /changes/epg/foo/state R /changes/epg/foo/branch/Makefile (from:/trunk/Makefile@16)
gvn
only had to send a new delta for Makefile and set
the snap revisions for everything else to 18:
{"base": "trunk", "base_rev": 15, "paths": { "Makefile": { "action": "M", "base_rev": 16 }, "testing/testcommon.py": { "action": "M", "snap_rev": 18 } } }
If some path had been copied:
{"base": "trunk", "base_rev": 15, "paths": { "Makefile": { "action": "M", "base_rev": 16 }, "testing/testcommon.py": { "action": "M", "snap_rev": 18 }, "copy-example": { "action": "M", "copyfrom_path": "/trunk/foo", "copyfrom_rev": 14 } } }
Resolve short "URLs" (e.g. //tools
) to real URLs based
on project settings and run svn checkout
to create a
working copy.
g4 analog: client + sync
Display the differences for locally modified paths.
g4 analog: diff
Add paths to or remove paths from a changebranch, or delete a changebranch entirely.
g4 analog: change
List local changebranches.
g4 analog: none?
List locally modified files, grouped by changebranch.
g4 analog: opened
Mail a changebranch review request.
g4 analog: mail
Show the change description and diffs for a change.
g4 analog: describe
Update a changebranch in the repository.
g4 analog: none
Mark a changebranch approved, reviewed.
g4 analog: approve
Submit the changes from a changebranch and remove the changebranch.
Since changebranches are like the private branches provided by
distributed systems like svk and Mercurial, it is likely that
users will want to snapshot frequently, with snapshot log messages
describing the change being snapshotted, rather than growing a
single change description with each snapshot. To facilitate
this, submit
will have an option to run the user's
editor on a form with gvn log
of all snapshots
loaded, for easy massaging into the final change dscription.
If the user has set the per-project run-presubmit option, this will run gvn presubmit . This is the only option which must be in the user project file; users themselves must make the decision to allow other committers to run arbitrary code on their systems. If an organization wants to make this decision for users, it can hack this check out internally.
g4 analog: submit
Run svn udpate
to update the working copy.
g4 analog: sync
g4 analog: diff2
g4 analog: describe -s
Show the history of changes of a path, or a changebranch.
g4 analog: changes
Run svn import
.
svn-vendor
:
import [local-path] //third-party/subversion [tags]
Import local-path (defaults to .) to //third-party/subversion/import (creating directories as needed). local-path is distinguished from repo path because it can start with at most 1 / and the path after that must start with 2. If tags are listed, copy the new import tree to those tags. Tags starting with any number of / except 0 or 2 is an error. 0 means it's treated as a path relative to //third-party/subversion, 2 means it's treated as an absolute path. The tag path will be Replaced if it exists.
cd svn-1.4.3 && gvn import //third-party/subversion 1.4.3 gvn import svn-1.4.3 //third-party/subversion # no tags gvn import svn-1.4.3 //third-party/subversion collab/1.4.3 //foo/1.4.3
svn.ra.do_status to get list of files in target tree os.walk to get list from source tree sort lists for i in difflib.Differ().compare(a, b): if i[0] == '-': remove i[2:] elif i[0] == '+': add i[2:]
g4 todo
is basically:
changes_options['status'] = 'pending' changes_options['long_output'] = True pending_changes = p4.changes([], **changes_options) for change in pending_changes: changelist = g4utils.GoogleChangelistDescription(change.Description()) if user in changelist.Reviewers(): yield change
The equivalent for gvn
would be:
for user_path in //changes: for change in //changes/user-path: if user in change.reviewers: yield change
p4 changes -s pending
is probably very fast, whereas
we're talking about multiple round trips for the gvn
equivalent. So, this is probably too slow, and we'll need to index
the changebranches.
Mondrian should already have this index. So, gvn
todo
could just use this. If Mondrian has no API, we can add
one while we're adding Subversion support.
Actually, we might be able to do this quickly with svn_ra_do_status or svn_ra_do_update, if the new svn_depth_t stuff will allow us to ask for //changes/*/* .
These commands are pure pass-through to svn, with // paths translated to full URLs
gvn.errors.User
-derived exceptions represent errors
that could originate from a user, though of course an application
may reasonably catch some of these and use it to know something,
e.g. that a path in the repository does not exist. These exception
classes have a code
member, which may be used as an
exit code.
gvn.errors.Internal
-derived exceptions represent
errors caused by callers, perhaps intentionally, e.g. to indicate
that a string is not a short URL.
Holds user configuration bits such as commands for running an
editor or showing a diff. Also holds the apr_hash_t
of svn_config_t
objects used by svn libraries. Has
functions for finding and returning ProjectConfig
objects, which hold user configuration about a project, e.g. the
project URL.
Represents a connection to a repository. It holds the username
used to open the connection, the URL, and functions to turn paths
into URLs, get the head revision, get revisions (svn
log
), and get information about a repository
path. Dirent
represents information about a path.
Revision
represents a revision,
with ChangedPath
objects for each path changed in that
revision.
Holds the project meta-data from the repository, i.e. location of
change branches, gvn mail
template, and how many lines
of unified diff to include in review mails. Holds
a Repository
object for the under-lying repository.
Represents a working. Has path manipulation functions,
svn_wc_status and svn_wc_entry wrappers, wcprop/post-commit
wrappers, a map of working copy paths to changebranches for all
changebranched paths and a Project
object. All input
and output paths are relative to the top of the working copy, except
for the path manipulation functions, which translate between
absolute/wc-relative and repository/local paths.
This function drives a commit. Callers provide a callback which
returns a gvn.commit.EditorAction
for a repository
path. Drive
calls this object, passing it the commit
editor baton. OpenOrMkdir
opens a directory if it
exists or creates it if not; Copy
copies a
path; Delete
deletes a path. gvn.wc.Edit
sends a local modification to the repository and schedules this
action with WorkingCopy
's post-commit queue.
Represents a changebranch. This includes all information about
the change (such as a list
of gvn.repository.ChangedPath
s) and methods for
updating, deleting, or submitting the change.
Represents a gvn command (e.g. gvn submit
),
including options (parsed from the command line
using Option
and OptionParser
(a subclass
of the standard Python optparse.OptionParser
).
Context
discovers and holds information about the
context in which the command executes. It holds a
Config
object (from gvn.config.Get
and
modified based on command-line options and environment variables),
a gvn.wc.WorkingCopy
object (if available), and
a gvn.project.Project
object thereby discovered.
It will be better one day.
presubmit allows a user with write access to a directory to execute arbitrary code on the systems of all users that checkout that directory and have the presubmit feature enabled.
N/A ?
N/A
N/A ?
??? Test coverage is not so good (39%), but improving all the time.
N/A
...
The code is well-commented and docstringed. Minimal user
documentation in the form of the gvn help
command is
available. We also have
a preliminary
manual.
...
Rewrite to fix "awkward flow of concepts" --dchristian
Subversion has a feature called "changelists", inspired by
Perforce's pending changelists. They are not quite the same: a
changelist in Subversion is identified by an arbitrary string chosen
by the user and is local to a working copy; a pending changelist in
Perforce is identified by a globally unique integer in the same series
as submitted changelists ("revisions" in Subversion) and is stored in
the depot (the actual changed files are local to the client). To use
these, gvn
would have to send around (changelist name,
hostname, working copy path) triples. The versioning and verification
requirements would go unmet.
The most obvious way to satisfy the versioning and verification requirements is to track the change in the Subversion repository rather than only in the working copy; i.e. use a branch. This addresses the last three requirements at once: any user can review the change by diffing branches, changes to the change are versioned because they are commits to the branch, and a tool can diff the branch to the submitted revision to verify the submitted revision is the same as a change that was reviewed. This leaves only the multiple changes requirement, which is a local issue.
One way to implement a changebranch is to branch from the working copy URL at its base revision and switch the working copy to that branch. This prevents the multiple changes requirement. Additionally, it is very expensive; it requires a slow crawl across the entire working copy, not just the changed files.
Instead, gvn
can switch only the files contained in
this change to the branch. This satisfies all requirements, but
introduces more issues, among them:
gvn
commits
that to the changebranch, the file is gone from the working copy.
It has to be resurrected somehow in order to submit the deletion to
the source branch.How do you manage updates? Other people will be changing the
changebranched files. If the user naively runs svn
update
, these changes will be missed. Instead the user
must always run a special gvn
command, which must
update these files itself. This can't be accomplished by
switching all files back to the source branch, updating, and
switching back; at no point are the "update" changes *local*;
assuming no conflicts, these changes will come in at the switch to
source branch and then leave at the switch back to the
changebranch.
Instead, gvn
must perform a merge from the base
revision to the update-target revision of the file on source
branch to the local file. This must be a separate merge for each
file in the change, as not all updated files are switched! This
is, of course, ridiculously expensive.
gvn
must exit. What about all these merged
files? Does it commit what it can?gvn
must track the "update" state for
each file.libsvn_wc
.One obvious way around all this is to keep a separate, "shadow" working copy of the changebranch, so the "real" working copy wouldn't have any switched files at all. This opens whole new cans of worms.
So, we changed the model: Re-branch at every snapshot. By re-branching every time, all the updating problems vanish (along with a giant pile of code, of which Erik and I were increasingly distrustful (and we weren't even finished writing it!)). Since we don't have any special update requirements, we no longer have to switch. That saves us from all that fragility. Of course, this model brings with it some of its own problems...