gvn Versions Nicely

Status: Draft (as of 2008-05-14)

Eric Gillespie <epg@google.com>

Modified:

Contents

Objective

Implement Google version control practices and work flow with Subversion.

code review
- Allow users to maintain multiple labelled changes (in the form of a list of modified files) in a single working copy.
- Allow other users to review a change before it is submitted to a code line.
- Version changes to the change over time so users can track the evolution of a change in response to reviews. This makes pending changes easy to backup.
- Provide a mechanism to verify that a change submitted to a code line was reviewed.
presubmit
improved user experience over svn
- short-hand URLs
- easy track of changes with changebranches
Provide a Python API for people to use instead of p4lib. This will be a mix of modules in the gvn and svn packages; it is as yet unclear how much we will wrap as high level gvn interfaces.

g4-compatible command-line and p4lib-compatible API are explicit non-goals. Instead, gvn provides Subversive solutions to the problems g4 and p4lib solve.

Background

Stuff one needs to know to understand this doc: motivating examples, previous versions and problems, links to related projects/design docs, etc. You should mention related work outside of Google if applicable. Note: this is background; do not write about your design or ideas to solve problems here.

Start out with a paragraph with one or two sentences speaking broadly about the whys and wherefores of source control, and then how we turned around and gave all that up with g4 pending changelists. Then Mondrian came in to version those changes on the side. Now let's bring it back under one source control system.

Link to some svn docs, SCM patterns site, ...

Terminology

"source branch" is used in comments and docstrings all over the place, but it's confusing: to many, "trunk" is not conceptually a branch, and it sometimes isn't technically. "code line" is a better term, but what about "source"? It's the "source" from which the changebranch is copied, but it's also the "target" to which the change is eventually submitted.

Include very brief overview of svn, including copy/modify/merge model, branches are copies, atomic commits, arbitrary metadata, and so on.

Overview

One page high-level overview; put details in the next section and background in the previous section. Should be understandable by a new Google engineer not working on the project.

Detailed Design

The changebranch

If you're feeling masochistic, see Appendix A: Where We've Been.

TODO(epg): document properties:

revprops

gvn:approve* (on snapshots)
gvn:submitted (on snapshots)
gvn:block-this-commit (on txn)
gvn:bypass-hooks (on submits)
gvn:change (on submits)
gvn:bug (on snapshots and submits)

node props

gvn:project (on project roots)
gvn:superusers (on repository root)

Where We Are

A changebranch is an entry in a project's changes directory in the repository, associated metadata, and a branch. The default location for changebranches is //changes, so user basil's change "foo" is located at //changes/basil/foo , which contains //changes/basil/foo/bar, the actual branch. "bar" is the basename of the directory being copied.

Users refer to changebranches with the canonical change name rather than the path in the repository. The canonical change name is of the form "user/change-name@revnum". The canonical name of basil's "foo" change is "basil/foo@17". Short forms "foo", "foo@17", and "basil/foo" are all valid as well. gvn assumes the user's own changes when the "user/" portion is missing, and assumes the HEAD revision in the absence of "revnum".

Creation of a changebranch is the same operation as later snapshots (updates to the changebranch); both are simply calls to gvn.changebranch.ChangeBranch.Branch . This creates any parent directories of the changebranch on the fly, meaning changebranch creation is transparent. gvn change and gvn snapshot open the working copy to the deepest common path of all paths to be changebranched. Branch starts a transaction, deletes the branch (if any) copies the source branch from the base revision of the deepest common path, and then applies the user's changes to the branch. Because working copies almost always have mixed revisions, the branch may have been copied from a different revision than what should be used for a file. Branch deletes such files from the branch and copies them with the correct revision. This behavior is identical to that of svn cp . URL.

gvn change takes the same options as svn for setting the log message of the snapshot, but when none are specified and it runs the user's editor, the log message of the previous snapshot (if any) is loaded into the form, ready to become the log message of this snapshot. The user may edit this, or leave it as is. gvn submit copies the log message of the last snapshot to become the log message of the final submitted revision. See gvn submit (TODO href me) below for future plans.

Future plans: gvn today maintains the list of changebranched files in the .gvnstate file, with none of the journalling svn itself uses for wc operations. Duplicating that would be pointless; instead we should add to svn a general mechanism for applications to store additional metadata in the Subversion metadata area (.svn directory).

Another disadvantage is that the smallest change to a changebranch involves sending all the diffs to the repository again. If the user wants to change only the change description, or change one file out of a hundred, the diffs for all hundred files must be transmitted.

gvn uses the list of changed paths on the latest (or specified) snapshot to determine which files are considered part of the change, and which action is taken for each. The problem with this is that actions implementing the changebranch are conflated with the actions the user took. Since the branch itself is always an Add or Replace, gvn has no way to know when the user is changing a property (e.g. svn:ignore or svn:mergeinfo) on the deepest common path itself, nor does it have a way to distinguish Replace files from Modified files.

Where We're Going

We knew from the beginning that using a branch of any kind to manage pending changes only works for reasonably sized changes. Some changes are so large as to be unreviewable. For example, when importing or updating a third-party tree, reviewers can only check the log message, list of files; they're not going to read over the entire tree. The same applies to large merges. In addition to the human problem, sending these large changes twice (once to the changebranch and again for the final submit) is an unreasonable burden. So, we need a way to create an artifact with certain metadata about the change without sending the change itself until final submit.

Additionally, we have the problems described in "Where We Are". We can solve all these problems with one change to the model: when snapshotting a changebranch, store our own copy of the path change metadata, and only re-branch files as needed. We store this as a file called state in the changebranch container (e.g. //changes/epg/foo/state). This state file is a JSON serialization of the new gvn.changebranch.State class.

For the class of changes we heuristically decide are "big" (or maybe the user has to use an option), we just don't create the branch part of the changebranch; we create //changes/epg/foo/state but not //changes/epg/foo/branch . gvn review can still show the changed files and how they were changed (and even where things are being copied from, e.g. svn mv gigantic src/gigantic).

Now, for changes where we do branch to //changes/epg/foo/branch , we do a full snapshot only at changebranch creation. On subsequent snapshots, unless --force is specified, we don't snapshot files unless NeedsSnapshot. E.g.

# wc is @15, Makefile is @16

wc% gvn opened
 M    Makefile
 M    testing/testcommon.py

wc% gvn change -c foo --non-interactive
Sending Makefile
Sending testing/testcommon.py
Changed epg/foo@17.

=> same as today, copied //trunk@15 and applied diffs
   A /changes/epg/foo
   A /changes/epg/foo/state
   A /changes/epg/foo/branch (from:/trunk@15)
   R /changes/epg/foo/branch/Makefile (from:/trunk/Makefile@16)
   M /changes/epg/foo/branch/testing/testcommon.py

   but review looks at the path's action in the State object and
   prints an M not an R.

wc% gvn opened
--- foo
 M    Makefile
 M    testing/testcommon.py

wc% gvn change -c foo --non-interactive -m 'new log message'
Changed epg/foo@18.

The state file looks like:

{"base": "trunk", "base_rev": 15, "paths": {
   "Makefile": {
     "action": "M",
     "base_rev": 16
   },
   "testing/testcommon.py": {
     "action": "M"
   }
 }
}

gvn notices that nothing NeedsSnapshot, and just tries to change the svn:log property of the last snapshot. If that fails, it commits a change just to state, updating all the snap entries to the last snap revision. Hmm, or maybe it always does that, rather than trying to change the log.

The State objects haven't changed, so we didn't have to re-send them and gvn review has no trouble showing the correct information and diffs. We didn't have to send any diffs, we're just making a new revision with a new svn:log .

wc% gvn opened
--- foo
*M    Makefile
 M    testing/testcommon.py
wc% gvn change -c foo --non-interactive -m 'new log message'
Sending Makefile
Changed epg/foo@19.

=> notices only Makefile needs snapshot, so does this:
   M /changes/epg/foo/state
   R /changes/epg/foo/branch/Makefile (from:/trunk/Makefile@16)

gvn only had to send a new delta for Makefile and set the snap revisions for everything else to 18:

{"base": "trunk", "base_rev": 15, "paths": {
   "Makefile": {
     "action": "M",
     "base_rev": 16
   },
   "testing/testcommon.py": {
     "action": "M",
     "snap_rev": 18
   }
 }
}

If some path had been copied:

{"base": "trunk", "base_rev": 15, "paths": {
   "Makefile": {
     "action": "M",
     "base_rev": 16
   },
   "testing/testcommon.py": {
     "action": "M",
     "snap_rev": 18
   },
   "copy-example": {
     "action": "M",
     "copyfrom_path": "/trunk/foo",
     "copyfrom_rev": 14
   }
 }
}

Commands

checkout (co)

Resolve short "URLs" (e.g. //tools) to real URLs based on project settings and run svn checkout to create a working copy.

Future plans: Take an option specifying a "sparse checkout spec" specifying exactly what to check out, rather than checking out the whole tree.

g4 analog: client + sync

diff (di)

Display the differences for locally modified paths.

g4 analog: diff

change (ch)

Add paths to or remove paths from a changebranch, or delete a changebranch entirely.

g4 analog: change

changes

List local changebranches.

g4 analog: none?

opened

List locally modified files, grouped by changebranch.

g4 analog: opened

mail

Mail a changebranch review request.

g4 analog: mail

review

Show the change description and diffs for a change.

g4 analog: describe

snapshot

Update a changebranch in the repository.

g4 analog: none

approve (ack)

Mark a changebranch approved, reviewed.

Future plans: At the moment, approved == reviewed, but we will probably change that. The big difference between the two under g4 is that you can submit without a 'looks good' (though you get nag mail) but not without an approval.

g4 analog: approve

submit

Submit the changes from a changebranch and remove the changebranch.

Future plans:

Since changebranches are like the private branches provided by distributed systems like svk and Mercurial, it is likely that users will want to snapshot frequently, with snapshot log messages describing the change being snapshotted, rather than growing a single change description with each snapshot. To facilitate this, submit will have an option to run the user's editor on a form with gvn log of all snapshots loaded, for easy massaging into the final change dscription.

If the user has set the per-project run-presubmit option, this will run gvn presubmit . This is the only option which must be in the user project file; users themselves must make the decision to allow other committers to run arbitrary code on their systems. If an organization wants to make this decision for users, it can hack this check out internally.

g4 analog: submit

update (up)

Run svn udpate to update the working copy.

g4 analog: sync

rdiff (rdi)

g4 analog: diff2

describe (desc)

g4 analog: describe -s

log

Show the history of changes of a path, or a changebranch.

Future plans: Will show gvn review status properties for each change.

g4 analog: changes

import

Run svn import.

Future plans: replace svn-vendor:

import [local-path] //third-party/subversion [tags]

Import local-path (defaults to .) to //third-party/subversion/import (creating directories as needed). local-path is distinguished from repo path because it can start with at most 1 / and the path after that must start with 2. If tags are listed, copy the new import tree to those tags. Tags starting with any number of / except 0 or 2 is an error. 0 means it's treated as a path relative to //third-party/subversion, 2 means it's treated as an absolute path. The tag path will be Replaced if it exists.

  cd svn-1.4.3 && gvn import //third-party/subversion 1.4.3
  gvn import svn-1.4.3 //third-party/subversion # no tags
  gvn import svn-1.4.3 //third-party/subversion collab/1.4.3 //foo/1.4.3

  svn.ra.do_status to get list of files in target tree
  os.walk to get list from source tree
  sort lists
  for i in difflib.Differ().compare(a, b):
    if i[0] == '-':
      remove i[2:]
    elif i[0] == '+':
      add i[2:]

todo

Future plans:

g4 todo is basically:

  changes_options['status'] = 'pending'
  changes_options['long_output'] = True
  pending_changes = p4.changes([], **changes_options)
  for change in pending_changes:
    changelist = g4utils.GoogleChangelistDescription(change.Description())
    if user in changelist.Reviewers():
      yield change

The equivalent for gvn would be:

  for user_path in //changes:
    for change in //changes/user-path:
      if user in change.reviewers:
        yield change

p4 changes -s pending is probably very fast, whereas we're talking about multiple round trips for the gvn equivalent. So, this is probably too slow, and we'll need to index the changebranches.

Mondrian should already have this index. So, gvn todo could just use this. If Mondrian has no API, we can add one while we're adding Subversion support.

Actually, we might be able to do this quickly with svn_ra_do_status or svn_ra_do_update, if the new svn_depth_t stuff will allow us to ask for //changes/*/* .

Subversion Commands

These commands are pure pass-through to svn, with // paths translated to full URLs

add
blame (praise, annotate, ann)
cat
cleanup
copy (cp)
delete (del, remove, rm)
export
info
list (ls)
lock
merge
mergeinfo
mkdir
move (mv, rename, ren)
propdel (pdel, pd)
propedit (pedit, pe)
propget (pget, pg)
proplist (plist, pl)
propset (pset, ps)
resolved
revert
status (stat, st)
switch (sw)
unlock

Code

class gvn.errors.Root

Associated classes:

gvn.errors.User
gvn.errors.Internal
...

gvn.errors.User-derived exceptions represent errors that could originate from a user, though of course an application may reasonably catch some of these and use it to know something, e.g. that a path in the repository does not exist. These exception classes have a code member, which may be used as an exit code.

gvn.errors.Internal-derived exceptions represent errors caused by callers, perhaps intentionally, e.g. to indicate that a string is not a short URL.

class gvn.config.Config

Associated classes:

gvn.config.ProjectConfig

Holds user configuration bits such as commands for running an editor or showing a diff. Also holds the apr_hash_t of svn_config_t objects used by svn libraries. Has functions for finding and returning ProjectConfig objects, which hold user configuration about a project, e.g. the project URL.

class gvn.repository.Repository

Associated classes:

gvn.repository.Dirent
gvn.repository.Revision
gvn.repository.ChangedPath

Represents a connection to a repository. It holds the username used to open the connection, the URL, and functions to turn paths into URLs, get the head revision, get revisions (svn log), and get information about a repository path. Dirent represents information about a path. Revision represents a revision, with ChangedPath objects for each path changed in that revision.

class gvn.project.Project

Holds the project meta-data from the repository, i.e. location of change branches, gvn mail template, and how many lines of unified diff to include in review mails. Holds a Repository object for the under-lying repository.

class gvn.wc.WorkingCopy

Represents a working. Has path manipulation functions, svn_wc_status and svn_wc_entry wrappers, wcprop/post-commit wrappers, a map of working copy paths to changebranches for all changebranched paths and a Project object. All input and output paths are relative to the top of the working copy, except for the path manipulation functions, which translate between absolute/wc-relative and repository/local paths.

def gvn.commit.Drive

Associated classes:

gvn.commit.EditorAction
gvn.commit.OpenOrMkdir
gvn.commit.Copy
gvn.commit.Delete
gvn.wc.Edit

This function drives a commit. Callers provide a callback which returns a gvn.commit.EditorAction for a repository path. Drive calls this object, passing it the commit editor baton. OpenOrMkdir opens a directory if it exists or creates it if not; Copy copies a path; Delete deletes a path. gvn.wc.Edit sends a local modification to the repository and schedules this action with WorkingCopy's post-commit queue.

class gvn.changebranch.ChangeBranch

Represents a changebranch. This includes all information about the change (such as a list of gvn.repository.ChangedPaths) and methods for updating, deleting, or submitting the change.

class gvn.cmdline.Command

Associated classes:

gvn.cmdline.Option
gvn.cmdline.OptionParser
gvn.cmdline.Context

Represents a gvn command (e.g. gvn submit), including options (parsed from the command line using Option and OptionParser (a subclass of the standard Python optparse.OptionParser). Context discovers and holds information about the context in which the command executes. It holds a Config object (from gvn.config.Get and modified based on command-line options and environment variables), a gvn.wc.WorkingCopy object (if available), and a gvn.project.Project object thereby discovered.

Caveats

It will be better one day.

Security Considerations

presubmit allows a user with write access to a directory to execute arbitrary code on the systems of all users that checkout that directory and have the presubmit feature enabled.

Privacy Considerations

N/A ?

Standards

N/A

Logging Plan

N/A ?

Testing Plan

??? Test coverage is not so good (39%), but improving all the time.

Monitoring Plan

N/A

Internationalization Plan

...

Documentation Plan

The code is well-commented and docstringed. Minimal user documentation in the form of the gvn help command is available. We also have a preliminary manual.

Work Estimates

...

Appendix A: Where We've Been

Rewrite to fix "awkward flow of concepts" --dchristian

Subversion has a feature called "changelists", inspired by Perforce's pending changelists. They are not quite the same: a changelist in Subversion is identified by an arbitrary string chosen by the user and is local to a working copy; a pending changelist in Perforce is identified by a globally unique integer in the same series as submitted changelists ("revisions" in Subversion) and is stored in the depot (the actual changed files are local to the client). To use these, gvn would have to send around (changelist name, hostname, working copy path) triples. The versioning and verification requirements would go unmet.

The most obvious way to satisfy the versioning and verification requirements is to track the change in the Subversion repository rather than only in the working copy; i.e. use a branch. This addresses the last three requirements at once: any user can review the change by diffing branches, changes to the change are versioned because they are commits to the branch, and a tool can diff the branch to the submitted revision to verify the submitted revision is the same as a change that was reviewed. This leaves only the multiple changes requirement, which is a local issue.

One way to implement a changebranch is to branch from the working copy URL at its base revision and switch the working copy to that branch. This prevents the multiple changes requirement. Additionally, it is very expensive; it requires a slow crawl across the entire working copy, not just the changed files.

Instead, gvn can switch only the files contained in this change to the branch. This satisfies all requirements, but introduces more issues, among them:

How do you changebranch a directory? If the directory has many descendants (in the filesystem, not in history), they must all be switched, which may be very expensive. What about when the user wants to make a change to the "ui" directory, but the file "ui/foo.c" is already involved in an unrelated change? This situation is impossible to implement with this model.
How do you changebranch adds? Create the added file in the changebranch (a commit), then switch the locally added file to the branch? To submit, switch the file back to the path on the source branch? But it doesn't exist yet, so that won't work. But somehow we'd have to turn the locally modified (or unmodified) state back into a local add on the source branch.
How do you changebranch deletes? Switching the removed file to the changebranch is simple enough, but once gvn commits that to the changebranch, the file is gone from the working copy. It has to be resurrected somehow in order to submit the deletion to the source branch.
How do you manage updates? Other people will be changing the changebranched files. If the user naively runs svn update, these changes will be missed. Instead the user must always run a special gvn command, which must update these files itself. This can't be accomplished by switching all files back to the source branch, updating, and switching back; at no point are the "update" changes *local*; assuming no conflicts, these changes will come in at the switch to source branch and then leave at the switch back to the changebranch.

Instead, gvn must perform a merge from the base revision to the update-target revision of the file on source branch to the local file. This must be a separate merge for each file in the change, as not all updated files are switched! This is, of course, ridiculously expensive.
What happens if there are conflicts after that merge (not an infrequent occurrence when updating)? Only the user can resolve them, so gvn must exit. What about all these merged files? Does it commit what it can?
Whether it leaves all "updated" files uncommitted or only the conflicted ones, how do those "updates" then get committed? Scribble out some state in a ".gvn" file?
What happens when the user edits a previously-unmodified file and puts it onto one of these "updated" changebranches? The file in the working copy is up-to-date, but its copy on the changebranch is not. Squirrel away the user's changed version, revert the file, switch it to the source branch, write the unmodified contents of the file, commit, and restore the user's modified file? What about when the user reverts the changed file, or moves it to a different changebranch? gvn must track the "update" state for each file.
What happens if the user had changes on a changebranched file not yet committed to the changebranch and the user runs this "update" command? Commit the user's changes along with the "update" changes, removing the ability to specify the log message on the snapshot? Force the user to snapshot first?
Finally, all the above adds up to fragility. It's a lot of moving parts, lots of edge cases. It's very difficult to get it all right. As we never finished implementing this model, it seems likely to have even more complications we hadn't yet encountered. This is all compounded by the working copy's infamous fragility, and the daunting spaghetti that is libsvn_wc.

One obvious way around all this is to keep a separate, "shadow" working copy of the changebranch, so the "real" working copy wouldn't have any switched files at all. This opens whole new cans of worms.

So, we changed the model: Re-branch at every snapshot. By re-branching every time, all the updating problems vanish (along with a giant pile of code, of which Erik and I were increasingly distrustful (and we weren't even finished writing it!)). Since we don't have any special update requirements, we no longer have to switch. That saves us from all that fragility. Of course, this model brings with it some of its own problems...