gvn Versions Nicely

Status: Draft   (as of 2008-05-14)

Modified:

Contents

Objective

Implement Google version control practices and work flow with Subversion.

g4-compatible command-line and p4lib-compatible API are explicit non-goals. Instead, gvn provides Subversive solutions to the problems g4 and p4lib solve.

Background

Stuff one needs to know to understand this doc: motivating examples, previous versions and problems, links to related projects/design docs, etc. You should mention related work outside of Google if applicable. Note: this is background; do not write about your design or ideas to solve problems here.

Start out with a paragraph with one or two sentences speaking broadly about the whys and wherefores of source control, and then how we turned around and gave all that up with g4 pending changelists. Then Mondrian came in to version those changes on the side. Now let's bring it back under one source control system.

Link to some svn docs, SCM patterns site, ...

Terminology

"source branch" is used in comments and docstrings all over the place, but it's confusing: to many, "trunk" is not conceptually a branch, and it sometimes isn't technically. "code line" is a better term, but what about "source"? It's the "source" from which the changebranch is copied, but it's also the "target" to which the change is eventually submitted.

Include very brief overview of svn, including copy/modify/merge model, branches are copies, atomic commits, arbitrary metadata, and so on.

Overview

One page high-level overview; put details in the next section and background in the previous section. Should be understandable by a new Google engineer not working on the project.

Detailed Design

The changebranch

If you're feeling masochistic, see Appendix A: Where We've Been.

TODO(epg): document properties:

  • revprops
  • node props
  • Where We Are

    A changebranch is an entry in a project's changes directory in the repository, associated metadata, and a branch. The default location for changebranches is //changes, so user basil's change "foo" is located at //changes/basil/foo , which contains //changes/basil/foo/bar, the actual branch. "bar" is the basename of the directory being copied.

    Users refer to changebranches with the canonical change name rather than the path in the repository. The canonical change name is of the form "user/change-name@revnum". The canonical name of basil's "foo" change is "basil/foo@17". Short forms "foo", "foo@17", and "basil/foo" are all valid as well. gvn assumes the user's own changes when the "user/" portion is missing, and assumes the HEAD revision in the absence of "revnum".

    Creation of a changebranch is the same operation as later snapshots (updates to the changebranch); both are simply calls to gvn.changebranch.ChangeBranch.Branch . This creates any parent directories of the changebranch on the fly, meaning changebranch creation is transparent. gvn change and gvn snapshot open the working copy to the deepest common path of all paths to be changebranched. Branch starts a transaction, deletes the branch (if any) copies the source branch from the base revision of the deepest common path, and then applies the user's changes to the branch. Because working copies almost always have mixed revisions, the branch may have been copied from a different revision than what should be used for a file. Branch deletes such files from the branch and copies them with the correct revision. This behavior is identical to that of svn cp . URL.

    gvn change takes the same options as svn for setting the log message of the snapshot, but when none are specified and it runs the user's editor, the log message of the previous snapshot (if any) is loaded into the form, ready to become the log message of this snapshot. The user may edit this, or leave it as is. gvn submit copies the log message of the last snapshot to become the log message of the final submitted revision. See gvn submit (TODO href me) below for future plans.

    Future plans: gvn today maintains the list of changebranched files in the .gvnstate file, with none of the journalling svn itself uses for wc operations. Duplicating that would be pointless; instead we should add to svn a general mechanism for applications to store additional metadata in the Subversion metadata area (.svn directory).

    Another disadvantage is that the smallest change to a changebranch involves sending all the diffs to the repository again. If the user wants to change only the change description, or change one file out of a hundred, the diffs for all hundred files must be transmitted.

    gvn uses the list of changed paths on the latest (or specified) snapshot to determine which files are considered part of the change, and which action is taken for each. The problem with this is that actions implementing the changebranch are conflated with the actions the user took. Since the branch itself is always an Add or Replace, gvn has no way to know when the user is changing a property (e.g. svn:ignore or svn:mergeinfo) on the deepest common path itself, nor does it have a way to distinguish Replace files from Modified files.

    Where We're Going

    We knew from the beginning that using a branch of any kind to manage pending changes only works for reasonably sized changes. Some changes are so large as to be unreviewable. For example, when importing or updating a third-party tree, reviewers can only check the log message, list of files; they're not going to read over the entire tree. The same applies to large merges. In addition to the human problem, sending these large changes twice (once to the changebranch and again for the final submit) is an unreasonable burden. So, we need a way to create an artifact with certain metadata about the change without sending the change itself until final submit.

    Additionally, we have the problems described in "Where We Are". We can solve all these problems with one change to the model: when snapshotting a changebranch, store our own copy of the path change metadata, and only re-branch files as needed. We store this as a file called state in the changebranch container (e.g. //changes/epg/foo/state). This state file is a JSON serialization of the new gvn.changebranch.State class.

    For the class of changes we heuristically decide are "big" (or maybe the user has to use an option), we just don't create the branch part of the changebranch; we create //changes/epg/foo/state but not //changes/epg/foo/branch . gvn review can still show the changed files and how they were changed (and even where things are being copied from, e.g. svn mv gigantic src/gigantic).

    Now, for changes where we do branch to //changes/epg/foo/branch , we do a full snapshot only at changebranch creation. On subsequent snapshots, unless --force is specified, we don't snapshot files unless NeedsSnapshot. E.g.

    # wc is @15, Makefile is @16
    
    wc% gvn opened
     M    Makefile
     M    testing/testcommon.py
    
    wc% gvn change -c foo --non-interactive
    Sending Makefile
    Sending testing/testcommon.py
    Changed epg/foo@17.
    
    => same as today, copied //trunk@15 and applied diffs
       A /changes/epg/foo
       A /changes/epg/foo/state
       A /changes/epg/foo/branch (from:/trunk@15)
       R /changes/epg/foo/branch/Makefile (from:/trunk/Makefile@16)
       M /changes/epg/foo/branch/testing/testcommon.py
    
       but review looks at the path's action in the State object and
       prints an M not an R.
    
    wc% gvn opened
    --- foo
     M    Makefile
     M    testing/testcommon.py
    
    wc% gvn change -c foo --non-interactive -m 'new log message'
    Changed epg/foo@18.
    

    The state file looks like:

    {"base": "trunk", "base_rev": 15, "paths": {
       "Makefile": {
         "action": "M",
         "base_rev": 16
       },
       "testing/testcommon.py": {
         "action": "M"
       }
     }
    }
    

    gvn notices that nothing NeedsSnapshot, and just tries to change the svn:log property of the last snapshot. If that fails, it commits a change just to state, updating all the snap entries to the last snap revision. Hmm, or maybe it always does that, rather than trying to change the log.

    The State objects haven't changed, so we didn't have to re-send them and gvn review has no trouble showing the correct information and diffs. We didn't have to send any diffs, we're just making a new revision with a new svn:log .

    wc% gvn opened
    --- foo
    *M    Makefile
     M    testing/testcommon.py
    wc% gvn change -c foo --non-interactive -m 'new log message'
    Sending Makefile
    Changed epg/foo@19.
    
    => notices only Makefile needs snapshot, so does this:
       M /changes/epg/foo/state
       R /changes/epg/foo/branch/Makefile (from:/trunk/Makefile@16)
    

    gvn only had to send a new delta for Makefile and set the snap revisions for everything else to 18:

    {"base": "trunk", "base_rev": 15, "paths": {
       "Makefile": {
         "action": "M",
         "base_rev": 16
       },
       "testing/testcommon.py": {
         "action": "M",
         "snap_rev": 18
       }
     }
    }
    

    If some path had been copied:

    {"base": "trunk", "base_rev": 15, "paths": {
       "Makefile": {
         "action": "M",
         "base_rev": 16
       },
       "testing/testcommon.py": {
         "action": "M",
         "snap_rev": 18
       },
       "copy-example": {
         "action": "M",
         "copyfrom_path": "/trunk/foo",
         "copyfrom_rev": 14
       }
     }
    }
    

    Commands

    checkout (co)

    Resolve short "URLs" (e.g. //tools) to real URLs based on project settings and run svn checkout to create a working copy.

    Future plans: Take an option specifying a "sparse checkout spec" specifying exactly what to check out, rather than checking out the whole tree.

    g4 analog: client + sync

    diff (di)

    Display the differences for locally modified paths.

    g4 analog: diff

    change (ch)

    Add paths to or remove paths from a changebranch, or delete a changebranch entirely.

    g4 analog: change

    changes

    List local changebranches.

    g4 analog: none?

    opened

    List locally modified files, grouped by changebranch.

    g4 analog: opened

    mail

    Mail a changebranch review request.

    g4 analog: mail

    review

    Show the change description and diffs for a change.

    g4 analog: describe

    snapshot

    Update a changebranch in the repository.

    g4 analog: none

    approve (ack)

    Mark a changebranch approved, reviewed.

    Future plans: At the moment, approved == reviewed, but we will probably change that. The big difference between the two under g4 is that you can submit without a 'looks good' (though you get nag mail) but not without an approval.

    g4 analog: approve

    submit

    Submit the changes from a changebranch and remove the changebranch.

    Future plans:

    Since changebranches are like the private branches provided by distributed systems like svk and Mercurial, it is likely that users will want to snapshot frequently, with snapshot log messages describing the change being snapshotted, rather than growing a single change description with each snapshot. To facilitate this, submit will have an option to run the user's editor on a form with gvn log of all snapshots loaded, for easy massaging into the final change dscription.

    If the user has set the per-project run-presubmit option, this will run gvn presubmit . This is the only option which must be in the user project file; users themselves must make the decision to allow other committers to run arbitrary code on their systems. If an organization wants to make this decision for users, it can hack this check out internally.

    g4 analog: submit

    update (up)

    Run svn udpate to update the working copy.

    g4 analog: sync

    rdiff (rdi)

    g4 analog: diff2

    describe (desc)

    g4 analog: describe -s

    log

    Show the history of changes of a path, or a changebranch.

    Future plans: Will show gvn review status properties for each change.

    g4 analog: changes

    import

    Run svn import.

    Future plans: replace svn-vendor:

    import [local-path] //third-party/subversion [tags]

    Import local-path (defaults to .) to //third-party/subversion/import (creating directories as needed). local-path is distinguished from repo path because it can start with at most 1 / and the path after that must start with 2. If tags are listed, copy the new import tree to those tags. Tags starting with any number of / except 0 or 2 is an error. 0 means it's treated as a path relative to //third-party/subversion, 2 means it's treated as an absolute path. The tag path will be Replaced if it exists.

      cd svn-1.4.3 && gvn import //third-party/subversion 1.4.3
      gvn import svn-1.4.3 //third-party/subversion # no tags
      gvn import svn-1.4.3 //third-party/subversion collab/1.4.3 //foo/1.4.3
      
      svn.ra.do_status to get list of files in target tree
      os.walk to get list from source tree
      sort lists
      for i in difflib.Differ().compare(a, b):
        if i[0] == '-':
          remove i[2:]
        elif i[0] == '+':
          add i[2:]
      

    todo

    Future plans:

    g4 todo is basically:

      changes_options['status'] = 'pending'
      changes_options['long_output'] = True
      pending_changes = p4.changes([], **changes_options)
      for change in pending_changes:
        changelist = g4utils.GoogleChangelistDescription(change.Description())
        if user in changelist.Reviewers():
          yield change
      

    The equivalent for gvn would be:

      for user_path in //changes:
        for change in //changes/user-path:
          if user in change.reviewers:
            yield change
      

    p4 changes -s pending is probably very fast, whereas we're talking about multiple round trips for the gvn equivalent. So, this is probably too slow, and we'll need to index the changebranches.

    Mondrian should already have this index. So, gvn todo could just use this. If Mondrian has no API, we can add one while we're adding Subversion support.

    Actually, we might be able to do this quickly with svn_ra_do_status or svn_ra_do_update, if the new svn_depth_t stuff will allow us to ask for //changes/*/* .

    Subversion Commands

    These commands are pure pass-through to svn, with // paths translated to full URLs

    Code

    class gvn.errors.Root

    Associated classes:

    gvn.errors.User-derived exceptions represent errors that could originate from a user, though of course an application may reasonably catch some of these and use it to know something, e.g. that a path in the repository does not exist. These exception classes have a code member, which may be used as an exit code.

    gvn.errors.Internal-derived exceptions represent errors caused by callers, perhaps intentionally, e.g. to indicate that a string is not a short URL.

    class gvn.config.Config

    Associated classes:

    Holds user configuration bits such as commands for running an editor or showing a diff. Also holds the apr_hash_t of svn_config_t objects used by svn libraries. Has functions for finding and returning ProjectConfig objects, which hold user configuration about a project, e.g. the project URL.

    class gvn.repository.Repository

    Associated classes:

    Represents a connection to a repository. It holds the username used to open the connection, the URL, and functions to turn paths into URLs, get the head revision, get revisions (svn log), and get information about a repository path. Dirent represents information about a path. Revision represents a revision, with ChangedPath objects for each path changed in that revision.

    class gvn.project.Project

    Holds the project meta-data from the repository, i.e. location of change branches, gvn mail template, and how many lines of unified diff to include in review mails. Holds a Repository object for the under-lying repository.

    class gvn.wc.WorkingCopy

    Represents a working. Has path manipulation functions, svn_wc_status and svn_wc_entry wrappers, wcprop/post-commit wrappers, a map of working copy paths to changebranches for all changebranched paths and a Project object. All input and output paths are relative to the top of the working copy, except for the path manipulation functions, which translate between absolute/wc-relative and repository/local paths.

    def gvn.commit.Drive

    Associated classes:

    This function drives a commit. Callers provide a callback which returns a gvn.commit.EditorAction for a repository path. Drive calls this object, passing it the commit editor baton. OpenOrMkdir opens a directory if it exists or creates it if not; Copy copies a path; Delete deletes a path. gvn.wc.Edit sends a local modification to the repository and schedules this action with WorkingCopy's post-commit queue.

    class gvn.changebranch.ChangeBranch

    Represents a changebranch. This includes all information about the change (such as a list of gvn.repository.ChangedPaths) and methods for updating, deleting, or submitting the change.

    class gvn.cmdline.Command

    Associated classes:

    Represents a gvn command (e.g. gvn submit), including options (parsed from the command line using Option and OptionParser (a subclass of the standard Python optparse.OptionParser). Context discovers and holds information about the context in which the command executes. It holds a Config object (from gvn.config.Get and modified based on command-line options and environment variables), a gvn.wc.WorkingCopy object (if available), and a gvn.project.Project object thereby discovered.

    Caveats

    It will be better one day.

    Security Considerations

    presubmit allows a user with write access to a directory to execute arbitrary code on the systems of all users that checkout that directory and have the presubmit feature enabled.

    Privacy Considerations

    N/A ?

    Standards

    N/A

    Logging Plan

    N/A ?

    Testing Plan

    ??? Test coverage is not so good (39%), but improving all the time.

    Monitoring Plan

    N/A

    Internationalization Plan

    ...

    Documentation Plan

    The code is well-commented and docstringed. Minimal user documentation in the form of the gvn help command is available. We also have a preliminary manual.

    Work Estimates

    ...

    Appendix A: Where We've Been

    Rewrite to fix "awkward flow of concepts" --dchristian

    Subversion has a feature called "changelists", inspired by Perforce's pending changelists. They are not quite the same: a changelist in Subversion is identified by an arbitrary string chosen by the user and is local to a working copy; a pending changelist in Perforce is identified by a globally unique integer in the same series as submitted changelists ("revisions" in Subversion) and is stored in the depot (the actual changed files are local to the client). To use these, gvn would have to send around (changelist name, hostname, working copy path) triples. The versioning and verification requirements would go unmet.

    The most obvious way to satisfy the versioning and verification requirements is to track the change in the Subversion repository rather than only in the working copy; i.e. use a branch. This addresses the last three requirements at once: any user can review the change by diffing branches, changes to the change are versioned because they are commits to the branch, and a tool can diff the branch to the submitted revision to verify the submitted revision is the same as a change that was reviewed. This leaves only the multiple changes requirement, which is a local issue.

    One way to implement a changebranch is to branch from the working copy URL at its base revision and switch the working copy to that branch. This prevents the multiple changes requirement. Additionally, it is very expensive; it requires a slow crawl across the entire working copy, not just the changed files.

    Instead, gvn can switch only the files contained in this change to the branch. This satisfies all requirements, but introduces more issues, among them:

    One obvious way around all this is to keep a separate, "shadow" working copy of the changebranch, so the "real" working copy wouldn't have any switched files at all. This opens whole new cans of worms.

    So, we changed the model: Re-branch at every snapshot. By re-branching every time, all the updating problems vanish (along with a giant pile of code, of which Erik and I were increasingly distrustful (and we weren't even finished writing it!)). Since we don't have any special update requirements, we no longer have to switch. That saves us from all that fragility. Of course, this model brings with it some of its own problems...