bzrlib.dirstate : API documentation

DirState objects record the state of a directory and its bzr metadata.

Pseudo EBNF grammar for the state file. Fields are separated by NULLs, and lines by NL. The field delimiters are ommitted in the grammar, line delimiters are not - this is done for clarity of reading. All string data is in utf8.

MINIKIND = "f" | "d" | "l" | "a" | "r" | "t";
NL = "\n";
NULL = "\0";
WHOLE_NUMBER = {digit}, digit;
BOOLEAN = "y" | "n";
REVISION_ID = a non-empty utf8 string;

dirstate format = header line, full checksum, row count, parent details,
 ghost_details, entries;
header line = "#bazaar dirstate flat format 3", NL;
full checksum = "crc32: ", ["-"], WHOLE_NUMBER, NL;
row count = "num_entries: ", WHOLE_NUMBER, NL;
parent_details = WHOLE NUMBER, {REVISION_ID}* NL;
ghost_details = WHOLE NUMBER, {REVISION_ID}*, NL;
entries = {entry};
entry = entry_key, current_entry_details, {parent_entry_details};
entry_key = dirname,  basename, fileid;
current_entry_details = common_entry_details, working_entry_details;
parent_entry_details = common_entry_details, history_entry_details;
common_entry_details = MINIKIND, fingerprint, size, executable
working_entry_details = packed_stat
history_entry_details = REVISION_ID;
executable = BOOLEAN;
size = WHOLE_NUMBER;
fingerprint = a nonempty utf8 sequence with meaning defined by minikind.

Given this definition, the following is useful to know:

entry (aka row) - all the data for a given key.
entry[0]: The key (dirname, basename, fileid)
entry[0][0]: dirname
entry[0][1]: basename
entry[0][2]: fileid
entry[1]: The tree(s) data for this path and id combination.
entry[1][0]: The current tree
entry[1][1]: The second tree

For an entry for a tree, we have (using tree 0 - current tree) to demonstrate:

entry[1][0][0]: minikind
entry[1][0][1]: fingerprint
entry[1][0][2]: size
entry[1][0][3]: executable
entry[1][0][4]: packed_stat

OR (for non tree-0):

entry[1][1][4]: revision_id

There may be multiple rows at the root, one per id present in the root, so the in memory root row is now:

self._dirblocks[0] -> ('', [entry ...]),

and the entries in there are:

entries[0][0]: ''
entries[0][1]: ''
entries[0][2]: file_id
entries[1][0]: The tree data for the current tree for this fileid at /
etc.

Kinds:

'r' is a relocated entry: This path is not present in this tree with this
    id, but the id can be found at another location. The fingerprint is
    used to point to the target location.
'a' is an absent entry: In that tree the id is not present at this path.
'd' is a directory entry: This path in this tree is a directory with the
    current file id. There is no fingerprint for directories.
'f' is a file entry: As for directory, but it's a file. The fingerprint is
    the sha1 value of the file's canonical form, i.e. after any read
    filters have been applied to the convenience form stored in the working
    tree.
'l' is a symlink entry: As for directory, but a symlink. The fingerprint is
    the link target.
't' is a reference to a nested subtree; the fingerprint is the referenced
    revision.

Ordering:

The entries on disk and in memory are ordered according to the following keys:

directory, as a list of components
filename
file-id

--- Format 1 had the following different definition: ---

rows = dirname, NULL, basename, NULL, MINIKIND, NULL, fileid_utf8, NULL,
    WHOLE NUMBER (* size *), NULL, packed stat, NULL, sha1|symlink target,
    {PARENT ROW}
PARENT ROW = NULL, revision_utf8, NULL, MINIKIND, NULL, dirname, NULL,
    basename, NULL, WHOLE NUMBER (* size *), NULL, "y" | "n", NULL,
    SHA1

PARENT ROW's are emitted for every parent that is not in the ghosts details line. That is, if the parents are foo, bar, baz, and the ghosts are bar, then each row will have a PARENT ROW for foo and baz, but not for bar.

In any tree, a kind of 'moved' indicates that the fingerprint field (which we treat as opaque data specific to the 'kind' anyway) has the details for the id of this row in that tree.

I'm strongly tempted to add a id->path index as well, but I think that where we need id->path mapping; we also usually read the whole file, so I'm going to skip that for the moment, as we have the ability to locate via bisect any path in any tree, and if we lookup things by path, we can accumulate an id->path mapping as we go, which will tend to match what we looked for.

I plan to implement this asap, so please speak up now to alter/tweak the design - and once we stabilise on this, I'll update the wiki page for it.

The rationale for all this is that we want fast operations for the common case (diff/status/commit/merge on all files) and extremely fast operations for the less common but still occurs a lot status/diff/commit on specific files). Operations on specific files involve a scan for all the children of a path, in every involved tree, which the current format did not accommodate. ----

Design priorities:

Fast end to end use for bzr's top 5 uses cases. (commmit/diff/status/merge/???)
fall back current object model as needed.
scale usably to the largest trees known today - say 50K entries. (mozilla is an example of this)

Locking:

Eventually reuse dirstate objects across locks IFF the dirstate file has not been modified, but will require that we flush/ignore cached stat-hit data because we won't want to restat all files on disk just because a lock was acquired, yet we cannot trust the data after the previous lock was released.

Memory representation:

vector of all directories, and vector of the childen ?
  i.e.
    root_entrie = (direntry for root, [parent_direntries_for_root]),
    dirblocks = [
    ('', ['data for achild', 'data for bchild', 'data for cchild'])
    ('dir', ['achild', 'cchild', 'echild'])
    ]
   - single bisect to find N subtrees from a path spec
   - in-order for serialisation - this is 'dirblock' grouping.
   - insertion of a file '/a' affects only the '/' child-vector, that is, to
     insert 10K elements from scratch does not generates O(N^2) memoves of a
     single vector, rather each individual, which tends to be limited to a
     manageable number. Will scale badly on trees with 10K entries in a
     single directory. compare with Inventory.InventoryDirectory which has
     a dictionary for the children. No bisect capability, can only probe for
     exact matches, or grab all elements and sort.
   - What's the risk of error here? Once we have the base format being processed
     we should have a net win regardless of optimality. So we are going to
     go with what seems reasonable.

open questions:

Maybe we should do a test profile of the core structure - 10K simulated searches/lookups/etc?

Objects for each row? The lifetime of Dirstate objects is current per lock, but see above for possible extensions. The lifetime of a row from a dirstate is expected to be very short in the optimistic case: which we are optimising for. For instance, subtree status will determine from analysis of the disk data what rows need to be examined at all, and will be able to determine from a single row whether that file has altered or not, so we are aiming to process tens of thousands of entries each second within the dirstate context, before exposing anything to the larger codebase. This suggests we want the time for a single file comparison to be < 0.1 milliseconds. That would give us 10000 paths per second processed, and to scale to 100 thousand we'll another order of magnitude to do that. Now, as the lifetime for all unchanged entries is the time to parse, stat the file on disk, and then immediately discard, the overhead of object creation becomes a significant cost.

Figures: Creating a tuple from 3 elements was profiled at 0.0625 microseconds, whereas creating a object which is subclassed from tuple was 0.500 microseconds, and creating an object with 3 elements and slots was 3 microseconds long. 0.1 milliseconds is 100 microseconds, and ideally we'll get down to 10 microseconds for the total processing - having 33% of that be object creation is a huge overhead. There is a potential cost in using tuples within each row which is that the conditional code to do comparisons may be slower than method invocation, but method invocation is known to be slow due to stack frame creation, so avoiding methods in these tight inner loops in unfortunately desirable. We can consider a pyrex version of this with objects in future if desired.

Class	SHA1Provider	An interface for getting sha1s of a file.
Class	DefaultSHA1Provider	A SHA1Provider that reads directly from the filesystem.
Class	DirState	Record directory and metadata state for fast access.
Function	py_update_entry	Update the entry based on what is actually on disk.
Class	ProcessEntryPython	No class docstring; 5/7 methods documented

Parameters	state	The dirstate this entry is in.
	entry	This is the dirblock entry for the file in question.
	abspath	The path on disk for this file.
	stat_value	The stat value done on the path.
Returns	None, or The sha1 hexdigest of the file (40 bytes) or link target of a symlink.

b.dirstate : module documentation