Introduction

This is probably the most active bit of the site, i post random thoughts, events and articles to this blog. It is ridden with personal issues and technical items alike, while being constantly out of date. Read at your own risk. If you are interested in my person (odd enough as that may sound), probably read something about me. If you are only interested in technical articles, see tech. Poetry has a completely separate blog of its own right. Older articles (as well as postings on the previous blogs i had) are to be found in archive. That’s probably it.

Recent Entries

26.08.2011: soc reloaded: outcomes

soc reloaded: Outcomes

This blog is kind of a final report for the summer. I had a progress report drafted for about two weeks, so let me paste that here just for the record. To read about the actual results, please skip to the “State of the Code” section below.

The Last Progress Report

This is the second (and last) progress update for this summer of code project. It was written something like a week before the pencil’s down, but I got disheartened and instead of finishing and posting, I went on to code some more. Here it goes…

Since my last report, I have decided to turn somewhat more radical again. The original plan was to stick with the darcs codebase and do most (all) of the work within that, based primarily on writing tests for the testsuite and not exposing anything of the new functionality in a user-visible fashion. I changed my mind about this. The main reason was that the test environment, as it is, makes certain properties hard to express: a typical test-suite works with assertions (HUnit) and invariants (QC). In this environment, expressing ideas like “the displayed patches are aesthetically pleasing” or “the files in the repository have reasonable shape” is impractical at best.

An alternative would have been to make myself a playground using the darcs library to expose the new code. But the fact is, our current codebase is entrenched in all kinds of legacy issues, like handling filenames and duplicated code. It makes the experimenter’s life harder than necessary, and it also involves rebuilding a whole lot of code that I never use, over and over.

All in all, I made a somewhat bold decision to cut everything that lived under Darcs.Patch (plus a few dependencies, as few as possible) into a new library, which I named patchlib, in the best tradition of cmdlib, pathlib and fslib. At that point, I also removed custom file path handling from that portion of code, removed the use of a custom Printer (a pretty-printer implementation) module and a made few other incompatible changes.

Of course, the testing code went along. The net result, at least for me, was an ability to build and test a much smaller piece of self-contained code. It also allowed me to experiment with APIs a bit, where those were used all over darcs, which made it, within the big codebase, impossible to advance without expending disproportionate amount of work on every change. Of course, part of that will be paid back when we decide to port darcs over to use patchlib.

I originally planned this report for the start of this week, but then I got caught in a big refactor of the ApplyMonad/Apply classes (again). The refactor was triggered by the need to pretty-print patches, which is not a completely easy task (made more complicated by the fact that UUIDs are meaningless for the user, so formatting patches without context is essentially useless now). Anyway, I am now much happier with how the ApplyMonad class looks (the ApplyMonadBase thing was genuinely hideous… good riddance). As a net result the ApplyMonad class and, even more importantly, the ApplyMonad transformer (used in applyTo{Tree,State} among other things) is substantially easier to use on the client side (while maybe very slightly harder on the provider side, it is also much clearer, IMHO). Overall win.

As for formatting and summarising patches, I have created a new Display class (I plan to nuke the existing viewing classes more or less, when I manage to make meaningful Display instances for V1 Prim). The API lets you format patches based on their ending or starting context. Since bits of the patches need to be fetched from the hashed store, the display needs to run in a LoadMonad. Since any reasonable patch formatter also needs to pass state around (and since the actual type of state passed around will be different for different Prim implementations, we hide the state in a type family of monad transformers; this also opens the option to use something else than StateT when appropriate).

(This is the end of the “progress” post. The remainder more or less describes the end state of the project.)

State of the Code

You can look at the code in my darcs repositories. Specifically, to play around with a “demo”, you need pathlib, fslib, patchlib, cmdlib and gorsvet. A bit more about the individual libraries:

About patchlib

One of the main problems of the darcs codebase today is insufficient modularity, and patchlib is an experiment in an attempt to address that concern. Efforts to bring about significant improvement of the situation from “inside” (by restructuring existing code) have to date failed. Even though there have been local improvements, the overall problem stubbornly persists.

Hand in hand with modularity problems come issues with unclear and underspecified (both internal end external) APIs. Since the separation between different components of darcs is blurry at best, the pressure to introduce clean, testable interfaces is minimal. The external library, on the other hand, is forced to put up a presentable façade. Luckily, the Patch subhierarchy in darcs is, compared to remainder of darcs, in a fairly good shape in this respect.

Eventually, patchlib should provide leverage to work with darcs(-style) patches, including at least:

With a homogenous set of APIs, mostly mediated by type classes and appropriate instances, to:

About V3 Prims in patchlib

The version 3 primitives in their current incarnation in patchlib have the following traits (based on the list from the proposal):

About gorsvet

Gorsvet is a toy implementation of a repository layer that uses V3 primitive patches. (Un)fortunately, the V3 prims violate fundamental assumptions made by the repository and command layers of darcs, which means that an integration is substantially more expensive than fits a single SoC project. However, as outlined above, it is useful to be able to play around with the V3 prim patches in a realistic environment. Therefore, gorsvet. I made it into a rather thin user shell based on cmdlib and a prototype repository layer. The UI more or less resembles darcs (without interactivity, since that’d be superfluous for a tool of this scope). You are of course welcome to try the experimental tool out: the online help should give you an idea how to use it.

One thing I would like to discuss a bit here is the repository format. Since the patch types are incompatible anyway, we are fully liberated from backward compatibility considerations. The next darcs repository format can be designed from scratch, keeping in mind the shortcomings of the previous two formats. The implementation in gorsvet is a peek at what the result might look like. Anyway, we still need a better “composite” patch layer, which represents conflicts (and sits one level above primitive patches), since the current (version 2) composite patches in darcs are quite unsatisfactory. That also means we have plenty of time to play around with the repository format, which is more or less independent of the composite patch format.

As for the format itself, I went for as simple as possible (but no simpler). So far, I have 2 files and a hashed store: .gorsvet/hashed is a sha256-based, uncompressed “dumping grounds” for stuff of all kinds. The files are (not implemented as of this writing) .gorsvet/index (with the same purpose as _darcs/index — fast, efficient working copy access) and .gorsvet/meta — a small set of root pointers. This has two very beneficial side effects: very atomic oupdates and transactional semantics (free transaction rollback). Compression and garbage collection of the hashed store can then be sorted out separately (and does not affect semnatics of the repository either way). There are currently 4 root pointers in meta: shadow, pristine, inventory and order. The pristine is the same thing as in darcs, shadow is a similar thing, but at any time reflects the current state of the working copy (it is automatically updated from working every time, before anything else happens with the repository). The reason is mainly that the working copy has entirely wrong semantics for diffing V3 primitive patches: most importantly, UUID tracking is implemented through shadow.

The remaining two root pointers represent two new data structures (and inventory is different from what darcs calls an inventory). The order simply lists patchinfo hashes (a handle for each patch that does not change through commutation) in the application order for the repository: in this sense, order replaces hashed_inventory known from darcs. It is pretty compact, but on its own also useless (since it give us no way to get the patches themselves). The inventory then, on the other hand, is an efficient map (currently a sorted pair list, written and read as a Data.Map) from patchinfo hashes to the current patch storage hashes. Moreover, the patchinfo objects are, unlike in darcs, stored in the hashed store as separate entities, and the patchinfo hash can be used to efficiently fetch the patchinfo itself. Therefore, the named patch can be assembled from inventory and order puts the patch into correct context.

Apart from storing and showing things, I have also implemented a “pull” command for gorsvet, but it’s currently fairly unusable, since any conflict automatically means failure (there is no layer to handle conflicts; we could use version 2 darcs patches, but I think it would constitute a dangerous slippery slope: we definitely want a more solid implementation, and also want to avoid a double transition).

Future Work

The obvious future work lies in the conflict handling. There are two main options in this regard: either re-engineer a patch-level, commute-based representation of conflicts (in the spirit of mergers and conflictors), as V3 “composite” patches, or alternatively, use a non-patch based mechanism for tracking conflicts and resolutions. It’s still somewhat early to decide which is a better choice, and they come with different trade-offs. Nevertheless, the decision, and the implementation, constitute a major step towards darcs 3.

The other major piece of work that remains is the repository format: in this area, I have done some research in both the previous and this year’s project, but there are no definitive answers, even less an implementation. I think we now have a number of good ideas on how to approach this. We do need to sort out a few issues though, and the decision on the conflict layer also influences the shape of the repository.

Each of these two open problems is probably about the size of an ambitious SoC project. On top of that, a lot of integration work needs to happen to actually make real use of the advancements. We shall see how much time and resources can be found for advancing this cause, but I am relatively optimistic: the primitive level has turned out fairly well, and to me it seems that shedding the shackles of legacy code sprawl can boost the project as a whole significantly forward.

(PS: While I agree, on the theoretical level, that nuking significant amounts of legacy code carries non-trivial risks, for a small volunteer project like darcs it is imperative to be fun. And a trench war against legacy code is not fun. Writing new things and exploring possibilities, on the other hand, is fun. Which means we need a bit more of the latter and a bit less of the former, even though the project sits in the more conservative camp — afterall, we handle rather precious data… Even when taking natural conservativism into account, a well-motivated, honest subsystem rewrite is better than a half-hearted, someone-has-to-do-it maintenance of a piece of code that everyone hates…)

posted 26.08.2011 1:40 pm tags:

12.07.2011: soc reloaded: progress 1

soc reloaded: progress 1

Oops! There has been no update for a long while from me, although I have been busy with code/patches. So far, I have tackled two areas: generalising and cleaning up the existing Patch testsuite, so I could apply it to the in-progress V3 Prim code later. This has been quite successful, although it took a little longer than I would have liked. With the new structure, (QC) properties for single patches, commutes and merges can be applied to any concrete patch type that supports the respective operations. Therefore, I now have coverage of both V3 Prims as standalone patches (for single patch and commute properties) and also when used with RealPatch (the implementation of non-primitive darcs 2 patches).

The latter part was then to make all these tests pass. Since I finished the respective Read/Show instances yesterday, all tests pass. Commute, apply and friends have been done couple days before that. So the next step is to write more tests that can demonstrate where the code needs to be augmented.

Hunk storage

I have slightly changed the actual (on-disk) hunk format slightly. For now, “detached” hunk-text storage is not quite supported, I am keeping that post-midterms. But the format still counts on that being possible. We do need a new monad (class) for writing patches though, since the Show instance is somewhat inadequate: the detached storage needs to be handled somehow.

Anyway, the format now looks like this:

hunk NNN .whitespace_encoded_old .new_text

(we use the same method for encoding whitespace as we do in filenames here). We might want to change to a format that’s faster to produce/parse, since hunks typically do contain whitespace. On the other hand, only very short hunks will be encoded in this form. Also, an empty string is encoded as “!”. So the hunk text (old and new) can take following forms: .whitespace_encoded, ! (empty) or “@”. The last form, @ should take 2 parameters, a hash and length, like e.g. this: @123456789ABCD<...>:65000. That means that most of the time, we can ignore the hash and only fool around with the numbers (offset and lengths) when commuting patches. Of course, applying them is a different matter: nevertheless, we still do have a substantial advantage over V1 prims there, since each hunk is a simple splice/catenate operation on bytestrings. With V1 prims, we had to chop up the hunk at newlines and remove the +/- signs.

As for implementation, this means we need to abstract commute over a monad class, which besides commutation failure can express an “fetch text for this hash” operation. This might be simpler than it was in the Apply case, a lot of code had to be modified to accomodate for V3 Prims, since the existing commute runs in the Maybe monad and we can make Maybe an instance of CommuteMonad. Nevertheless, to make actual use of this, toplevel code that runs commutes will definitely need to be modified, and in effect, all the intermediate code that relies on Maybe for commutation will need to be modified as well. This part could become actually more hairy than was the case with Apply.

The question of coalesce

Apart from apply and commute, there is currently one more “core” operation in Darcs patches: coalesce. This operation takes two primitive patches and decides whether they can be merged into a single primitive patch. This can only happen if the patches do not commute. Unfortunately, coalesce does not preserve commutation behaviour: move a b :> move b c gets coalesced into move a c, which modifies its commutation behaviour with (add, remove, move) patches mentioning b. On the other hand, coalesce is normally only used during operations like amend-record, rebase and when handling the “pending” patch. All these operations modify the identity of a patch, and therefore it shouldn’t matter much that coalesce fails to commute with commute.

On the downside, coalesce is currently a first-class operation that cannot be derived from the remaining. Most importantly, it is “redundant” with the diff operation, that constructs patches from two states. The problem is that with current (V1) prims, there is no diff operation for some patches. If we had reliable diff for all patch types, we could implement coalesce in terms of diff, commute and apply (pseudocode):

coalesce context (a :> b) | isJust (commute (a :> b)) =
    diff context (apply context $ a :> b)
                          | otherwise = a :> b

This would work as long as there was a deterministic diff operation, i.e. one with the property (for a being a primitive patch) that diff ctx (apply ctx a) == a for all a. This diff operation does not need (and indeed cannot be made) universal over different patch types, but fortunately that doesn’t matter. We can always call it with a specific patch type as one of its inputs:

diff TextHunk ctx1 ctx2

I believe this operation should be possible to have, and it would also allow significant improvements in the “record” user experience: darcs could try various differs on a pair of states and offer “better” patches than just hunks (like eg. replace, move, etc.).

The ability to implement coalesce in terms of other operations is important because even more than commute, the implementation of coalesce is O(n^2), with n being the number of different primitive patch types: it needs to take into account any pair of types. With the above approach, since coalesce always yields a single resulting patch, we can implement it as follows:

An UI digression

This is not GSoC related: you can skip this section if all you care about is GSoC…

Anyway, when I am talking about user interface. I think a substantial improvement in the way “semantic” patches work would come with the ability to infer those patches “automatically”. In the spirit of the above diff operation that is available with every patch type. However, it is hard to do a semantic “diff” on significantly divergent repository states. This is simply because further changes obscure the relationships of entities in question. When the only thing that changes is a “mv”, it is easy to detect. But when you also edit the file, it is no longer possible to tell for sure if this is a move or a new object.

What would help substantially, then, would be to be able to run diff much more often. This prompts a workflow change. Of course, we cannot ask people to commit every time they do a small change. However, we could ask them to “amend” an “in-progress” patch when they do. This would be especially useful if they can be coached into stashing their changes before and after things like applying “sed” to whole codebase, moving around files etc. This would be basically a supercharged version of “darcs mv”: it would say “I did something, you figure what is the right way to represent it”. It adds the burden of having to call this both before and after the “contentious” operation. But it also adds significant benefits (IMHO).

The other thing that would this kind of “open” patch (that you keep adding things to, until you are satisfied, and then you commit) allow is progressive (time-sensitive) revert. This is something that I would be completely sold on: if I kept telling the VCS to note down my changes reasonably often, I could get, in exchange, a whole-repository (but still granular) undo operation.

(It is not hard to imagine that you could also have more than one “open” patch at a time, sorting changes into different buckets for semi-related changes. The UI for that one would be more tricky though.)

The story of Apply

To get back to GSoC though. For what it’s worth, the test-suite part of the work and a sketch of the V3 prim implementation is already in the screened branch. The changes to the Apply class are almost ready for getting into screened as well; they are currently available as patch635.

The basic challenge with Apply was that V1 prims and V3 prims apply to different kinds of things. While V1 applies to a filesystem tree (basically your working copy), this is not the case with V3 anymore. The V3 state is modelled as a map from UUIDs to Objects. However, it would be extremely uncool to have two different “Apply” type classes to apply different kinds of patches: this would also mean that all higher layers of patch code would need to duplicate their apply implementations. Not cool.

Therefore, associated types to the rescue: I have added an ApplyState associated type to the Apply class. The Prim level patches then decide what they can be applied to (currently a Tree, from hashed-storage, for V1 and an ObjectMap, which still needs to be fleshed out in more detail, for V3). Any higher levels inherit their apply state from the prims. Cool.

Of course, it’s not as simple, since we actually have to implement that cool apply method. This was traditionally (well, since I merged my new annotate code; not that traditional I guess) been done through the ApplyMonad class. Now ApplyMonad used to have operations like “create a file”, “create a directory” or “write this bytestring to this file”. That’s cool for V1 prims, but not very useful for V3 prims. So ApplyMonad needed to be generalised over multiple apply states. This forced a multi-parameter type class, since there are no functional dependencies to save us (and therefore associated types do not apply either). This is because we expect some monads (e.g. IO) to be instances of both ObjectMap- and Tree-based ApplyMonad. In general, I didn’t want to limit this to special monads, although we might go for that option later if it turns out to be superior.

Anyway, the ApplyMonad class is a bit of a meta-class, since the actually useful set operations is different for different apply state representations. But since the methods can carry their own constraints, I have added a couple of fully generic methods (get current state, set current state and the like) and a set that only applies to ObjectMap and one that only applies to Tree. This doesn’t seem to pose significant trouble. Haskell doesn’t seem to have higher-order type classes that would solve this maybe slightly more elegantly. (I don’t even know if they are possible. Don’t crucify me if they aren’t.)

Anyway, long story short, we now have a single apply method in a single Apply class, that works on both V1 and V3 prims, as witnessed by running the same set of tests, which sometimes do invoke apply, on both implementations.

The story of Commute

There isn’t much of a story behind this one so far. As I outlined above, there are things coming here as well, but they are not required to allow V3 prims per se: only needed for the detached storage optimisation. The commute as it is implemented works, at least as far as tests are concerned. This goes hand in hand with some kind of StoreMonad and LoadMonad abstraction, that will actually allow us to implement the detached storage. The CommuteMonad can then be a

class (LoadMonad m) => CommuteMonad m where
    commuteFail :: m a

kind of deal. A LoadMonad superclass constraint can (should) appear on ApplyMonad as well. For now, the instances don’t need to be too elaborate (they can simply fail to fetch anything at all, which will work just fine for V1 prims).

The story of unsafeInterleaveIO

I don’t like unsafeInterleaveIO. At all. My last summer of code was, after a fashion, about removing a significant source of unsafe, ugly and outright dangerous lazy IO from darcs. I believe it was a significant success. Now the LoadMonad / StoreMonad abstraction has a potential to rid us of another source of lazy IO in darcs: currently, the reading of patch content is unsafeInterleaveIO`d and the rest of the code treats the repository as a kind-of-pure data structure built of patches. Of course, the unsafeInterleaveIO is unsafe because it breaks the type system. Since darcs uses it a lot, there is no telling which value is actually pure. Arguing about runtime behaviour of lazy code is hard enough when it is actually pure. Random IO thunks lurking inside pure values (kind of like an alien in Sigourney’s stomach) turn it into a nightmare.

This will involve a lot of code being lifted into a monad, fortunately a significantly restricted monad. In practice, it’ll be IO more often than not (although in the testsuite, it’ll probably be the like of Maybe, or StateT Maybe). What matters is not that the code will execute in IO, but it is statically well protected: it cannot access the IO monad and it has to be explicit about side effects (like loading things from disk). Therefore, we give things types that actually reflect impurity, without allowing them to spin out of control with side effects (like they could if they were simply in the IO monad). A static type system win.

Conclusion

Ok, I guess this is more than enough for today. I’ll try to keep you folks more informed about the progress in the second half of the endeavour. On the other hand, I am a bit worried that these posts are more useful as a note-to-self resource than for general reading by others. Well, let’s hope, dear reader, that you found at least something of this post useful and/or interesting.

posted 12.07.2011 12:27 pm tags:

24.05.2011: soc reloaded: darcs primitive patches version 3

soc reloaded: darcs primitive patches version 3

This year, I have accepted Eric’s invitation to submit a proposal to Google Summer of Code again. I am not going to repeat the proposal itself here, so please read that as well. This post is more about filling in the technical details and setting out a plan of work.

First a clarification: even though there are no V2 prims, I call those V3, because the V1 prims have slightly (and somewhat confusingly) different semantics in Darcs1 and Darcs2 repositories; if nothing else filename encoding has changed incompatibly. There have been some commute rule changes as well, although I am not sure this wasn’t retroactively changed even for Darcs1 repos. Not important anyway.

In the rest of the post, I will lay out how I anticipate the V3 prims to work and further what I intend to implement and roughly how.

Objects

Primitive patches shall operate on a collection of abstract objects (which define the “pristine” state), each object in the collection being uniquely identified. Objects come into existence the first time they are referenced and they are never destroyed. We assign a type to each object (and object patches get corresponding arrow types).

I imagine there would be a few object types: binary, text, directory. We can add “bugs” objects and stuff like that later. The patch types should be monomorphic to simplify things. We can share implementations between different patch types if they are identical apart from their type.

Directories

A directory object is a map from names (strings) to object ids. (I say map and not bimap since there seems no good reason to prevent multiple manifestations of a single object.) We should however take care to avoid loops in the structure and such. We could even tie hardlinking to this, although that’s probably pretty useless in practice. We definitely should take care to avoid loops and similar abominations in the directory structure.

Among other repository properties, we keep a “root” object — this is the UUID of a (directory) object that represents the root of the working copy of the repository. The directory can map names to things, like text or binary files, or other directories.

Akin to the “root” object, we may want to keep track of a “preferences” object as well. Again, this would be just an UUID of a directory object. The directory object could then list individual preference files.

Patches

Some examples:

Patches of different types on the same object clearly don’t need to commute. If there is a binary -> binary patch and a text -> text patch affecting the same object, they can never change their order. In fact, a -> b patches for a != b can’t realistically commute with any a -> a patch. This should drastically reduce the exponential number of commute rules we’d otherwise need to deal with, and should make the primitive commute function much more modular. In fact, only a -> a patches for same a become involved in the exponential commute definition blowup. This should be manageable.

Moreover, if we impose a map from patches to the object they affect, we can also trivially commute patches that affect different objects. We will need to generalise this later, however, since some patch types may involve multiple objects (even though our type system can’t handle that yet, either), or even involve a list of objects variable under commute.

Patch application needs to obey the type restrictions of course.

We will possibly want to attach a UUID to each primitive patch as well, so it can be readily identified. Of course, this increases the space overhead, but presumably not exceedingly so.

Hunks

The basic patch type is the hunk: the representation may be identical for both binary and text objects. What is not the same is how binary and text hunks are obtained. For text objects, we should use a whitespace-sensitive diffing algorithm (line diff, most likely; either the one we already have in darcs, or alternatively patience diff). For binary objects, we can use one of the binary delta algorithms. It may be prudent to disallow commute for binary hunks, too.

The format is still not defined, although there is a “first shot” at http://web.mornfall.net/blog/patch_formats.html. But in the end, we probably want a somewhat different format anyway, or an additional hunk type, because we apparently want both removal and addition to happen in a single Prim, for commute to make more sense. So the format could instead look like:

hunk 123 "old text" "new text"

or

hunk 123 old_hash new_hash

with the quoted-string version being a text-escaped, “inline” variant of the patch to be used when length of old\_text + new\_text is less than two times the length of a hash. Even though patches of this textual form are not of constant byte width, that doesn’t matter since other patch types cannot be made to be that way either (like replace, or manifest). Any “substantial” text that is the actual problematic part to parse is moved away by indirecting it through hashes.

Multi-object patches

Until now, we restricted ourselves to patches that affect a single object. This may be genuinely impractical for patches that move around things, be it complete files (move) or pieces of content (hunk move). We want such patches to commute as a single unit, either commuting completely or not at all. This could be achieved differently, by adding a concept of atomic patch group. I am not entirely sure if that is right or not, but it currently seems like the more complicated option.

Therefore, we can go on adding multi-object patches. Presumably, the correct type would be (a, b) -> (c, d). Most commonly of course (i.e. in the two abovementioned cases), this would end up being (a, a) -> (a, a).

Generic commute rules

Let’s assume a function

touches :: Prim -> [UUID]

we can say that

commute (a :> b) | null (touches a `intersect` touches b) = (b :> a)

Now we can also add the type restrictions. We demand that for each touched object, all the types in both patches match for the commute to be allowed.

commute (a :> b) | not (a `typematch` b) = fail

Where

typematch a b = all match (touches a `union` touches b)
    where match x | type a x /= type b x = False
                  | fst (type a x) /= snd (type a x) = False
                  | fst (type b x) /= snd (type b x) = False
                  | otherwise = True

With type :: Patch -> UUID -> (ObjectType, ObjectType).

A few more restrictions will need to be applied in case we need a patch type that may operate on object collections such as a “subtree”, whose composition can change in time. I have a rather vague idea that some kind of general mechanism could be used for designating such an object collection could be used, which could then be built into the framework, making it also possible to define that kind of patches without introducing exponentially big commute definitions.

Adding new patch/object types

With the generic commute rules, it becomes possible (and easy) to add new object types and corresponding patch types to the system, without ending up in an exponential tangle. One such object type could be “haskell” (holding a representation of a Haskell AST), or “bug” (for in-repo bug tracker, ala bugs everywhere). Another useful object type could be “set”, keeping a sorted set of lines, or a “changelog”, keeping a timestamped list of textual chunks.

Of course, there are other options to achieve a similar effect, but I like how this “falls out” more or less naturally from the design.

Optimisations

The suggested patch representation allows for some optimisations in the way patches are stored. This could include per-object buckets, detached hunk storage (which is more or less implied by the hunk format) or the like. Per-object buckets are slightly complicated by multi-object patches, but probably not ruled out. With per-object buckets, UUIDs based on hashing and minimal context for prims may be feasible, granting a number of desirable security and convenience properties to the system as a whole.

Issues

There’s a couple of issues that have been raised about the proposed design so far. I think the major one is that merge behaviour might be somewhat surprising in some respects, especially if root and preference object UUIDs are picked fresh, instead of being hardcoded: it is not clear how to manage these cases. On the repository level, any two repositories with disjoint sets of live UUIDs merge cleanly. However, they don’t necessarily merge cleanly on the working directory level.

The most counter-intuitive case seems to be when you initialise an empty repository and then immediately pull from somewhere: you get a conflict here. A possible solution: darcs init should create a repo with no objects. Any and all objects come into existence through patches. And once you have two patches that add a root object, you get a reasonable conflict that can be reasonably resolved. (The preferred solution probably entailing recursive merge of the two trees, with conflicting leaves being renamed out of way; this would constitute “conflict markup” and would need to be explicitly recorded, of course, like with other conflicts… another desirable solution may be renaming one of the roots and making it a child of the other, confirming the other as the repository root.)

About preferences, that’s a bit more tricky. The tradeoff is slightly different, since pivoting repository roots sanely is probably significantly more valuable than it is for preferences. More discussion is probably needed on how to exactly arrange this, but the exact details are not important for the project. Most of the available options can be implemented relatively painlessly on top of the proposed infrastructure.

Battle Plan

So in the above, I have outlined the design of what I intend to implement. I think I am now reasonably happy with the abstract level of design. There is of course a number of implementation challenges to be solved.

Nevertheless, since it is really getting late now, so I will keep the “battle plan” for another post. I intend to spell out roughly what and how I intend to do on the implementation level, and break this down into a couple of phases, time-wise. A very quick summary for now…

The core of the implementation work will happen directly in the darcs repository, since each “version” of primitive patches comes with a separate module hierarchy (even if there is only V1 so far). I can work in Darcs.Patch.Prim.V3 more or less without disturbing other work or the release process. I expect that coverage of the newly written code will come mostly from QC and unit tests, which will likely live in the darcs source tree as well.

An external, experimental “client” may be an option, depending on how things go. It would probably implement a simple repository format without any of the higher level patch types (i.e. no conflict handling etc.), mostly for demonstration purposes.

A semi-independent vein of the implementation work would entail pushing out lazy IO out of the patch code in darcs per se. This is tied in with decoupling the hunk content (the text or binary blobs) from the patches themselves in memory. Even though it may well be possible to achieve using IO interleaving, an explicit approach will be more transparent, presumably much easier to reason about and, ultimately, debug, than the current lazy IO code.

posted 24.05.2011 12:47 am tags:

20.09.2010: italy

Italy

Yesterday, I have returned from a trip to Italy. The main event was the SEFM (Software Engineering & Formal Methods) 2010 conference, held in CNR in Pisa, the side event was a visit to Florence, Pisa, Lucca, Bologna and Venice. Although I have been initially a bit sceptical (as I usually am), partly due to travelling alone, the trip turned out to be one of the best thing that happened to me recently. I had really an awesome time, met lots of great people, had lots of fun, the weather was fabulous… The cities of northern Italy are beautiful, the conference was great as well — in particular, I really enjoyed both the tutorials I attended: first about Orc by Jayadev Misra and the second by David Harel about computational biology was even better. David really is a great speaker and the topic is quite intriguing as well.

Florence & Arrival

The choice of taking a train (instead of flying) turned out to be quite lucky one, since it enabled me to stop at various places along the way. I took an afternoon connection to Wien on Saturday and from there a night train to Florence, where I arrived around 6:30 am on Sunday. The city has been quite empty in the early morning, so I had a chance to walk around the deserted city centre during sunrise. I even managed to peek into the Basilica di Santa Maria del Fiore before leaving. Overall, the architecture of Italian cities is really nice to look at.

Anyway, in the afternoon I took a train to Pisa, checked in at the hotel and spent part of the afternoon trying to find an open grocery store. Traversed a good part of residential areas of Pisa, finally finding an open shop near the railway station. Thus refreshed, I looked around a bit more, going through the nicer historical parts of the town and crashing back at the hotel later in the evening.

Pisa, The Conference

Next day was the first conference day, mostly filled by tutorials and chatting with other attendees. I have made acquaintances with a few fellow PhD students and researchers from various parts of the world (including two colleagues from Prague). On Tuesday, we had the first set of conference sessions and in the evening, we had a dinner in a local pizzeria with Emilia (Technion, Israel but originally from St. Peterburg) and Abel (BME, Hungary) and Faraz (UCF, US, originally from India), and after the dinner I walked around Pisa with Faraz a bit. We went to see the Piazza dei Miracoli at night, with its Leaning Tower and the Duomo di Pisa. More sessions followed on Wednesday and in the afternoon, we attended the social event, which consisted of a trip to Lucca, where we were received by a guide and have been shown around the city, which has been really nice as well, and the guide did a really good job too. The trip was followed by a dinner in a villa outside of Lucca — the only complaint I could possibly raise about it, I guess, was that there was actually too much food to handle. Which is not much of a complaint really. Most of the afternoon, I spent with Natallia, who’s currently a post-doc researcher at CWI (Netherlands), has a PhD from Trento and a masters from her native Belarus. I much enjoyed both the event and the company.

On Thursday, the model checking session of the conference took place — this being the session where I gave my presentation as well. The fact that this was in the morning after the social event was somewhat inconvenient, but I think overall it went quite smoothly and I am fairly happy about it. In fact, a couple people later told me that they enjoyed the talk, which made me really quite happy. That is the sort of thing that can motivate you to continue the hard work (cough) and all. I had a bit of a conversation with Kirsten Winter (ITEE / University of Queensland, Australia) and later Dimitra Giannakopoulou (NASA, US) and both were very encouraging of my current and (hopefully) future work. I suppose that is as much an incentive to carry through with one’s PhD as one can hope for.

Moreover, a couple of people have asked me about the Red Hat funding for my formal methods research and whether and how the cooperation works. I have explained my plan for applying my research on the real code that I work on, which has met with a lot of approval and enthusiasm. I certainly hope that the goals are attainable and that it will be possible for me to continue to work on this as my PhD continues. There is definitely a lot of interest from the academia in seeing further real-world formal method applications.

Departure, Bologna & Venice

Later that day, I went to the town centre for a bit, and to take some daylight photos… Met later for another dinner with Emilia and Abel (same place, that’s how conservative we are), walk around the town, and a bit of Italian ice-cream. That was the last evening in Pisa, and I crashed to bed and in the morning packed up and checked out.

I had arranged previously with Enrico, a long time friend of mine and also a fellow Debian Developer, to meet in Bologna. I took a noon train to Florence, then Prato and finally arrived in Bologna in the afternoon, where Enrico and Yuwei picked me up. They walked me through the city centre of Bologna and shown me through some nice hidden places (which I would never think of visiting, were there not for my experienced guides). Among other things, we have seen a medic’s lecture room from 17th century, a city council art gallery and visited the San Petronio Basilica, one of the largest churches in the world, built throughout the 14th and 15th centuries. Later in the evening, we had a dinner with members of the local amateur astronomy club (which turned out to be more of an all-sorts-geek club) and a tour of their observatory-museum. My thanks to Enrico, Yuwei and also the members of the club for another great evening. We finally retreated to Enrico’s place (where he stays while in Bologna, and where his parents live).

The next day (that is Saturday) was split in two parts, first I spent with Enrico and Yuwei and stayed for lunch with Enrico’s parents (which was very nice!). In the afternoon, Enrico drove me to a nearby railway station and I took a train to Venice. I very much enjoyed the stay in Bologna, in a big part thanks to my hosts who really went out of their way to make my stay pleasant: thanks again, and looking forward to seeing you again.

After lunch, I departed for Venice, which is about 2 hours of a train ride away from Bologna — I arrived around 4 pm, to the Venezia Santa Lucia station. The weather (which was sunny and pleasantly warm until then) went on a strike though, so it was overcast and every now and when it went on to rain for a while. Overall, I guess that was sort of a mixed blessing: on one hand, it was a bit cold and wet (and I wasn’t very well equipped for that), on the other, there probably weren’t as many tourists in the city as there would be on a sunny day. Sadly, the San Marco was already closed when I arrived, so I had to omit that one. Nevertheless, I have seen most of the city, also quite a few remote bits of it that were virtually deserted on a Saturday without sun (like the seaside wall of the Arsenal, where I did not meet a single person).

Back Home

Late in the evening (around half past midnight), I took a train back to mainland (Mestre) where I boarded a night train to Wien. Finally landed in Brno a bit before noon.

Overall, as I already said a couple of times, it was a very nice trip. The late summer in Italy and meeting everyone was amazing. I am of course a bit sad and melancholic since it’s all over, but then, I am off to Netherlands in a week for another conference, which will be probably very different, but hopefully also nice. Sadly, due to all the travelling, I am going to have some issues at the conservatory (I still don’t have a viable timetable, eg.) but nevertheless, it was definitely worth it. I also hope to meet the folks that I have befriended again at some point — which doesn’t seem that unlikely, since Abel is actually attending the same conference that I am, in Netherlands next week.

The trip also sort of made me realize that my attachment to this place (which I call home, more or less) is not as tight as I have been thinking. It may, after all, be a good idea to move elsewhere, maybe for a couple of months or maybe a year or two. But that it is still a long way from here.

posted 20.09.2010 2:20 pm

14.08.2009: soc final report

soc final report

As you may have noticed, I skipped last progress report. That’s because I was busy with coding work. Anyway, let’s summarise what happened throughout the GSoC program. The basic idea was to improve darcs support for large repositories — large in terms of number of working files. The primary goal was to improve the code that diffs the working tree against the pristine cache: an operation, that is very common, and in darcs 2.2 and earlier was also painfully slow (depending on circumstances).

Deliverables

So what has been delivered? There is a number of outputs of the project:

The first two items on the list are done for good. The diffing code that was a major goal of the project is stable and part of a stable darcs release. It is supported by a independent hackage library, hashed-storage.

For efficient retrieval over HTTP… there is support for creating and using hashed packs in hashed-storage 0.4. This still needs some work to be put in practice, both on darcs-hs and on hashed-storage. I will continue working on this in the fall. The current status of code in hashed-storage 0.4 is basically proof-of-concept (comes with unit tests though). There is also a concept document, designing storage for darcs. Either way, I am trying to not rush things on this, since once we put this into the stable darcs, we will have to live with it for a long time, and mistakes now would be expensive later.

As for darcs-benchmark — I have uploaded version 0.1 to hackage and it should be ready for wider circulation. Please try it out and let me know if you have any issues with it (cabal install darcs-benchmark).

Work in Progress

Since pencils down is approaching quickly, there will stay a number of in-progress items. There are still some bugs to fix in darcs-hs, and hashed-storage needs a haddock cleanup and at least one pass over the API before I can upload 0.4. But let’s see what 0.4 brings over 0.3.7. During a review process that Ganesh started, we have identified a number of weak spots in hashed-storage. This has directly and indirectly led to hashed-storage 0.4.

On the API level, we have generalised the Tree out of IO into a generic Monad, to help with testing infrastructure. The Index API has been simplified and, more importantly, made much safer. With this done, it turned out that the index reading code is needlessly general, and I have simplified it, improving index performance along the way (in the process, I also changed the layout… twice… the magic word versioning mechanism really paid off).

I have also refactored the Hash type, and removed all the hacks that deal with the size prefix on darcs hashes. This means we won’t be able to write out new trees of this kind anymore (but no big loss, old darcs can read repositories without the size prefix as well). Along with the original Hash refactor, I have changed the internal hash representation, cutting hash length from 64 to 32 bytes (this is still SHA256, but is using full 8 bits of every byte, unlike ascii-hex which carries 4 bits of useful data per byte). This representation is also used in the index.

The overall result is, that the index size has shrunk considerably — it is now, on average, smaller than a git index for equivalent repository, even though we use stronger hashes (but yes, git index is more than a simple cache).

The even more interesting outcome is that, with darcs-hs, I can now run “whatsnew” on a 100k-file repository in 0.35 seconds — a mere .1 second (or 40%) slower than equivalent git diff. With 100% Haskell (well, the sha256 implementation is in C, but this code is not tripped in the optimal case). For most realistic repository sizes, the difference is going to be negligible. Oh, and to get an idea how big a 100k-file repository is, the Linux kernel tree has about 28k files in it.

Future Work

Recently, it has been noted, that for old-fashioned repositories, darcs whatsnew has regressed in performance by a fair amount. This is true, but it’s not clear if it is worth addressing properly. For 2.3.1, the easy fix is to restore the old code path when the repository is not hashed. However, I have already removed unsafeDiff in darcs-hs and I want to keep it that way, so we either need to start treating old-fashioned as second-class, or either come up with a more systematic fix.

As for hashed-storage, I want to do further work on packs. Maybe even add code to read git repositories. Moreover, there is an ongoing review of hashed-storage which may bring further ideas for changes. We also have to make some decisions about what to do about darcs in the future — how to.

Conclusions

It has definitely been a productive summer. Lots of work has been done on hashed-storage, and it seems that with a little further work, it can provide solid underpinning to future darcs versions. Apart from major performance improvements, it exposes a new standalone component with its own set of tests, improving test coverage of what constitutes darcs.

posted 14.08.2009 11:35 pm tags:

29.07.2009: soc progress 10

soc progress 10

Wow, tenth progress report. That’s a long time now. Still a day late, but getting on track slowly (slower than expected I guess). Last week has seen the release of darcs 2.3. Nonetheless, this release has seen partial integration of hashed-storage work. Unfortunately, the release was also a major motivation drain — RMing darcs is a lot of (fairly hard) work.

On the unsafeDiff front, I’m at a standstill (there are still the same 2 instances that I reported about last week). I haven’t progressed on the HPC issue either, but that can wait and it’s not directly relevant to SoC.

I have worked on profiling my last-week’s check/repair port. Fortunately, the implementation does not leak memory, and is slightly slower (some 20%) than the old version. Most of the extra goes into filepath juggling (due to representation differences between darcs and hashed-storage). See my darcs-users post for more details about this.

In other news, I have forked an 0.3 branch of hashed-storage and flipped the switch to 0.4 in mainline. The first compatibility-breaker is the Hash type refactor, which improved things a fair bit. Also, we now represent hashes in-memory (and in the binary index) by 32-byte binary vectors (represented as bytestrings) instead of using the 64-byte ascii-hex representation. Also, this has removed a lot of Maybe wrapping, since the new Hash type comes with a “NoHash” constructor.

Other than that, I have taken a break after the release. I still have some motivational issues, but I should improve my output rate over the next week. The major todo on is file benchmarking (since Eric keeps nagging me about this, and I still haven’t figured how to do this properly) and actual implementation of the (internal) pack format.

Hashed-storage changes:

And darcs-hs changes:

posted 29.07.2009 10:01 pm tags:

25.07.2009: patch formats

Patch Formats

I have been thinking for a while, that a completely new “repository” format (an experimental one) would be in place for darcs 2.4. I have previously outlined a way I’d like to go about building up new things within the darcs 2.x series. Now a darcs repository has two basic “components”: the “file” part of the layout: truly a repository format, and a “patch format”: which determines not only how patches are written out to disk, but more importantly, their exact semantics. Once you set up a “patch format”, this is set in stone and repositories with different patch types cannot exchange patches between them (at least not without an in-between conversion). This is the case between darcs-1 and darcs-2 format repositories, as they use a different patch format. The case of darcs-1 vs “hashed” repositories, as darcs calls them, is only on the file level though: the patch formats are identical, and that’s why hashed and plain darcs-1 repositories can exchange patches just fine. (I will from now on refer to repository and patch format as two orthogonal things, as they mostly are.)

Now I have been working on a packed repository format… one that would allow to store the repository — regardless of patch format used — in a compact form suitable for HTTP-based retrieval. In this post, I’d like to address the other thing: a patch format. It seems worthwhile to improve our current system, since it has a number of weak spots. Currently, we have a number of “primitive” patch types, and some more complicated ones — conflictors in darcs-2 or mergers in darcs-1. I am not going to talk about these — we’ll focus on the primitive patches for now.

The primitives in darcs are addfile, rmfile, addir, rmdir, move, replace and most importantly hunk. (You should be able to look up what these roughly represent somewhere else.)

Let’s address these addfile and friend patches, that create and remove files or directories. Obviously, addfile foo.txt and a different addfile foo.txt are going to conflict. Also, all hunks for foo.txt obviously depend on addfile foo.txt — which means that if you pull two branches together with nontrivial files of the same name, you are going to end up with a massive conflict (at least in terms of darcs data structures) for virtually no reason.

So my proposal is to divorce the filename from file identity: this is something that has been pondered before, I believe. The result would look something like:

hunk fileid 1
- a
+ b

This means that hunks would exist without any dependence on addfile: the abstract file would pop into existence with first hunk touching its identity. Of course this would be no good, since you just lost the relation between a working copy and whatever darcs tracks. To put that relation back on track, we add two patch types:

manifest fileid ./file/path
demanifest fileid ./file/path

A manifest patch will tell darcs to associate the fileid with your working file at ./file/path. The inverse operation is demanifest, and that would remove the association: and your working copy file. The abstract identity continues to exist just fine, and can be manifested again (under the same or different filepath). Basically, this completely de-couples the “hunk-space” from the “filepath-space” — manifest/demanifest/move(/adddir/removedir) patches commute completely freely with hunk/replace patches. To make the de-coupling complete, you want a “manifest” of a non-existent fileid to pop that fileid into existence as well. No problem.

Basically, this means that as far as darcs is concerned, file content manipulation is orthogonal to the directory tree manipulation: and this is good and well, since it allows us to solve conflicts on both of those levels separately, without dragging in a lot of stuff from the other level. Moreover, the add-add conflict no longer exists.

As for the hunk format itself, there is also a number of issues: it uses a GNU-patch-like format with ‘+’ or ‘-’ sign in front of each line. It will usually look like a block of ‘-’ lines followed by a block of ‘+’ lines (either of these may be empty). Parsing this format is not quite simple, you have to look up all the newlines, chop off the ‘+’ and ‘-’ signs etc. Lots of work for darcs.

hunk ./foo.txt 1
- a
+ b

Now a friendly-to-parse alternative would be to use a chunk kind of patch:

chunk fileid - 0 2
a\n
chunk fileid + 0 2
b\n

the 0 and the 2 are an offset and length (in bytes), respectively. What follows the patch header is a monolithic block of text to be removed from (or pasted at) the given offset (and the block is of a given length). This would produce more primitives than the original hunks, but they would be vastly simpler to process in bulk. Each basically represents a string “splice” operation. No newline shuffling whatsoever. The commutation rules should be as simple as they were with hunks: you just wibble the offset (after checking for overlap).

But now that the data in those chunks is basically a blob of (anything), there is an extra thing that can be done. Instead of keeping this data inline in the patch, we could refer to it by a hash and store it elsewhere:

chunk fileid - 0 2 hash
chunk fileid + 0 2 hash

Of course, for the example patch, this is going to inflate the patch size quite a bit. But let’s not care for a while. We now have a chunk format that is O(1) in size (well O(logn) for purists, given the length needs to be represented, but we don’t care about that either). We can still commute and invert it just fine: inversion is flip-flopping the ‘-‘/’+’ sign, and commutation just wibbles the offset. Awesome. (Besides that, it also effectively obliterates any need for a “binary” kind of patch: chunks will do just fine for that, and we could even use binary diff algorithm if we wanted…)

We will need to dereference that hash sometimes of course: when we want to actually apply the patch, and when we want to show it to user. The latter is a non-issue, since our processing power vastly exceeds that of the user, we can play with the patch as much as we like for presentation issues. So when we want to apply that patch, we need to fetch its content… but wait, we already have a mechanism for buckets of bits by their hashes: the hashed pristine! So yes, we can just dump the data bits of the chunks into hashed pristine (plus some wibbling of garbage collection, see my previous post about that).

Now that no actual file content is ever part of any patch representation, we can consider some new options. One would be to store patches inline in the inventory files: this would probably inflate their sizes by a small factor. Looking at darcs repository itself, we have 750K of compressed inventories (1.5M uncompressed). There are 43225 hunks — this would add about 6M of uncompressed (but relatively compressible) text to the inventories (considering about 150 bytes per a chunk patch: 2x sha256 + a little). That is about factor 5. We would probably have to play around a little to find out how (un)reasonable this is. We could also cut a bit of the cost by using a less-stupid encoding of hashes than ascii-hex (aka base16)… say base64 (see RFC 3548/4).

I should probably also note that this would save some bits on conflictor representation (no copies of hunk data) and it should also solve the “big initial patch” problem — the patch itself would be O(n) in number of files (instead of O(n) in number of bytes of the initial tree). There are of course some drawbacks and some other advantages, but I don’t have the amount of time to go into more details just now. Instead, I’ll let people think about this for a while. Comments are definitely welcome (probably address them to our darcs-users@ mailing list).

posted 25.07.2009 2:03 pm tags:

23.07.2009: darcs 2.3.0

Darcs 2.3.0

I’d like to announce immediate availability of a new stable release of darcs, 2.3.0. There is a number of improvements and bugfixes over the previous stable release, 2.2. Moreover, work has been done to improve performance of “darcs whatsnew” for large repositories.

As in the past, there are two source tarballs available. As of this release, the cabal-based build is preferred, and the autoconf build is deprecated. You can obtain the source tarballs at these addresses:

The build instructions are available in the enclosed README file in those tarballs.

Moreover, if you have cabal-install available, you can install latest stable release of darcs by issuing the following commands (no tarballs needed):

    $ cabal update  
    $ cabal install darcs

This should give you a darcs binary in ~/.cabal/bin — you should probably add this to your PATH. More detailed instructions for installing on Windows are available near the end of this announcement.

What’s New

This is a summary of important changes since the last stable release (2.2):

A number of issues has been resolved since 2.2 as well. See http://bugs.darcs.net/issueN for details on bug number N.

The question of GHC 6.8

Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.

Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.

Installing on Windows

To install darcs on Windows systems from scratch, please download the Haskell Platform and MSYS:

After installing both, you should have an “MSYS” icon: run MSYS and in the terminal window type (the $ character denotes the prompt, do not repeat it):

$ cabal update  
$ cabal install darcs -f-curl

This should download, compile and install all required dependencies and also darcs itself. The resulting darcs executable will be placed into the Haskell Platform executables folder, and should be accessible from the MSYS shell (just type “darcs —version” to check).

posted 23.07.2009 3:27 pm tags:

23.07.2009: soc progress 9

soc progress 9

Uh, I have slipped by two days again — well, almost. The problem is that in the last two days, I have been somewhat busy with e-mail at darcs-users@… — a sudden peak of traffic hit the list. Basically, two threads created a lot of traffic, the David’s darcs one started by me, and the cheap in-repo branching one. And since this was lot of writing on my part, I didn’t quite feel like writing also the report. But I’m still a day closer to the original Tuesday (and it’s still not even noon here, so that’s more like a day and a half!). I’ll try again to get on track next week.

I have worked some more on the packed format bits in hashed-storage, and I have worked on getting rid of “unsafeDiff” in darcs-hs. The latter is down to 2 uses, one in “darcs replace” and the other in Darcs.Resolution. The replace one should be relatively easy, actually. The resolution one hopefully too, but we have no testsuite coverage of the external merge, which would be better added before I break it… For the changes already done, I have ported over check/repair (I still need to run this through a heap profiler, since I suspect I introduced some space leaks) and “darcs remove” which is now also neater and more readable (at least I think it is). And since we never use unsafeDiff to compare working against pristine anymore, I have nuked a bunch of code that deals with timestamps.

I have also done a little work on HPC reports for darcs itself: I put a current HPC report online, and there’s some automation in place, so it should get updated whenever I push into the repository (it is supposed to be a 1:1 mirror of http://darcs.net). There are still some issues, like the TIXDIR taking over a gigabyte of space (I know how to fix this, but haven’t had the time yet — it involves a relatively easy support in ShellHarness).

Also on the mainline darcs (and this is not strictly SoC work, but worth reporting anyway), we have removed autoconf support. Finally. There are still some ongoing issues, like we need the user manual to be built by Cabal (we kept some bits of the GNUmakefile to address this for now) and some other features are currently not available with cabal (see also this thread).

Hashed-storage changes:

And darcs-hs changes:

posted 23.07.2009 11:14 am tags:

17.07.2009: soc progress 8

soc progress 8

So I have managed to reduce the report lag by a few days, expect the next one on the (regular) Tuesday. Hopefully. Anyway, since the last report, a lot has happened.

I got a new laptop a few weeks back (a Lenovo X200 — I installed the base system on 30th of June), which enabled me to play around with more resource-hungry software than before (this beast has 2G of RAM and two 2.4GHz CPUs). The thing I tried was installing VirtualBox and a copy of Windows XP inside (I have an academic license, hopefully I am within the rules). This worked fairly well (the VM is pretty fast, actually) and enabled me to play with darcs on a win32 system. After installing the Haskell Platform (congrats to all the folks who worked on this: this piece of software is awesome) and MSYS, I got cabal install darcs-beta rolling. Turned out that darcs 2.3 didn’t work very well on windows (well, about 30 tests failed, and darcs whatsnew was completely unusable). It took about two days time to kick all the parts into obedience: I patched all of mmap, hashed-storage and darcs. The net result is that darcs now passes its testsuite on win32 (this never quite worked before). In the meantime, I have already released new beta, darcs 2.3 beta 4 which has all the win32 fixes in it. In addition, I made the hashed-storage testsuite pass on Windows, which should make it much easier in the future to track down problems.

I have also made some coding progress on hashed-storage: first, the packed format prototyping is advancing, and before next report, I hope to have a workable prototype. Some code already went into hashed-storage repository, and there’s already some testing coverage. Which brings me to the other thing, that is that I have improved test coverage of hashed-storage significantly (we now have 10 QC properties and 18 testcases that together cover of 77% toplevels, 66% alternatives and 83% expressions, as reported by HPC).

In darcs-hs, I have done some work on optimising darcs show contents --match, as this seems to be an important operation for repository browsers (or at least for trac, but I imagine others may benefit from this). I have also figured that large part of the tracdarcs inefficiency comes from calling darcs show contents on way old revisions: based on this, Lele has implemented an optimisation that instead looks at the latest revision that has a given version of the file, and for HEAD revisions omits —match altogether. I have deployed the optimised version on my own tracdarcs instance (for another project) and I can confirm that it is much more usable than before. I’ll set up a test instance for GHC repository just for the kicks of it. (Maybe we could then get Hackage folks to install tracdarcs into their trac instances? Now that would be cool.)

The summary of hashed-storage changes for the week:

and for darcs-hs:

posted 17.07.2009 2:55 pm tags: