Search Results: "Richard Braakman"

8 June 2012

Lars Wirzenius: Obnam 1.0 (backup software); a story in many words

tl;dr: Version 1.0 of Obnam, my snapshotting, de-duplicating, encrypting backup program is released. See the end of this announcement for the details. Where we see the hero in his formative years; parental influence From the very beginning, my computing life has involved backups. In 1984, when I was 14, my father was an independent telecommunications consultant, which meant he needed a personal computer for writing reports. He bought a Luxor ABC-802, a Swedish computer with a Z80 microprocessor and two floppy drives. My father also taught me how to use it. When I needed to save files, he gave me not one, but two floppies, and explained that I should store my files one one, and then copy them to the other one every now and then. Later on, over the years, I've made backups from a hard disk (30 megabytes!) to a stack of floppies, to a tape drive installed into a floppy interface (400 megabytes!), to a DAT drive, and various other media. It was always a bit tedious. The start of the quest; lengthy justification for NIH In 2004, I decided to do a full backup, by burning a copy of all my files onto CD-R disks. It took me most of the day. Afterwards, I sat admiring the large stack of disks, and realized that I would not ever do that again. I'm too lazy for that. That I had done it once was an aberration in the space-time continuum. Switching to DVD-Rs instead CD-Rs would reduce to the number of disks to burn, but not enough: it would still take a stack of them. I needed something much better. I had a little experience with tape drives, and that was enough to convince me that I didn't want them. Tape drives are expensive hardware, and the tapes also cost money. If the drive goes bad, you have to get a compatible one, or all your backups are toast. The price per gigabyte was coming down fast for hard drives, and it was clear that they were about to be very competitive with tapes for price. I looked for backup programs that I could use for disk based backups. rsync, of course, was the obvious choice, but there were others. I ended up doing what many geeks do: I wrote my own wrapper around rsync. There's hundred, possibly thousands, of such wrappers around the Internet. I also got the idea that doing a startup to provide online backup space would be a really cool thing. However, I didn't really do anything about that until 2007. More on that later. The rsync wrapper script I wrote used hardlinked directory trees to provide a backup history, though not in the smart way that backuppc does it. The hardlinks were wonderful, because they were cheap, and provided de-duplication. They were also quite cumbersome, when I needed to move my backups to a new disk the first time. It turned out that a lot of tools deal very badly with directory trees with large numbers of hardlinks. I also decided I wanted encrypted backups. This led me to find duplicity, which is a nice program that does encrypted backups, but I had issues with some of its limitations. To fix those limitations, I would have had to re-design and possibly re-implement the entire program. The biggest limitation was that it treated backups as full backup, plus a sequence of incremental backups, which were deltas against the previous backup. Delta based incrementals make sense for tape drives. You run a full backup once, then incremental deltas for every day. When enough time has passed since the full backup, you do a new full backup, and then future incrementals are based on that. Repeat forever. I decided that this makes no sense for disk based backups. If I already have backed up a file, there's no point in making me backup it again, since it's already there on the same hard disk. It makes even less sense for online backups, since doing a new full backup would require me to transmit all the data all over again, even though it's already on the server. The first battle I could not find a program that did what I wanted to do, and like every good NIHolic, I started writing my own. After various aborted attempts, I started for real in 2006. Here is the first commit message:

revno: 1
committer: Lars Wirzenius <liw@iki.fi>
branch nick: wibbr
timestamp: Wed 2006-09-06 18:35:52 +0300
message:
  Initial commit.

wibbr was the placeholder name for Obnam until we came up with something better. We was myself and Richard Braakman, who was going to be doing the backup startup with me. We eventually founded the company near the end of 2006, and started doing business in 2007. However, we did not do very much business, and ran out of money in September 2007. We ended the backup startup experiment. That's when I took a job with Canonical, and Obnam became a hobby project of mine: I still wanted a good backup tool. In September 2007, Obnam was working, but it was not very good. For example, it was quite slow and wasteful of backup space. That version of Obnam used deltas, based on the rsync algorithm, to backup only changes. It did not require the user to do full and incremental backups manually, but essentially created an endless sequence of incrementals. It was possible to remove any generation, and Obnam would manage the deltas as necessary, keeping the ones needed for the remaining generations, and removing the rest. Obnam made it look as if each generation was independent of each other. The wasteful part was the way in which metadata about files was stored: each generation stored the full list of filenames and their permissions and other inode fields. This turned out to be bigger than my daily delta. The lost years; getting lost in the forest For the next two years, I did a little work on Obnam, but I did not make progress very fast. I changed the way metadata was stored, for example, but I picked another bad way of doing it: the new way was essentially building a tree of directory and file nodes, and any unchanged subtrees were shared between generations. This reduced the space overhead per generation, but made it quite slow to look up the metadata for any one file. The final battle; finding cows in the forest In 2009 I decided to leave Canonical and after that, my Obnam hobby picked up in speed again. Below is a table of the number of commits per year, from the very first commit (

bzr log -n0  
awk '/timestamp:/   print $3 '   sed 's/-.*//'   uniq -c  
awk '  print $2, $1  '   tac

During most of 2010 and 2011 I was unemployed, and happily hacking Obnam, while moving to another country twice. I don't recommend that as a way to hack on hobby projects, but it worked for me. After Canonical, I decided to tackle the way Obnam stores data from a new angle. Richard told me about the copy-on-write (or COW) B-trees that btrfs uses, originally designed by Ohad Rodeh (see his paper for details), and I started reading about that. It turned out that they're pretty ideal for backups: each B-tree stores data about one generation. To start a new generation, you clone the previous generation's B-tree, and make any modifications you need. I implemented the B-tree library myself, in Python. I wanted something that was flexible about how and where I stored data, which the btrfs implementation did not seem to give me. (Also, I worship at the altar of NIH.) With the B-trees, doing file deltas from the previous generation no longer made any sense. I realized that it was, in any case, a better idea to store file data in chunks, and re-use chunks in different generations as needed. This makes it much easier to manage changes to files: with deltas, you need to keep a long chain of deltas and apply many deltas to reconstruct a particular version. With lists of chunks, you just get the chunks you need. The spin-off franchise; lost in a maze of dependencies, all alike In the process of developing Obnam, I have split off a number of helper programs and libraries:

genbackupdata generates reproducible test data for backups
seivot runs benchmarks on backup software (although only Obnam for now)
cliapp is a Python framework for command line applications
cmdtest runs black box tests for Unix command line applications
summain makes diff-able file manifests (md5sum on steroids), useful for verifying that files are restored correctly
tracing allows run-time selectable debug log messages that is really fast during normal production runs when messages are not printed

I have found it convenient to keep these split off, since I've been able to use them in other projects as well. However, it turns out that those installing Obnam don't like this: it would probably make sense to have a fat release with Obnam and all dependencies, but I haven't bothered to do that yet. The blurb; readers advised about blatant marketing The strong points of Obnam are, I think:

Snapshot backups, similar to btrfs snapshot subvolumes. Every generation looks like a complete snapshot, so you don't need to care about full versus incremental backups, or rotate real or virtual tapes. The generations share data as much as possible, so only changes are backed up each time.
Data de-duplication, across files, and backup generations. If the backup repository already contains a particular chunk of data, it will be re-used, even if it was in another file in an older backup generation. This way, you don't need to worry about moving around large files, or modifying them.
Encrypted backups, using GnuPG.

Backups may be stored on local hard disks (e.g., USB drives), any locally mounted network file shares (NFS, SMB, almost anything with remotely Posix-like semantics), or on any SFTP server you have access to. What's not so strong is backing up online over SFTP, particularly with long round trip times to the server, or many small files to back up. That performance is Obnam's weakest part. I hope to fix that in the future, but I don't want to delay 1.0 for it. The big news; readers sighing in relief I am now ready to release version 1.0 of Obnam. Finally. It's been a long project, much longer than I expected, and much longer than was really sensible. However, it's ready now. It's not bug free, and it's not as fast as I would like, but it's time to declare it ready for general use. If nothing else, this will get more people to use it, and they'll find the remaining problems faster than I can do on my own. I have packaged Obnam for Debian, and it is in unstable, and will hopefully get into wheezy before the Debian freeze. I provide packages built for squeeze on my own repository, see the download page. The changes in the 1.0 release compared to the previous one:

Fixed bug in finding duplicate files during a backup generation. Thanks to Saint Germain for reporting the problem.
Changed version number to 1.0.

The future; not including winning lottery numbers I expect to get a flurry of bug reports in the near future as new people try Obnam. It will take a bit of effort dealing with that. Help is, of course, welcome! After that, I expect to be mainly working on Obnam performance for the foreseeable future. There may also be a FUSE filesystem interface for restoring from backups, and a continous backup version of Obnam. Plus other features, too. I make no promises about how fast new features and optimizations will happen: Obnam is a hobby project for me, and I work on it only in my free time. Also, I have a bunch of things that are on hold until I get Obnam into shape, and I may decide to do one of those things before the next big Obnam push. Where; the trail of an errant hacker I've developed Obnam in a number of physical locations, and I thought it might be interesting to list them: Espoo, Helsinki, Vantaa, Kotka, Raahe, Oulu, Tampere, Cambridge, Boston, Plymouth, London, Los Angeles, Auckland, Wellington, Christchurch, Portland, New York, Edinburgh, Manchester, San Giorgio di Piano. I've also hacked on Obnam in trains, on planes, and once on a ship, but only for a few minutes on the ship before I got seasick. Thank you; sincerely

Richard Braakman, for helping me with ideas, feedback, and some code optimizations, and for doing the startup with me. Even though he has provided little code, he's Obnam's most significant contributor so far.
Chris Cormack, for helping to build Obnam for Ubuntu. I no longer use Ubuntu at all, so it's a big help to not have to worry about building and testing packages for it.
Daniel Silverstone, for spending a Saturday with me hacking Obnam, and rewriting the way repository file filters work (compression, encryption), thus making them not suck.
Tapani Tarvainen for running Obnam for serious amounts of real data, and for being patient while I fixed things.
Soile Mottisenkangas for believing in me, and helping me overcome periods of despair.
Everyone else who has tried Obnam and reported bugs or provided any other feedback. I apologize for not listing everyone.

Lars Wirzenius: DPL elections: candidate counts

Out of curiosity, and because it is Sunday morning and I have a cold and can't get my brain to do anything tricky, I counted the number of candidates in each year's DPL elections.

Year	Count	Names
1999	4	Joseph Carter, Ben Collins, Wichert Akkerman, Richard Braakman
2000	4	Ben Collins, Wichert Akkerman, Joel Klecker, Matthew Vernon
2001	4	Branden Robinson, Anand Kumria, Ben Collins, Bdale Garbee
2002	3	Branden Robinson, Rapha l Hertzog, Bdale Garbee
2003	4	Moshe Zadka, Bdale Garbee, Branden Robinson, Martin Michlmayr
2004	3	Martin Michlmayr, Gergely Nagy, Branden Robinson
2005	6	Matthew Garrett, Andreas Schuldei, Angus Lees, Anthony Towns, Jonathan Walther, Branden Robinson
2006	7	Jeroen van Wolffelaar, Ari Pollak, Steve McIntyre, Anthony Towns, Andreas Schuldei, Jonathan (Ted) Walther, Bill Allombert
2007	8	Wouter Verhelst, Aigars Mahinovs, Gustavo Franco, Sam Hocevar, Steve McIntyre, Rapha l Hertzog, Anthony Towns, Simon Richter
2008	3	Marc Brockschmidt, Rapha l Hertzog, Steve McIntyre
2009	2	Stefano Zacchiroli, Steve McIntyre
2010	4	Stefano Zacchiroli, Wouter Verhelst, Charles Plessy, Margarita Manterola
2011	1	Stefano Zacchiroli (no vote yet)

Winner indicate by boldface. I expect Zack to win over "None Of The Above", so I went ahead and boldfaced him already, even if there has not been a vote for this year. Median number of candidates is 4.

1 March 2008

Anthony Towns: Been a while...

So, sometime over the past few weeks I clocked up ten years as a Debian developer:

From: Anthony Towns <aj@humbug.org.au>
Subject: Wannabe maintainer.
Date: Sun, 8 Feb 1998 18:35:28 +1000 (EST)
To: new-maintainer@debian.org
Hello world,
I'd like to become a debian maintainer.
I'd like an account on master, and for it to be subscribed to the
debian-private list.
My preferred login on master would have been aj, but as that's taken
ajt or atowns would be great.
I've run a debian system at home for half a year, and a system at work
for about two months. I've run Linux for two and a half years at home,
two years at work. I've been active in my local linux users' group for
just over a year. I've written a few programs, and am part way through
packaging the distributed.net personal proxy for Debian (pending
approval for non-free distribution from distributed.net).
I've read the Debian Social Contract.
My PGP public key is attached, and also available as
<http://azure.humbug.org.au/~aj/aj_key.asc>.
If there's anything more you need to know, please email me.
Thanks in advance.
Cheers,
aj
-- 
Anthony Towns <aj@humbug.org.au> <http://azure.humbug.org.au/~aj/>
I don't speak for anyone save myself. PGP encrypted mail preferred.
On Netscape GPLing their browser:  How can you trust a browser that
ANYONE can hack? For the secure choice, choose Microsoft.''
        -- <oryx@pobox.com> in a comment on slashdot.org

Apparently that also means I’ve clocked up ten and a half years as a Debian user; I think my previous two years of Linux (mid-95 to mid-97) were split between Slackware and Red Hat, though I couldn’t say for sure at this point. There’s already been a few other grand ten-year reviews, such as Joey Hess’s twenty-part serial, or LWN’s week-by-week review, or ONLamp’s interview with Bruce Perens, Eric Raymond and Michael Tiemann on ten years of “open source”. I don’t think I’m going to try matching that sort of depth though, so here are some of my highlights (after the break).

Starting small: my first package was distributed-net-pproxy, which would claim a bunch of work units which could then be distributed over the local LAN. Useful if you’re in Australia in the mid-90s and being charged by the minute for Internet access. distributed.net didn’t allow general distribution, so I asked for (and received) explicit permission to include it in Debian. (distributed.net had a ten year anniversary last year too)
Diving straight in: I got sucked in to the mailing lists pretty quickly, and within a couple of months was up to my neck trying to redesign the way we did releases; oddly enough that didn’t turn out quite so easy, but at least it ended up with some concrete proposals within a couple of months, pretty much synchronously with the hamm (2.0) release going out (and, I guess, about a year after I’d started using Debian…). At around the same time was, I think, my first recorded comment about trade offs regarding freeness…
cruft: My first bit of Debian specific code was cruft (that its name, not it’s value…), which worked okay as a prototype, but got snagged up in debian-policy trying to get packages to make a note of files they’ll put on the system without telling dpkg (eg, /etc/passwd, /var/cache/apt/archives/*.deb, etc). That pretty much sapped my interest in it, and cruft stagnated until Marcin Owsiany picked it up in 2005.

Played for a sucker, part one: a few months later I was messing around doing QA stuff and fixing bugs and as part of that did an NMU of netbase (which at the time directly contained all sorts of important tools like ping and inetd directly). That went something like:

From: Anthony Towns
Date: Sat, Nov 21, 1998
There are a few bugs accumulating against the netbase package which
you're maintaining. I was wondering if you'd mind if I made an NMU
to fix some of them for the upcoming slink release?

From: Peter Tobias
Date: Sat, Nov 21, 1998
No, please go ahead ... I'm quite busy right now and I would really
appreciate any help. Please let me know if you need additional information
about the package.

From: Anthony Towns
Date: Sun, Dec 6, 1998
There's an NMU sitting in Incoming now. It fixes a few bugs, viz:
[...]

From: Peter Tobias
Date: Fri, Dec 25, 1998
due to my current job I don't have much time to work on my debian
packages. In order to have more time for my other debian packages
I would like to give away the netbase package. Are you interested
in maintaining this package?

From: Anthony Towns
Date: Fri, Dec 25, 1998
Ummm. Sure. I guess. (or, iow, *Eeeeeeeeeeeeek*!!!)

ifupdown: As part of maintaining netbase I tried to figure out a way of fixing bugs like 31745 and 39118. Basically, Debian used to do networking by generating a script on install that you could modify by hand, and that was it – so when the commands needed changed between 2.0 and 2.2 (no more “route add -net”), you couldn’t make that upgrade “just work”. Red Hat handled it with “ifup” and “ifdown” commands that would look at a whole bunch of shell-format variables in /etc, which worked but wasn’t elegant, so I came up with something I thought was actually pleasant. It’s fundamental attitude is to be a parser – the networking commands are specifed as compile-time configuration, the description of your network is your runtime configuration, and ifupdown just puts those all together without trying to actually understand what’s going on. To some degree, that works really well – it keeps all the knowledge separate so when you’re writing an /etc/network/interfaces, you’re not worried about how DHCP works; in others it breaks down – in particular if bringing up an interface fails part way through, is it up, or down, or something else entirely?

bugs.debian.org: I’m not entirely sure why, but in August ‘99 I decided to start running a script to stop old bugs from being permanently deleted:

From: Anthony Towns
Subject: BTS and old bugs
Date: Tue, 24 Aug 1999 22:33:16 +1000
To: debian-private@lists.debian.org
ObPrivate: Erm. I'm not sure. It is only even vaguely relevant to
developers.
Since bugs #9705 and #36727 don't seem like being fixed any time soon
and Darren hasn't managed to convert the BTS to using debbugs.deb yet,
I've made a little script to stop us from continuing to lose bug
reports, and am running it in my crontab on master.
~ajt/debian-bugs/archive/ contains hardlinked copies of the bugs in
~iwj/debian-bugs/spool/db (except split into sub-directories). When
the bugs get expired from the BTS, the hardlink in ~ajt remains, so
the file doesn't get lost forever.
In the week or so I've been running it, some 500 odd bugs expired [0].

Next month I sent Darren Benham a first version of bugreport.cgi, and at some point around then must’ve sent off a pkreport.cgi too; by the month after (October) I’d evidently been added to the debbugs group, because I was merging my archived bugs into the official debbugs directories.

testing: So a year and a bit later and there had been some more discussions about making major changes to our release process, and since I’d done some more algorithms subjects at university by this point, dove in a bit more seriously. For me, the main challenge seemed to be keeping dependencies consistent, and it turns out validating dependencies and conflicts is an NP-complete problem, and solving that reasonably efficiently seemed like a good first step. By October ‘99 I’d come up with a first pass solution, which I think was still in Perl at the time. A couple more rounds of playing with consistency checking stuff ensued, resulting in some regularly updated lists being generated by December, and by March 2000 or so that had graduated to a simulation of testing as we know and love it today.
Played for a sucker, part two: so after a few months of maintaining a testing suite while potato (Debian 2.2) was getting finalised for release I figured it’d be useful to get broader release experience, so asked Richard Braakman if I could help out with the last bits of the potato release. At which point he said “sure”, gave me James Troup’s email address for accepting/rejecting packages etc, then buzzed off to a mobile phone related junket in the US, and laid low until the release was actually out… A cunning plan, for sure. Happily, that didn’t take too long – I think I started around June, and potato was finished baking for release at LinuxWorld Expo in August, though it did involve about 2000 lines of archive changes and stuff mailed to James over the period, along with half a dozen mails to -devel-announce.
Junket: In the middle of that was my first ever debconf – or technically the Debian portion of the 2000 LSM/RMLL in Bordeaux, aka Debconf 0. It was awesome – met heaps of cool people, got to practice my high school French, and enjoy some lovely red wine while gossipping about free software. Also heard RMS sing the free software song live. But the wine was great!
katie/dak: The biggest problem with the “simulation of testing” mentioned above was it was held together with paperclips and sticky-tape; or less metaphorically, shell one-liners and hardlinks. Since packages would originally get uploaded to unstable, and then move into testing, and then get deleted from unstable, you’d end up at least doubling the bandwidth required to mirror Debian, because each package would get sent to mirrors once for unstable and once for testing. Which is fine for a simulation, but for deployment, well, we needed package pools, which in turned required a rewrite of dinstall. James did all the initial work, apart from some of the schema design and the scarier SQL, and by November had it deployed for the non-US archive, and got it onto ftp-master in December. At which point the time was ripe for hooking up testing into the archive proper, which is about the point that the testing scripts got named “britney” so as to fit in with all silly names in the da-katie suite. I’m pretty sure the rationale was that I did the final coordination of the potato release (archive changes, cd images, announcement, etc), I spent from about 3am to 10am listening to a handful of Britney Spears songs on repeat. The phrase “scarred for life” might come to mind. Since that point, there’s been lots of evolutionary changes to both dak and testing (including a rewrite of the perl parts of testing into python), but it’s all built fairly naturally from that point.
debootstrap: I’m pretty sure my motivation for writing debootstrap was a mix of just wanting to avoid having to worry about out of date base tarballs for the purposes of keeping testing in-sync, and wondering if it was actually possible to write a script to bootstrap a Debian system from the sort of minimally functioanl environment the installer has to put up with. In the end, it turned out pretty cool – it was used by the final boot-floppies, it’s part of d-i, it makes tools like pbuilder possible, I think it helps embedded folks heaps, it’s spawned some imitators, and generally been a pretty cool exercise in hackery.
Freedom and purity: About six months earlier, there was a big flamewar and an attempted vote on whether to remove non-free from the archive. I’d written a rebuttal and counter-proposal, but it all ended up collapsing in a heap, due to disputes about the constitution. For some reason I feel like this mail was more interesting than all the previous ones combined… That ended up needing a whole bunch of changes to the constitution: one to make it clear under what circumstances the social contract can be changed, and another on what happens when one option on a vote requires a supermajority while another doesn’t, and working that out ended up taking from October 2000 until October 2003. It was actually kind-of interesting; trying to analyse a good voting system that works in the real world has all sorts of interesting maths to it, and there’s a whole bunch of election methods guys that have studied it in a bunch of depth. And since its elections, you can have fun with all sorts of scenarious of corruption – like what if the secretary’s trying to change the results. In practice, the difference between good and bad election methods seems to turn out to be negligible, but why wouldn’t you have the best one you possibly can? A couple of months after those changes were done, dropping non-free got reproposed, and I did another counter-proposal; this time they were actually voted on, with the counter-proposal winning, by a little under a two-to-one margin. Which means we get to get rid of non-free by making it completely superfluous, rather than just near enough.

Hrm, this is going on longer than I’d hoped. Oh well, to be continued!