Manoj Srivastava: Manoj: The way of the wolf

Continuous Automated Build and Integration Environment. Cabie is a multi-platform, multi-cm client/server based application providing both command line and web-based access to real time build monitoring and execution information. Cabie builds jobs based upon configuration information stored in MySQL and will support virtually any build that can be called from the command line. Cabie provides a centralized collection point for all builds providing web based dynamic access, the collector is SQL based and provides information for all projects under Cabie s control. Cabie can be integrated with bug tracking systems and test systems with some effort depending on the complexity of those systems. With the idea in mind that most companies create build systems from the ground up, Cabie was designed to not have to re-write scripted builds but instead to integrate existing build scripts into a smart collector. Cabie provides rapid email notification and RSS integration to quickly handle build issues. Cabie provides the ability to run builds in parallel, series, to poll jobs or to allow the use of scripted nightly builds. Cabie is perfect for agile development in an environment that requires multiple languages and tools. Cabie supports Perforce, Subversion and CVS. The use of a backend broker allows anyone with perl skills to write support for additional CM systems.The nice people at Yo Linux have provided a Tutorial for the process. I did have to make some changes to get things working (mostly in line with the changes recommended in the tutorial, but not exactly the same. I have sent the patches upstream, but upstream is not sure how much of it they can use, since there has been major progress since the last release. The upstream is nice and responsive, and have added support in unreleased versions for using virtual machines to run the builds in (they use that to do the solaris/windows builds), improved the web interface using (shudder) PHP, and and all kinds of neat stuff.
mailtrainer
command, one can specify to leave out a
certain percentage of the training set in the learn phase, and run
a second pass over the mails so skipped to test the accuracy of the
training. The way you do this is by specifying a regular expression
to match the file names. Since my training set has message numbers,
it was simple to use the least significant two digits as a regexp;
but I did not like the idea of always leaving out the same
messages. So I now generate two sets of numbers for every training
run, and leave out messages with those two trailing digits, in
effect reserving 2% of all mails for the accuracy run.
An interesting thing to note is the assymetry in the accuracy:
CRM114 has never identified a Spam message incorrectly. This is
because the training mechanism is skewed towards letting a few spam
messages slip through, rather than let a good message slip into the
spam folder. I like that. So, here are the accuracy numbers for
CRM114; adding in Spamassassin into the mix only improves the
numbers. Also, I have always felt that a freshly learned css file
is somewhat brittle in the sense that if one trains an
unsuremessage, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle The table below starts with data from a newly minted .css file.
Date | Corpus | Ham | Spam | Overall | Validation | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Size | Count | Correct | Accuracy | Count | Correct | Accuracy | Count | Correct | Accuracy | Regexp | |
Wed Oct 31 10:22:23 UTC 2007 | 43319 | 492 | 482 | 97.967480 | 374 | 374 | 100.000000 | 866 | 856 | 98.845270 | [1][6][_][_] [0][3][_][_] |
Wed Oct 31 17:32:44 UTC 2007 | 43330 | 490 | 482 | 98.367350 | 378 | 378 | 100.000000 | 868 | 860 | 99.078340 | [3][7][_][_] [2][3][_][_] |
Thu Nov 1 03:01:35 UTC 2007 | 43334 | 491 | 483 | 98.370670 | 375 | 375 | 100.000000 | 866 | 858 | 99.076210 | [2][0][_][_] [7][9][_][_] |
Thu Nov 1 13:47:55 UTC 2007 | 43345 | 492 | 482 | 97.967480 | 376 | 376 | 100.000000 | 868 | 858 | 98.847930 | [1][2][_][_] [0][2][_][_] |
Sat Nov 3 18:27:00 UTC 2007 | 43390 | 490 | 480 | 97.959180 | 379 | 379 | 100.000000 | 869 | 859 | 98.849250 | [4][1][_][_] [6][4][_][_] |
Sat Nov 3 22:38:12 UTC 2007 | 43394 | 491 | 482 | 98.167010 | 375 | 375 | 100.000000 | 866 | 857 | 98.960740 | [3][1][_][_] [7][8][_][_] |
Sun Nov 4 05:49:45 UTC 2007 | 43400 | 490 | 483 | 98.571430 | 377 | 377 | 100.000000 | 867 | 860 | 99.192620 | [4][6][_][_] [6][8][_][_] |
Sun Nov 4 13:35:15 UTC 2007 | 43409 | 490 | 485 | 98.979590 | 377 | 377 | 100.000000 | 867 | 862 | 99.423300 | [3][7][_][_] [7][9][_][_] |
Sun Nov 4 19:22:02 UTC 2007 | 43421 | 490 | 486 | 99.183670 | 379 | 379 | 100.000000 | 869 | 865 | 99.539700 | [7][2][_][_] [9][4][_][_] |
Mon Nov 5 05:47:45 UTC 2007 | 43423 | 490 | 489 | 99.795920 | 378 | 378 | 100.000000 | 868 | 867 | 99.884790 | [4][0][_][_] [8][3][_][_] |
$ ssh alioth $ cd /git/collab-maint $ ./setup-repository pkg-mdadm mdadm Debian packaging $ exit $ apt-get source --download-only mdadm $ mkdir mdadm && cd mdadm $ git init $ git remote add origin ssh://git.debian.org/git/collab-maint/pkg-mdadm $ git config branch.master.merge refs/heads/masterNow we can use git-pull and git-push, except the remote repository is empty and we can't pull from there yet. We'll save that for later. Instead, we tell the repository about upstream's Git repository. I am giving you the git.debian.org URL though, simply because I don't want upstream repository (which lives on an ADSL line) hammered in response to this blog post:
$ git remote add upstream-repo git://git.debian.org/git/pkg-mdadm/mdadmSince we're using the upstream branch of the pkg-mdadm repository as source (and don't want all the other mess I created in that repository), we'll first limit the set of branches to be fetched (I could have used the -t option in the above git-remote command, but I prefer to make it explicit that we're doing things slightly differently to protect upstream's ADSL line).
$ git config remote.upstream-repo.fetch \ +refs/heads/upstream:refs/remotes/upstream-repo/upstreamAnd now we can pull down upstream's history and create a local branch off it. The "no common commits" warning can be safely ignored since we don't have any commits at all at that point (so there can't be any in common between the local and remote repository), but we know what we're doing, even to the point that we can forcefully give birth to a branch, which is because we do not have a HEAD commit yet (our repository is still empty):
$ git fetch upstream-repo warning: no common commits [ ] # in the real world, we'd be branching off upstream-repo/master $ git checkout -b upstream upstream-repo/upstream warning: You appear to be on a branch yet to be born. warning: Forcing checkout of upstream-repo/upstream. Branch upstream set up to track remote branch refs/remotes/upstream-repo/upstream. $ git branch * upstream $ ls wc -l 77
$ #git tag -s mdadm-2.6.3+200709292116+4450e59 4450e59 $ git checkout -b master mdadm-2.6.3+200709292116+4450e59 $ zcat ../mdadm_2.6.3+200709292116+4450e59-2.diff.gz git applyThe local tree is now "debianised", but Git does not know about the new and changed files, which you can verify with git-status. We will split the changes made by Debian's diff.gz across several branches.
$ git checkout -b upstream-patches mdadm-2.6.3+200709292116+4450e59 M Makefile M ReadMe.c M mdadm.8 M mdadm.conf.5 M mdassemble.8 M super1.c $ git add super1.c #444682 $ git commit -s # i now branch off master, but that's the same as 4450e59 actually # i just do it so i can make this point $ git checkout -b deb/conffile-location master $ git add Makefile ReadMe.c mdadm.8 mdadm.conf.5 mdassemble.8 $ git commit -s $ git checkout -b deb/initramfs master $ git add debian/initramfs/* $ git commit -s $ git checkout -b deb/docs master $ git add RAID5_versus_RAID10.txt md.txt rootraiddoc.97.html $ git commit -s # and finally, the ./debian/ directory: $ git checkout master $ chmod +x debian/rules $ git add debian $ git commit -s $ git branch deb/conffile-location deb/docs * master upstream upstream-patchesAt this time, we push our work so it won't get lost if, at this moment, aliens land on the house, or any other completely plausible event of apocalypse descends upon you. We'll push our work to git.debian.org (the origin, which is the default destination and thus needs not be specified) by using git-push --all, which conveniently pushes all local branches, thus including the upstream code; you may not want to push the upstream code, but I prefer it since it makes it easier to work with the repository, and since most of the objects are needed for the other branches anyway after all, we branched off the upstream branch. Specifying --tags instead of --all pushes tags instead of heads (branches); you couldn't have guessed that! See this thread if you (rightfully) think that one should be able to do this in a single command (which is not git push refs/heads/* refs/tags/*)
$ git push --all $ git push --tagsDone. Well, almost
$ git checkout -b build mdadm-2.6.3+200709292116+4450e59Now we're ready to build, and the following procedure should really be automated. I thus write it like a script, called poor-mans-gitbuild, which takes as optional argument the name of the (upstream) tag to use, defaulting to upstream (the tip):
#!/bin/sh set -eu git checkout master debver=$(dpkg-parsechangelog sed -ne 's,Version: ,,p') git checkout build git merge $ 1:-upstream git merge upstream-patches git merge master for b in $(git for-each-ref --format='%(refname)' refs/heads/deb/*); do git merge -- $b done git tag -s debian/$debver debuild # will ignore .git automatically git checkout masterNote how we are merging each branch in turn, instead of using the octopus merge strategy (which would create a commit with more than two parents) for reasons outlined in this post. An octopus-merge would actually work in our situation, but it will not always work, so better safe than sorry (although you could still achieve the same result). If you discover during the build that you forgot something, or the build script failed to run, just remove the tag, undo the merges, checkout the branch to which you need to commit to fix the issue, and then repeat the above build process:
$ git tag -d debian/$debver $ git checkout build $ git reset --hard upstream $ git checkout master $ editor debian/rules # or whatever $ git add debian/rules $ git commit -s $ poor-mans-gitbuildBefore you upload, it's a good idea to invoke gitk --all and verify that all goes according to plan:
$ git push origin build tag debian/2.6.3+200709292116+4450e59-3Now take your dog for a walk, or play outside, or do something else not involving a computer or entertainment device.
$ git checkout upstream-patches $ git-apply < patch-from-lunar.diff #444682 again $ git commit --author 'J r my Bobbio <lunar@debian.org>' -s # this should also be automated, see below $ git checkout master $ dch -i $ dpkg-parsechangelog sed -ne 's,Version: ,,p' 2.6.3+200709292116+4450e59-3 $ git commit -s debian/changelog $ poor-mans-gitbuild $ git push $ git push origin tag debian/2.6.3+200709292116+4450e59-3That first git-push may require a short explanation: without any arguments, git-push updates only the intersection of local and remote branches, so it would never push a new local branch (such as build above), but it updates all existing ones; thus, you cannot inadvertedly publish a local branch. Tags still need to be published explicitly.
$ git checkout -b tmp/start-arrays-rework masterUnfortunately (or fortunately), fixing this issue will require work on two branches, since the initramfs script and hook are maintained in a separate branch. There are (again) two ways in which we can (sensibly) approach this:
$ git merge master deb/initramfs $ editor debian/mdadm-raid # $ git commit -s debian/mdadm-raid $ editor debian/initramfs/script.local-top # $ git commit -s debian/initramfs/script.local-top [many hours of iteration pass ] [ until you are done] $ git checkout -b tmp/start-arrays-rework-init master # for each commit $c in tmp/start-arrays-rework # applicable to the master branch: $ git cherry-pick $c $ git checkout -b tmp/start-arrays-rework-initramfs deb/initramfs # for each commit $c in tmp/start-arrays-rework # applicable to the deb/initramfs branch: $ git cherry-pick $cThis is assuming that all your commits are logical units. If you find several commits which would better be bundled together into a single commit, this is the time to do it:
$ git cherry-pick --no-commit <commit7> $ git cherry-pick --no-commit <commit4> $ git cherry-pick --no-commit <commit5> $ git commit -sBefore we now merge this into the official branches, let me briefly intervene and introduce the concept of a fast-forward. Git will "fast-forward" a branch to a new tip if it decides that no merge is needed. In the above example, we branched a temporary branch (T) off the tip of an official branch (O) and then worked on the temporary one. If we now merge the temporary one into the official one, Git determines that it can actually squash the ancestry into a single line and push the official branch tip to the same ref as the temporary branch tip. In cheap (poor man's), ASCII notation:
- - - O >> merge T >> - - - = - - OT - - T >> into O >>This works because no new commits have been made on top of O (if there would be any, we might be able to rebase, but let's not go there quite yet; rebasing is how you shoot yourself in the foot with Git). Thus we can simply do the following:
$ git checkout deb/initramfs $ git merge tmp/start-arrays-rework-initramfs $ git checkout master $ git merge tmp/start-arrays-rework-initand test/build/push the result. Or well, since you are not an mdadm maintainer (We^W I have open job positions! Applications welcome!), you'll want to submit your work as patches via email:
$ git format-patch -s -M origin/masterThis will create a number of files in the current directory, one corresponding for each commit you made since origin/master. Assuming each commit is a logical unit, you can now submit these to an email address. The --compose option lets you write an introductory message, which is optional:
$ git send-email --compose --to your@email.address <file1> <file2> < >Once you've verified that everything is alright, swap your email address for the bug number (or the pkg-mdadm-devel list address). Thanks (in advance) for your contribution! Of course, you may also be working on a feature that you want to go upstream, in which case you'd probably branch off upstream-patches (if it depends on a patch not yet in upstream's repository), or upstream (if it does not):
$ git checkout -b tmp/cool-feature upstream [ ]
$ git fetch upstream-repo $ git checkout upstream $ git merge upstream-repo/masterwe could just as well have executed git-pull, which with the default configuration would have done the same; however, I prefer to separate the process into fetching and merging. Now comes the point when many Git people think about rebasing. And in fact, rebasing is exactly what you should be doing, iff you're still working on an unpublished branch, such as the previous tmp/cool-feature off upstream. By rebasing your branch onto the updated upstream branch, you are making sure that your patch will apply cleanly when upstream tries it, because potential merge conflicts would be handled by you as part of the rebase, rather than by upstream:
$ git checkout tmp/cool-feature $ git rebase upstreamWhat rebasing does is quite simple actually: it takes every commit you made since you branched off the parent branch and records the diff and commit message. Then, for each diff/commit_message pair, it creates a new commit on top of the new parent branch tip, thus rewrites history, and orphans all your original commits. Thus, you should only do this if your branch has never been published or else you would leave people who cloned from your published branch with orphans.
If this still does not make sense, try it out: create a (source) repository, make a commit (with a meaningful commit message), branch B off the tip, make a commit on top of B (with a meaningful message), clone that repository and return to the source repository. There, checkout the master, make a commit (with a ), checkout B, rebase it onto the tip of master, make a commit (with a ), and now git-pull from the clone; use gitk to figure out what's going on.So you should almost never rebase a published branch, and since all your branches outside of the tmp/* namespace are published on git.debian.org, you should not rebase those. But then again, Pierre actually rebases a published branch in his workflow, and he does so with reason: his patches branch is just a collection of branches to go upstream, from which upstream cherry-picks or which upstream merges, but which no one tracks (or should be tracking). But we can't (or at least will not at this point) do this for our feature branches (though we could treat upstream-patches that way), so we have to merge. At first, it suffices to merge the new upstream into the long-living build branch, and to call poor-mans-gitbuild, but if you run into merge conflicts or find that upstream's changes affect the functionality contained in your feature branches, you need to actually fix those. For instance, let's say that upstream started providing md.txt (which I previously provided in the deb/docs branch), then I need to fix that branch:
$ git checkout deb/docs $ git rm md.txt $ git commit -sThat was easy, since I could evade the conflict. But what if upstream made a change to Makefile, which got in the way with my configuration file location change? Then I'd have to merge upstream into deb/conffile-location, resolve the conflicts, and commit the change:
$ git checkout deb/conffile-location $ git merge upstream CONFLICT! $ git-mergetool $ git commit -sWhen all conflicts have been resolved, I can prepare a new release, as before:
$ git checkout master $ dch -i $ dpkg-parsechangelog sed -ne 's,Version: ,,p' 2.6.3+200709292116+4450e59-3 # git commit -s debian/changelog $ poor-mans-gitbuild # git push $ git push origin tag debian/2.6.3+200709292116+4450e59-3Note that Git often appears smart about commits that percolated upstream: since upstream included the two commits in upstream-patches in his 2.6.4 release, my upstream-patches branch got effectively annihilated, and Git was smart enough to figure that out without a conflict. But before you rejoice, let it be told that this does not always work.
$ git checkout -b maint/lenny debian/2.7.6-1You might do this to celebrate the release, or you may wait until the need arises. We've already left the domain of reality ("lenny" is not yet released), so the following is just theory. Now, assume that a security bug is found in mdadm 2.7.6 after "lenny" was released. Upstream is already on mdadm 2.7.8 and commits deadbeef and c0ffee fix the security issue, then you'd cherry-pick them into the maint/lenny branch:
$ git checkout upstream $ git pull $ git checkout maint/lenny $ git cherry-pick deadbeef $ git cherry-pick c0ffeeIf there are no merge conflicts (which you'd resolve with git-mergetool), we can just go ahead to prepare the new package:
$ dch -i $ dpkg-parsechangelog sed -ne 's,Version: ,,p' 2.7.6-1lenny1 $ git commit -s debian/changelog $ poor-mans-gitbuild $ git push origin maint/lenny $ git push origin tag debian/2.7.6-1lenny1
$ ldapsearch -xLLLH ldap://db.debian.org -b ou=users,dc=debian,dc=org \ gidNumber=800 keyFingerPrint \ sed -rne ':s;/^dn:/bl;n;bs;:l;n;/^keyFingerPrint:/ p;bs ' \ wc -l 1049This actually seems enough as I do not recall any new maintainers being added since the last call for votes, which gives 1049 as well. Andreas told me to count the number of entries in LDAP with GID 800 and an associated key in the Debian keyring. Manoj's dvt-quorum script also takes the Debian keyrings (GPG and PGP) into account, so I did the same:
$ ldapsearch -xLLLH ldap://db.debian.org -b ou=users,dc=debian,dc=org \ gidNumber=800 keyFingerPrint \ sed -rne ':s;/^dn:/bl;n;bs; :l;n;/^keyFingerPrint:/ s,keyFingerPrint: ,,p;bs ' \ sort -u > ldapfprs $ rsync -az --progress \ keyring.debian.org::keyrings/keyrings/debian-keyring.gpg \ ./debian-keyring.gpg $ gpg --homedir . --no-default-keyring --keyring debian-keyring.gpg \ --no-options --always-trust --no-permission-warning \ --no-auto-check-trustdb --armor --rfc1991 --fingerprint \ --fast-list-mode --fixed-list-mode --with-colons --list-keys \ sed -rne 's,^fpr:::::::::([[:xdigit:]]+):,\1,p' \ sort -u > gpgfprs $ rsync -az --progress \ keyring.debian.org::keyrings/keyrings/debian-keyring.pgp \ ./debian-keyring.pgp $ gpg --homedir . --no-default-keyring --keyring debian-keyring.pgp \ --no-options --always-trust --no-permission-warning \ --no-auto-check-trustdb --armor --rfc1991 --fingerprint \ --fast-list-mode --fixed-list-mode --list-keys \ sed -rne 's,^[[:space:]]+Key fingerprint = ,,;T;s,[[:space:]]+,,gp' \ sort -u > pgpfprs $ sort ldapfprs pgpfprs gpgfprs uniq -c \ egrep -c '^[[:space:]]+2[[:space:]]' 1048MAN OVER BOARD! Who's the black sheep? Update: In the initial post, I forgot the option --fixed-list-mode and hit a minor bug in gnupg. I have since updated the above commands. Thus, there is no more black sheep and the rest of this post only lingers here for posterity.
while read i; do grep "^$ i $" pgpfprs gpgfprs echo $i >&2 done < ldapfprs >/dev/nullwhich returns 9BF093BC475BABF8B6AEA5F6D7C3F131AB2A91F5
$ gpg --list-keys 9BF093BC475BABF8B6AEA5F6D7C3F131AB2A91F5 pub 4096R/AB2A91F5 2004-08-20 uid James Troup <james@nocrew.org>our very own keyring master James Troup. So has James subverted the project? Is he actually not a Debian developer? Given the position(s) he holds, does that mean that the project is doomed? Ha! I am so tempted to end right here, but since my readers are used to getting all the facts, here's the deal: James is so special that he gets to be the only one to have a key in our GPG keyring which can be used for encryption, or so I found out as I was researching this. Now this bug in gnupg actually causes his fingerprint not to be printed. Until this is fixed (if ever), simply leave out --fast-list-mode in the above commands. NP: Oceansize: Effloresce
~/.arch-params/hook
script. Enough
information is passed in to make this mechanism one of the most
flexible I have had the pleasure to work with.
In my hook script, I do
the following things:
-MIRROR
defined, the script updates the
mirror now, and logs an informational message to the screen../debian
directory
that belongs to one of my packages, then the script sends a cleaned
up change log by mail to the packages.qa.debian.org.
People can subscribe to the mailing list setup for each package to
get commit logs, if they so desire../debian/control
for all my
packages.
bayes_expiry_max_db_size 4000000
bayes_auto_expire 0
I also have regularly updated spam rules from the spamassassin rules emporium to
improve the efficiency of the rules; my current user_prefs is available as an
example.
Initial training
I keep my Spam/Ham corpus under the directory
/backup/classify/Done
, in the subdirectories
Ham
and Spam
. At the time of writing, I
have approximately 20,000 mails in each of these subdirectories,
for a total of 41,000+ emails.
I have created a couple of scripts to train the discriminators
from scratch using the extant Spam corpus; and these scripts are
also used for re-learning, for instance, when I moved from a 32-bit
machine to a 64-bit one, or when I change CRM114
discrimators. I generally run them from
~/.spamassassin/
and ~/var/lib/crm114
(which contains my CRM114 setup) directories.
I have found that training Spamassassin works
best if you alternate Spam and Ham message chunks; and this Spamassassin learning
script delivers chunks of 50 messages for learning.
With CRM114, I have discovered that it is not a
good idea to stop learning based on the number of times the corpus
has been gone over; since stopping before all messages i the Corpus
are correctly handled is also disastrous. So I set the repeat count
to a ridiculously high number, and tell mailtrainer
to
continue training until a streak larger than the sum of Spam and
Ham messages has occurred. This CRM114 trainer
script does the hob nicely; running it under
screen
is highly recommend.
Routine updates
Coming back to where we left off, we had mail (mbox format)
folders called ham and/or junk sitting in the
local mail delivery directory, which were ready to be used for
training either CRM114 or
Spamassassin or both.
There are two scripts that help me automate the training. The
first script, called mail-process, does most of
the heavy listing. This processes a bunch of mail folders, which
are supposed to contain mail which is either all ham or all spam,
indicated by the command line arguments. We go looking though every
mail, and any mail where either the CRM114 or the
Spamassassin judgement was not what we expected,
we strip out mail gathering headers, and then we save the mail, one
to a file, and we train the approprite filter. This ensures that we
only train on error, and it does not matter if we accidentally try
to train on correctly classified mail, since that would be a no-op
(apart from increasing the size of the corpus).
The second script, called mproc is a convenience front-end;
it just calls mail-process
with the proper command
line arguments, and feeds them the ham and junk
in sequence; and takes no arguments. So, after human
classification, just calling mproc
does the
classification.
This pretty much finishes the series of posts I had in mind
about spam filtering, I hope it has been useful.
Next.