Search Results: "rleigh"

15 October 2010

Enrico Zini: Award winning code

Award winning code Me and Yuwei had a fun day at hhhmcr (#hhhmcr) and even managed to put together a prototype that won the first prize \o/ We played with the gmp24 dataset kindly extracted from Twitter by Michael Brunton-Spall of the Guardian into a convenient JSON dataset. The idea was to find ways of making it easier to look at the data and making sense of it. This is the story of what we did, including the code we wrote. The original dataset has several JSON files, so the first task was to put them all together:
#!/usr/bin/python
# Merge the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import os
res = []
for f in os.listdir("."):
    if not f.startswith("gmp24"): continue
    data = open(f).read().strip()
    if data == "[]": continue
    parsed = simplejson.loads(data)
    res.extend(parsed)
print simplejson.dumps(res)
The results however were not ordered by date, as GMP had to use several accounts to twit because Twitter was putting Greather Manchester Police into jail for generating too much traffic. There would be quite a bit to write about that, but let's stick to our work. Here is code to sort the JSON data by time:
#!/usr/bin/python
# Sort the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import sys
import datetime as dt
all_recs = simplejson.load(sys.stdin)
all_recs.sort(key=lambda x: dt.datetime.strptime(x["created_at"], "%a %b %d %H:%M:%S +0000 %Y"))
simplejson.dump(all_recs, sys.stdout)
I then wanted to play with Tf-idf for extracting the most important words of every tweet:
#!/usr/bin/python
# tfifd - Annotate JSON elements with Tf-idf extracted keywords
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import sys, math
import simplejson
import re
# Read all the twits
records = simplejson.load(sys.stdin)
# All the twits by ID
byid = dict(((x["id"], x) for x in records))
# Stopwords we ignore
stopwords = set(["by", "it", "and", "of", "in", "a", "to"])
# Tokenising engine
re_num = re.compile(r"^\d+$")
re_word = re.compile(r"(\w+)")
def tokenise(tweet):
    "Extract tokens from a tweet"
    for tok in tweet["text"].split():
        tok = tok.strip().lower()
        if re_num.match(tok): continue
        mo = re_word.match(tok)
        if not mo: continue
        if mo.group(1) in stopwords: continue
        yield mo.group(1)
# Extract tokens from tweets
tokenised = dict(((x["id"], list(tokenise(x))) for x in records))
# Aggregate token counts
aggregated =  
for d in byid.iterkeys():
    for t in tokenised[d]:
        if t in aggregated:
            aggregated[t] += 1
        else:
            aggregated[t] = 1
def tfidf(doc, tok):
    "Compute TFIDF score of a token in a document"
    return doc.count(tok) * math.log(float(len(byid)) / aggregated[tok])
# Annotate tweets with keywords
res = []
for name, tweet in byid.iteritems():
    doc = tokenised[name]
    keywords = sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5]
    tweet["keywords"] = keywords
    res.append(tweet)
simplejson.dump(res, sys.stdout)
I thought this was producing a nice summary of every tweet but nobody was particularly interested, so we moved on to adding categories to tweet. Thanks to Yuwei who put together some useful keyword sets, we managed to annotate each tweet with a place name (i.e. "Stockport"), a social place name (i.e. "pub", "bank") and a social category (i.e. "man", "woman", "landlord"...) The code is simple; the biggest work in it was the dictionary of keywords:
#!/usr/bin/python
# categorise - Annotate JSON elements with categories
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
# Copyright (C) 2010  Yuwei Lin <yuwei@ylin.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import sys, math
import simplejson
import re
# Electoral wards from http://en.wikipedia.org/wiki/List_of_electoral_wards_in_Greater_Manchester
placenames = ["Altrincham", "Sale West",
"Altrincham", "Ashton upon Mersey", "Bowdon", "Broadheath", "Hale Barns", "Hale Central", "St Mary", "Timperley", "Village",
"Ashton-under-Lyne",
"Ashton Hurst", "Ashton St Michael", "Ashton Waterloo", "Droylsden East", "Droylsden West", "Failsworth East", "Failsworth West", "St Peter",
"Blackley", "Broughton",
"Broughton", "Charlestown", "Cheetham", "Crumpsall", "Harpurhey", "Higher Blackley", "Kersal",
"Bolton North East",
"Astley Bridge", "Bradshaw", "Breightmet", "Bromley Cross", "Crompton", "Halliwell", "Tonge with the Haulgh",
"Bolton South East",
"Farnworth", "Great Lever", "Harper Green", "Hulton", "Kearsley", "Little Lever", "Darcy Lever", "Rumworth",
"Bolton West",
"Atherton", "Heaton", "Lostock", "Horwich", "Blackrod", "Horwich North East", "Smithills", "Westhoughton North", "Chew Moor", "Westhoughton South",
"Bury North",
"Church", "East", "Elton", "Moorside", "North Manor", "Ramsbottom", "Redvales", "Tottington",
"Bury South",
"Besses", "Holyrood", "Pilkington Park", "Radcliffe East", "Radcliffe North", "Radcliffe West", "St Mary", "Sedgley", "Unsworth",
"Cheadle",
"Bramhall North", "Bramhall South", "Cheadle", "Gatley", "Cheadle Hulme North", "Cheadle Hulme South", "Heald Green", "Stepping Hill",
"Denton", "Reddish",
"Audenshaw", "Denton North East", "Denton South", "Denton West", "Dukinfield", "Reddish North", "Reddish South",
"Hazel Grove",
"Bredbury", "Woodley", "Bredbury Green", "Romiley", "Hazel Grove", "Marple North", "Marple South", "Offerton",
"Heywood", "Middleton",
"Bamford", "Castleton", "East Middleton", "Hopwood Hall", "Norden", "North Heywood", "North Middleton", "South Middleton", "West Heywood", "West Middleton",
"Leigh",
"Astley Mosley Common", "Atherleigh", "Golborne", "Lowton West", "Leigh East", "Leigh South", "Leigh West", "Lowton East", "Tyldesley",
"Makerfield",
"Abram", "Ashton", "Bryn", "Hindley", "Hindley Green", "Orrell", "Winstanley", "Worsley Mesnes",
"Manchester Central",
"Ancoats", "Clayton", "Ardwick", "Bradford", "City Centre", "Hulme", "Miles Platting", "Newton Heath", "Moss Side", "Moston",
"Manchester", "Gorton",
"Fallowfield", "Gorton North", "Gorton South", "Levenshulme", "Longsight", "Rusholme", "Whalley Range",
"Manchester", "Withington",
"Burnage", "Chorlton", "Chorlton Park", "Didsbury East", "Didsbury West", "Old Moat", "Withington",
"Oldham East", "Saddleworth",
"Alexandra", "Crompton", "Saddleworth North", "Saddleworth South", "Saddleworth West", "Lees", "St James", "St Mary", "Shaw", "Waterhead",
"Oldham West", "Royton",
"Chadderton Central", "Chadderton North", "Chadderton South", "Coldhurst", "Hollinwood", "Medlock Vale", "Royton North", "Royton South", "Werneth",
"Rochdale",
"Balderstone", "Kirkholt", "Central Rochdale", "Healey", "Kingsway", "Littleborough Lakeside", "Milkstone", "Deeplish", "Milnrow", "Newhey", "Smallbridge", "Firgrove", "Spotland", "Falinge", "Wardle", "West Littleborough",
"Salford", "Eccles",
"Claremont", "Eccles", "Irwell Riverside", "Langworthy", "Ordsall", "Pendlebury", "Swinton North", "Swinton South", "Weaste", "Seedley",
"Stalybridge", "Hyde",
"Dukinfield Stalybridge", "Hyde Godley", "Hyde Newton", "Hyde Werneth", "Longdendale", "Mossley", "Stalybridge North", "Stalybridge South",
"Stockport",
"Brinnington", "Central", "Davenport", "Cale Green", "Edgeley", "Cheadle Heath", "Heatons North", "Heatons South", "Manor",
"Stretford", "Urmston",
"Bucklow-St Martins", "Clifford", "Davyhulme East", "Davyhulme West", "Flixton", "Gorse Hill", "Longford", "Stretford", "Urmston",
"Wigan",
"Aspull New Springs Whelley", "Douglas", "Ince", "Pemberton", "Shevington with Lower Ground", "Standish with Langtree", "Wigan Central", "Wigan West",
"Worsley", "Eccles South",
"Barton", "Boothstown", "Ellenbrook", "Cadishead", "Irlam", "Little Hulton", "Walkden North", "Walkden South", "Winton", "Worsley",
"Wythenshawe", "Sale East",
"Baguley", "Brooklands", "Northenden", "Priory", "Sale Moor", "Sharston", "Woodhouse Park"]
# Manual coding from Yuwei
placenames.extend(["City centre", "Tameside", "Oldham", "Bury", "Bolton",
"Trafford", "Pendleton", "New Moston", "Denton", "Eccles", "Leigh", "Benchill",
"Prestwich", "Sale", "Kearsley", ])
placenames.extend(["Trafford", "Bolton", "Stockport", "Levenshulme", "Gorton",
"Tameside", "Blackley", "City centre", "Airport", "South Manchester",
"Rochdale", "Chorlton", "Uppermill", "Castleton", "Stalybridge", "Ashton",
"Chadderton", "Bury", "Ancoats", "Whalley Range", "West Yorkshire",
"Fallowfield", "New Moston", "Denton", "Stretford", "Eccles", "Pendleton",
"Leigh", "Altrincham", "Sale", "Prestwich", "Kearsley", "Hulme", "Withington",
"Moss Side", "Milnrow", "outskirt of Manchester City Centre", "Newton Heath",
"Wythenshawe", "Mancunian Way", "M60", "A6", "Droylesden", "M56", "Timperley",
"Higher Ince", "Clayton", "Higher Blackley", "Lowton", "Droylsden",
"Partington", "Cheetham Hill", "Benchill", "Longsight", "Didsbury",
"Westhoughton"])
# Social categories from Yuwei
soccat = ["man", "woman", "men", "women", "youth", "teenager", "elderly",
"patient", "taxi driver", "neighbour", "male", "tenant", "landlord", "child",
"children", "immigrant", "female", "workmen", "boy", "girl", "foster parents",
"next of kin"]
for i in range(100):
    soccat.append("%d-year-old" % i)
    soccat.append("%d-years-old" % i)
# Types of social locations from Yuwei
socloc = ["car park", "park", "pub", "club", "shop", "premises", "bus stop",
"property", "credit card", "supermarket", "garden", "phone box", "theatre",
"toilet", "building site", "Crown court", "hard shoulder", "telephone kiosk",
"hotel", "restaurant", "cafe", "petrol station", "bank", "school",
"university"]
extras =   "placename": placenames, "soccat": soccat, "socloc": socloc  
# Normalise keyword lists
for k, v in extras.iteritems():
    # Remove duplicates
    v = list(set(v))
    # Sort by length
    v.sort(key=lambda x:len(x), reverse=True)
# Add keywords
def add_categories(tweet):
    text = tweet["text"].lower()
    for field, categories in extras.iteritems():
        for cat in categories:
            if cat.lower() in text:
                tweet[field] = cat
                break
    return tweet
# Read all the twits
records = (add_categories(x) for x in simplejson.load(sys.stdin))
simplejson.dump(list(records), sys.stdout)
All these scripts form a nice processing chain: each script takes a list of JSON records, adds some bit and passes it on. In order to see what we have so far, here is a simple script to convert the JSON twits to CSV so they can be viewed in a spreadsheet:
#!/usr/bin/python
# Convert the JSON twits to CSV
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import sys
import csv
rows = ["id", "created_at", "text", "keywords", "placename"]
writer = csv.writer(sys.stdout)
for rec in simplejson.load(sys.stdin):
    rec["keywords"] = " ".join(rec["keywords"])
    rec["placename"] = rec.get("placename", "")
    writer.writerow([rec[row] for row in rows])
At this point we were coming up with lots of questions: "were there more reports on women or men?", "which place had most incidents?", "what were the incidents involving animals?"... Time to bring Xapian into play. This script reads all the JSON tweets and builds a Xapian index with them:
#!/usr/bin/python
# toxapian - Index JSON tweets in Xapian
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import simplejson
import sys
import os, os.path
import xapian
DBNAME = sys.argv[1]
db = xapian.WritableDatabase(DBNAME, xapian.DB_CREATE_OR_OPEN)
stemmer = xapian.Stem("english")
indexer = xapian.TermGenerator()
indexer.set_stemmer(stemmer)
indexer.set_database(db)
data = simplejson.load(sys.stdin)
for rec in data:
    doc = xapian.Document()
    doc.set_data(str(rec["id"]))
    indexer.set_document(doc)
    indexer.index_text_without_positions(rec["text"])
    # Index categories as categories
    if "placename" in rec:
        doc.add_boolean_term("XP" + rec["placename"].lower())
    if "soccat" in rec:
        doc.add_boolean_term("XS" + rec["soccat"].lower())
    if "socloc" in rec:
        doc.add_boolean_term("XL" + rec["socloc"].lower())
    db.add_document(doc)
db.flush()
# Also save the whole dataset so we know where to find it later if we want to
# show the details of an entry
simplejson.dump(data, open(os.path.join(DBNAME, "all.json"), "w"))
And this is a simple command line tool to query to the database:
#!/usr/bin/python
# xgrep - Command line tool to query the GMP24 tweet Xapian database
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import simplejson
import sys
import os, os.path
import xapian
DBNAME = sys.argv[1]
db = xapian.Database(DBNAME)
stem = xapian.Stem("english")
qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
qp.add_boolean_prefix("place", "XP")
qp.add_boolean_prefix("soc", "XS")
qp.add_boolean_prefix("loc", "XL")
query = qp.parse_query(sys.argv[2],
    xapian.QueryParser.FLAG_BOOLEAN  
    xapian.QueryParser.FLAG_LOVEHATE  
    xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE  
    xapian.QueryParser.FLAG_WILDCARD  
    xapian.QueryParser.FLAG_PURE_NOT  
    xapian.QueryParser.FLAG_SPELLING_CORRECTION  
    xapian.QueryParser.FLAG_AUTO_SYNONYMS)
enquire = xapian.Enquire(db)
enquire.set_query(query)
count = 40
matches = enquire.get_mset(0, count)
estimated = matches.get_matches_estimated()
print "%d/%d results" % (matches.size(), estimated)
data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))
for m in matches:
    rec = data[m.document.get_data()]
    print rec["text"]
print "%d/%d results" % (matches.size(), matches.get_matches_estimated())
total = db.get_doccount()
estimated = matches.get_matches_estimated()
print "%d results over %d documents, %d%%" % (estimated, total, estimated * 100 / total)
Neat! Now that we have a proper index that supports all sort of cool things, like stemming, tag clouds, full text search with complex queries, lookup of similar documents, suggest keywords and so on, it was just fair to put together a web service to share it with other people at the event. It helped that I had already written similar code for apt-xapian-index and dde before. Here is the server, quickly built on bottle. The very last line starts the server and it is where you can configure the listening interface and port.
#!/usr/bin/python
# xserve - Make the GMP24 tweet Xapian database available on the web
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import bottle
from bottle import route, post
from cStringIO import StringIO
import cPickle as pickle
import simplejson
import sys
import os, os.path
import xapian
import urllib
import math
bottle.debug(True)
DBNAME = sys.argv[1]
QUERYLOG = os.path.join(DBNAME, "queries.txt")
data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))
prefixes =   "place": "XP", "soc": "XS", "loc": "XL"  
prefix_desc =   "place": "Place name", "soc": "Social category", "loc": "Social location"  
db = xapian.Database(DBNAME)
stem = xapian.Stem("english")
qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
for k, v in prefixes.iteritems():
    qp.add_boolean_prefix(k, v)
def make_query(qstring):
    return qp.parse_query(qstring,
        xapian.QueryParser.FLAG_BOOLEAN  
        xapian.QueryParser.FLAG_LOVEHATE  
        xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE  
        xapian.QueryParser.FLAG_WILDCARD  
        xapian.QueryParser.FLAG_PURE_NOT  
        xapian.QueryParser.FLAG_SPELLING_CORRECTION  
        xapian.QueryParser.FLAG_AUTO_SYNONYMS)
@route("/")
def index():
    query = urllib.unquote_plus(bottle.request.GET.get("q", ""))
    out = StringIO()
    print >>out, '''
<html>
<head>
<title>Query</title>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript">
$(function() 
    $("#queryfield")[0].focus()
 )
</script>
</head>
<body>
<h1>Search</h1>
<form method="POST" action="/query">
Keywords: <input type="text" name="query" value="%s" id="queryfield">
<input type="submit">
<a href="http://xapian.org/docs/queryparser.html">Help</a>
</form>''' % query
    print >>out, '''
<p>Example: "car place:wigan"</p>

<p>Available prefixes:</p>

<ul>
'''
    for pfx in prefixes.keys():
        print >>out, "<li><a href='/catinfo/%s'>%s - %s</a></li>" % (pfx, pfx, prefix_desc[pfx])
    print >>out, '''
</ul>
'''
    oldqueries = []
    if os.path.exists(QUERYLOG):
        total = db.get_doccount()
        fd = open(QUERYLOG, "r")
        while True:
            try:
                q = pickle.load(fd)
            except EOFError:
                break
            oldqueries.append(q)
        fd.close()
        def print_query(q):
            count = q["count"]
            print >>out, "<li><a href='/query?query=%s'>%s (%d/%d %.2f%%)</a></li>" % (urllib.quote_plus(q["q"]), q["q"], count, total, count * 100.0 / total)
        print >>out, "<p>Last 10 queries:</p><ul>"
        for q in oldqueries[:-10:-1]:
            print_query(q)
        print >>out, "</ul>"
        # Remove duplicates
        oldqueries = dict(((x["q"], x) for x in oldqueries)).values()
        print >>out, "<table>"
        print >>out, "<tr><th>10 queries with most results</th><th>10 queries with least results</th></tr>"
        print >>out, "<tr><td>"
        print >>out, "<ul>"
        oldqueries.sort(key=lambda x:x["count"], reverse=True)
        for q in oldqueries[:10]:
            print_query(q)
        print >>out, "</ul>"
        print >>out, "</td><td>"
        print >>out, "<ul>"
        nonempty = [x for x in oldqueries if x["count"] > 0]
        nonempty.sort(key=lambda x:x["count"])
        for q in nonempty[:10]:
            print_query(q)
        print >>out, "</ul>"
        print >>out, "</td></tr>"
        print >>out, "</table>"
    print >>out, '''
</body>
</html>'''
    return out.getvalue()
@route("/query")
@route("/query/")
@post("/query")
@post("/query/")
def query():
    query = bottle.request.POST.get("query", bottle.request.GET.get("query", ""))
    enquire = xapian.Enquire(db)
    enquire.set_query(make_query(query))
    count = 40
    matches = enquire.get_mset(0, count)
    estimated = matches.get_matches_estimated()
    total = db.get_doccount()
    out = StringIO()
    print >>out, '''
<html>
<head><title>Results</title></head>
<body>
<h1>Results for "<b>%s</b>"</h1>
''' % query
    if estimated == 0:
        print >>out, "No results found."
    else:
        # Give as results the first 30 documents; also use them as the key
        # ones to use to compute relevant terms
        rset = xapian.RSet()
        for m in enquire.get_mset(0, 30):
            rset.add_document(m.document.get_docid())
        # Compute the tag cloud
        class NonTagFilter(xapian.ExpandDecider):
            def __call__(self, term):
                return not term[0].isupper() and not term[0].isdigit()
        cloud = []
        maxscore = None
        for res in enquire.get_eset(40, rset, NonTagFilter()):
            # Normalise the score in the interval [0, 1]
            weight = math.log(res.weight)
            if maxscore == None: maxscore = weight
            tag = res.term
            cloud.append([tag, float(weight) / maxscore])
        max_weight = cloud[0][1]
        min_weight = cloud[-1][1]
        cloud.sort(key=lambda x:x[0])
        def mklink(query, term):
            return "/query?query=%s" % urllib.quote_plus(query + " and " + term)
        print >>out, "<h2>Tag cloud</h2>"
        print >>out, "<blockquote>"
        for term, weight in cloud:
            size = 100 + 100.0 * (weight - min_weight) / (max_weight - min_weight)
            print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(query, term), size, term)
        print >>out, "</blockquote>"
        print >>out, "<h2>Results</h2>"
        print >>out, "<p><a href='/'>Search again</a></p>"
        print >>out, "<p>%d results over %d documents, %.2f%%</p>" % (estimated, total, estimated * 100.0 / total)
        print >>out, "<p>%d/%d results</p>" % (matches.size(), estimated)
        print >>out, "<ul>"
        for m in matches:
            rec = data[m.document.get_data()]
            print >>out, "<li><a href='/item/%s'>%s</a></li>" % (rec["id"], rec["text"])
        print >>out, "</ul>"
        fd = open(QUERYLOG, "a")
        qinfo = dict(q=query, count=estimated)
        pickle.dump(qinfo, fd)
        fd.close()
    print >>out, '''
<a href="/">Search again</a>

</body>
</html>'''
    return out.getvalue()
@route("/item/:id")
@route("/item/:id/")
def show(id):
    rec = data[id]
    out = StringIO()
    print >>out, '''
<html>
<head><title>Result %s</title></head>
<body>
<h1>Raw JSON record for twit %s</h1>
<pre>''' % (rec["id"], rec["id"])
    print >>out, simplejson.dumps(rec, indent=" ")
    print >>out, '''
</pre>
</body>
</html>'''
    return out.getvalue()
@route("/catinfo/:name")
@route("/catinfo/:name/")
def catinfo(name):
    prefix = prefixes[name]
    out = StringIO()
    print >>out, '''
<html>
<head><title>Values for %s</title></head>
<body>
''' % name
    terms = [(x.term[len(prefix):], db.get_termfreq(x.term)) for x in db.allterms(prefix)]
    terms.sort(key=lambda x:x[1], reverse=True)
    freq_min = terms[0][1]
    freq_max = terms[-1][1]
    def mklink(name, term):
        return "/query?query=%s" % urllib.quote_plus(name + ":" + term)
    # Build tag cloud
    print >>out, "<h1>Tag cloud</h1>"
    print >>out, "<blockquote>"
    for term, freq in sorted(terms[:20], key=lambda x:x[0]):
        size = 100 + 100.0 * (freq - freq_min) / (freq_max - freq_min)
        print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(name, term), size, term)
    print >>out, "</blockquote>"
    print >>out, "<h1>All terms</h1>"
    print >>out, "<table>"
    print >>out, "<tr><th>Occurrences</th><th>Name</th></tr>"
    for term, freq in terms:
        print >>out, "<tr><td>%d</td><td><a href='/query?query=%s'>%s</a></td></tr>" % (freq, urllib.quote_plus(name + ":" + term), term)
    print >>out, "</table>"
    print >>out, '''
</body>
</html>'''
    return out.getvalue()
# Change here for bind host and port
bottle.run(host="0.0.0.0", port=8024)
...and then we presented our work and ended up winning the contest. This was the story of how we wrote this set of award winning code.

19 March 2006

Clint Adams: This report is flawed, but it sure is fun

91D63469DFdnusinow1243
63DEB0EC31eloy
55A965818Fvela1243
4658510B5Amyon2143
399B7C328Dluk31-2
391880283Canibal2134
370FE53DD9opal4213
322B0920C0lool1342
29788A3F4Cjoeyh
270F932C9Cdoko
258768B1D2sjoerd
23F1BCDB73aurel3213-2
19E02FEF11jordens1243
18AB963370schizo1243
186E74A7D1jdassen(Ks)1243
1868FD549Ftbm3142
186783ED5Efpeters1--2
1791B0D3B7edd-213
16E07F1CF9rousseau321-
16248AEB73rene1243
158E635A5Erafl
14C0143D2Dbubulle4123
13D87C6781krooger(P)4213
13A436AD25jfs(P)
133D08B612msp
131E880A84fjp4213
130F7A8D01nobse
12F1968D1Bdecklin1234
12E7075A54mhatta
12D75F8533joss1342
12BF24424Csrivasta1342
12B8C1FA69sto
127F961564kobold
122A30D729pere4213
1216D970C6eric12--
115E0577F2mpitt
11307D56EDnoel3241
112BE16D01moray1342
10BC7D020Aformorer-1--
10A7D91602apollock4213
10A51A4FDDgcs
10917A225Ejordi
104B729625pvaneynd3123
10497A176Dloic
962F1A57Fpa3aba
954FD2A58glandium1342
94A5D72FErafael
913FEFC40fenio-1--
90AFC7476rra1243
890267086duck31-2
886A118E6ch321-
8801EA932joey1243
87F4E0E11waldi-123
8514B3E7Cflorian21--
841954920fs12--
82A385C57mckinstry21-3
825BFB848rleigh1243
7BC70A6FFpape1---
7B70E403Bari1243
78E2D213Ajochen(Ks)
785FEC17Fkilian
784FB46D6lwall1342
7800969EFsmimram-1--
779CC6586haas
75BFA90ECkohda
752B7487Esesse2341
729499F61sho1342
71E161AFBbarbier12--
6FC05DA69wildfire(P)
6EEB6B4C2avdyk-12-
6EDF008C5blade1243
6E25F2102mejo1342
6D1C41882adeodato(Ks)3142
6D0B433DFross12-3
6B0EBC777piman1233
69D309C3Brobert4213
6882A6C4Bkov
66BBA3C84zugschlus4213
65662C734mvo
6554FB4C6petere-1-2
637155778stratus
62D9ACC8Elars1243
62809E61Ajosem
62252FA1Afrank2143
61CF2D62Amicah
610FA4CD1cjwatson2143
5EE6DC66Ajaldhar2143
5EA59038Esgran4123
5E1EE3FB1md4312
5E0B8B2DEjaybonci
5C9A5B54Esesse(Ps,Gs) 2341
5C4CF8EC3twerner
5C2FEE5CDacid213-
5C09FD35Atille
5C03C56DFrfrancoise---1
5B7CDA2DCxam213-
5A20EBC50cavok4214
5808D0FD0don1342
5797EBFABenrico1243
55230514Asjackman
549A5F855otavio-123
53DC29B41pdm
529982E5Avorlon1243
52763483Bmkoch213-
521DB31C5smr2143
51BF8DE0Fstigge312-
512CADFA5csmall3214
50A0AC927lamont
4F2CF01A8bdale
4F095E5E4mnencia
4E9F2C747frankie
4E9ABFCD2devin2143
4E81E55C1dancer2143
4E38E7ACFhmh(Gs)1243
4E298966Djrv(P)
4DF5CE2B4huggie12-3
4DD982A75speedblue
4C671257Ddamog-1-2
4C4A3823Ekmr4213
4C0B10A5Bdexter
4C02440B8js1342
4BE9F70EAtb1342
4B7D2F063varenet-213
4A3F9E30Eschultmc1243
4A3D7B9BClawrencc2143
4A1EE761Cmadcoder21--
49DE1EEB1he3142
49D928C9Bguillem1---
49B726B71racke
490788E11jsogo2143
4864826C3gotom4321
47244970Bkroeckx2143
45B48FFAEmarga2143
454E672DEisaac1243
44B3A135Cerich1243
44597A593agmartin4213
43FCC2A90amaya1243
43F3E6426agx-1-2
43EF23CD6sanvila1342
432C9C8BDwerner(K)
4204DDF1Baquette
400D8CD16tolimar12--
3FEC23FB2bap34-1
3F972BE03tmancill4213
3F801A743nduboc1---
3EBEDB32Bchrsmrtn4123
3EA291785taggart2314
3E4D47EC1tv(P)
3E19F188Etroyh1244
3DF6807BEsrk4213
3D2A913A1psg(P)
3D097A261chrisb
3C6CEA0C9adconrad1243
3C20DF273ondrej
3B5444815ballombe1342
3B1DF9A57cate2143
3AFA44BDDweasel(Ps,Gs) 1342
3AA6541EEbrlink1442
3A824B93Fasac3144
3A71C1E00turbo
3A2D7D292seb128
39ED101BFmbanck3132
3969457F0joostvb2143
389BF7E2Bkobras1--2
386946D69mooch12-3
374886B63nathans
36F222F1Fedelhard
36D67F790foka
360B6B958geiger
3607559E6mako
35C33C1B8dirson
35921B5D8ajmitch
34C1A5BE5sjq
3431B38BApxt312-
33E7B4B73lmamane2143
327572C47ucko1342
320021490schepler1342
31DEB8EAEgoedson
31BF2305Akrala(Gs)3142
319A42D19dannf21-4
3174FEE35wookey3124
3124B26F3mfurr21-3
30A327652tschmidt312-
3090DD8D5ingo3123
30813569Fjeroen1141
30644FAB7bas1332
30123F2F2gareuselesinge1243
300530C24bam1234
2FD6645ABrmurray-1-2
2F95C2F6Dchrism(P)
2F9138496graham(Gs)3142
2F5D65169jblache1332
2F28CD102absurd
2F2597E04samu
2F0B27113patrick
2EFA6B9D5hamish(P)3142
2EE0A35C7risko4213
2E91CD250daigo
2D688E0A7qjb-21-
2D4BE1450prudhomm
2D2A6B810joussen
2CFD42F26dilinger
2CEE44978dburrows1243
2CD4C0D9Dskx4213
2BFB880A3zeevon
2BD8B050Droland3214
2B74952A9alee
2B4D6DE13paul
2B345BDD3neilm1243
2B28C5995bod4213
2B0FA4F49schoepf
2B0DDAF42awoodland
2A8061F32osamu4213
2A21AD4F9tviehmann1342
299E81DA0kaplan
2964199E2fabbe3142
28DBFEC2Fpelle
28B8D7663ametzler1342
28B143975martignlo
288C7C1F793sam2134
283E5110Fovek
2817A996Atfheen
2807CAC25abi4123
2798DD95Cpiefel
278D621B4uwe-1--
26FF0ABF2rcw2143
26E8169D2hertzog3124
26C0084FCchrisvdb
26B79D401filippo-1--
267756F5Dfrn2341
25E2EB5B4nveber123-
25C6153ADbroonie1243
25B713DF0djpig1243
250ECFB98ccontavalli(Gs)
250064181paulvt
24F71955Adajobe21-3
24E2ECA5Ajmm4213
2496A1827srittau
23E8DCCC0maxx1342
23D97C149mstone(P)2143
22DB65596dz321-
229F19BD1meskes
21F41B907marillat1---
21EB2DE66boll
21557BC10kraai1342
2144843F5lolando1243
210656584voc
20D7CA701steinm
205410E97horms
1FC992520tpo-14-
1FB0DFE9Bgildor
1FAEEB4A9neil1342
1F7E8BC63cedric21--
1F2C423BCzack1332
1F0199162kreckel4214
1ECA94FA8ishikawa2143
1EAAC62DFcyb---1
1EA2D2C41malattia-312
1E77AC835bcwhite(P)
1E66C9BB0tach
1E145F334mquinson2143
1E0BA04C1treinen321-
1DFE80FB2tali
1DE054F69azekulic(P)
1DC814B09jfs
1CB467E27kalfa
1C9132DDByoush-21-
1C87FFC2Fstevenk-1--
1C2CE8099knok321-
1BED37FD2henning(Ks)1342
1BA0A7EB5treacy(P)
1B7D86E0Fcmb4213
1B62849B3smarenka2143
1B3C281F4alain2143
1B25A5CF1omote
1ABA0E8B2sasa
1AB474598baruch2143
1AB2A91F5troup1--2
1A827CEDEafayolle(Gs)
1A6C805B9zorglub2134
1A674A359maehara
1A57D8BF7drew2143
1A269D927sharky
1A1696D2Blfousse1232
19BF42B07zinoviev--12
19057B5D3vanicat2143
18E950E00mechanix
18BB527AFgwolf1132
18A1D9A1Fjgoerzen
18807529Bultrotter2134
1872EB4E5rcardenes
185EE3E0Eangdraug12-3
1835EB2FFbossekr
180C83E8Eigloo1243
17B8357E5andreas212-
17B80220Dsjr(Gs)1342
17796A60Bsfllaw1342
175CB1AD2toni1---
1746C51F4klindsay
172D03CB1kmuto4231
171473F66ttroxell13-4
16E76D81Dseanius1243
16C63746Dhector
16C5F196Bmalex4213
16A9F3C38rkrishnan
168021CE4ron---1
166F24521pyro-123
1631B4819anfra
162EEAD8Bfalk1342
161326D40jamessan13-4
1609CD2C0berin--1-
15D8CDA7Bguus1243
15D8C12EArganesan
15D64F870zobel
159EF5DBCbs
157F045DCcamm
1564EE4B6hazelsct
15623FC45moronito4213
1551BE447torsten
154AD21B5warmenhoven
153BBA490sjg
1532005DAseamus
150973B91pjb2143
14F83C751kmccarty12-3
14DB97694khkim
14CD6E3D2wjl4213
14A8854E6weinholt1243
14950EAA6ajkessel
14298C761robertc(Ks)
142955682kamop
13FD29468bengen-213
13FD25C84roktas3142
13B047084madhack
139CCF0C7tagoh3142
139A8CCE2eugen31-2
138015E7Ethb1234
136B861C1bab2143
133FC40A4mennucc13214
12C0FCD1Awdg4312
12B05B73Arjs
1258D8781grisu31-2
1206C5AFDchewie-1-1
1200D1596joy2143
11C74E0B7alfs
119D03486francois4123
118EA3457rvr
1176015EDevo
116BD77C6alfie
112AA1DB8jh
1128287E8daf
109FC015Cgodisch
106468DEBfog--12
105792F34rla-21-
1028AF63Cforcer3142
1004DA6B4bg66
0.zufus-1--
0.zoso-123
0.ykomatsu-123
0.xtifr1243
0.xavier-312
0.wouter2143
0.will-132
0.warp1342
0.voss1342
0.vlm2314
0.vleeuwen4312
0.vince2134
0.ukai4123
0.tytso-12-
0.tjrc14213
0.tats-1-2
0.tao1--2
0.stone2134
0.stevegr1243
0.smig-1-2
0.siggi1-44
0.shaul4213
0.sharpone1243
0.sfrost1342
0.seb-21-
0.salve4213
0.ruoso1243
0.rover--12
0.rmayr-213
0.riku4123
0.rdonald12-3
0.radu-1--
0.pzn112-
0.pronovic1243
0.profeta321-
0.portnoy12-3
0.porridge1342
0.pmhahn4123
0.pmachard1--2
0.pkern3124
0.pik1--2
0.phil4213
0.pfrauenf4213
0.pfaffben2143
0.p21243
0.ossk1243
0.oohara1234
0.ohura-213
0.nwp1342
0.noshiro4312
0.noodles2134
0.nomeata2143
0.noahm3124
0.nils3132
0.nico-213
0.ms3124
0.mpalmer2143
0.moth3241
0.mlang2134
0.mjr1342
0.mjg591342
0.merker2--1
0.mbuck2143
0.mbrubeck1243
0.madduck4123
0.mace-1-2
0.luther1243
0.luigi4213
0.lss-112
0.lightsey1--2
0.ley-1-2
0.ldrolez--1-
0.lange4124
0.kirk1342
0.killer1243
0.kelbert-214
0.juanma2134
0.jtarrio1342
0.jonas4312
0.joerg1342
0.jmintha-21-
0.jimmy1243
0.jerome21--
0.jaqque1342
0.jaq4123
0.jamuraa4123
0.iwj1243
0.ivan2341
0.hsteoh3142
0.hilliard4123
0.helen1243
0.hecker3142
0.hartmans1342
0.guterm312-
0.gniibe4213
0.glaweh4213
0.gemorin4213
0.gaudenz3142
0.fw2134
0.fmw12-3
0.evan1--2
0.ender4213
0.elonen4123
0.eevans13-4
0.ean-1--
0.dwhedon4213
0.duncf2133
0.ds1342
0.dparsons1342
0.dlehn1243
0.dfrey-123
0.deek1--2
0.davidw4132
0.davidc1342
0.dave4113
0.daenzer1243
0.cupis1---
0.cts-213
0.cph4312
0.cmc2143
0.clebars2143
0.chaton-21-
0.cgb-12-
0.calvin-1-2
0.branden1342
0.brad4213
0.bnelson1342
0.blarson1342
0.benj3132
0.bayle-213
0.baran1342
0.az2134
0.awm3124
0.atterer4132
0.andressh1---
0.amu1--2
0.akumria-312
0.ajt1144
0.ajk1342
0.agi2143
0.adric2143
0.adejong1243
0.adamm12--
0.aba1143