Search Results: "rleigh"

Award winning code Me and Yuwei had a fun day at hhhmcr (#hhhmcr) and even managed to put together a prototype that won the first prize \o/ We played with the gmp24 dataset kindly extracted from Twitter by Michael Brunton-Spall of the Guardian into a convenient JSON dataset. The idea was to find ways of making it easier to look at the data and making sense of it. This is the story of what we did, including the code we wrote. The original dataset has several JSON files, so the first task was to put them all together:

#!/usr/bin/python
# Merge the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import os
res = []
for f in os.listdir("."):
    if not f.startswith("gmp24"): continue
    data = open(f).read().strip()
    if data == "[]": continue
    parsed = simplejson.loads(data)
    res.extend(parsed)
print simplejson.dumps(res)

The results however were not ordered by date, as GMP had to use several accounts to twit because Twitter was putting Greather Manchester Police into jail for generating too much traffic. There would be quite a bit to write about that, but let's stick to our work. Here is code to sort the JSON data by time:

#!/usr/bin/python
# Sort the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import sys
import datetime as dt
all_recs = simplejson.load(sys.stdin)
all_recs.sort(key=lambda x: dt.datetime.strptime(x["created_at"], "%a %b %d %H:%M:%S +0000 %Y"))
simplejson.dump(all_recs, sys.stdout)

I then wanted to play with Tf-idf for extracting the most important words of every tweet:

#!/usr/bin/python
# tfifd - Annotate JSON elements with Tf-idf extracted keywords
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import sys, math
import simplejson
import re
# Read all the twits
records = simplejson.load(sys.stdin)
# All the twits by ID
byid = dict(((x["id"], x) for x in records))
# Stopwords we ignore
stopwords = set(["by", "it", "and", "of", "in", "a", "to"])
# Tokenising engine
re_num = re.compile(r"^\d+$")
re_word = re.compile(r"(\w+)")
def tokenise(tweet):
    "Extract tokens from a tweet"
    for tok in tweet["text"].split():
        tok = tok.strip().lower()
        if re_num.match(tok): continue
        mo = re_word.match(tok)
        if not mo: continue
        if mo.group(1) in stopwords: continue
        yield mo.group(1)
# Extract tokens from tweets
tokenised = dict(((x["id"], list(tokenise(x))) for x in records))
# Aggregate token counts
aggregated =  
for d in byid.iterkeys():
    for t in tokenised[d]:
        if t in aggregated:
            aggregated[t] += 1
        else:
            aggregated[t] = 1
def tfidf(doc, tok):
    "Compute TFIDF score of a token in a document"
    return doc.count(tok) * math.log(float(len(byid)) / aggregated[tok])
# Annotate tweets with keywords
res = []
for name, tweet in byid.iteritems():
    doc = tokenised[name]
    keywords = sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5]
    tweet["keywords"] = keywords
    res.append(tweet)
simplejson.dump(res, sys.stdout)

I thought this was producing a nice summary of every tweet but nobody was particularly interested, so we moved on to adding categories to tweet. Thanks to Yuwei who put together some useful keyword sets, we managed to annotate each tweet with a place name (i.e. "Stockport"), a social place name (i.e. "pub", "bank") and a social category (i.e. "man", "woman", "landlord"...) The code is simple; the biggest work in it was the dictionary of keywords:

#!/usr/bin/python
# categorise - Annotate JSON elements with categories
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
# Copyright (C) 2010  Yuwei Lin <yuwei@ylin.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import sys, math
import simplejson
import re
# Electoral wards from http://en.wikipedia.org/wiki/List_of_electoral_wards_in_Greater_Manchester
placenames = ["Altrincham", "Sale West",
"Altrincham", "Ashton upon Mersey", "Bowdon", "Broadheath", "Hale Barns", "Hale Central", "St Mary", "Timperley", "Village",
"Ashton-under-Lyne",
"Ashton Hurst", "Ashton St Michael", "Ashton Waterloo", "Droylsden East", "Droylsden West", "Failsworth East", "Failsworth West", "St Peter",
"Blackley", "Broughton",
"Broughton", "Charlestown", "Cheetham", "Crumpsall", "Harpurhey", "Higher Blackley", "Kersal",
"Bolton North East",
"Astley Bridge", "Bradshaw", "Breightmet", "Bromley Cross", "Crompton", "Halliwell", "Tonge with the Haulgh",
"Bolton South East",
"Farnworth", "Great Lever", "Harper Green", "Hulton", "Kearsley", "Little Lever", "Darcy Lever", "Rumworth",
"Bolton West",
"Atherton", "Heaton", "Lostock", "Horwich", "Blackrod", "Horwich North East", "Smithills", "Westhoughton North", "Chew Moor", "Westhoughton South",
"Bury North",
"Church", "East", "Elton", "Moorside", "North Manor", "Ramsbottom", "Redvales", "Tottington",
"Bury South",
"Besses", "Holyrood", "Pilkington Park", "Radcliffe East", "Radcliffe North", "Radcliffe West", "St Mary", "Sedgley", "Unsworth",
"Cheadle",
"Bramhall North", "Bramhall South", "Cheadle", "Gatley", "Cheadle Hulme North", "Cheadle Hulme South", "Heald Green", "Stepping Hill",
"Denton", "Reddish",
"Audenshaw", "Denton North East", "Denton South", "Denton West", "Dukinfield", "Reddish North", "Reddish South",
"Hazel Grove",
"Bredbury", "Woodley", "Bredbury Green", "Romiley", "Hazel Grove", "Marple North", "Marple South", "Offerton",
"Heywood", "Middleton",
"Bamford", "Castleton", "East Middleton", "Hopwood Hall", "Norden", "North Heywood", "North Middleton", "South Middleton", "West Heywood", "West Middleton",
"Leigh",
"Astley Mosley Common", "Atherleigh", "Golborne", "Lowton West", "Leigh East", "Leigh South", "Leigh West", "Lowton East", "Tyldesley",
"Makerfield",
"Abram", "Ashton", "Bryn", "Hindley", "Hindley Green", "Orrell", "Winstanley", "Worsley Mesnes",
"Manchester Central",
"Ancoats", "Clayton", "Ardwick", "Bradford", "City Centre", "Hulme", "Miles Platting", "Newton Heath", "Moss Side", "Moston",
"Manchester", "Gorton",
"Fallowfield", "Gorton North", "Gorton South", "Levenshulme", "Longsight", "Rusholme", "Whalley Range",
"Manchester", "Withington",
"Burnage", "Chorlton", "Chorlton Park", "Didsbury East", "Didsbury West", "Old Moat", "Withington",
"Oldham East", "Saddleworth",
"Alexandra", "Crompton", "Saddleworth North", "Saddleworth South", "Saddleworth West", "Lees", "St James", "St Mary", "Shaw", "Waterhead",
"Oldham West", "Royton",
"Chadderton Central", "Chadderton North", "Chadderton South", "Coldhurst", "Hollinwood", "Medlock Vale", "Royton North", "Royton South", "Werneth",
"Rochdale",
"Balderstone", "Kirkholt", "Central Rochdale", "Healey", "Kingsway", "Littleborough Lakeside", "Milkstone", "Deeplish", "Milnrow", "Newhey", "Smallbridge", "Firgrove", "Spotland", "Falinge", "Wardle", "West Littleborough",
"Salford", "Eccles",
"Claremont", "Eccles", "Irwell Riverside", "Langworthy", "Ordsall", "Pendlebury", "Swinton North", "Swinton South", "Weaste", "Seedley",
"Stalybridge", "Hyde",
"Dukinfield Stalybridge", "Hyde Godley", "Hyde Newton", "Hyde Werneth", "Longdendale", "Mossley", "Stalybridge North", "Stalybridge South",
"Stockport",
"Brinnington", "Central", "Davenport", "Cale Green", "Edgeley", "Cheadle Heath", "Heatons North", "Heatons South", "Manor",
"Stretford", "Urmston",
"Bucklow-St Martins", "Clifford", "Davyhulme East", "Davyhulme West", "Flixton", "Gorse Hill", "Longford", "Stretford", "Urmston",
"Wigan",
"Aspull New Springs Whelley", "Douglas", "Ince", "Pemberton", "Shevington with Lower Ground", "Standish with Langtree", "Wigan Central", "Wigan West",
"Worsley", "Eccles South",
"Barton", "Boothstown", "Ellenbrook", "Cadishead", "Irlam", "Little Hulton", "Walkden North", "Walkden South", "Winton", "Worsley",
"Wythenshawe", "Sale East",
"Baguley", "Brooklands", "Northenden", "Priory", "Sale Moor", "Sharston", "Woodhouse Park"]
# Manual coding from Yuwei
placenames.extend(["City centre", "Tameside", "Oldham", "Bury", "Bolton",
"Trafford", "Pendleton", "New Moston", "Denton", "Eccles", "Leigh", "Benchill",
"Prestwich", "Sale", "Kearsley", ])
placenames.extend(["Trafford", "Bolton", "Stockport", "Levenshulme", "Gorton",
"Tameside", "Blackley", "City centre", "Airport", "South Manchester",
"Rochdale", "Chorlton", "Uppermill", "Castleton", "Stalybridge", "Ashton",
"Chadderton", "Bury", "Ancoats", "Whalley Range", "West Yorkshire",
"Fallowfield", "New Moston", "Denton", "Stretford", "Eccles", "Pendleton",
"Leigh", "Altrincham", "Sale", "Prestwich", "Kearsley", "Hulme", "Withington",
"Moss Side", "Milnrow", "outskirt of Manchester City Centre", "Newton Heath",
"Wythenshawe", "Mancunian Way", "M60", "A6", "Droylesden", "M56", "Timperley",
"Higher Ince", "Clayton", "Higher Blackley", "Lowton", "Droylsden",
"Partington", "Cheetham Hill", "Benchill", "Longsight", "Didsbury",
"Westhoughton"])
# Social categories from Yuwei
soccat = ["man", "woman", "men", "women", "youth", "teenager", "elderly",
"patient", "taxi driver", "neighbour", "male", "tenant", "landlord", "child",
"children", "immigrant", "female", "workmen", "boy", "girl", "foster parents",
"next of kin"]
for i in range(100):
    soccat.append("%d-year-old" % i)
    soccat.append("%d-years-old" % i)
# Types of social locations from Yuwei
socloc = ["car park", "park", "pub", "club", "shop", "premises", "bus stop",
"property", "credit card", "supermarket", "garden", "phone box", "theatre",
"toilet", "building site", "Crown court", "hard shoulder", "telephone kiosk",
"hotel", "restaurant", "cafe", "petrol station", "bank", "school",
"university"]
extras =   "placename": placenames, "soccat": soccat, "socloc": socloc  
# Normalise keyword lists
for k, v in extras.iteritems():
    # Remove duplicates
    v = list(set(v))
    # Sort by length
    v.sort(key=lambda x:len(x), reverse=True)
# Add keywords
def add_categories(tweet):
    text = tweet["text"].lower()
    for field, categories in extras.iteritems():
        for cat in categories:
            if cat.lower() in text:
                tweet[field] = cat
                break
    return tweet
# Read all the twits
records = (add_categories(x) for x in simplejson.load(sys.stdin))
simplejson.dump(list(records), sys.stdout)

All these scripts form a nice processing chain: each script takes a list of JSON records, adds some bit and passes it on. In order to see what we have so far, here is a simple script to convert the JSON twits to CSV so they can be viewed in a spreadsheet:

#!/usr/bin/python
# Convert the JSON twits to CSV
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import sys
import csv
rows = ["id", "created_at", "text", "keywords", "placename"]
writer = csv.writer(sys.stdout)
for rec in simplejson.load(sys.stdin):
    rec["keywords"] = " ".join(rec["keywords"])
    rec["placename"] = rec.get("placename", "")
    writer.writerow([rec[row] for row in rows])

At this point we were coming up with lots of questions: "were there more reports on women or men?", "which place had most incidents?", "what were the incidents involving animals?"... Time to bring Xapian into play. This script reads all the JSON tweets and builds a Xapian index with them:

#!/usr/bin/python
# toxapian - Index JSON tweets in Xapian
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import simplejson
import sys
import os, os.path
import xapian
DBNAME = sys.argv[1]
db = xapian.WritableDatabase(DBNAME, xapian.DB_CREATE_OR_OPEN)
stemmer = xapian.Stem("english")
indexer = xapian.TermGenerator()
indexer.set_stemmer(stemmer)
indexer.set_database(db)
data = simplejson.load(sys.stdin)
for rec in data:
    doc = xapian.Document()
    doc.set_data(str(rec["id"]))
    indexer.set_document(doc)
    indexer.index_text_without_positions(rec["text"])
    # Index categories as categories
    if "placename" in rec:
        doc.add_boolean_term("XP" + rec["placename"].lower())
    if "soccat" in rec:
        doc.add_boolean_term("XS" + rec["soccat"].lower())
    if "socloc" in rec:
        doc.add_boolean_term("XL" + rec["socloc"].lower())
    db.add_document(doc)
db.flush()
# Also save the whole dataset so we know where to find it later if we want to
# show the details of an entry
simplejson.dump(data, open(os.path.join(DBNAME, "all.json"), "w"))

And this is a simple command line tool to query to the database:

#!/usr/bin/python
# xgrep - Command line tool to query the GMP24 tweet Xapian database
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import simplejson
import sys
import os, os.path
import xapian
DBNAME = sys.argv[1]
db = xapian.Database(DBNAME)
stem = xapian.Stem("english")
qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
qp.add_boolean_prefix("place", "XP")
qp.add_boolean_prefix("soc", "XS")
qp.add_boolean_prefix("loc", "XL")
query = qp.parse_query(sys.argv[2],
    xapian.QueryParser.FLAG_BOOLEAN  
    xapian.QueryParser.FLAG_LOVEHATE  
    xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE  
    xapian.QueryParser.FLAG_WILDCARD  
    xapian.QueryParser.FLAG_PURE_NOT  
    xapian.QueryParser.FLAG_SPELLING_CORRECTION  
    xapian.QueryParser.FLAG_AUTO_SYNONYMS)
enquire = xapian.Enquire(db)
enquire.set_query(query)
count = 40
matches = enquire.get_mset(0, count)
estimated = matches.get_matches_estimated()
print "%d/%d results" % (matches.size(), estimated)
data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))
for m in matches:
    rec = data[m.document.get_data()]
    print rec["text"]
print "%d/%d results" % (matches.size(), matches.get_matches_estimated())
total = db.get_doccount()
estimated = matches.get_matches_estimated()
print "%d results over %d documents, %d%%" % (estimated, total, estimated * 100 / total)

Neat! Now that we have a proper index that supports all sort of cool things, like stemming, tag clouds, full text search with complex queries, lookup of similar documents, suggest keywords and so on, it was just fair to put together a web service to share it with other people at the event. It helped that I had already written similar code for apt-xapian-index and dde before. Here is the server, quickly built on bottle. The very last line starts the server and it is where you can configure the listening interface and port.

#!/usr/bin/python
# xserve - Make the GMP24 tweet Xapian database available on the web
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import bottle
from bottle import route, post
from cStringIO import StringIO
import cPickle as pickle
import simplejson
import sys
import os, os.path
import xapian
import urllib
import math
bottle.debug(True)
DBNAME = sys.argv[1]
QUERYLOG = os.path.join(DBNAME, "queries.txt")
data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))
prefixes =   "place": "XP", "soc": "XS", "loc": "XL"  
prefix_desc =   "place": "Place name", "soc": "Social category", "loc": "Social location"  
db = xapian.Database(DBNAME)
stem = xapian.Stem("english")
qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
for k, v in prefixes.iteritems():
    qp.add_boolean_prefix(k, v)
def make_query(qstring):
    return qp.parse_query(qstring,
        xapian.QueryParser.FLAG_BOOLEAN  
        xapian.QueryParser.FLAG_LOVEHATE  
        xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE  
        xapian.QueryParser.FLAG_WILDCARD  
        xapian.QueryParser.FLAG_PURE_NOT  
        xapian.QueryParser.FLAG_SPELLING_CORRECTION  
        xapian.QueryParser.FLAG_AUTO_SYNONYMS)
@route("/")
def index():
    query = urllib.unquote_plus(bottle.request.GET.get("q", ""))
    out = StringIO()
    print >>out, '''
<html>
<head>
<title>Query</title>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript">
$(function() 
    $("#queryfield")[0].focus()
 )
</script>
</head>
<body>
<h1>Search</h1>
<form method="POST" action="/query">
Keywords: <input type="text" name="query" value="%s" id="queryfield">
<input type="submit">
<a href="http://xapian.org/docs/queryparser.html">Help</a>
</form>''' % query
    print >>out, '''
<p>Example: "car place:wigan"</p>

<p>Available prefixes:</p>

<ul>
'''
    for pfx in prefixes.keys():
        print >>out, "<li><a href='/catinfo/%s'>%s - %s</a></li>" % (pfx, pfx, prefix_desc[pfx])
    print >>out, '''
</ul>
'''
    oldqueries = []
    if os.path.exists(QUERYLOG):
        total = db.get_doccount()
        fd = open(QUERYLOG, "r")
        while True:
            try:
                q = pickle.load(fd)
            except EOFError:
                break
            oldqueries.append(q)
        fd.close()
        def print_query(q):
            count = q["count"]
            print >>out, "<li><a href='/query?query=%s'>%s (%d/%d %.2f%%)</a></li>" % (urllib.quote_plus(q["q"]), q["q"], count, total, count * 100.0 / total)
        print >>out, "<p>Last 10 queries:</p><ul>"
        for q in oldqueries[:-10:-1]:
            print_query(q)
        print >>out, "</ul>"
        # Remove duplicates
        oldqueries = dict(((x["q"], x) for x in oldqueries)).values()
        print >>out, "<table>"
        print >>out, "<tr><th>10 queries with most results</th><th>10 queries with least results</th></tr>"
        print >>out, "<tr><td>"
        print >>out, "<ul>"
        oldqueries.sort(key=lambda x:x["count"], reverse=True)
        for q in oldqueries[:10]:
            print_query(q)
        print >>out, "</ul>"
        print >>out, "</td><td>"
        print >>out, "<ul>"
        nonempty = [x for x in oldqueries if x["count"] > 0]
        nonempty.sort(key=lambda x:x["count"])
        for q in nonempty[:10]:
            print_query(q)
        print >>out, "</ul>"
        print >>out, "</td></tr>"
        print >>out, "</table>"
    print >>out, '''
</body>
</html>'''
    return out.getvalue()
@route("/query")
@route("/query/")
@post("/query")
@post("/query/")
def query():
    query = bottle.request.POST.get("query", bottle.request.GET.get("query", ""))
    enquire = xapian.Enquire(db)
    enquire.set_query(make_query(query))
    count = 40
    matches = enquire.get_mset(0, count)
    estimated = matches.get_matches_estimated()
    total = db.get_doccount()
    out = StringIO()
    print >>out, '''
<html>
<head><title>Results</title></head>
<body>
<h1>Results for "<b>%s</b>"</h1>
''' % query
    if estimated == 0:
        print >>out, "No results found."
    else:
        # Give as results the first 30 documents; also use them as the key
        # ones to use to compute relevant terms
        rset = xapian.RSet()
        for m in enquire.get_mset(0, 30):
            rset.add_document(m.document.get_docid())
        # Compute the tag cloud
        class NonTagFilter(xapian.ExpandDecider):
            def __call__(self, term):
                return not term[0].isupper() and not term[0].isdigit()
        cloud = []
        maxscore = None
        for res in enquire.get_eset(40, rset, NonTagFilter()):
            # Normalise the score in the interval [0, 1]
            weight = math.log(res.weight)
            if maxscore == None: maxscore = weight
            tag = res.term
            cloud.append([tag, float(weight) / maxscore])
        max_weight = cloud[0][1]
        min_weight = cloud[-1][1]
        cloud.sort(key=lambda x:x[0])
        def mklink(query, term):
            return "/query?query=%s" % urllib.quote_plus(query + " and " + term)
        print >>out, "<h2>Tag cloud</h2>"
        print >>out, "<blockquote>"
        for term, weight in cloud:
            size = 100 + 100.0 * (weight - min_weight) / (max_weight - min_weight)
            print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(query, term), size, term)
        print >>out, "</blockquote>"
        print >>out, "<h2>Results</h2>"
        print >>out, "<p><a href='/'>Search again</a></p>"
        print >>out, "<p>%d results over %d documents, %.2f%%</p>" % (estimated, total, estimated * 100.0 / total)
        print >>out, "<p>%d/%d results</p>" % (matches.size(), estimated)
        print >>out, "<ul>"
        for m in matches:
            rec = data[m.document.get_data()]
            print >>out, "<li><a href='/item/%s'>%s</a></li>" % (rec["id"], rec["text"])
        print >>out, "</ul>"
        fd = open(QUERYLOG, "a")
        qinfo = dict(q=query, count=estimated)
        pickle.dump(qinfo, fd)
        fd.close()
    print >>out, '''
<a href="/">Search again</a>

</body>
</html>'''
    return out.getvalue()
@route("/item/:id")
@route("/item/:id/")
def show(id):
    rec = data[id]
    out = StringIO()
    print >>out, '''
<html>
<head><title>Result %s</title></head>
<body>
<h1>Raw JSON record for twit %s</h1>
<pre>''' % (rec["id"], rec["id"])
    print >>out, simplejson.dumps(rec, indent=" ")
    print >>out, '''
</pre>
</body>
</html>'''
    return out.getvalue()
@route("/catinfo/:name")
@route("/catinfo/:name/")
def catinfo(name):
    prefix = prefixes[name]
    out = StringIO()
    print >>out, '''
<html>
<head><title>Values for %s</title></head>
<body>
''' % name
    terms = [(x.term[len(prefix):], db.get_termfreq(x.term)) for x in db.allterms(prefix)]
    terms.sort(key=lambda x:x[1], reverse=True)
    freq_min = terms[0][1]
    freq_max = terms[-1][1]
    def mklink(name, term):
        return "/query?query=%s" % urllib.quote_plus(name + ":" + term)
    # Build tag cloud
    print >>out, "<h1>Tag cloud</h1>"
    print >>out, "<blockquote>"
    for term, freq in sorted(terms[:20], key=lambda x:x[0]):
        size = 100 + 100.0 * (freq - freq_min) / (freq_max - freq_min)
        print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(name, term), size, term)
    print >>out, "</blockquote>"
    print >>out, "<h1>All terms</h1>"
    print >>out, "<table>"
    print >>out, "<tr><th>Occurrences</th><th>Name</th></tr>"
    for term, freq in terms:
        print >>out, "<tr><td>%d</td><td><a href='/query?query=%s'>%s</a></td></tr>" % (freq, urllib.quote_plus(name + ":" + term), term)
    print >>out, "</table>"
    print >>out, '''
</body>
</html>'''
    return out.getvalue()
# Change here for bind host and port
bottle.run(host="0.0.0.0", port=8024)

...and then we presented our work and ended up winning the contest. This was the story of how we wrote this set of award winning code.

Search Results: "rleigh"

15 October 2010

Enrico Zini: Award winning code

19 March 2006

Clint Adams: This report is flawed, but it sure is fun

91	D63469DF	dnusinow	1243
63	DEB0EC31	eloy
55	A965818F	vela	1243
46	58510B5A	myon	2143
39	9B7C328D	luk	31-2
39	1880283C	anibal	2134
37	0FE53DD9	opal	4213
32	2B0920C0	lool	1342
29	788A3F4C	joeyh
27	0F932C9C	doko
25	8768B1D2	sjoerd
23	F1BCDB73	aurel32	13-2
19	E02FEF11	jordens	1243
18	AB963370	schizo	1243
18	6E74A7D1	jdassen(Ks)	1243
18	68FD549F	tbm	3142
18	6783ED5E	fpeters	1--2
17	91B0D3B7	edd	-213
16	E07F1CF9	rousseau	321-
16	248AEB73	rene	1243
15	8E635A5E	rafl
14	C0143D2D	bubulle	4123
13	D87C6781	krooger(P)	4213
13	A436AD25	jfs(P)
13	3D08B612	msp
13	1E880A84	fjp	4213
13	0F7A8D01	nobse
12	F1968D1B	decklin	1234
12	E7075A54	mhatta
12	D75F8533	joss	1342
12	BF24424C	srivasta	1342
12	B8C1FA69	sto
12	7F961564	kobold
12	2A30D729	pere	4213
12	16D970C6	eric	12--
11	5E0577F2	mpitt
11	307D56ED	noel	3241
11	2BE16D01	moray	1342
10	BC7D020A	formorer	-1--
10	A7D91602	apollock	4213
10	A51A4FDD	gcs
10	917A225E	jordi
10	4B729625	pvaneynd	3123
10	497A176D	loic
9	62F1A57F	pa3aba
9	54FD2A58	glandium	1342
9	4A5D72FE	rafael
9	13FEFC40	fenio	-1--
9	0AFC7476	rra	1243
8	90267086	duck	31-2
8	86A118E6	ch	321-
8	801EA932	joey	1243
8	7F4E0E11	waldi	-123
8	514B3E7C	florian	21--
8	41954920	fs	12--
8	2A385C57	mckinstry	21-3
8	25BFB848	rleigh	1243
7	BC70A6FF	pape	1---
7	B70E403B	ari	1243
7	8E2D213A	jochen(Ks)
7	85FEC17F	kilian
7	84FB46D6	lwall	1342
7	800969EF	smimram	-1--
7	79CC6586	haas
7	5BFA90EC	kohda
7	52B7487E	sesse	2341
7	29499F61	sho	1342
7	1E161AFB	barbier	12--
6	FC05DA69	wildfire(P)
6	EEB6B4C2	avdyk	-12-
6	EDF008C5	blade	1243
6	E25F2102	mejo	1342
6	D1C41882	adeodato(Ks)	3142
6	D0B433DF	ross	12-3
6	B0EBC777	piman	1233
6	9D309C3B	robert	4213
6	882A6C4B	kov
6	6BBA3C84	zugschlus	4213
6	5662C734	mvo
6	554FB4C6	petere	-1-2
6	37155778	stratus
6	2D9ACC8E	lars	1243
6	2809E61A	josem
6	2252FA1A	frank	2143
6	1CF2D62A	micah
6	10FA4CD1	cjwatson	2143
5	EE6DC66A	jaldhar	2143
5	EA59038E	sgran	4123
5	E1EE3FB1	md	4312
5	E0B8B2DE	jaybonci
5	C9A5B54E	sesse(Ps	,Gs) 2341
5	C4CF8EC3	twerner
5	C2FEE5CD	acid	213-
5	C09FD35A	tille
5	C03C56DF	rfrancoise	---1
5	B7CDA2DC	xam	213-
5	A20EBC50	cavok	4214
5	808D0FD0	don	1342
5	797EBFAB	enrico	1243
5	5230514A	sjackman
5	49A5F855	otavio	-123
5	3DC29B41	pdm
5	29982E5A	vorlon	1243
5	2763483B	mkoch	213-
5	21DB31C5	smr	2143
5	1BF8DE0F	stigge	312-
5	12CADFA5	csmall	3214
5	0A0AC927	lamont
4	F2CF01A8	bdale
4	F095E5E4	mnencia
4	E9F2C747	frankie
4	E9ABFCD2	devin	2143
4	E81E55C1	dancer	2143
4	E38E7ACF	hmh(Gs)	1243
4	E298966D	jrv(P)
4	DF5CE2B4	huggie	12-3
4	DD982A75	speedblue
4	C671257D	damog	-1-2
4	C4A3823E	kmr	4213
4	C0B10A5B	dexter
4	C02440B8	js	1342
4	BE9F70EA	tb	1342
4	B7D2F063	varenet	-213
4	A3F9E30E	schultmc	1243
4	A3D7B9BC	lawrencc	2143
4	A1EE761C	madcoder	21--
4	9DE1EEB1	he	3142
4	9D928C9B	guillem	1---
4	9B726B71	racke
4	90788E11	jsogo	2143
4	864826C3	gotom	4321
4	7244970B	kroeckx	2143
4	5B48FFAE	marga	2143
4	54E672DE	isaac	1243
4	4B3A135C	erich	1243
4	4597A593	agmartin	4213
4	3FCC2A90	amaya	1243
4	3F3E6426	agx	-1-2
4	3EF23CD6	sanvila	1342
4	32C9C8BD	werner(K)
4	204DDF1B	aquette
4	00D8CD16	tolimar	12--
3	FEC23FB2	bap	34-1
3	F972BE03	tmancill	4213
3	F801A743	nduboc	1---
3	EBEDB32B	chrsmrtn	4123
3	EA291785	taggart	2314
3	E4D47EC1	tv(P)
3	E19F188E	troyh	1244
3	DF6807BE	srk	4213
3	D2A913A1	psg(P)
3	D097A261	chrisb
3	C6CEA0C9	adconrad	1243
3	C20DF273	ondrej
3	B5444815	ballombe	1342
3	B1DF9A57	cate	2143
3	AFA44BDD	weasel(Ps	,Gs) 1342
3	AA6541EE	brlink	1442
3	A824B93F	asac	3144
3	A71C1E00	turbo
3	A2D7D292	seb128
3	9ED101BF	mbanck	3132
3	969457F0	joostvb	2143
3	89BF7E2B	kobras	1--2
3	86946D69	mooch	12-3
3	74886B63	nathans
3	6F222F1F	edelhard
3	6D67F790	foka
3	60B6B958	geiger
3	607559E6	mako
3	5C33C1B8	dirson
3	5921B5D8	ajmitch
3	4C1A5BE5	sjq
3	431B38BA	pxt	312-
3	3E7B4B73	lmamane	2143
3	27572C47	ucko	1342
3	20021490	schepler	1342
3	1DEB8EAE	goedson
3	1BF2305A	krala(Gs)	3142
3	19A42D19	dannf	21-4
3	174FEE35	wookey	3124
3	124B26F3	mfurr	21-3
3	0A327652	tschmidt	312-
3	090DD8D5	ingo	3123
3	0813569F	jeroen	1141
3	0644FAB7	bas	1332
3	0123F2F2	gareuselesinge	1243
3	00530C24	bam	1234
2	FD6645AB	rmurray	-1-2
2	F95C2F6D	chrism(P)
2	F9138496	graham(Gs)	3142
2	F5D65169	jblache	1332
2	F28CD102	absurd
2	F2597E04	samu
2	F0B27113	patrick
2	EFA6B9D5	hamish(P)	3142
2	EE0A35C7	risko	4213
2	E91CD250	daigo
2	D688E0A7	qjb	-21-
2	D4BE1450	prudhomm
2	D2A6B810	joussen
2	CFD42F26	dilinger
2	CEE44978	dburrows	1243
2	CD4C0D9D	skx	4213
2	BFB880A3	zeevon
2	BD8B050D	roland	3214
2	B74952A9	alee
2	B4D6DE13	paul
2	B345BDD3	neilm	1243
2	B28C5995	bod	4213
2	B0FA4F49	schoepf
2	B0DDAF42	awoodland
2	A8061F32	osamu	4213
2	A21AD4F9	tviehmann	1342
2	99E81DA0	kaplan
2	964199E2	fabbe	3142
2	8DBFEC2F	pelle
2	8B8D7663	ametzler	1342
2	8B143975	martignlo
2	88C7C1F7	93sam	2134
2	83E5110F	ovek
2	817A996A	tfheen
2	807CAC25	abi	4123
2	798DD95C	piefel
2	78D621B4	uwe	-1--
2	6FF0ABF2	rcw	2143
2	6E8169D2	hertzog	3124
2	6C0084FC	chrisvdb
2	6B79D401	filippo	-1--
2	67756F5D	frn	2341
2	5E2EB5B4	nveber	123-
2	5C6153AD	broonie	1243
2	5B713DF0	djpig	1243
2	50ECFB98	ccontavalli(Gs)
2	50064181	paulvt
2	4F71955A	dajobe	21-3
2	4E2ECA5A	jmm	4213
2	496A1827	srittau
2	3E8DCCC0	maxx	1342
2	3D97C149	mstone(P)	2143
2	2DB65596	dz	321-
2	29F19BD1	meskes
2	1F41B907	marillat	1---
2	1EB2DE66	boll
2	1557BC10	kraai	1342
2	144843F5	lolando	1243
2	10656584	voc
2	0D7CA701	steinm
2	05410E97	horms
1	FC992520	tpo	-14-
1	FB0DFE9B	gildor
1	FAEEB4A9	neil	1342
1	F7E8BC63	cedric	21--
1	F2C423BC	zack	1332
1	F0199162	kreckel	4214
1	ECA94FA8	ishikawa	2143
1	EAAC62DF	cyb	---1
1	EA2D2C41	malattia	-312
1	E77AC835	bcwhite(P)
1	E66C9BB0	tach
1	E145F334	mquinson	2143
1	E0BA04C1	treinen	321-
1	DFE80FB2	tali
1	DE054F69	azekulic(P)
1	DC814B09	jfs
1	CB467E27	kalfa
1	C9132DDB	yoush	-21-
1	C87FFC2F	stevenk	-1--
1	C2CE8099	knok	321-
1	BED37FD2	henning(Ks)	1342
1	BA0A7EB5	treacy(P)
1	B7D86E0F	cmb	4213
1	B62849B3	smarenka	2143
1	B3C281F4	alain	2143
1	B25A5CF1	omote
1	ABA0E8B2	sasa
1	AB474598	baruch	2143
1	AB2A91F5	troup	1--2
1	A827CEDE	afayolle(Gs)
1	A6C805B9	zorglub	2134
1	A674A359	maehara
1	A57D8BF7	drew	2143
1	A269D927	sharky
1	A1696D2B	lfousse	1232
1	9BF42B07	zinoviev	--12
1	9057B5D3	vanicat	2143
1	8E950E00	mechanix
1	8BB527AF	gwolf	1132
1	8A1D9A1F	jgoerzen
1	8807529B	ultrotter	2134
1	872EB4E5	rcardenes
1	85EE3E0E	angdraug	12-3
1	835EB2FF	bossekr
1	80C83E8E	igloo	1243
1	7B8357E5	andreas	212-
1	7B80220D	sjr(Gs)	1342
1	7796A60B	sfllaw	1342
1	75CB1AD2	toni	1---
1	746C51F4	klindsay
1	72D03CB1	kmuto	4231
1	71473F66	ttroxell	13-4
1	6E76D81D	seanius	1243
1	6C63746D	hector
1	6C5F196B	malex	4213
1	6A9F3C38	rkrishnan
1	68021CE4	ron	---1
1	66F24521	pyro	-123
1	631B4819	anfra
1	62EEAD8B	falk	1342
1	61326D40	jamessan	13-4
1	609CD2C0	berin	--1-
1	5D8CDA7B	guus	1243
1	5D8C12EA	rganesan
1	5D64F870	zobel
1	59EF5DBC	bs
1	57F045DC	camm
1	564EE4B6	hazelsct
1	5623FC45	moronito	4213
1	551BE447	torsten
1	54AD21B5	warmenhoven
1	53BBA490	sjg
1	532005DA	seamus
1	50973B91	pjb	2143
1	4F83C751	kmccarty	12-3
1	4DB97694	khkim
1	4CD6E3D2	wjl	4213
1	4A8854E6	weinholt	1243
1	4950EAA6	ajkessel
1	4298C761	robertc(Ks)
1	42955682	kamop
1	3FD29468	bengen	-213
1	3FD25C84	roktas	3142
1	3B047084	madhack
1	39CCF0C7	tagoh	3142
1	39A8CCE2	eugen	31-2
1	38015E7E	thb	1234
1	36B861C1	bab	2143
1	33FC40A4	mennucc1	3214
1	2C0FCD1A	wdg	4312
1	2B05B73A	rjs
1	258D8781	grisu	31-2
1	206C5AFD	chewie	-1-1
1	200D1596	joy	2143
1	1C74E0B7	alfs
1	19D03486	francois	4123
1	18EA3457	rvr
1	176015ED	evo
1	16BD77C6	alfie
1	12AA1DB8	jh
1	128287E8	daf
1	09FC015C	godisch
1	06468DEB	fog	--12
1	05792F34	rla	-21-
1	028AF63C	forcer	3142
1	004DA6B4	bg66
0	.	zufus	-1--
0	.	zoso	-123
0	.	ykomatsu	-123
0	.	xtifr	1243
0	.	xavier	-312
0	.	wouter	2143
0	.	will	-132
0	.	warp	1342
0	.	voss	1342
0	.	vlm	2314
0	.	vleeuwen	4312
0	.	vince	2134
0	.	ukai	4123
0	.	tytso	-12-
0	.	tjrc1	4213
0	.	tats	-1-2
0	.	tao	1--2
0	.	stone	2134
0	.	stevegr	1243
0	.	smig	-1-2
0	.	siggi	1-44
0	.	shaul	4213
0	.	sharpone	1243
0	.	sfrost	1342
0	.	seb	-21-
0	.	salve	4213
0	.	ruoso	1243
0	.	rover	--12
0	.	rmayr	-213
0	.	riku	4123
0	.	rdonald	12-3
0	.	radu	-1--
0	.	pzn	112-
0	.	pronovic	1243
0	.	profeta	321-
0	.	portnoy	12-3
0	.	porridge	1342
0	.	pmhahn	4123
0	.	pmachard	1--2
0	.	pkern	3124
0	.	pik	1--2
0	.	phil	4213
0	.	pfrauenf	4213
0	.	pfaffben	2143
0	.	p2	1243
0	.	ossk	1243
0	.	oohara	1234
0	.	ohura	-213
0	.	nwp	1342
0	.	noshiro	4312
0	.	noodles	2134
0	.	nomeata	2143
0	.	noahm	3124
0	.	nils	3132
0	.	nico	-213
0	.	ms	3124
0	.	mpalmer	2143
0	.	moth	3241
0	.	mlang	2134
0	.	mjr	1342
0	.	mjg59	1342
0	.	merker	2--1
0	.	mbuck	2143
0	.	mbrubeck	1243
0	.	madduck	4123
0	.	mace	-1-2
0	.	luther	1243
0	.	luigi	4213
0	.	lss	-112
0	.	lightsey	1--2
0	.	ley	-1-2
0	.	ldrolez	--1-
0	.	lange	4124
0	.	kirk	1342
0	.	killer	1243
0	.	kelbert	-214
0	.	juanma	2134
0	.	jtarrio	1342
0	.	jonas	4312
0	.	joerg	1342
0	.	jmintha	-21-
0	.	jimmy	1243
0	.	jerome	21--
0	.	jaqque	1342
0	.	jaq	4123
0	.	jamuraa	4123
0	.	iwj	1243
0	.	ivan	2341
0	.	hsteoh	3142
0	.	hilliard	4123
0	.	helen	1243
0	.	hecker	3142
0	.	hartmans	1342
0	.	guterm	312-
0	.	gniibe	4213
0	.	glaweh	4213
0	.	gemorin	4213
0	.	gaudenz	3142
0	.	fw	2134
0	.	fmw	12-3
0	.	evan	1--2
0	.	ender	4213
0	.	elonen	4123
0	.	eevans	13-4
0	.	ean	-1--
0	.	dwhedon	4213
0	.	duncf	2133
0	.	ds	1342
0	.	dparsons	1342
0	.	dlehn	1243
0	.	dfrey	-123
0	.	deek	1--2
0	.	davidw	4132
0	.	davidc	1342
0	.	dave	4113
0	.	daenzer	1243
0	.	cupis	1---
0	.	cts	-213
0	.	cph	4312
0	.	cmc	2143
0	.	clebars	2143
0	.	chaton	-21-
0	.	cgb	-12-
0	.	calvin	-1-2
0	.	branden	1342
0	.	brad	4213
0	.	bnelson	1342
0	.	blarson	1342
0	.	benj	3132
0	.	bayle	-213
0	.	baran	1342
0	.	az	2134
0	.	awm	3124
0	.	atterer	4132
0	.	andressh	1---
0	.	amu	1--2
0	.	akumria	-312
0	.	ajt	1144
0	.	ajk	1342
0	.	agi	2143
0	.	adric	2143
0	.	adejong	1243
0	.	adamm	12--
0	.	aba	1143