Some months ago I came across
Sketch Engine. Sketch Engine is a website that offers a collection of pre-loaded corpora in several languages and the ability to automatically extract
collocation information from them among other things.
You can get a 30-day free trial account if you want to check it out, the point is that I thought it was really cool, but it was a bit pricy, 55 euros/year for an individual and 1080 euros/year for a site (up to 50 employees and students) and these were the academic licenses!!! And I don’t even have any real need for it
So I thought it would be an interesting project to do something similar, albeit just focusing in the ‘word sketching’ part, as described in
this paper.
After a weekend I got it working although I didn’t devote any second to make it look good as you can appreciate:
Collocations and other stuff.
For now only the corpus of the state of the union addresses is loaded, with almost 400000 words. You can select that corpus, click on sketch and get the sketch of any word, for example,
the ‘word sketch’ for problem.
We can see that the adjective used more times with ‘problem’ is ’serious’, although if we look at the relative frequency it’s ‘complex’. The verb which has ‘problem’ as object more times is ’solve’ followed by ‘approach’, ‘address’ and ‘deal’. You can also click on the numbers to see the actual sentences in which these words appear, for example,
’serious problem’.
So how does it work? First of all it does part of speech tagging using
Apertium. Once the text is POS-tagged we apply a set of ‘regular expression’-like rules to identify the relation between words, such as:
*DUAL
=a_modifier/modifies
2:[tag=adj] [tag=n] 0,2 1:[tag=n] [tag!=n]
This rule expresses the relation between adjectives and the nouns they modify, matching sentences like ‘the red ball‘ and ‘the red football ball‘. Each relation is stored in the database with extra info about position in the text. Once the database is created accessing it to display the sketch and concordance information is really simple.
The site and auxiliary tools were written in around 1400 lines of Python/Django. I am still not sure about what to do with this, if there is anyone interested on adding some corpora to it, continue development or anything else, please let me know.