During a JAWS for Windows training, I was introduced to the Research It feature of that screen reader. Research It is a quick way to utilize web scraping to make working with complex web pages easier. It is about extracting specific information from a website that does not offer an API. For instance, look up a word in an online dictionary, or quickly check the status of a delivery. Strictly speaking, this feature does not belong in a screen reader, but it is a very helpful tool to have at your fingertips. Research It uses XQuery (actually, XQilla) to do all the heavy lifting. This also means that the Research It Rulesets are theoretically also useable on other platforms. I was immediately hooked, because I always had a love for XPath. Looking at XQuery code is totally self-explanatory for me. I just like the syntax and semantics. So I immediately checked out XQilla on Debian, and found #821329 and #821330, which were promptly fixed by Tommi Vainikainen, thanks to him for the really quick response! Unfortunately, making xqilla:parse-html available and upgrading to the latest upstream version is not enough to use XQilla on Linux with the typical webpages out there. Xerces-C++, which is what XQilla uses to fetch web resources, does not support HTTPS URLs at the moment. I filed #821380 to ask for HTTPS support in Xerces-C to be enabled by default. And even with HTTPS support enabled in Xerces-C, the xqilla:parse-html function (which is based on HTML Tidy) fails for a lot of real-world webpages I tried. Manually upgrading the six year old version of HTML Tidy in Debian to the latest from GitHub (tidy-html5, #810951) did not help a lot either.
Python to the rescue XQuery is still a very nice language for extracting information from markup documents. XQilla just has a bit of a hard time dealing with the typical HTML documents out there. After all, it was designed to deal with well-formed XML documents. So I decided to build myself a little wrapper around XQilla which fetches the web resources with the Python Requests package, and cleans the HTML document with BeautifulSoup (which uses lxml to do HTML parsing). The output of BeautifulSoup can apparently be passed to XQilla as the context document. This is a fairly crazy hack, but it works quite reliably so far. Here is how one of my web scraping rules looks like:
The function scrape automatically determines the XQuery filename according to the callers function name. Here is how github_code_search.xq looks like:
from click import argument, group @group() def xq(): """Web scraping for command-line users.""" pass @xq.group('github.com') def github(): """Quick access to github.com.""" pass @github.command('code_search') @argument('language') @argument('query') def github_code_search(language, query): """Search for source code.""" scrape(get='https://github.com/search', params= 'l': language, 'q': query, 'type': 'code' )
That is all I need to implement a custom web scraping rule. A few lines of Python to specify how and where to fetch the website from. And a XQuery file that specifies how to mangle the document content. And thanks to the Python click package, the various entry points of my web scraping script can easily be called from the command-line. Here is a sample invokation:
declare function local:source-lines($table as node()*) as xs:string* for $tr in $table/tr return normalize-space(data($tr)) ; let $results := html//div[@id="code_search_results"]/div[@class="code-list"] for $div in $results/div let $repo := data($div/p/a) let $file := data($div/p/a) let $link := resolve-uri(data($div/p/a/@href)) return (concat($repo, ": ", $file), $link, local:source-lines($div//table), "---------------------------------------------------------------")
For the impatient, here is the implementation of scrape:
fx:~/xq% ./xq.py github.com Usage: xq.py github.com [OPTIONS] COMMAND [ARGS]... Quick access to github.com. Options: --help Show this message and exit. Commands: code_search Search for source code. fx:~/xq% ./xq.py github.com code_search Pascal '"debian/rules"' prof7bit/LazPackager: frmlazpackageroptionsdeb.pas https://github.com/prof7bit/LazPackager/blob/cc3e35e9bae0c5a582b0b301dcbb38047fba2ad9/frmlazpackageroptionsdeb.pas 230 procedure TFDebianOptions.BtnPreviewRulesClick(Sender: TObject); 231 begin 232 ShowPreview('debian/rules', EdRules.Text); 233 end; 234 235 procedure TFDebianOptions.BtnPreviewChangelogClick(Sender: TObject); --------------------------------------------------------------- prof7bit/LazPackager: lazpackagerdebian.pas https://github.com/prof7bit/LazPackager/blob/cc3e35e9bae0c5a582b0b301dcbb38047fba2ad9/lazpackagerdebian.pas 205 + 'mv ../rules debian/' + LF 206 + 'chmod +x debian/rules' + LF 207 + 'mv ../changelog debian/' + LF 208 + 'mv ../copyright debian/' + LF ---------------------------------------------------------------
The full source for xq can be found on GitHub. The project is just two days old, so I have only implemented three scraping rules as of now. However, adding new rules has been made deliberately easy, so that I can just write up a few lines of code whenever I find something on the web which I'd like to scrape on the command-line. If you find this "framework" useful, make sure to share your insights with me. And if you impelement your own scraping rules for a public service, consider sharing that as well. If you have an comments or questions, send me mail. Oh, and by the way, I am now also on Twitter as @blindbird23.
from bs4 import BeautifulSoup from bs4.element import Doctype, ResultSet from inspect import currentframe from itertools import chain from os import path from os.path import abspath, dirname from subprocess import PIPE, run from tempfile import NamedTemporaryFile import requests def scrape(get=None, post=None, find_all=None, xquery_name=None, xquery_vars= , **kwargs): """Execute a XQuery file. When either get or post is specified, fetch the resource and run it through BeautifulSoup, passing it as context to the XQuery. If find_all is given, wrap the result of executing find_all on the BeautifulSoup in an artificial HTML body. If xquery_name is not specified, the callers function name is used. xquery_name combined with extension ".xq" is searched in the directory where this Python script resides and executed with XQilla. kwargs are passed to get or post calls. Typical extra keywords would be: params -- To pass extra parameters to the URL. data -- For HTTP POST. """ response = None url = None context = None if get is not None: response = requests.get(get, **kwargs) elif post is not None: response = requests.post(post, **kwargs) if response is not None: response.raise_for_status() context = BeautifulSoup(response.text, 'lxml') dtd = next(context.descendants) if type(dtd) is Doctype: dtd.extract() if find_all is not None: context = context.find_all(find_all) url = response.url if xquery_name is None: xquery_name = currentframe().f_back.f_code.co_name cmd = ['xqilla'] if context is not None: if type(context) is BeautifulSoup: soup = context context = NamedTemporaryFile(mode='w') print(soup, file=context) cmd.extend(['-i', context.name]) elif isinstance(context, list) or isinstance(context, ResultSet): tags = context context = NamedTemporaryFile(mode='w') print('<html><body>', file=context) for item in tags: print(item, file=context) print('</body></html>', file=context) context.flush() cmd.extend(['-i', context.name]) cmd.extend(chain.from_iterable(['-v', k, v] for k, v in xquery_vars.items())) if url is not None: cmd.extend(['-b', url]) cmd.append(abspath(path.join(dirname(__file__), xquery_name + ".xq"))) output = run(cmd, stdout=PIPE).stdout.decode('utf-8') if type(context) is NamedTemporaryFile: context.close() print(output, end='')