XQuery 1.0 and XSLT 2.0: news and how to use them in
Debian
(long post about the second generation of the XSL family
and how to use the related languages in Debian)
The 2nd generation of the XSL family
At the beginning of 2007,
W3C
released a
family of
interrelated specification:
XPath 2.0,
XSLT 2.0,
XQuery (1.0). The 3
specifications are based on the very same underlying
data model which
(finally!) supports typed values and exploits them
to some extent, for example permitting some form of static
type checking in XPath and XQuery. Types usually come from
XML Schema, with both
built-in and user-defined types supported.
A brief overview of the 3 specifications and what they change in
the state of the art follows.
XPath 2.0
XPath 2.0 basically consists in
XPath 1.0 + 3 new macro features.
The first one is a revamp of those language features which are
dealing with types, in order to better integrate them with the new
typed data model. XPath now has operators to check whether a value
belongs to a given type (so yes, type information are kept at run
time), and to cast them from one type to another. The basic new
type constructor is now the
sequence, which replaces the
old node-set solving many annoying issues, such as the
impossibility of having a node ordering other than
document
order.
The second feature is the improvement of
standard
library of functions which was ridiculously small in XPath
1.0; it is much better
now. Additionally, 2nd
generation languages (
XSLT 2.0 and
XQuery) now support the ability to define functions
which will then be visible to inner (XPath) expressions, pushing
yet forward the possibilities of plain old XPath. Functions can now
also declare types in their signatures (for their arguments and
return value) and untyped arguments will be automatically casted to
them upon invocation. Hence, even if you are not using an
implementation which assigns types to your XML trees, once you
"enter" the typed world calling a typed function (almost all
standard library functions are decently typed) you will be able to
stay there avoiding annoying casts everywhere.
Finally XPath 2.0 has turned into a powerful
purely
functional language and is now powered by constructs like
conditionals, for-each loops, existential/universal quantifiers,
and existentially-quantified comparison operators for sequences.
Here is a complex expression to hwet your appetite (or scare you
away ...), comments come as
(: smiley faces :)
for $book in /bookshelf//books
return
if ((every $author in $book/authors/author
satisfies $author/nativeLang eq "it_IT")
and $book/lang eq "it_IT")
then $book
else ()
(: think about the trouble of writing this in XSLT/XPath 1.0 ... :)
XSLT 2.0
XSLT 2.0 is what I would call a "bug fix release" of XSLT 1.0 +
the routinary reworking of the language to deal with typed values,
which is not sensibly different than what has been done for XPath
2.0. The fixed "bugs" are several, starting from the annoying issue
of
result tree fragments. They are basically tree snippets
that in XSLT 1.0 you were able to generate for future use.
Unfortunately they were not thaaat reusable, given that you were
not even able to navigate them with XPath operators! Now the
specification is much more clear and distinguishes
final result
trees from ordinary variables, which can now contain sequences
of (navigable) tree nodes.
Another important "bug" fixed is the new ability to
output
multiple documents with a single XSLT stylesheet: it marks the
end of stupid extra post processing to be added in pipeline to a
XSLT processor.
Other minor "bugs" fixed are a limited amount of
backtracking capabilities among imported templates,
regular expression support directly in the language, and powerful
grouping constructs on the lines of SQL's
GROUP
BY (but much more powerful). Here is a template snippet
exploiting the latter feature:
<xsl:for-each-group select="*" group-starting-with="h1">
<div>
<xsl:apply-templates select="current-group()" />
</div>
<xsl:for-each-group>
XQuery
XQuery is the end of the chains imposed by XML-based syntaxes.
Why the heck one has to use an XML syntax (as in the above snippet)
only because she is manipulating XML tress is one of the mysteries
of XML technologies which have always been floating around in my
head.
XQuery is the (supposedly) SQL equivalent for databases of XML
documents, but is actually much more than that. I depict it in my
head as
the XML manipulation language with a syntax I can
finally stand. Technically it is XPath 2.0 (say 80% of the whole
language) + some extra ingredients (say 20%); so remember that
every XPath 2.0 expression is also a XQuery expression.
The main extra ingredient is the so called
FLWOR
expression (to be read: "flower expression", which in addition
to the "smiley faces" used for comments gives a "back to
1968"-flavour to the language .... erm FLWOR/flower/flavour, no pun
intended). A FLWOR expression is very similar to SQL's
SELECT-FROM-WHERE: it lets you generate a
tuple stream by
iterating on sequences (F:
for clauses),
binding expression values to names (L:
let clauses), filter out tuples which do not satisfy a
required condition (W:
where clause),
order the survived tuples (O:
order by
clause), and finally return a sequence built using the residual
tuple stream (R:
return clause).
The other interesting extra ingredient is the ability to build
the
XML snippets you want to manipulate. Within XQuery you
do that using plain XML syntax (the only place where a sane-minded
programmer actually wants to see it!) which also supports a
classical
interpolation mechanism to embed expressions
which will be evaluated inside XML snippets, and also the other way
around. A canonical XQuery example is:
for $t in doc("books.xml")//title,
$e in doc("reviews.xml")//entry
where $t = $e/title
return <review> $t, $e/remarks </review>
(: braces denote the escaping context where XQuery expressions will
be evaluated inside snippets; plain XML syntax is used for the
other way around :)
But remember: XQuery for XML is much more than SQL for RDBMS,
thanks to the implicit templating mechanism implemented by
interpolation, and thanks to several language features fostering
modularity (user-defined functions, library modules, XPath 2.0
standard library, ...) you can basically do with it any kind of XML
manipulation you can imagine.
I don't think I will ever write myself any other single line
of XML output in DOM or XSLT ...
Cool, how can I use it in Debian?
... if only this stuff were decently supported in the open
source world.
Last time I checked, the author of most parts of the GNOME
toolchain for dealing with XML (libxml2, libxslt, ...) was not
intentioned to implement XSLT 2.0, not even mentioning XQuery. This
comes as no surprise, the whole GNOME XML toolkit is written in C,
and XSLT 2.0 / XQuery have reached a level of complexity and formal
specification which usually entails a higher level approach. So on
the GNOME side we are stuck.
The other open source implementations of XSLT 2.0 / XQuery I'm
aware of are
Saxon and
Galax.
Saxon is an XSLT 2.0 and XQuery implementation written in Java,
which is unfortunate per se. Additionally, it is also unfortunate
that it is only
partially open source. Indeed, Saxon is
split into SaxonB (for "basic") which is open source under the
Mozilla Public License and SaxonSA which is commercial. While
SaxonSA is a fully conformant, XML Schema-aware processor, with
support for static typing, SaxonB is a basic-conformant processor
with no type-aware features and actually much less optimized than
SaxonSA. This is annoying. (Much more annoying is the fact that
SaxonSA's author is the only editor of the XSLT 2.0 specification
and that instead of his mail address the specification includes an
URL pointing to the website selling SaxonSA ...)
Galax is an open source (IBM CPL / Lucent license) OCaml
implementation of XQuery which is not fully conformant to the
specification (though it gets quite close) which is type-aware and
implements static typing.
The only XQuery / XSLT 2.0 implementation available
in
Debian at the time of writing is SaxonB. The binary package is
libsaxonb-java,
kudos to Michael Koch and the Debian Java Maintainers for having
packaged it (and to have stood some annoying pings of mine
bug #408842).
To execute XQuery code you just have to
aptitude install
libsaxonb-java, prepare a
query.xq file containing
your query, and then execute something like:
CLASSPATH=/usr/share/java/saxonb.jar \
java net.sf.saxon.Query query.xq
note that the information in README.Debian are still referring
to old Saxon versions, see
bug #465894 which proposes a
more up to date README.Debian.
Similarly, to perform a XSLT 2.0 transformation you have to do
something like:
CLASSPATH=/usr/share/java/saxonb.jar \
java net.sf.saxon.Transform -ext:off -s:input.xml -xsl:style.xsl -o:output.xml
Do not remove the -ext:off flag when processing
untrusted
stylesheets!, see
bug
#465885 for the reason.
I've written some handier (1-liner) shell script helpers which
remove the need of invoking java manually. They are attached to
bug #465894 and I've
proposed their addition to the saxonb package. Using them the above
invocations become:
saxonb-xquery query.xq
saxonb-xslt -ext:off -s:input.xml -xsl:style.xsl -o:output.xml
Galax in Debian ... well, since long time I've been planning to
package it (the ITP has been filed some months ago:
bug #447984) and the authors
sent me a newer version than what is available online to gather
feedback before the long overdue final release. Unfortunately I've
been lagging behind in finishing the packaging (which is tricky due
to the need of binding libraries to several different languages:
OCaml is native, but there is also Java for example). Hopefully
this post will give me some renewed motivation for finishing the
work ...
References
- Debian package w3-recs, ships the whole
list of W3C Recommendations for offline consultation; it includes
all the specifications we have discussed in this post
- Saxon: home page of
Saxon, a XSLT 2.0 / XQuery processor written in Java
- Galax: home page of
Galax, a XQuery processor written in OCaml
- Debian package libsaxonb-java,
ships SaxonB as a Debian package
Acknowledgements
Thanks to
godog for his
helpful comments.
Update: the helpers I've proposed have been
accepted into the official package, I've also made available their
manpages. Also the pending changes to README.Debian has been
accepted. Kudos again to Michael Koch for his quick feedback on my
patches!