Google Summer of Code 2011 gave a big boost to the development of the
shogun
machine learning toolbox. In case you have never heard of shogun or
machine learning:
Machine Learning involves algorithms that do intelligent'' and even automatic
data processing and is nowadays used everywhere to e.g. do face detection in
your camera, compress the speech in you mobile phone, powers the
recommendations in your favourite online shop, predicts solulabily of molecules
in water, the location of genes in humans, to name just a few examples.
Interested? Then you should give it a try. Some very simple examples stemming
from a sub-branch of machine learning called
supervised learning
illustrate how objects represented by two-dimensional vectors can be classified in good or bad by learning a so
called
support
vector machine. I would suggest to install the python_modular
interface of shogun and to run the example
interactive_svm_demo.py
also included in the source tarball. Two images illustrating the training of a support vector machine follow (click to enlarge):
Now back to Google Summer of Code: Google sponsored 5 talented students who were working
hard on various subjects. As a result we now have a new core developer and various new features
implemented in shogun: Interfaces to new languages like java, c#, ruby, lua
written by Baozeng; A model selection framework written by Heiko Strathman,
many dimension reduction techniques written by Sergey Lisitsyn, Gaussian
Mixture Model estimation written by Alesis Novik and a full-fledged online
learning framework developed by Shashwat Lal Das. All of this work has already been
integrated in the newly released
shogun 1.0.0. In case you
want to know more about the students projects continue reading below, but
before going into more detail I would like to summarize my experience with GSoC 2011.
My Experience with Google Summer of Code
We were a first time organization, i.e. taking part for the first time in GSoC.
Having received many many student applications we were very happy to hear that
we at least got 5 very talented students accepted
but still had to reject about
60 students (only 7% acceptance rate!). Doing this was an extremely tough decision for us. Each of us
ended up in scoring students even then we had many ties. So in the end we
raised the bar by requiring contributions even before the actual GSoC started.
This way we already got many improvements like more complete i/o functions,
nicely polished ROC and other evaluation routines, new machine learning
algorithms like gaussian naive bayes and averaged perceptron and many bugfixes.
The quality of the contributions and independence of the student aided us
coming up with the selection of the final
five.
I personally played the role of the administrator and (co-)mentor and scheduled
regular (usually) monthly irc meetings with mentors and students. For other org admins or mentors wanting into GSoC here come my lessons learned:
- Set up the infrastructure for your project before GSoC: We transitioned from svn to git (on github) just before GSoC started. While it was a bit tough to work with git in the beginning it quickly payed off (patch reviewing and discussions on github were really much more easy). We did not have proper regression tests running daily during most of GSoC leaving a number of issues undetected for quite some time. Now that we have buildbots running I keep wondering how we could survive for so long without them :-)
- Even though all of our students worked very independently, you want to mentor them very closely in the beginning such that they write code that you like to see in your project, following coding style, utilizing already existing helper routines. We did this and it simplified our lives later - we could mostly accept patches as is.
- Expect contributions from external parties: We had contributions to shogun's ruby and csharp interfaces/examples. Ensure that you have some spare manpower to review such additional patches.
- Expect critical code review by your students and be open to restructure the code. As a long term contributer you probably no longer realize whether your class-design / code structure is hard to digest. Freshmans like GSoC students immediately will when they stumble upon inconsitencies. When they discover such issues, discuss with them how to resolve them and don't be afraid of doing even bigger changes in the early GSoC phase (not too big to hinder work of all of your students though). We had quite some structural improvent in shogun due to several suggestions by our students. Overall the project improved drastically - not just w.r.t. additions.
- As a mentor, work with your student on the project. Yes, get your hands dirty too. This way you are much more of an help to the student when things get stuck and it will be much easier for you to answer difficult questions.
- As a mentor, try to answer the questions your students have within a few hours. This keeps the students motivated and you excited that they are doing a great job.
Now please read on to learn about the newly implemented features:
Dimension Reduction Techniques
Sergey Lisitsyn (Mentor: Christian Widmer)
Dimensionality reduction is the process of finding a low-dimensional representation of a high-dimensional one while maintaining the core essence of the data. For one of the most important practical issues of applied machine learning, it is widely used for preprocessing real data. With a strong focus on memory requirements and speed, Sergey implemented the following dimension reduction techniques:
See below for the some nice illustrations of dimension reduction/embedding techniques (click to enlarge).
Cross-Validation Framework
Heiko Strathmann (Mentor: Soeren Sonnenburg)
Nearly every learning machine has parameters which have to be determined manually. Before Heiko started his project one had to manually implement cross-validation using (nested) for-loops. In his highly involved project Heiko extend shogun's core to register parameters and ultimately made cross-validation possible. He implemented different model selection schemes (train,validation,test split, n-fold cross-validation, stratified cross-validation, etc and did create some
examples for illustration. Note that various performance measures are
available to measure how good'' a model is. The figure below shows the
area under the receiver operator characteristic curve as an example.
Interfaces to the Java, C#, Lua and Ruby Programming Languages
Baozeng (Mentor: Mikio Braun and Soeren Sonnenburg)
Boazeng implemented swig-typemaps that enable transfer of objects native to the language one wants to interface to. In his project, he added support for Java, Ruby, C# and Lua. His knowlegde about swig helped us to drastically simplify shogun's typemaps for existing languages like octave and python resolving other corner-case type issues. The addition of these typemaps brings a high-performance and versatile machine learning toolbox to these languages. It should be noted that shogun objects trained in e.g. python can be serialized to disk and then loaded from any other language like say lua or java. We hope this helps users working in multiple-language environments.
Note that the syntax is very similar across all languages used, compare for yourself - various examples for all languages (
python,
octave,
java,
lua,
ruby, and
csharp) are available.
Largescale Learning Framework and Integration of Vowpal Wabbit
Shashwat Lal Das (Mentor: John Langford and Soeren Sonnenburg)
Shashwat introduced support for 'streaming' features into shogun. That is instead of shogun's traditional way of requiring all data to be in memory, features can now be streamed from e.g. disk, enabling the use of massively big data sets. He implemented support for dense and sparse vector based input streams as well as strings and converted existing online learning methods to use this framework. He was particularly careful and even made it possible to emulate streaming from in-memory features. He finally integrated (parts of) vowpal wabbit, which is a very fast large scale online learning algorithm based on SGD.
Expectation Maximization Algorithms for Gaussian Mixture Models
Alesis Novik (Mentor: Vojtech Franc)
The Expectation-Maximization algorithm is well known in the machine learning community. The goal of this project was the robust implementation of the Expectation-Maximization algorithm for Gaussian Mixture Models. Several computational tricks have been applied to address numerical and stability issues, like
- Representing covariance matrices as their SVD
- Doing operations in log domain to avoid overflow/underflow
- Setting minimum variances to avoid singular Gaussians.
- Merging/splitting of Gaussians.
An illustrative example of estimating a one and two-dimensional Gaussian follows below.
Final Remarks
All in all, this year s GSoC has given the SHOGUN project a great push forward and we hope that this will translate into an increased user base and numerous external contributions. Also, we hope that by providing bindings for many languages, we can provide a neutral ground for Machine Learning implementations and that way bring together communities centered around different programming languages. All that s left to say is that given the great experiences from this year, we d be more than happy to participate in GSoC2012.