I'm glad to announce that we have the current release of
Theano (0.8.2) in Debian unstable now, it's on its way into the testing branch and the Debian derivatives, heading for Debian 9.
The
Debian package is maintained in behalf of the Debian Science Team.
We have a binary package with the modules in the Python 2.7 import path (python-theano), if you want or need to stick to that branch a little longer (as a matter of fact, in the current
popcon stats it's the most installed package), and a package running on the default Python 3 version (python3-theano).
The comprehensive
documentation is available for offline usage in another binary package (theano-doc).
Although Theano builds its extensions on run time and therefore all binary packages contain the same code, the source package generates arch specific packages
for the reason that the exhaustive test suite could run over all the architectures to detect if there are problems somewhere (
#824116).
what's this?
In a nutshell, Theano is a computer algebra system (CAS) and expression compiler, which is implemented in Python as a library.
It is named after a Classical Greek female mathematician and it's developed at the LISA lab (located at MILA, the Montreal Institute for Learning Algorithms) at the Universit de Montr al.
Theano tightly integrates multi-dimensional arrays (N-dimensional, ND-array) from NumPy (
numpy.ndarray
), which are broadly used in Scientific Python for the representation of numeric data.
It features a declarative Python based language with symbolic operations for the functional definition of mathematical expressions, which allows to create functions that compute values for them.
Internally the expressions are represented as directed graphs with nodes for variables and operations.
The internal compiler then optimizes those graphs for stability and speed and then generates high-performance native machine code to evaluate resp. compute these mathematical expressions
.
One of the main features of Theano is that it's capable to compute also on GPU processors (graphical processor unit), like on custom graphic cards (e.g. the developers are using a GeForce GTX Titan X for benchmarks).
Today's GPUs became very powerful parallel floating point devices which can be employed also for scientific computations instead of 3D video games
.
The acronym "GPGPU" (general purpose graphical processor unit) refers to special cards like NVIDIA's Tesla
, which could be used alike (more on that below).
Thus, Theano is a high-performance number cruncher with an own computing engine which could be used for large-scale scientific computations.
If you haven't came across Theano as a Pythonistic professional mathematician, it's also one of the most prevalent frameworks for implementing deep learning applications (training multi-layered, "deep" artificial neural networks, DNN) around
, and has been developed with a focus on machine learning from the ground up.
There are several higher level user interfaces build in the top of Theano (for DNN, Keras, Lasagne, Blocks, and others, or for Python probalistic programming, PyMC3).
I'll seek for some of them also becoming available in Debian, too.
helper scripts
Both binary packages ship three convenience scripts,
theano-cache
,
theano-test
, and
theano-nose
.
Instead of them being copied into
/usr/bin
, which would result into a
binaries-have-conflict violation, the scripts are to be found in
/usr/share/python-theano
(python3-theano respectively), so that both module packages of Theano can be installed at the same time.
The scripts could be run directly from these folders, e.g. do
$ python /usr/share/python-theano/theano-nose
to achieve that.
If you're going to heavy use them, you could add the directory of the flavour you prefer (Python 2 or Python 3) to the
$PATH
environment variable manually by either typing e.g.
$ export PATH=/usr/share/python-theano:$PATH
on the prompt, or save that line into
~/.bashrc
.
Manpages aren't available for these little helper scripts
, but you could always get info on what they do and which arguments they accept by invoking them with the
-h
(for
theano-nose
) resp.
help
flag (for
theano-cache
).
running the tests
On some occasions you might want to run the testsuite of the installed library, like to check over if everything runs fine on your GPU hardware.
There are two different ways to run the tests (anyway you need to have
python ,3 -nose
installed).
One is, you could launch the test suite by doing
$ python -c 'import theano; theano.test()
(or the same with
python3
to test the other flavour), that's the same what the helper script
theano-test
does.
However, by doing it that way some particular tests might fail by raising errors also for the group of known failures.
Known failures are excluded from being errors if you run the tests by
theano-nose
, which is a wrapper around nosetests, so this might be always the better choice.
You can run this convenience script with the option
--theano
on the installed library, or from the source package root, which you could pull by
$ sudo apt-get source theano
(there you have also the option to use
bin/theano-nose
).
The script accept options for nosetests, so you might run it with
-v
to increase verbosity.
For the tests the configuration switch
config.device
must be set to
cpu
.
This will also include the GPU tests when a proper accessible device is detected, so that's a little misleading in the sense of it doesn't mean "run everything on the CPU".
You're on the safe side if you run it always like this:
$ THEANO_FLAGS=device=cpu theano-nose
, if you've set
config.device
to
gpu
in your
~/.theanorc
.
Depending on the available hardware and the used BLAS implementation (see below) it could take quite a long time to run the whole test suite through, on the Core-i5 in my laptop that takes around an hour even excluded the GPU related tests (which perform pretty fast, though).
Theano features a couple of switches to manipulate the default configuration for optimization and compilation.
There is a rivalry between optimization and compilation costs against performance of the test suite, and it turned out the test suite performs a quicker with lesser graph optimization.
There are two different switches available to control
config.optimizer
, the
fast_run
toggles maximal
optimization, while
fast_compile
runs only a minimal set of graph optimization features.
These settings are used by the general
mode switches for
config.mode
, which is either
FAST_RUN
by default, or
FAST_COMPILE
.
The default mode
FAST_RUN
(optimizer=fast_run, linker=cvm) needs around 72 minutes on my lower mid-level machine (on un-optimized BLAS).
To set
mode=FAST_COMPILE
(optimizer=fast_compile, linker=py) brings some boost for the performance of the test suite because it runs the whole suite in 46 minutes.
The downside of that is that C code compilation is disabled in this mode by using the linker
py
, and also the GPU related tests are not included.
I've played around with using the optimizer
fast_compile
with some of the other linkers (
c py
and
cvm
, and their versions without garbage collection) as alternative to
FAST_COMPILE
with minimal optimization but also machine code compilation incl. GPU testing.
But to my experience,
fast_compile
without another than the linker
py
results in some new errors and failures of some tests on amd64, and this might the case also on other architectures, too.
By the way, another useful feature is
DebugMode
for
config.mode
, which verifies the correctness of all optimizations and compares the C to Python results.
If you want to have detailed info on the configuration settings of Theano, do
$ python -c 'import theano; print theano.config' less
, and check out the chapter
config in the library documentation in the documentation.
cache maintenance
Theano isn't a JIT (just-in-time) compiler like Numba, which generates native machine code in the memory and executes it immediately, but it saves the generated native machine code into
compiledirs.
The reason for doing it that way is quite practical like the docs explain, the persistent cache on disk makes it possible to avoid generating code for the same operation, and to avoid compiling again when different operations generate the same code.
The compiledirs by default are located within
$(HOME)/.theano/
.
After some time the folder becomes quite large, and might look something like this:
$ ls ~/.theano
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.11+-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.12-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.12rc1-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--3.5.1+-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--3.5.2-64
compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--3.5.2rc1-64
If the used Python version changed like in this example you might to want to purge obsolete cache.
For working with the cache resp. the compiledirs, the helper
theano-cache
comes in handy.
If you invoke it without any arguments the current cache location is put out like
~/.theano/compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.12-64
(the script is run from
/usr/share/python-theano
).
So, the compiledirs for the old Python versions in this example (11+ and 12rc1) can be removed to free the space they occupy.
All compiledirs resp. cache directories meaning the whole cache could be erased by
$ theano-cache basecompiledir purge
, the effect is the same as by performing
$ rm -rf ~/.theano
.
You might want to do that e.g. if you're using different hardware, like when you got yourself another graphics card.
Or habitual from time to time when the compiledirs fill up so much that it slows down processing with the harddisk being very busy all the time, if you don't have an SSD drive available.
For example, the disk space of build chroots carrying (mainly) the tests completely compiled through on default Python 2 and Python 3 consumes around 1.3 GB (see
here).
BLAS implementations
Theano needs a level 3 implementation of
BLAS (Basic Linear Algebra Subprograms) for operations between vectors (one-dimensional mathematical objects) and matrices (two-dimensional objects) carried out on the CPU.
NumPy is already build on BLAS and pulls the standard implementation (libblas3, soure package: lapack), but Theano links directly to it instead of using NumPy as intermediate layer to reduce the computational overhead.
For this, Theano needs development headers and the binary packages pull libblas-dev by default, if any other development package of another BLAS
implementation (like OpenBLAS or ATLAS) isn't already installed, or pulled with them (providing the virtual package
libblas.so
).
The linker flags could be manipulated directly through the configuration switch
config.blas.ldflags
, which is by default set to
-L/usr/lib -lblas -lblas
.
By the way, if you set it to an empty value, Theano falls back to using BLAS through NumPy, if you want to have that for some reason.
On Debian, there is a very convenient way to switch between BLAS implementations by the alternatives mechanism.
If you have several alternative implementations installed at the same time, you can switch from one to another easily by just doing:
$ sudo update-alternatives --config libblas.so
There are 3 choices for the alternative libblas.so (providing /usr/lib/libblas.so).
Selection Path Priority Status
------------------------------------------------------------
* 0 /usr/lib/openblas-base/libblas.so 40 auto mode
1 /usr/lib/atlas-base/atlas/libblas.so 35 manual mode
2 /usr/lib/libblas/libblas.so 10 manual mode
3 /usr/lib/openblas-base/libblas.so 40 manual mode
Press <enter> to keep the current choice[*], or type selection number:
The implementations are performing differently on different hardware, so you might want to take the time to compare which one does it best on your processor (the other packages are libatlas-base-dev and libopenblas-dev), and choose that to optimize your system.
If you want to squeeze out all which is in there for carrying out Theano's computations on the CPU, another option is to compile an optimized version of a BLAS library especially for your processor.
I'm going to write another blog posting on this issue.
The binary packages of Theano ship the script
check_blas.py
to check over how well a BLAS implementation performs with it, and if everything works right.
That script is located in the
misc
subfolder of the library, you could locate it by doing
$ dpkg -L python-theano grep check_blas
(or for the package python3-theano accordingly), and run it with the Python interpreter.
By default the scripts puts out a lot of info like a huge perfomance comparison reference table, the current setting of
blas.ldflags
, the compiledir, the setting of
floatX
, OS information, the GCC version, the current NumPy config towards BLAS, NumPy location and version, if Theano linked directly or has used the NumPy binding, and finally and most important, the execution time.
If just the execution time for quick perfomance comparisons is needed this script could be invoked with
-q
.
Theano on CUDA
The function compiler of Theano works with alternative backends to carry out the computations, like the ones for graphics cards.
Currently, there are two different backends for GPU processing available, one docks onto NVIDIA's CUDA (Compute Unified Device Architecture) technology
, and another one for
libgpuarray, which is also developed by the Theano developers in parallel.
The
libgpuarray library is an interesting alternative for Theano, it's a GPU tensor (multi-dimensional mathematical object) array written in C with Python bindings based on Cython, which has the advantage of running also on OpenCL
.
OpenCL, unlike CUDA
, is full free software, vendor neutral and overcomes the limitation of the CUDA toolkit being only available for amd64 and the ppc64el port (see
here).
I've opened an
ITP on libgpuarray and we'll see if and how this works out.
Another reason for it would be great to have it available is that it looks like CUDA currently runs into problems with GCC 6
.
More on that, soon.
Here's a litle checklist for setting up your CUDA device so that you don't have to experience something like this:
$ THEANO_FLAGS=device=gpu,floatX=float32 python ./cat_dog_classifier.py
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: Unable to get the number of gpus available: no CUDA-capable device is detected)
hardware check
For running Theano on CUDA you need an NVIDIA graphics card which is capable of doing that.
You can recheck if your device is supported by CUDA
here.
When the hardware isn't too old (CUDA support started with GeForce 8 and Quadro X series) or too strange I think it isn't working only in exceptional cases.
You can check your model and if the device is present in the system on the bare hardware level by doing this:
$ lspci grep -i nvidia
04:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940M] (rev a2)
If a line like this doesn't get returned, your device most probably is broken, or not properly connected (ouch).
If
rev ff
appears at the end of the line that means the device is off meaning powered down.
This might be happening if you have a laptop with Optimus graphics hardware, and the related drivers have switched off the unoccupied device to safe energy
.
kernel module
Running CUDA applications requires the proprietary
NVIDIA driver kernel module to be loaded into the kernel and working.
If you haven't already installed it for another purpose, the NVIDIA driver and the CUDA toolkit are both in the non-free section of the Debian archive, which is not enabled by default.
To get non-free packages you have to add
non-free
(and it's better to do so, also
contrib
) to your package source in
/etc/apt/sources.list
, which might then look like this:
deb http://httpredir.debian.org/debian/ testing main contrib non-free
After doing that, perform
$ apt-cache update
to update the package lists, and there you go with the non-free packages.
The headers of the running kernel are needed to compile modules, you can get them together with the NVIDIA kernel module package by running:
$ sudo apt-get install linux-headers-$(uname -r) nvidia-kernel-dkms build-essential
DKMS will then build the NVIDIA module for the kernel and does some other things on the system.
When the installation has finished, it's generally advised to reboot the system completely.
troubleshooting
If you have problems with the CUDA device, it's advised to verify if the following things concerning the NVIDIA driver resp. kernel module are in order:
blacklist nouveau
Check if the default Nouveau kernel module driver (which blocks the NVIDIA module) for some reason still gets loaded by doing
$ lsmod grep nouveau
.
If nothing gets returned, that's right.
If it's still in the kernel, just add
blacklist nouveau
to
/etc/modprobe.d/blacklist.conf
, and update the booting ramdisk with
sudo update-initramfs -u
afterwards.
Then reboot once more, this shouldn't be the case then anymore.
rebuild kernel module
To fix it when the module haven't been properly compiled for some reason you could trigger a rebuild of the NVIDIA kernel module with
$ sudo dpkg-reconfigure nvidia-kernel-dkms
.
When you're about to send your hardware in to repair because everything looks all right but the device just isn't working, that really could help (own experience).
After the rebuild of the module or modules (if you have a few kernel packages installed) has completed, you could recheck if the module really is available by running:
$ sudo modinfo nvidia-current
filename: /lib/modules/4.4.0-1-amd64/updates/dkms/nvidia-current.ko
alias: char-major-195-*
version: 352.79
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: drm
vermagic: 4.4.0-1-amd64 SMP mod_unload modversions
parm: NVreg_Mobile:int
It should be something similiar to this when everything is all right.
reload kernel module
When there are problems with the GPU, maybe the kernel module isn't properly loaded.
You could recheck if the module has been properly loaded by doing
$ lsmod grep nvidia
nvidia_uvm 73728 0
nvidia 8540160 1 nvidia_uvm
drm 356352 7 i915,drm_kms_helper,nvidia
The kernel module could be loaded resp. reloaded with
$ sudo nvidia-modprobe
(that tool is from the package nvidia-modprobe).
unsupported graphics card
Be sure that you graphics cards is supported by the current driver kernel module.
If you have bought new hardware, that's quite possible to come out being a problem.
You can get the version of the current NVIDIA driver with:
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.79 Wed Jan 13 16:17:53 PST 2016
GCC version: gcc version 5.3.1 20160528 (Debian 5.3.1-21)
Then, google the version number like
nvidia 352.79
, this should get you onto an official driver download page
like this.
There, check for what's to be found under "Supported Products".
I you're stuck with that there are two options, to wait until the driver in Debian got updated, or replace it with the latest driver package from NVIDIA.
That's possible to do, but something more for experienced users.
occupied graphics card
The CUDA driver cannot work while the graphical interface is busy like by processing the graphical display of your X.Org server.
Which kernel driver actually is used to process the desktop could be examined by this command:
$ grep '(II).*([0-9]):' /var/log/Xorg.0.log
[ 37.700] (II) intel(0): Using Kernel Mode Setting driver: i915, version 1.6.0 20150522
[ 37.700] (II) intel(0): SNA compiled: xserver-xorg-video-intel 2:2.99.917-2 (Vincent Cheng <vcheng@debian.org>)
...
[ 39.808] (II) intel(0): switch to mode 1920x1080@60.0 on eDP1 using pipe 0, position (0, 0), rotation normal, reflection none
[ 39.810] (II) intel(0): Setting screen physical size to 508 x 285
[ 67.576] (II) intel(0): EDID vendor "CMN", prod id 5941
[ 67.576] (II) intel(0): Printing DDC gathered Modelines:
[ 67.576] (II) intel(0): Modeline "1920x1080"x0.0 152.84 1920 1968 2000 2250 1080 1083 1088 1132 -hsync -vsync (67.9 kHz eP)
This example shows that the rendering of the desktop is performed by the graphical device of the Intel CPU, which is just like it's needed for running CUDA applications on your NVIDIA graphics card, if you don't have another one.
nvidia-cuda-toolkit
With the
Debian package of the CUDA toolkit everything pretty much runs out of the box for Theano.
Just install it with
apt-get
, and you're ready to go, the CUDA backend is the default one.
Pycuda is also a suggested dependency of the binary packages, it could be pulled together with the CUDA toolkit.
The up-to-date CUDA release 7.5 is of course available, with that you have Maxwell architecture support so that you can run Theano on e.g. a GeForce GTX Titan X with 6,2 TFLOPS on single precision
at an affordable price.
CUDA 8
is
around the corner with support for the new Pascal architecture
.
Like the GeForce GTX 1080 high-end gaming graphics card already has 8,23 TFLOPS
.
When it comes to professional GPGPU hardware like the Tesla P100 there is much more computational power available, scalable by multiplication of cores resp. cards up to genuine little supercomputers which fit on a desk, like the DGX-1
.
Theano can use multiple GPUs for calculations to work with highly scaled hardware, I'll write another blog post on this issue.
Theano on the GPU
It's not difficult to run
Theano on the GPU.
Only single precision floating point numbers (float32) are supported on the GPU, but that is sufficient for deep learning applications.
Theano uses double precision floats (float64) by default, so you have to set the configuration variable
config.floatX
to
float32
, like written on above, either with the
THEANO_FLAGS
environment variable or better in your
.theanorc
file, if you're going to use the GPU a lot.
Switching to the GPU actually happens with the
config.device
configuration variable, which must be set to either
gpu
or
gpu0
,
gpu1
etc., to choose a particular one if multiple devices are available.
Here's is a little test script
check1.py, it's taken from the docs and slightly altered.
You can run that script either with
python
or
python3
(there was a single test failure on the Python 3 package, so the Python 2 library might be a little more stable currently).
For comparison, here's an example on how it perfoms on my hardware, one time on the CPU, one more time on the GPU:
$ THEANO_FLAGS=floatX=float32 python ./check1.py
[Elemwise exp,no_inplace (<TensorType(float32, vector)>)]
Looping 1000 times took 4.481719 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
$ THEANO_FLAGS=floatX=float32,device=gpu python ./check1.py
Using gpu device 0: GeForce 940M (CNMeM is disabled, cuDNN not available)
[GpuElemwise exp,no_inplace (<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise exp,no_inplace .0)]
Looping 1000 times took 1.164906 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
If you got a result like this you're ready to go with Theano on Debian, training computer vision classifiers or whatever you want to do with it.
I'll write more on for what Theano could be used, soon.