Search Results: "huber"

4 May 2021

Erich Schubert: Machine Learning Lecture Recordings

I have uploaded most of my Machine Learning lecture to YouTube. The slides are in English, but the audio is in German. Some very basic contents (e.g., a demo of standard k-means clustering) were left out from this advanced class, and instead only a link to recordings from an earlier class were given. In this class, I wanted to focus on the improved (accelerated) algorithms instead. These are not included here (yet). I believe there are some contents covered in this class you will find nowhere else (yet). The first unit is pretty long (I did not split it further yet). The later units are shorter recordings. ML F1: Principles in Machine Learning ML F2/F3: Correlation does not Imply Causation & Multiple Testing Problem ML F4: Overfitting beranpassung ML F5: Fluch der Dimensionalit t Curse of Dimensionality ML F6: Intrinsische Dimensionalit t Intrinsic Dimensionality ML F7: Distanzfunktionen und hnlichkeitsfunktionen ML L1: Einf hrung in die Klassifikation ML L2: Evaluation und Wahl von Klassifikatoren ML L3: Bayes-Klassifikatoren ML L4: N chste-Nachbarn Klassifikation ML L5: N chste Nachbarn und Kerndichtesch tzung ML L6: Lernen von Entscheidungsb umen ML L7: Splitkriterien bei Entscheidungsb umen ML L8: Ensembles und Meta-Learning: Random Forests und Gradient Boosting ML L9: Support Vector Machinen - Motivation ML L10: Affine Hyperebenen und Skalarprodukte Geometrie f r SVMs ML L11: Maximum Margin Hyperplane die breitest m gliche Stra e ML L12: Training Support Vector Machines ML L13: Non-linear SVM and the Kernel Trick ML L14: SVM Extensions and Conclusions ML L15: Motivation of Neural Networks ML L16: Threshold Logic Units ML L17: General Artificial Neural Networks ML L18: Learning Neural Networks with Backpropagation ML L19: Deep Neural Networks ML L20: Convolutional Neural Networks ML L21: Recurrent Neural Networks and LSTM ML L22: Conclusion Classification ML U1: Einleitung Clusteranalyse ML U2: Hierarchisches Clustering ML U3: Accelerating HAC mit Anderberg s Algorithmus ML U4: k-Means Clustering ML U5: Accelerating k-Means Clustering ML U6: Limitations of k-Means Clustering ML U7: Extensions of k-Means Clustering ML U8: Partitioning Around Medoids (k-Medoids) ML U9: Gaussian Mixture Modeling (EM Clustering) ML U10: Gaussian Mixture Modeling Demo ML U11: BIRCH and BETULA Clustering ML U12: Motivation Density-Based Clustering (DBSCAN) ML U13: Density-reachable and density-connected (DBSCAN Clustering) ML U14: DBSCAN Clustering ML U15: Parameterization of DBSCAN ML U16: Extensions and Variations of DBSCAN Clustering ML U17: OPTICS Clustering ML U18: Cluster Extraction from OPTICS Plots ML U19: Understanding the OPTICS Cluster Order ML U20: Spectral Clustering ML U21: Biclustering and Subspace Clustering ML U22: Further Clustering Approaches

21 February 2021

Erich Schubert: My first Rust crate: faster kmedoids clustering

I have written my first Rust crate: kmedoids. Python users can use the wrapper package kmedoids. It implements k-medoids clustering, and includes our new FasterPAM algorithm that drastically reduces the computational overhead. As long as you can afford to compute the distance matrix of your data set, clustering it with k-medoids is now feasible even for large k. (If your data is continuous and you are interested in minimizing squared errors, k-means surely remains the better choice!) My take on Rust so far: Will I use it more? I don t know. Probably if I need extreme performance, but I likely would not want to do everything my self in a pedantic language. So community is key, and I do not see Rust shine there.

13 August 2020

Erich Schubert: Publisher MDPI lies to prospective authors

The publisher MDPI is a spammer and lies. If you upload a paper draft to arXiv, MDPI will send spam to the authors to solicit submission. Within minutes of an upload I received the following email (sent by MDPI staff, not some overly eager new editor):
We read your recent manuscript "[...]" on
arXiv, and sincerely invite you to submit it to our journal Future
Internet, if it has not been published or submitted elsewhere.
Future Internet (ISSN 1999-5903, indexed by Scopus, Ei compendex,
*ESCI*-Web of Science) is a journal on Internet technologies and the
information society. It maintains a rigorous and fast peer review system
with a median publication time of 35 days from submission to online
publication, and 3 days from acceptance to publication. The journal
scope is shown here:
Editorial Board:
Since Future Internet is an open access journal there is a publication
fee. Your paper will be published, with a 20% discount (amounting to 200
CHF), and provided that it is accepted after our standard peer-review
First of all, the email begins with a lie. Because this paper clearly states that it is submitted elsewhere. Also, it fits other journals much better, and if they had read even just the abstract, they would have known. This is predatory behavior by MDPI. Clearly, it is just about getting as many submissions as possible. The journal charges 1000 CHF (next year, 1400 CHF) to publish the papers. Its about the money. Also, there have been reports that MDPI ignores the reviews, and always publishes even when reviewers recommended rejection The reviewer requests I have received from MDPI came with unreasonable deadlines, which will not allow for a thorough peer review. Hence I asked to not ever be emailed by them again. I must assume that many other qualified reviewers do the same. MDPI boasts in their 2019 annual report a median time to first decision of 19 days in my discipline the typical time window to ask for reviews is at least a month (for shorter conference papers, not full journal articles), because professors tend to have lots of other duties, hence they need more flexibility. Above paper has been submitted in March, and is now under review for 4 months already. This is an annoying long time window, and I would appreciate if this were less, but it shows how extremely short the MDPI time frame is. They also claim 269.1k submissions and 106.2k published papers, so the acceptance rate is around 40% on average, and assuming that there are some journals with higher standards there then some must have acceptance rates much higher than this. I d assume that many reputable journals have 40% desk-rejection rate for papers that are not even on-topic The average cost to authors is given as 1144 CHF (after discounts, 25% waived feeds etc.), so they, so we are talking about 120 million CHF of revenue from authors. Is that what you want academic publishing to be? I am not happy with some of the established publishers such as Elsevier that also overcharge universities heavily. I do think we need to change academic publishing, and arXiv is a big improvement here. But I do not respect publishers such as MDPI that lie and send spam.

17 May 2020

Erich Schubert: Contact Tracing Apps are Useless

Some people believe that automatic contact tracing apps will help contain the Coronavirus epidemic. They won t. Sorry to bring the bad news, but IT and mobile phones and artificial intelligence will not solve every problem. In my opinion, those that promise to solve these things with artificial intelligence / mobile phones / apps / your-favorite-buzzword are at least overly optimistic and blinder Aktionismus (*), if not naive, detachted from reality, or fraudsters that just want to get some funding. (*) there does not seem to be an English word for this doing something just for the sake of doing something, without thinking about whether it makes sense to do so Here are the reasons why it will not work:
  1. Signal quality. Forget detecting proximity with Bluetooth Low Energy. Yes, there are attempts to use BLE beacons for indoor positioning. But these use that you can learn fingerprints of which beacons are visible at which points, combined with additional information such as movement sensors and history (you do not teleport around in a building). BLE signals and antennas apparently tend to be very prone to orientation differences, signal reflections, and of course you will not have the idealized controlled environment used in such prototypes. The contacts have a single device, and they move this is not comparable to indoor positioning. I strongly doubt you can tell whether you are close to someone, or not.
  2. Close vs. protection. The app cannot detect protection in place. Being close to someone behind a plexiglass window or even a solid wall is very different from being close otherwise. You will get a lot of false contacts this way. That neighbor that you have never seen living in the appartment above will likely be considered a close contact of yours, as you sleep next to each other every day
  3. Low adoption rates. Apparently even in technology affine Singapore, fewer than 20% of people installed the app. That does not even mean they use it regularly. In Austria, the number is apparently below 5%, and people complain that it does not detect contact But in order for this approach to work, you will need Chinese-style mass surveillance that literally puts you in prison if you do not install the app.
  4. False alerts. Because of these issues, you will get false alerts, until you just do not care anymore.
  5. False sense of security. Honestly: the app does not pretect you at all. All it tries to do is to make the tracing of contacts easier. It will not tell you reliably if you have been infected (as mentioned above, too many false positives, too few users) nor that you are relatively safe (too few contacts included, too slow testing and reporting). It will all be on the quality of about 10 days ago you may or may not have contact with someone that tested positive, please contact someone to expose more data to tell you that it is actually another false alert .
  6. Trust. In Germany, the app will be operated by T-Systems and SAP. Not exactly two companies that have a lot of fans SAP seems to be one of the most hated software around. Neither company is known for caring about privacy much, but they are prototypical for business first . Its trust the cat to keep the cream. Yes, I know they want to make it open-source. But likely only the client, and you will still have to trust that the binary in the app stores is actually built from this source code, and not from a modified copy. As long as the name T-Systems and SAP are associated to the app, people will not trust it. Plus, we all know that the app will be bad, given the reputation of these companies at making horrible software systems
  7. Too late. SAP and T-Systems want to have the app ready in mid June. Seriously, this must be a joke? It will be very buggy in the beginning (because it is SAP!) and it will not be working reliably before end of July. There will not be a substantial user before fall. But given the low infection rates in Germany, nobody will bother to install it anymore, because the perceived benefit is 0 one the infection rates are low.
  8. Infighting. You may remember that there was the discussion before that there should be a pan-european effort. Except that in the end, everybody fought everybody else, countries went into different directions and they all broke up. France wanted a centralized systems, while in Germany people pointed out that the users will not accept this and only a distributed system will have a chance. That failed effort was known as Pan-European Privacy-Preserving Proximity Tracing (PEPP-PT) vs. Decentralized Privacy-Preserving Proximity Tracing (DP-3T) , and it turned out to have become a big clusterfuck . And that is just the tip of the iceberg.
Iceleand, probably the country that handled the Corona crisis best (they issued a travel advisory against Austria, when they were still happily spreading the virus at apres-ski; they massively tested, and got the infections down to almost zero within 6 weeks), has been experimenting with such an app. Iceland as a fairly close community managed to have almost 40% of people install their app. So did it help? No: The technology is more or less I wouldn t say useless [ ] it wasn t a game changer for us. The contact tracing app is just a huge waste of effort and public money. And pretty much the same applies to any other attempts to solve this with IT. There is a lot of buzz about solving the Corona crisis with artificial intelligence: bullshit! That is just naive. Do not speculate about magic power of AI. Get the data, understand the data, and you will see it does not help. Because its real data. Its dirty. Its late. Its contradicting. Its incomplete. It is all what AI currently can not handle well. This is not image recognition. You have no labels. Many of the attempts in this direction already fail at the trivial 7-day seasonality you observe in the data For example, the widely known John Hopkins Has the curve flattened trend has a stupid, useless indicator based on 5 day averages. And hence you get the weekly up and downs due to weekends. They show pretty up and down indicators. But these are affected mostly by the day of the week. And nobody cares. Notice that they currently even have big negative infections in their plots? There is no data on when someone was infected. Because such data simply does not exist. What you have is data when someone tested positive (mostly), when someone reported symptons (sometimes, but some never have symptoms!), and when someone dies (but then you do not know if it was because of Corona, because of other issues that became just worse because of Corona, or hit by a car without any relation to Corona). The data that we work with is incredibly delayed, yet we pretend it is live . Stop reading tea leaves. Stop pretending AI can save the world from Corona.

30 May 2016

Daniel Stender: My work for Debian in May

No double posting this time ;-) I've got not so much spare time this month to spend on Debian, but I could work on the following packages: This series of blog postings also includes little introductions of and into new packages in the archive. This month there is: Pyinfra Pyinfra is a new project which is currently still in development state. It has been already pointed out in an interesting German article1, and is now available as package maintained within the Python Applications Team. It's currently a one man production by Nick Barrett, and eagerly developed in the past weeks (we're currently at 0.1~dev24). Pyinfra is a remote server configuration/provisioning/service deployment tool which belongs in the same software category like Puppet or Ansible2. It's for provisioning one or an array of remote servers with software packages and to configure them. Pyinfra runs agentless like Ansible, that means for using it nothing special (like a daemon) has to run on targeted servers. It's written to be used for provisioning POSIX compatible Linux systems and has alternatives when it comes to special features like package managers (e.g. supports apt as well as yum). The documentation could be found in usr/share/doc/pyinfra/html/. Here's a little crash course on how to use Pyinfra: The pyinfra CLI tool is used on the command line like this, deploy scripts, single operations or facts (see below) could be used on a single server or a multitude of remote servers:
$ pyinfra -i <inventory script/single host> <deploy script>
$ pyinfra -i <inventory script/single host> --run <operation>
$ pyinfra -i <inventory script/single host> --facts <fact>
Remote servers which are operated on must provide a working shell and they must be reachable by SSH. For connecting, --port, --user, --password, --key/--key-password and --sudo flags are available, --sudo to gain superuser rights. Root access or sudo rights of course have to be already set up. By the way, localhost could be operated on the same way. Single operations are organized in modules like "apt", "files", "init", "server" etc. With the --run option they could be used individually on servers like follows, e.g. server.user adds a new user on a single targeted system (-v adds verbosity to the pyinfra run):
$ pyinfra -i --run server.user sam --user root --key ~/.ssh/sshkey --key-password 123456 -v
Multiple servers can be grouped in inventories, which hold the targeted hosts and data associated with them, like e.g. an inventory file would contain lists like this:
Group designators must be all caps. A higher level of grouping are the file names of inventory scripts, thus COMPUTE_SERVERS and DATABASE_SERVERS can be referenced to at the same time by the group designator farm1. Plus, all servers are automatically added to the group all. And, inventory scripts should be stored in the subfolder inventory/ in the project directory. Inventory files then could be used instead of specific IP addresses like this, the single operation then gets performed on all given machines in
$ pyinfra -i inventory/  --run server.user sam --user root --key ~/.ssh/sshkey --key-password=123456 -v
Deployment scripts could be used together with group data files in the subfolder group_data/ in the project directory. For example, a group_data/ designates all servers given in inventory/ (by the way, designates all servers), and contains the random attribute user_name (attributes must be lowercase), next to authentication data for the whole inventory group:
user_name = 'sam'
ssh_user = 'root'
ssh_key = '~/.ssh/sshkey'
ssh_key_password = '123456'
The random attribute can be picked up by a deployment script using like follows, user_name could be used again for e.g. server.user(), like this:
from pyinfra import host
from pyinfra.modules import server
This deploy, the ensemble of inventory file, group data file and deployment script (usually placed top level in the project folder) then could be run that way:
$ pyinfra -i inventory/
You have guessed it, since deployment scripts are Python scripts they are fully programmable (please regard that Pyinfra is build & runs on Python 3 on Debian), and that's the main advantage point with this piece of software. Quite handy for that come Pyinfra facts, functions which check different things on remote systems and return information as Python data. Like e.g. deb_packages returns a dictionary of installed packages from a remote apt based server:
$ pyinfra -i --fact deb_packages --user root --key ~/.ssh/sshkey --key-password=123456
        "libdebconfclient0": "0.192",
        "python-debian": "0.1.27",
        "libavahi-client3": "0.6.31-5",
        "dbus": "1.8.20-0+deb8u1",
        "libustr-1.0-1": "1.0.4-3+b2",
        "sed": "4.2.2-4+b1",
Using facts, Pyinfra reveals its full potential. For example, a deployment script could go like this, linux.distribution() returns a dict containing the installed distribution:
from pyinfra import host
from pyinfra.modules import apt
if host.fact.linux_distribution['name'] == 'Debian':
   apt.packages(packages='gummi', present=True, update=True)
elif host.fact.linux_distribution['name'] == 'CentOS':
I'll spare more sophisticated examples to keep this introduction simple. Beyond fancy deployment scripts, Pyinfra features an own API by which it could be programmed from the outside, and much more. But maybe that's enough to introduce Pyinfra. That are the usage basics. Pyinfra is a brand new project and it remains to be seen whether the developer can keep on further developing the tool like he does these days. For a private project it's insane to attempt to become a contender for the established "big" free configuration management tools and frameworks, but, if Puppet has become too complex in the meanwhile or not3, I really don't think that's the point here. Pyinfra follows an own approach in being programmable the way which it is. And it's definitely not harm to have it in the toolbox already, not trying to replace nothing. Brainstorm After the first package has been in experimental, the Brainstorm library from Swiss AI research institute IDSIA4 is now available as python3-brainstorm in unstable. Brainstorm is a lean, easy-to-use library for setting up deep learning networks (multiple layered artificial neural networks) for machine learning applications like for image and speech recognition or natural language processing. To set up a working training network for a classifier for handwritten digits like the MNIST dataset (a usual "hello world") just takes a couple of lines, like an example demonstrates. The package is maintained within the Debian Python Modules Team. The Debian package ships a couple of examples in /usr/share/python3-brainstorm/examples (the data/ and examples/ folders of the upstream tarball are combined here). Among them there are5: The current documentation in /usr/share/doc/python3-brainstorm/html/ isn't complete yet (several chapters are under construction), but there's a walkthrough on the CIFAR-10 example. The MNIST example has been extended by Github user pinae, and has been explained in German C't recently6. What are the perspectives for further development? Like Zhou Mo confirmed, there are a couple of deep learning frameworks around having a rather poor outlook since there have been abandoned after being completed as PhD projects. There's really no point for thriving to have them all in Debian, like the ITP of Minerva has been given up partly for this reason, there weren't any commits since 08/2015 (and because cuDNN isn't available and most likely won't). Brainstorm, 0.5 have been released 05/2015, also was a PhD project as IDSIA. It's stated on Github that the project is "under active development", but the rather sparse project page on the other side expresses the "hope the community will help us to further improve Brainstorm". This sentence much often implies that the developers are not actively working on the project. But there are recent commits and it looks that upstream is active and could be reached when there are problems, and that the project is active. So I don't think we're riding a dead horse, here. The downside for Brainstorm in Debian is, it seems that the libraries which are needed for GPU accelerated processing can't be fully provided. Pycuda is available, but scikit-cuda (an additional library which provides wrappers for CUDA features like CUBLAS, CUFFT and CUSOLVER) is not and won't be, because the CULA Dense Toolkit (scikit-cuda also contains wrappers for also that) is not available freely as source. Because of that, a dependency against pycuda, not even as Suggests (it's non-free), has been spared. Without GPU acceleration, Brainstorm computes the matrices on openBLAS using a Cython wrapper on the NumpyHandler, and the PyCudaHandler couldn't be used. openBLAS makes pretty good use of the available hardware (it distributes over all available CPU cores), but it's not yet possible to run Brainstorm full throttle using available floating point devices to reduce training times, which becomes crucial when the projects are getting bigger. Brainstorm belongs to the number of deep learning frameworks already being or becoming available in Debian. Currently there is: I've checked over Microsoft's CNTK, but although it's also set free recently I have my doubts if that could be included. Apparently there are dependencies against non-free software and most likely other issues. So much for a little update on the state of deep learning in Debian, please excuse if my radar misses something.

  1. Tim Sch rmann: "Schlangen l: Automatisiertes Service-Deployment mit Pyinfra". In: IT-Administrator 05/2016, pp. 90-95.
  2. For a comparison of configuration management software like this, see B wetter/Johannsen/Steig: "Baukastensysteme: Konfigurationsmanagement mit Open-Source-Software". In: iX 04/2016, pp. 94-99 (please excuse the prevalence of German articles in the pointers, I've just have them at hand).
  3. On the points of critique on Puppet, see Martin Loschwitz: "David gegen Goliath Zwei Welten treffen aufeinander: Puppet und Ansible". In Linux-Magazin 01/2016, 50-54.
  4. See the interview with IDSIA's deep learning guru J rgen Schmidhuber in German C't 2014/09, p. 148
  5. The examples scripts need some more finetuning. To run the data creation scripts in place the environment variable BRAINSTORM_DATA_DIR could be set, but the trained networks are currently tried to write in place. So please copy the scripts into some workspace if you want to try them out. I'll patch the example scripts to run out-of-the-box, soon.
  6. Johannes Merkert: "Ziffernlerner. Ein k nstliches neuronales Netz selber gebaut". In: C't 2016/06, p. 142-147. Web:
  7. See Ramon Wartala: "Tiefensch rfe: Deep learning mit NVIDIAs Jetson-TX1-Board und dem Caffe-Framework". In: iX 06/2016, pp. 100-103

1 March 2016

Erich Schubert: Stop abusing lambda expressions - this is not functional programming

I know, all the Scala fanboys are going to hate me now. But:
Stop overusing lambda expressions.
Most of the time when you are using lambdas, you are not even doing functional programming, because you often are violating one key rule of functional programming: no side effects.
For example:
is of course very cute to use, and is (wow) 10 characters shorter than:
for (Object o : collection) System.out.println(o);
but this is not functional programming because it has side effects.
What you are doing are anonymous methods/objects, using a shorthand notion. It's sometimes convenient, it is usually short, and unfortunately often unreadable, once you start cramming complex problems into this framework.
It does not offer efficiency improvements, unless you have the propery of side-effect freeness (and a language compiler that can exploit this, or parallelism that can then call the function concurrently in arbitrary order and still yield the same result).
Here is an examples of how to not use lambdas:
DZone Java 8 Factorial (with boilerplate such as the Pair class omitted):
Stream<Pair> allFactorials = Stream.iterate(
  new Pair(BigInteger.ONE, BigInteger.ONE),
  x -> new Pair(
return allFactorials.filter(
  (x) -> x.num.equals(num)).findAny().get().value;
When you are fresh out of the functional programming class, this may seem like a good idea to you... (and in contrast to the examples mentioned above, this is really a functional program).
But such code is a pain to read, and will not scale well either. Rewriting this to classic Java yields:
BigInteger cur = BigInteger.ONE, acc = BigInteger.ONE;
while(cur.compareTo(num) <= 0)  
  cur = cur.add(BigInteger.ONE); // Unfortunately, BigInteger is immutable!
  acc = acc.multiply(cur);
return acc;
Sorry, but the traditional loop is much more readable. It will still not perform very well (because of BigInteger not being designed for efficiency - it does not even make sense to allow BigInteger for num - the factorial of 2**63-1, the maximum of a Java long, needs 1020 bytes to store, i.e. about 500 exabyte.
For some, I did some benchmarking. One hundred random values num (of course the same for all methods) from the range 1 to 1000.
I also included this even more traditional version:
BigInteger acc = BigInteger.ONE;
for(long i = 2; i <=x; i++)  
  acc = acc.multiply(BigInteger.valueOf(i));
return acc;
Here are the results (Microbenchmark, using JMH, 10 warum iterations, 20 measurement iterations of 1 second each):
functional    1000     100  avgt   20  9748276,035   222981,283  ns/op
biginteger    1000     100  avgt   20  7920254,491   247454,534  ns/op
traditional   1000     100  avgt   20  6360620,309   135236,735  ns/op
As you can see, this "functional" approach above is about 50% slower than the classic for-loop. This will be mostly due to the Pair and additional BigInteger objects created and garbage collected.
Apart from being substantially faster, the iterative approach is also much simpler to follow. (To some extend it is faster because it is also easier for the compiler!)
There was a recent blog post by Robert Br utigam that discussed exception throwing in Java lambdas and the pitfalls associated with this. The discussed approach involves abusing generics for throwing unknown checked exceptions in the lambdas, ouch.

Don't get me wrong. There are cases where the use of lambdas is perfectly reasonable. There are also cases where it adheres to the "functional programming" principle. For example, a stream.filter(x ->"John Doe")) can be a readable shorthand when selecting or preprocessing data. If it is really functional (side-effect free), then it can safely be run in parallel and give you some speedup.
Also, Java lambdas were carefully designed, and the hotspot VM tries hard to optimize them. That is why Java lambdas are not closures - that would be much less performant. Also, the stack traces of Java lambdas remain somewhat readable (although still much worse than those of traditional code). This blog post by Takipi showcases how bad the stacktraces become (in the Java example, the stream function is more to blame than the actual lambda - nevertheless, the actual lambda application shows up as the cryptic LmbdaMain$$Lambda$1/821270929.apply(Unknown Source) without line number information). Java 8 added new bytecodes to be able to optimize Lambdas better - earlier JVM-based languages may not yet make good use of this.
But you really should use lambdas only for one-liners. If it is a more complex method, you should give it a name to encourage reuse and improve debugging.
Beware of the cost of .boxed() streams!
And do not overuse lambdas. Most often, non-Lambda code is just as compact, and much more readable. Similar to foreach-loops, you do lose some flexibility compared to the "raw" APIs such as Iterators:
for(Iterator<Something>> it = collection.iterator(); it.hasNext(); )  
  Something s =;
  if (someTest(s)) continue; // Skip
  if (otherTest(s)) it.remove(); // Remove
  if (thirdTest(s)) process(s); // Call-out to a complex function
  if (fourthTest(s)) break; // Stop early
In many cases, this code is preferrable to the lambda hacks we see pop up everywhere these days. Above code is efficient, and readable.
If you can solve it with a for loop, use a for loop!
Code quality is not measured by how much functionality you can do without typing a semicolon or a newline!
On the contrary: the key ingredient to writing high-performance code is the memory layout (usually) - something you need to do low-level.
Instead of going crazy about Lambdas, I'm more looking forward to real value types (similar to a struct in C, reference-free objects) maybe in Java 9 (Project Valhalla), as they will allow reducing the memory impact for many scenarios considerably. I'd prefer a mutable design, however - I understand why this is proposed, but the uses cases I have in mind become much less elegant when having to overwrite instead of modify all the time.

26 February 2016

Erich Schubert: Protect your file server from the Locky trojan

The "Locky" trojan and similar trojans apparently can cause havoc on your file servers (you may have heard the reports of hospitals that had to pay thousands of dollars to be able to decrypt their files).
Obviously, this is a good reason to double-check you backups.
But as a Linux admin, you may want to consider additional security measures. Here is one suggestion (untested, because I do not run a Samba file server):
Enable logging in the Samba file server, and monitor the log file for the known file names created by Locky. I.e. files named .locky or _Locky_recover_instructions.txt.
If a user creates such a file, immediately ban his IP from accessing your file server, and send out an alert to the admin and the affected user.
This probably won't prevent much damage from the users PC, but it should at least prevent it from doing much on your file server.
There also exist security modules such as "samba-virusfilter" that could probably be extended to cover this, too.

Sorry, I cannot provide you step-by-step instruction because I am a Linux-only user. I do not run a Samba file server. I have only had conversations with friends about this trojan.

Erich Schubert: Protect your file server from the Locky trojan

The "Locky" trojan and similar trojans apparently can cause havoc on your file servers (you may have heard the reports of hospitals that had to pay thousands of dollars to be able to decrypt their files).
Obviously, this is a good reason to double-check you backups.
But as a Linux admin, you may want to consider additional security measures. Here is one suggestion (untested, because I do not run a Samba file server):
Enable logging in the Samba file server, and monitor the log file for the known file names created by Locky. I.e. files named .locky or _Locky_recover_instructions.txt.
If a user creates such a file, immediately ban his IP from accessing your file server, and send out an alert to the admin and the affected user.
This probably won't prevent much damage from the users PC, but it should at least prevent it from doing much on your file server.
There also exist security modules such as "samba-virusfilter" that could probably be extended to cover this, too.

Sorry, I cannot provide you step-by-step instruction because I am a Linux-only user. I do not run a Samba file server. I have only had conversations with friends about this trojan.

14 January 2016

Lunar: Reproducible builds: week 37 in Stretch cycle

What happened in the reproducible builds effort between January 3rd and January 9th 2016:

Toolchain fixes David Bremner uploaded dh-elpa/0.0.18 which adds a --fix-autoload-date option (on by default) to take autoload dates from changelog. Lunar updated and sent the patch adding the generation of .buildinfo to dpkg.

Packages fixed The following packages have become reproducible due to changes in their build dependencies: aggressive-indent-mode, circe, company-mode, db4o, dh-elpa, editorconfig-emacs, expand-region-el, f-el, geiser, hyena, js2-mode, markdown-mode, mono-fuse, mysql-connector-net, openbve, regina-normal, sml-mode, vala-mode-el. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues, but not all of them: Patches submitted which have not made their way to the archive yet:
  • #809780 on flask-restful by Chris Lamb: implement support for SOURCE_DATE_EPOCH in the build system.
  • #810259 on avfs by Chris Lamb: implement support for SOURCE_DATE_EPOCH in the build system.
  • #810509 on apt by Mattia Rizzolo: ensure a stable file order is given to the linker. Add 2 more armhf build nodes provided by Vagrant Cascadian. This added 7 more armhf builder jobs. We now run around 900 tests of armhf packages each day. (h01ger) The footer of each page now indicates by which Jenkins jobs build it. (h01ger)

diffoscope development diffoscope 45 has been released on January 4th. It features huge memory improvements when comparing large files, several fixes of squashfs related issues that prevented comparing two Tails images, and improve the file list of tar and cpio archive to be more precise and consistent over time. It also fixes a typo that prevented the Mach-O to work (Rainer M ller), improves comparisons of ELF files when specified on the command line, and solves a few more encoding issues.

Package reviews 134 reviews have been removed, 30 added and 37 updated in the previous week. 20 new fail to build from source issues were reported by Chris Lamb and Chris West. prebuilder will now skip installing diffoscope to save time if the build results are identical. (Reiner Herrmann)

27 November 2015

Erich Schubert: ELKI 0.7.0 on Maven and GitHub

Version 0.7.0 of our data mining toolkit ELKI is now available on the project homepage, GitHub and Maven.
You can also clone this example project to get started easily.
What is new in ELKI 0.7.0? Too much, see the release notes, please!
What is ELKI exactly?
ELKI is a Java based data mining toolkit. We focus on cluster analysis and outlier detection, because there are plenty of tools available for classification already. But there is a kNN classifier, and a number of frequent itemset mining algorithms in ELKI, too.
ELKI is highly modular. You can combine almost everything with almost everything else. In particular, you can combine algorithms such as DBSCAN, with arbitrary distance functions, and you can choose from many index structures to accelerate the algorithm. But because we separate them well, you can add a new index, or a new distance function, or a new data type, and still benefit from the other parts. In other tools such as R, you cannot easily add a new distance function into an arbitrary algorithm and get good performance - all the fast code in R is written in C and Fortran; and cannot be easily extended this way. In ELKI, you can define a new data type, new distance function, new index, and still use most algorithms. (Some algorithms may have prerequisites that e.g. your new data type does not fulfill, of course).
ELKI is also very fast. Of course a good C code can be faster - but then it usually is not as modular and easy to extend anymore.
ELKI is documented. We have JavaDoc, and we annotate classes with their scientific references (see a list of all references we have). So you know which algorithm a class is supposed to implement, and can look up details there. This makes it very useful for science.
ELKI is not: a turnkey solution. It aims at researchers, developers and data scientists. If you have a SQL database, and want to do a point-and-click analysis of your data, please get a business solution instead with commercial support.

29 September 2015

Erich Schubert: Ubuntu broke Java because of Unity

Unity, that is the Ubuntu user interface, that nobody else uses. Since it is a Ubuntu-only thing, few applications have native support for its OSX-style hipster "global" menus. For Java, someone once wrote a hack called java-swing-ayatana, or "jayatana", that is preloaded into the JVM via the environment variable JAVA_TOOL_OPTIONS. The hacks seems to be unmaintained now. Unfortunately, this hack seems to be broken now (Google has thousands of problem reports), and causes a NullPointerException or similar crashes in many applications; likely due to a change in OpenJDK 8. Now all Java Swing applications appear to be broken for Ubuntu users, if they have the jayatana package installed. Congratulations! And of couse, you see bug reports everywhere. Matlab seems to no longer work for some, NetBeans appears to have issues, and I got a number of bug reports on ELKI because of Ubuntu. Thank you, not.

11 May 2015

Julien Danjou: My interview about software tests and Python

I've recently been contacted by Johannes Hubertz, who is writing a new book about Python in German called "Softwaretests mit Python" which will be published by Open Source Press, Munich this summer. His book will feature some interviews, and he was kind enough to let me write a bit about software testing. This is the interview that I gave for his book. Johannes translated to German and it will be included in Johannes' book, and I decided to publish it on my blog today. Following is the original version. How did you come to Python?
I don't recall exactly, but around ten years ago, I saw more and more people using it and decided to take a look. Back then, I was more used to Perl. I didn't really like Perl and was not getting a good grip on its object system. As soon as I found an idea to work on if I remember correctly that was rebuildd I started to code in Python, learning the language at the same time. I liked how Python worked, and how fast I was to able to develop and learn it, so I decided to keep using it for my next projects. I ended up diving into Python core for some reasons, even doing things like briefly hacking on projects like Cython at some point, and finally ended up working on OpenStack. OpenStack is a cloud computing platform entirely written in Python. So I've been writing Python every day since working on it. That's what pushed me to write The Hacker's Guide to Python in 2013 and then self-publish it a year later in 2014, a book where I talk about doing smart and efficient Python. It had a great success, has even been translated in Chinese and Korean, so I'm currently working on a second edition of the book. It has been an amazing adventure! Zen of Python: Which line is the most important for you and why?
I like the "There should be one and preferably only one obvious way to do it". The opposite is probably something that scared me in languages like Perl. But having one obvious way to do it is and something I tend to like in functional languages like Lisp, which are in my humble opinion, even better at that. For a python newbie, what are the most difficult subjects in Python?
I haven't been a newbie since a while, so it's hard for me to say. I don't think the language is hard to learn. There are some subtlety in the language itself when you deeply dive into the internals, but for beginners most of the concept are pretty straight-forward. If I had to pick, in the language basics, the most difficult thing would be around the generator objects (yield). Nowadays I think the most difficult subject for new comers is what version of Python to use, which libraries to rely on, and how to package and distribute projects. Though things get better, fortunately. When did you start using Test Driven Development and why?
I learned unit testing and TDD at school where teachers forced me to learn Java, and I hated it. The frameworks looked complicated, and I had the impression I was losing my time. Which I actually was, since I was writing disposable programs that's the only thing you do at school. Years later, when I started to write real and bigger programs (e.g. rebuildd), I quickly ended up fixing bugs I already fixed. That recalled me about unit tests and that it may be a good idea to start using them to stop fixing the same things over and over again. For a few years, I wrote less Python and more C code and Lua (for the awesome window manager), and I didn't use any testing. I probably lost hundreds of hours testing manually and fixing regressions that was a good lesson. Though I had good excuses at that time it is/was way harder to do testing in C/Lua than in Python. Since that period, I have never stopped writing "tests". When I started to hack on OpenStack, the project was adopting a "no test? no merge!" policy due to the high number of regressions it had during the first releases. I honestly don't think I could work on any project that does not have at least a minimal test coverage. It's impossible to hack efficiently on a code base that you're not able to test in just a simple command. It's also a real problem for new comers in the open source world. When there are no test, you can hack something and send a patch, and get a "you broke this" in response. Nowadays, this kind of response sounds unacceptable to me: if there is no test, then I didn't break anything! In the end, it's just too much frustration to work on non tested projects as I demonstrated in my study of whisper source code. What do you think to be the most often seen pitfalls of TDD and how to avoid them best?
The biggest problems are when and at what rate writing tests. On one hand, some people starts to write too precise tests way too soon. Doing that slows you down, especially when you are prototyping some idea or concept you just had. That does not mean that you should not do test at all, but you should probably start with a light coverage, until you are pretty sure that you're not going to rip every thing and start over. On the other hand, some people postpone writing tests for ever, and end up with no test all or a too thin layer of test. Which makes the project with a pretty low coverage. Basically, your test coverage should reflect the state of your project. If it's just starting, you should build a thin layer of test so you can hack it on it easily and remodel it if needed. The more your project grow, the more you should make it sold and lay more tests. Having too detailed tests is painful to make the project evolve at the start. Having not enough in a big project makes it painful to maintain it. Do you think, TDD fits and scales well for the big projects like OpenStack?
Not only I think it fits and scales well, but I also think it's just impossible to not use TDD in such big projects. When unit and functional tests coverage was weak in OpenStack at its beginning it was just impossible to fix a bug or write a new feature without breaking a lot of things without even noticing. We would release version N, and a ton of old bugs present in N-2 but fixed in N-1 were reopened. For big projects, with a lot of different use cases, configuration options, etc, you need belt and braces. You cannot throw code in a repository thinking it's going to work ever, and you can't afford to test everything manually at each commit. That's just insane.

4 May 2015

Lunar: Reproducible builds: first week in Stretch cycle

Debian Jessie has been released on April 25th, 2015. This has opened the Stretch development cycle. Reactions to the idea of making Debian build reproducibly have been pretty enthusiastic. As the pace is now likely to be even faster, let's see if we can keep everyone up-to-date on the developments. Before the release of Jessie The story goes back a long way but a formal announcement to the project has only been sent in February 2015. Since then, too much work has happened to make a complete report, but to give some highlights:
  • New variations are now tested: umask, kernel version, domain name, and timezone. We might only be missing CPU type and current date now.
  • Many improvements to the test system on and the pages showing the results.
  • Now not only packages from unstable are tested but also those in testing and experimental.
  • When rescheduling packages for testing, the build products can be kept and the IRC channel gets a notification when its over.
  • binutils version 2.25-6 is now built with the --enable-deterministic-archives flag. Making ar, strip and others create deterministic static libraries.
  • Number of identified issues has grown from about 80 to 123 today.
Lunar did a pretty improvised lightning talk during the Mini-DebConf in Lyon. This past week It seems changes were pilling behind the curtains given the amount of activity that happened in just one week. Toolchain fixes
  • Niels Thykier uploaded debhelper/9.20150501 which includes fixes to dh_makeshlibs (#774100), dh_icons (#774102), dh_usrlocal (#775020). Patches written by Lunar.
  • Helmut Grohne uploaded doxygen/ which will not generate timestamps in HTML by default. Kudos to akira for bringing the issue upstream.
  • Kenneth J. Pronovici uploaded epydoc/3.0.1+dfsg-6 adding a --no-include-build-time option. Patch by Jelmer Vernooij.
  • David Pr vot uploaded php-apigen/2.8.1+dfsg-2 which now has reproducible output.
  • C dric Boutillier uploaded ruby-prawn/2.0.1+dfsg-1 which now produce a deterministic output when using gradients. Patch by Lunar.
  • Jelmer Vernooij uploaded samba/2:4.1.17+dfsg-4 which contains a patch by Matthieu Patou making the output of pidl (from libparse-pidl-perl) reproducible.
  • Dmitry Shachnev uploaded sphinx/1.3.1-1 in experimental which should produce deterministic output. The original patch from Chris Lamb has inspired the upstream fix.
  • gregor herrmann uploaded libextutils-depends-perl/0.404-1 which makes ExtUtils::Depends output deterministic. Original patch by Reiner Herrmann.
  • Niko Tyni uploaded perl/5.20.2-4 which makes the output of Pod::Man reproducible. Nice team work visible on #780259.
We also rebased the experimental version of debhelper twice to merge the latest set of changes. Lunar submitted a patch to add a -creation-date to genisoimage. Reiner Herrmann opened #783938 to request making -notimestamp the default behavior for javadoc. Juan Picca submitted a patch to add a --use-date flag to texi2html. Packages fixed The following packages became reproducible due to changes of their build dependencies: apport, batctl, cil, commons-math3, devscripts, disruptor, ehcache, ftphs, gtk2hs-buildtools, haskell-abstract-deque, haskell-abstract-par, haskell-acid-state, haskell-adjunctions, haskell-aeson, haskell-aeson-pretty, haskell-alut, haskell-ansi-terminal, haskell-async, haskell-attoparsec, haskell-augeas, haskell-auto-update, haskell-binary-conduit, haskell-hscurses, jsch, ledgersmb, libapache2-mod-auth-mellon, libarchive-tar-wrapper-perl, libbusiness-onlinepayment-payflowpro-perl, libcapture-tiny-perl, libchi-perl, libcommons-codec-java, libconfig-model-itself-perl, libconfig-model-tester-perl, libcpan-perl-releases-perl, libcrypt-unixcrypt-perl, libdatetime-timezone-perl, libdbd-firebird-perl, libdbix-class-resultset-recursiveupdate-perl, libdbix-profile-perl, libdevel-cover-perl, libdevel-ptkdb-perl, libfile-tail-perl, libfinance-quote-perl, libformat-human-bytes-perl, libgtk2-perl, libhibernate-validator-java, libimage-exiftool-perl, libjson-perl, liblinux-prctl-perl, liblog-any-perl, libmail-imapclient-perl, libmocked-perl, libmodule-build-xsutil-perl, libmodule-extractuse-perl, libmodule-signature-perl, libmoosex-simpleconfig-perl, libmoox-handlesvia-perl, libnet-frame-layer-ipv6-perl, libnet-openssh-perl, libnumber-format-perl, libobject-id-perl, libpackage-pkg-perl, libpdf-fdf-simple-perl, libpod-webserver-perl, libpoe-component-pubsub-perl, libregexp-grammars-perl, libreply-perl, libscalar-defer-perl, libsereal-encoder-perl, libspreadsheet-read-perl, libspring-java, libsql-abstract-more-perl, libsvn-class-perl, libtemplate-plugin-gravatar-perl, libterm-progressbar-perl, libterm-shellui-perl, libtest-dir-perl, libtest-log4perl-perl, libtext-context-eitherside-perl, libtime-warp-perl, libtree-simple-perl, libwww-shorten-simple-perl, libwx-perl-processstream-perl, libxml-filter-xslt-perl, libxml-writer-string-perl, libyaml-tiny-perl, mupen64plus-core, nmap, openssl, pkg-perl-tools, quodlibet, r-cran-rjags, r-cran-rjson, r-cran-sn, r-cran-statmod, ruby-nokogiri, sezpoz, skksearch, slurm-llnl, stellarium. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which did not make their way to the archive yet: Improvements to Mattia Rizzolo has been working on compressing logs using gzip to save disk space. The web server would uncompress them on-the-fly for clients which does not accept gzip content. Mattia Rizzolo worked on a new page listing various breakage: missing or bad debbindiff output, missing build logs, unavailable build dependencies. Holger Levsen added a new execution environment to run debbindiff using dependencies from testing. This is required for packages built with GHC as the compiler only understands interfaces built by the same version. debbindiff development Version 17 has been uploaded to unstable. It now supports comparing ISO9660 images, dictzip files and should compare identical files much faster. Documentation update Various small updates and fixes to the pages about PDF produced by LaTeX, DVI produced by LaTeX, static libraries, Javadoc, PE binaries, and Epydoc. Package reviews Known issues have been tagged when known to be deterministic as some might unfortunately not show up on every single build. For example, two new issues have been identified by building with one timezone in April and one in May. RD and help2man add current month and year to the documentation they are producing. 1162 packages have been removed and 774 have been added in the past week. Most of them are the work of proper automated investigation done by Chris West. Summer of code Finally, we learned that both akira and Dhole were accepted for this Google Summer of Code. Let's welcome them! They have until May 25th before coding officialy begins. Now is the good time to help them feel more comfortable by sharing all these little bits of knowledge on how Debian works.

3 May 2015

Erich Schubert: @Zigo: Why I don't package Hadoop myself

A quick reply to Zigo's post:
Well, I looked at the Bigtop efforts because I needed Hadoop packages. But they are not very useful. They have lots of issues (including empty packages, naming conflicts etc.).
I filed a few bugs, and I even uploaded my fixes to Github. Some of that went unnoticed, because Sean Owen of Cloudera decided to remove all Debian packaging from Spark. But in the end, even with these fixes, the resulting packages do not live up to Debian quality standards (not to say, they would outright violate policy).
If you wanted to package Hadoop properly, you should ditch Apache Bigtop, and instead use the existing best practises for packaging. Using any of the Bigtop work just makes your job harder, by pulling in additional dependencies like their modified Groovy.
But whatever you do, you will be stuck in .jar dependency hell. Whatever you look at, it pulls in another batch of dependencies, that all need to be properly packaged, too. Here is the dependency chain of Hadoop:
[INFO] +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0:compile
[INFO]    +-
[INFO]    +- org.mortbay.jetty:jetty:jar:6.1.26:compile
[INFO]    +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
[INFO]    +- com.sun.jersey:jersey-core:jar:1.9:compile
[INFO]    +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO]       \- asm:asm:jar:3.1:compile
[INFO]    +- commons-cli:commons-cli:jar:1.2:compile
[INFO]    +- commons-codec:commons-codec:jar:1.4:compile
[INFO]    +- commons-io:commons-io:jar:2.4:compile
[INFO]    +- commons-lang:commons-lang:jar:2.6:compile
[INFO]    +- commons-logging:commons-logging:jar:1.1.3:compile
[INFO]    +- commons-daemon:commons-daemon:jar:1.0.13:compile
[INFO]    +- javax.servlet.jsp:jsp-api:jar:2.1:compile
[INFO]    +- log4j:log4j:jar:1.2.17:compile
[INFO]    +-
[INFO]    +- javax.servlet:servlet-api:jar:2.5:compile
[INFO]    +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO]    +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO]    +- tomcat:jasper-runtime:jar:5.5.23:compile
[INFO]    +- xmlenc:xmlenc:jar:0.52:compile
[INFO]    +- io.netty:netty:jar:3.6.2.Final:compile
[INFO]    +- xerces:xercesImpl:jar:2.9.1:compile
[INFO]       \- xml-apis:xml-apis:jar:1.3.04:compile
[INFO]    \- org.htrace:htrace-core:jar:3.0.4:compile
[INFO] +- org.apache.hadoop:hadoop-auth:jar:2.6.0:compile
[INFO]    +- org.slf4j:slf4j-api:jar:1.7.5:compile
[INFO]    +- org.apache.httpcomponents:httpclient:jar:4.2.5:compile
[INFO]       \- org.apache.httpcomponents:httpcore:jar:4.2.4:compile
[INFO]    +-
[INFO]       +-
[INFO]       +-
[INFO]       \-
[INFO]    +- org.apache.zookeeper:zookeeper:jar:3.4.6:compile
[INFO]       +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile
[INFO]       \- jline:jline:jar:0.9.94:compile
[INFO]    \- org.apache.curator:curator-framework:jar:2.6.0:compile
[INFO] +- org.apache.hadoop:hadoop-common:jar:2.6.0:compile
[INFO]    +- org.apache.hadoop:hadoop-annotations:jar:2.6.0:compile
[INFO]       \-
[INFO]    +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO]    +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO]    +- commons-net:commons-net:jar:3.1:compile
[INFO]    +- commons-collections:commons-collections:jar:3.2.1:compile
[INFO]    +- com.sun.jersey:jersey-json:jar:1.9:compile
[INFO]       +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO]       +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
[INFO]          \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO]             +-
[INFO]             \- javax.activation:activation:jar:1.1:compile
[INFO]       +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
[INFO]       \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
[INFO]    +-
[INFO]       \- com.jamesmurty.utils:java-xmlbuilder:jar:0.4:compile
[INFO]    +- commons-configuration:commons-configuration:jar:1.6:compile
[INFO]       +- commons-digester:commons-digester:jar:1.8:compile
[INFO]          \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO]       \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO]    +- org.apache.avro:avro:jar:1.7.4:compile
[INFO]       +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
[INFO]       \- org.xerial.snappy:snappy-java:jar:
[INFO]    +-
[INFO]    +- com.jcraft:jsch:jar:0.1.42:compile
[INFO]    +- org.apache.curator:curator-client:jar:2.6.0:compile
[INFO]    +- org.apache.curator:curator-recipes:jar:2.6.0:compile
[INFO]    +-
[INFO]    \- org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO]       \- org.tukaani:xz:jar:1.0:compile
[INFO] +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile
[INFO]    +- org.apache.commons:commons-math:jar:2.1:compile
[INFO]    +- tomcat:jasper-compiler:jar:5.5.23:compile
[INFO]    +- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile
[INFO]       \- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile
[INFO]    +- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile
[INFO]       \- ant:ant:jar:1.6.5:compile
[INFO]    +- commons-el:commons-el:jar:1.0:compile
[INFO]    +- hsqldb:hsqldb:jar:
[INFO]    +- oro:oro:jar:2.0.8:compile
[INFO]    \- org.eclipse.jdt:core:jar:3.1.1:compile
So the first step for packaging Hadoop would be to check which of these dependencies are not yet packaged in Debian... I guess 1/3 is not.
Maybe, we should just rip out some of these dependencies with a cluebat. For the stupid reason of making a webfrontend (which doesn't provide a lot of functionality, and I doubt many people use it at all), Hadoop embeds not just one web server, but two: Jetty and Netty...
Things would also be easier if e.g. S3 support, htrace, the web frontend, and different data serializations were properly put into modules. Then you could postpose S3 support, for example.
As I said, the deeper you dig, the crazier it gets.
If the OpenDataPlatform efforts of Hortonworks, Pivotal and IBM were anything but a marketing gag, they would try to address these technical issues. Instead, they make things worse by specifying yet another fatter core, including Ambari, Apaches attempt to automatically make a mess out of your servers - essentially, they are now adding the ultimate root shell, for all those cases where unaudited puppet commands and "curl sudo bash" was not bad enough:
  command1 = as_sudo(["cat,"/etc/passwd"]) + "   grep user"
(from the Ambari python documentation)
The closer you look, the more you want to rather die than use this.
P.S. I have updated the libtrove3-java package (Java collections for primitive types; but no longer the fastest such library), so that it is now in the local maven repository (/usr/share/maven-repo) and that it can be rebuilt reproducible (the build user name is no longer in the jar manifest).

28 April 2015

Thomas Goirand: @Erich Schubert: why not trying to package Hadoop in Debian?

Erich, As a follow-up on your blog post, where you complain about the state of Hadoop. First, I couldn t agree more with all you wrote. All of it! But why not trying to get Hadoop in Debian, rather than only complaining about the state of things? I have recently packaged and uploaded Sahara, which is OpenStack big data as a service (in other words: running Hadoop as a service on an OpenStack cloud). Its working well, though it was a bit frustrating to discover exactly what you complained about: the operating system cloud image needed to run within Sahara can only be downloaded as a pre-built image, which is impossible to check. It would have been so much work to package Hadoop that I just gave up (and frankly, packaging all of OpenStack in Debian is enough work for a single person doing the job so no, I don t have time to do it myself). OpenStack Sahara already provides the reproducible deployment system which you seem to wish. We only need Hadoop itself.

26 April 2015

Erich Schubert: Your big data toolchain is a big security risk!

This post is a follow-up to my earlier post on the "sad state of sysadmin in the age of containers". While I was drafting this post, that story got picked up by HackerNews, Reddit and Twitter, sending a lot of comments and emails my way. Surprisingly many of the comments are supportive of my impression - I would have expected to see much more insults along the lines "you just don't like my-favorite-tool, so you rant against using it". But a lot of people seem to share my concerns. Thanks, you surprised me!
Here is the new rant post, in the slightly different context of big data:

Everybody is doing "big data" these days. Or at least, pretending to do so to upper management. A lot of the time, there is no big data. People do more data anylsis than before, and therefore stick the "big data" label on them to promote themselves and get green light from management, isn't it?
"Big data" is not a technical term. It is a business term, referring to any attempt to get more value out of your business by analyzing data you did not use before. From this point of view, most of such projects are indeed "big data" as in "data-driven revenue generation" projects. It may be unsatisfactory to those interested in the challenges of volume and the other "V's", but this is the reality how the term is used.
But even in those cases where the volume and complexity of the data would warrant the use of all the new toys tools, people overlook a major problem: security of their systems and of their data.

The currently offered "big data technology stack" is all but secure. Sure, companies try to earn money with security add-ons such as Kerberos authentication to sell multi-tenancy, and with offering their version of Hadoop (their "Hadoop distribution").
The security problem is deep inside the "stack". It comes from the way this world ticks: the world of people that constantly follow the latest tool-of-the-day. In many of the projects, you no longer have mostly Linux developers that co-function as system administrators, but you see a lot of Apple iFanboys now. They live in a world where technology is outdated after half a year, so you will not need to support product longer than that. They love reinstalling their development environment frequently - because each time, they get to change something. They also live in a world where you would simply get a new model if your machine breaks down at some point. (Note that this will not work well for your big data project, restarting it from scratch every half year...)
And while Mac users have recently been surprisingly unaffected by various attacks (and unconcerned about e.g. GoToFail, or the fail to fix the rootpipe exploit) the operating system is not considered to be very secure. Combining this with users who do not care is an explosive mixture...
This type of developer, who is good at getting a prototype website for a startup kicking in a short amount of time, rolling out new features every day to beta test on the live users is what currently makes the Dotcom 2.0 bubble grow. It's also this type of user that mainstream products aim at - he has already forgotten what was half a year ago, but is looking for the next tech product to announced soon, and willing to buy it as soon as it is available...
This attitude causes a problem at the very heart of the stack: in the way packages are built, upgrades (and safety updates) are handled etc. - nobody is interested in consistency or reproducability anymore.
Someone commented on my blog that all these tools "seem to be written by 20 year old" kids. He probably is right. It wouldn't be so bad if we had some experienced sysadmins with a cluebat around. People that have experience on how to build systems that can be maintained for 10 years, and securely deployed automatically, instead of relying on puppet hacks, wget and unzipping of unsigned binary code.
I know that a lot of people don't want to hear this, but:
Your Hadoop system contains unsigned binary code in a number of places, that people downloaded, uploaded and redownloaded a countless number of times. There is no guarantee that .jar ever was what people think it is.
Hadoop has a huge set of dependencies, and little of this has been seriously audited for security - and in particular not in a way that would allow you to check that your binaries are built from this audited code anyway.
There might be functionality hidden in the code that just sits there and waits for a system with a hostname somewhat like "" to start looking for its command and control server to steal some key data from your company. The way your systems are built they probably do not have much of a firewall guarding against such. Much of the software may be constantly calling home, and your DevOps would not notice (nor would they care, anyway).
The mentality of "big data stacks" these days is that of Windows Shareware in the 90s. People downloading random binaries from the Internet, not adequately checked for security (ever heard of anybody running an AntiVirus on his Hadoop cluster?) and installing them everywhere.
And worse: not even keeping track of what they installed over time, or how. Because the tools change every year. But what if that developer leaves? You may never be able to get his stuff running properly again!
I predict that within the next 5 years, we will have a number of security incidents in various major companies. This is industrial espionage heaven. A lot of companies will cover it up, but some leaks will reach mass media, and there will be a major backlash against this hipster way of stringing together random components.
There is a big "Hadoop bubble" growing, that will eventually burst.
In order to get into a trustworthy state, the big data toolchain needs to:
  • Consolidate. There are too many tools for every job. There are even too many tools to manage your too many tools, and frontends for your frontends.
  • Lose weight. Every project depends on way too many other projects, each of which only contributes a tiny fragment for a very specific use case. Get rid of most dependencies!
  • Modularize. If you can't get rid of a dependency, but it is still only of interest to a small group of users, make it an optional extension module that the user only has to install if he needs this particular functionality.
  • Buildable. Make sure that everybody can build everything from scratch, without having to rely on Maven or Ivy or SBT downloading something automagically in the background. Test your builds offline, with a clean build directory, and document them! Everything must be rebuildable by any sysadmin in a reproducible way, so he can ensure a bug fix is really applied.
  • Distribute. Do not rely on binary downloads from your CDN as sole distribution channel. Instead, encourage and support alternate means of distribution, such as the proper integration in existing and trusted Linux distributions.
  • Maintain compatibility. successful big data projects will not be fire-and-forget. Eventually, they will need to go into production and then it will be necessary to run them over years. It will be necessary to migrate them to newer, larger clusters. And you must not lose all the data while doing so.
  • Sign. Code needs to be signed, end-of-story.
  • Authenticate. All downloads need to come with a way of checking the downloaded files agree with what you uploaded.
  • Integrate. The key feature that makes Linux systems so very good at servers is the all-round integrated software management. When you tell the system to update - and you have different update channels available, such as a more conservative "stable/LTS" channel, a channel that gets you the latest version after basic QA, and a channel that gives you the latest versions shortly after their upload to help with QA. It covers almost all software on your system, so it does not matter whether the security fix is in your kernel, web server, library, auxillary service, extension module, scripting language etc. - it will pull this fix and update you in no time.
Now you may argue that Hortonworks, Cloudera, Bigtop etc. already provide packages. Well ... they provide crap. They have something they call a "package", but it fails by any quality standards. Technically, a Wartburg is a car; but not one that would pass todays safety regulations...
For example, they only support Ubuntu 12.04 - a three year old Ubuntu is the latest version they support... Furthermore, these packages are roughly the same. Cloudera eventually handed over their efforts to "the community" (in other words, they gave up on doing it themselves, and hoped that someone else would clean up their mess); and Hortonworks HDP (any maybe Pivotal HD, too) is derived from these efforts, too. Much of what they do is offering some extra documentation and training for the packages they built using Bigtop with minimal effort.
The "spark" .deb packages of Bigtop, for example, are empty. They forgot to include the .jars in the package. Do I really need to give more examples of bad packaging decisions? All bigtop packages now depend on their own version of groovy - for a single script. Instead of rewriting this script in an already required language - or in a way that it would run on the distribution-provided groovy version - they decided to make yet another package, bigtop-groovy.
When I read about Hortonworks and IBM announcing their "Open Data Platform", I could not care less. As far as I can tell, they are only sticking their label on the existing tools anyway. Thus, I'm also not surprised that Cloudera and MapR do not join this rebranding effort - given the low divergence of Hadoop, who would need such a label anyway?
So why does this matter? Essentially, if anything does not work, you are currently toast. Say there is a bug in Hadoop that makes it fail to process your data. Your business is belly-up because of that, no data is processed anymore, your are vegetable. Who is going to fix it? All these "distributions" are built from the same, messy, branch. There is probably only a dozen of people around the world who have figured this out well enough to be able to fully build this toolchain. Apparently, none of the "Hadoop" companies are able to support a newer Ubuntu than 2012.04 - are you sure they have really understood what they are selling? I have doubts. All the freelancers out there, they know how to download and use Hadoop. But can they get that business-critical bug fix into the toolchain to get you up and running again? This is much worse than with Linux distributions. They have build daemons - servers that continuously check they can compile all the software that is there. You need to type two well-documented lines to rebuild a typical Linux package from scratch on your workstation - any experienced developer can follow the manual, and get a fix into the package. There are even people who try to recompile complete distributions with a different compiler to discover compatibility issues early that may arise in the future.
In other words, the "Hadoop distribution" they are selling you is not code they compiled themselves. It is mostly .jar files they downloaded from unsigned, unencrypted, unverified sources on the internet. They have no idea how to rebuild these parts, who compiled that, and how it was built. At most, they know for the very last layer. You can figure out how to recompile the Hadoop .jar. But when doing so, your computer will download a lot of binaries. It will not warn you of that, and they are included in the Hadoop distributions, too.
As is, I can not recommend to trust your business data into Hadoop.
It is probably okay to copy the data into HDFS and play with it - in particular if you keep your cluster and development machines isolated with strong firewalls - but be prepared to toss everything and restart from scratch. It's not ready yet for prime time, and as they keep on adding more and more unneeded cruft, it does not look like it will be ready anytime soon.

One more examples of the immaturity of the toolchain:
The scala package from cannot be cleanly installed as an upgrade to the old scala package that already exists in Ubuntu and Debian (and the distributions seem to have given up on compiling a newer Scala due to a stupid Catch-22 build process, making it very hacky to bootstrap scala and sbt compilation).
And the "upstream" package also cannot be easily fixed, because it is not built with standard packaging tools, but with an automagic sbt helper that lacks important functionality (in particular, access to the Replaces: field, or even cleaner: a way of splitting the package properly into components) instead - obviously written by someone with 0 experience in packaging for Ubuntu or Debian; and instead of using the proven tools, he decided to hack some wrapper that tries to automatically do things the wrong way...

I'm convinced that most "big data" projects will turn out to be a miserable failure. Either due to overmanagement or undermanagement, and due to lack of experience with the data, tools, and project management... Except that - of course - nobody will be willing to admit these failures. Since all these projects are political projects, they by definition must be successful, even if they never go into production, and never earn a single dollar.

12 March 2015

Erich Schubert: The sad state of sysadmin in the age of containers

System administration is in a sad state. It in a mess.
I'm not complaining about old-school sysadmins. They know how to keep systems running, manage update and upgrade paths.
This rant is about containers, prebuilt VMs, and the incredible mess they cause because their concept lacks notions of "trust" and "upgrades".
Consider for example Hadoop. Nobody seems to know how to build Hadoop from scratch. It's an incredible mess of dependencies, version requirements and build tools.
None of these "fancy" tools still builds by a traditional make command. Every tool has to come up with their own, incomptaible, and non-portable "method of the day" of building.
And since nobody is still able to compile things from scratch, everybody just downloads precompiled binaries from random websites. Often without any authentication or signature.
NSA and virus heaven. You don't need to exploit any security hole anymore. Just make an "app" or "VM" or "Docker" image, and have people load your malicious binary to their network.
The Hadoop Wiki Page of Debian is a typical example. Essentially, people have given up in 2010 to be able build Hadoop from source for Debian and offer nice packages.
To build Apache Bigtop, you apparently first have to install puppet3. Let it download magic data from the internet. Then it tries to run sudo puppet to enable the NSA backdoors (for example, it will download and install an outdated precompiled JDK, because it considers you too stupid to install Java.) And then hope the gradle build doesn't throw a 200 line useless backtrace.
I am not joking. It will try to execute commands such as e.g.
/bin/bash -c "wget ; dpkg -x ./scala-2.10.3.deb /"
Note that it doesn't even install the package properly, but extracts it to your root directory. The download does not check any signature, not even SSL certificates. (Source: Bigtop puppet manifests)
Even if your build would work, it will involve Maven downloading unsigned binary code from the internet, and use that for building.
Instead of writing clean, modular architecture, everything these days morphs into a huge mess of interlocked dependencies. Last I checked, the Hadoop classpath was already over 100 jars. I bet it is now 150, without even using any of the HBaseGiraphFlumeCrunchPigHiveMahoutSolrSparkElasticsearch (or any other of the Apache chaos) mess yet.
Stack is the new term for "I have no idea what I'm actually using".
Maven, ivy and sbt are the go-to tools for having your system download unsigned binary data from the internet and run it on your computer.
And with containers, this mess gets even worse.
Ever tried to security update a container?
Essentially, the Docker approach boils down to downloading an unsigned binary, running it, and hoping it doesn't contain any backdoor into your companies network.
Feels like downloading Windows shareware in the 90s to me.
When will the first docker image appear which contains the Ask toolbar? The first internet worm spreading via flawed docker images?

Back then, years ago, Linux distributions were trying to provide you with a safe operating system. With signed packages, built from a web of trust. Some even work on reproducible builds.
But then, everything got Windows-ized. "Apps" were the rage, which you download and run, without being concerned about security, or the ability to upgrade the application to the next version. Because "you only live once".
Update: it was pointed out that this started way before Docker: Docker is the new 'curl sudo bash' . That's right, but it's now pretty much mainstream to download and run untrusted software in your "datacenter". That is bad, really bad. Before, admins would try hard to prevent security holes, now they call themselves "devops" and happily introduce them to the network themselves!

22 January 2015

Erich Schubert: Year 2014 in Review as Seen by a Trend Detection System

We ran our trend detection tool Signi-Trend (published at KDD 2014) on news articles collected for the year 2014. We removed the category of financial news, which is overrepresented in the data set. Below are the (described) results, from the top 50 trends (I will push the raw result to appspot if possible due to file limits).
I have highlighted the top 10 trends in bold, but otherwise ordered them chronologically.
Updated: due to an error in a regexp, I had filtered out too many stories. The new results use more articles.

2014-01-29: Obama's state of the union address
2014-02-07: Sochi Olympics gay rights protests
2014-02-08: Sochi Olympics first results
2014-02-19: Violence in Ukraine and Maidan in Kiev
2014-02-20: Wall street reaction to Facebook buying WhatsApp
2014-02-22: Yanukovich leaves Kiev
2014-02-28: Crimea crisis begins
2014-03-01: Crimea crisis escalates futher
2014-03-02: NATO meeting on Crimea crisis
2014-03-04: Obama presents U.S. fiscal budget 2015 plan
2014-03-08: Malaysia Airlines MH-370 missing in South China Sea
2014-03-08: MH-370: many Chinese on board of missing airplane
2014-03-15: Crimean status referencum (upcoming)
2014-03-18: Crimea now considered part of Russia by Putin
2014-03-21: Russian stocks fall after U.S. sanctions.
2014-04-02: Chile quake and tsunami warning
2014-04-09: False positive? experience + views
2014-04-13: Pro-russian rebels in Ukraine's Sloviansk
2014-04-17: Russia-Ukraine crisis continues
2014-04-22: French deficit reduction plan pressure
2014-04-28: Soccer World Cup coverage: team lineups
2014-05-14: MERS reports in Florida, U.S.
2014-05-23: Russia feels sanctions impact
2014-05-25: EU elections
2014-06-06: World cup coverage
2014-06-13: Islamic state Camp Speicher massacre in Iraq
2014-06-14: Soccer world cup: Spain surprisingly destoyed by Netherlands
2014-07-05: Soccer world cup quarter finals
2014-07-17: Malaysian Airlines MH-17 shot down over Ukraine
2014-07-18: Russian blamed for 298 dead in airline downing
2014-07-19: Independent crash site investigation demanded
2014-07-20: Israel shelling Gaza causes 40+ casualties in a day
2014-08-07: Russia bans food imports from EU and U.S.
2014-08-08: Obama orders targeted air strikes in Iraq
2014-08-20: IS murders journalist James Foley, air strikes continue
2014-08-30: EU increases sanctions against Russia
2014-09-05: NATO summit with respect to IS and Ukraine conflict
2014-09-11: Scottish referendum upcoming - poll results are close
2014-09-23: U.N. on legality of U.S. air strikes in Syria against IS
2014-09-26: Star manager Bill Gross leaves Allianz/PIMCO for Janus
2014-10-22: Ottawa parliament shooting
2014-10-26: EU banking review
2014-11-05: U.S. Senate and governor elections
2014-11-12: Foreign exchange manipulation investigation results
2014-11-17: Japan recession
2014-12-11: CIA prisoner and U.S. torture centers revieled
2014-12-15: Sydney cafe hostage siege
2014-12-17: U.S. and Cuba relations improve unexpectedly
2014-12-18: Putin criticizes NATO, U.S., Kiev
2014-12-28: AirAsia flight QZ-8501 missing

As you can guess, we are really happy with this result - just like the result for 2013 it mentiones (almost) all the key events.
There probably is one "false positive" there: 2014-04-09 has a lot of articles talking about "experience" and "views", but not all refer to the same topic (we did not do topic modeling yet).
There are also some events missing that we would have liked to appear; many of these barely did not make it into the top 50, but do appear in the top 100, such as the Sony cyberattack (#51) and the Fergusson riots on November 11 (#66).
You can also explore the results online in a snapshot.

16 January 2015

Erich Schubert: Year 2014 in Review as Seen by a Trend Detection System

We ran our trend detection tool Signi-Trend (published at KDD 2014) on news articles collected for the year 2014. We removed the category of financial news, which is overrepresented in the data set. Below are the (described) results, from the top 50 trends (I will push the raw result to appspot if possible due to file limits). The top 10 trends are highlighted in bold.
2014-01-29: Obama's State of the Union address
2014-02-05..23: Sochi Olympics (11x, including the four below)
2014-02-07: Gay rights protesters arrested at Sochi Olympics
2014-02-08: Sochi Olympics begins
2014-02-16: Injuries in Sochi Extreme Park
2014-02-17: Men's Snowboard cross finals called of because of fog
2014-02-19: Violence in Ukraine and Kiev
2014-02-22: Yanukovich leaves Kiev
2014-02-23: Sochi Olympics close
2014-02-28: Crimea crisis begins
2014-03-01..06: Crimea crisis escalates futher (3x)
2014-03-08: Malaysia Airlines machine missing in South China Sea (2x)
2014-03-18: Crimea now considered part of Russia by Putin
2014-03-28: U.N. condemns Crimea's secession
2014-04-17..18: Russia-Ukraine crisis continues (3x)
2014-04-20: South Korea ferry accident
2014-05-18: Cannes film festival
2014-05-25: EU elections
2014-06-13: Islamic state fighting in Iraq
2014-06-16: U.S. talks to Iran about Iraq
2014-07-17..19: Malaysian airline shot down over Ukraine (3x)
2014-07-20: Israel shelling Gaza kills 40+ in a day
2014-08-07: Russia bans EU food imports
2014-08-20: Obama orders U.S. air strikes in Iraq against IS
2014-08-30: EU increases sanctions against Russia
2014-09-04: NATO summit
2014-09-23: Obama orders more U.S. air strikes against IS
2014-10-16: Ebola case in Dallas
2014-10-24: Ebola patient in New York is stable
2014-11-02: Elections: Romania, and U.S. rampup
2014-11-05: U.S. Senate elections
2014-11-25: Ferguson prosecution
2014-12-08: IOC Olympics sport additions
2014-12-11: CIA prisoner center in Thailand
2014-12-15: Sydney cafe hostage siege
2014-12-17: U.S. and Cuba relations improve unexpectedly
2014-12-19: North Korea blamed for Sony cyber attack
2014-12-28: AirAsia flight 8501 missing

13 January 2015

Erich Schubert: Big data predictions for 2015

My big data predictions for 2015:
  1. Big data will continue to fail to deliver for most companies.
    This has several reasons, including in particular: 1: lack of data to analyze that actually benefits from big data tools and approaches (and which is not better analyzed with traditional tools). 2: lack of talent, and failure to attract analytics talent. 3: stuck in old IT, and too inflexible to allow using modern tools (if you want to use big data, you will need a flexible "in-house development" type of IT that can install tools, try them, abandon them, without going up and down the management chains) 4: too much marketing. As long as big data is being run by the marketing department, not by developers, it will fail.
  2. Project consolidation: we have seen hundreds of big data software projects the last years. Plenty of them on Apache, too. But the current state is a mess, there is massive redundancy, and lots and lots of projects are more-or-less abandoned. Cloudera ML, for example, is dead: superseded by Oryx and Oryx 2. More projects will be abandoned, because we have way too many (including much too many NoSQL databases, that fail to outperform SQL solutions like PostgreSQL). As is, we have dozens of competing NoSQL databases, dozens of competing ML tools, dozens of everything.
  3. Hype: the hype will continue, but eventually (when there is too much negative press on the term "big data" due to failed projects and inflated expectations) move on to other terms. The same is also happening to "data science", so I guess the next will be "big analytics", "big intelligence" or something like that.
  4. Less openness: we have seen lots of open-source projects. However, many decided to go with Apache-style licensing - always ready to close down their sharing, and no longer share their development. In 2015, we'll see this happen more often, as companies try to make money off their reputation. At some point, copyleft licenses like GPL may return to popularity due to this.