This post is a follow-up to my earlier post on the
"sad state of sysadmin in the age of containers
While I was drafting this post, that story got picked up by HackerNews, Reddit
and Twitter, sending a lot of comments and emails my way.
Surprisingly many of the comments are supportive of my impression - I would
have expected to see much more insults along the lines "you just don't like
my-favorite-tool, so you rant against using it". But a lot of people seem to share
my concerns. Thanks, you surprised me!
Here is the new
rant post, in the slightly different context of big data:
Everybody is doing "big data" these days. Or at least, pretending to do so
to upper management. A lot of the time, there is no big data. People do more
data anylsis than before, and therefore stick the "big data" label on them to
promote themselves and get green light from management, isn't it?
"Big data" is not a technical term. It is a business term, referring to any
attempt to get more value out of your business by analyzing data
you did not use before. From this point of view, most of such projects
are indeed "big data" as in "data-driven revenue generation" projects.
It may be unsatisfactory to those interested in the challenges of volume and
the other "V's", but this is the reality how the term is used.
But even in those cases where the volume and complexity of the data would
warrant the use of all the new
toys tools, people overlook
a major problem: security of their systems and of their data.
The currently offered "big data technology stack" is all but secure.
Sure, companies try to earn money with security add-ons such as Kerberos
authentication to sell multi-tenancy, and with offering their version of
Hadoop (their "Hadoop distribution").
The security problem is deep inside the "stack". It comes from the way this
world ticks: the world of people that constantly follow the latest
tool-of-the-day. In many of the projects, you no longer have mostly Linux
developers that co-function as system administrators, but you see a lot of
Apple iFanboys now. They live in a world where technology is outdated after
half a year, so you will not need to support product longer than that. They
love reinstalling their development environment frequently - because each time,
they get to change something. They also live in a world where you would simply
get a new model if your machine breaks down at some point. (Note that this will
not work well for your big data project, restarting it from scratch every half
And while Mac users have recently been surprisingly unaffected by various attacks
(and unconcerned about e.g. GoToFail, or the fail to fix the
the operating system is
considered to be very secure
. Combining this with users who do not care is an
This type of developer, who is good at getting a prototype website for a startup
kicking in a short amount of time, rolling out new features every day to beta
test on the live users is what currently makes the Dotcom 2.0 bubble grow.
It's also this type of user that mainstream products aim at - he has already
forgotten what was half a year ago, but is looking for the next tech product
to announced soon, and willing to buy it as soon as it is available...
This attitude causes a problem at the very heart of the stack:
in the way packages are built, upgrades (and safety updates) are handled etc.
- nobody is interested in consistency or reproducability anymore.
Someone commented on my blog that all these tools "seem to be written
by 20 year old" kids. He probably is right. It wouldn't be so bad if we had
some experienced sysadmins with a cluebat around. People that have
experience on how to build systems that can be maintained for 10 years,
and securely deployed automatically, instead of relying on puppet hacks,
wget and unzipping of unsigned binary code.
I know that a lot of people don't want to hear this, but:
Your Hadoop system contains unsigned binary code in a number of
places, that people downloaded, uploaded and redownloaded a countless number of
times. There is no guarantee that .jar ever was what people
think it is.
Hadoop has a huge set of dependencies, and little of this has been
seriously audited for security - and in particular not in a way that would
allow you to check that your binaries are built from this audited code anyway.
There might be functionality hidden in the code that just sits there and waits
for a system with a hostname somewhat like "yourcompany.com" to start looking
for its command and control server to steal some key data from your company.
The way your systems are built they probably do not have much of a firewall
guarding against such. Much of the software may be constantly calling home,
and your DevOps would not notice (nor would they care, anyway).
The mentality of "big data stacks" these days is that of Windows Shareware
in the 90s. People downloading random binaries from the Internet, not
adequately checked for security (ever heard of anybody running an AntiVirus
on his Hadoop cluster?) and installing them everywhere.
And worse: not even keeping track of what they installed over time, or how.
Because the tools change every year. But what if that developer leaves? You may
never be able to get his stuff running properly again!
I predict that within the next 5 years, we will have a number of security
incidents in various major companies. This is industrial espionage
heaven. A lot of companies will cover it up, but some leaks will reach
mass media, and there will be a major backlash against this hipster way
of stringing together random components.
There is a big "Hadoop bubble" growing, that will eventually burst.
In order to get into a trustworthy state, the big data toolchain needs to:
- Consolidate. There are too many tools for every job. There are even
too many tools to manage your too many tools, and frontends for your frontends.
- Lose weight. Every project depends on way too many
other projects, each of which only contributes a tiny fragment for a
very specific use case. Get rid of most dependencies!
- Modularize. If you can't get rid of a dependency, but it is
still only of interest to a small group of users, make it an optional
extension module that the user only has to install if he needs this
- Buildable. Make sure that everybody can build everything
from scratch, without having to rely on Maven or Ivy or SBT downloading
something automagically in the background. Test your builds offline,
with a clean build directory, and document them! Everything must be
rebuildable by any sysadmin in a reproducible way, so he can ensure a
bug fix is really applied.
- Distribute. Do not rely on binary downloads from your CDN
as sole distribution channel. Instead, encourage and support alternate
means of distribution, such as the proper integration in existing
and trusted Linux distributions.
- Maintain compatibility. successful big data projects will
not be fire-and-forget. Eventually, they will need to go into
production and then it will be necessary to run them over years.
It will be necessary to migrate them to newer, larger clusters. And you
must not lose all the data while doing so.
- Sign. Code needs to be signed, end-of-story.
- Authenticate. All downloads need to come with a way of checking the
downloaded files agree with what you uploaded.
- Integrate. The key feature that makes Linux systems so very
good at servers is the all-round integrated software management. When you
tell the system to update - and you have different update channels available,
such as a more conservative "stable/LTS" channel, a channel that gets you
the latest version after basic QA, and a channel that gives you the latest
versions shortly after their upload to help with QA. It covers almost all
software on your system, so it does not matter whether the security fix is
in your kernel, web server, library, auxillary service, extension module,
scripting language etc. - it will pull this fix and update you in no time.
Now you may argue that Hortonworks, Cloudera, Bigtop etc. already provide
packages. Well ... they provide crap. They have something they call a "package",
but it fails by any quality standards. Technically, a
is a car;
but not one that would pass todays safety regulations...
For example, they only support Ubuntu 12.04 - a three year old Ubuntu is the
latest version they support... Furthermore, these packages are roughly
the same. Cloudera eventually handed over their efforts to "the community" (in
other words, they gave up on doing it themselves, and hoped that someone else
would clean up their mess); and Hortonworks HDP (any maybe Pivotal HD, too)
is derived from these efforts, too.
Much of what they do is offering some extra documentation and training for
the packages they built using Bigtop with minimal effort.
The "spark" .deb
packages of Bigtop, for example, are empty. They
forgot to include the .jar
s in the package. Do I really need to give
more examples of bad packaging decisions? All bigtop packages now depend on
their own version of groovy - for a single script. Instead of rewriting this
script in an already required language - or in a way that it would run
on the distribution-provided groovy version - they decided to make yet another
When I read about Hortonworks and IBM announcing their "Open Data Platform",
I could not care less. As far as I can tell, they are only sticking their label
on the existing tools anyway. Thus, I'm also not surprised that Cloudera and
MapR do not join this rebranding effort - given the low divergence of Hadoop, who
would need such a label anyway?
So why does this matter? Essentially, if anything does not work,
you are currently toast. Say there is a bug in Hadoop that makes it fail to
process your data. Your business is belly-up because of that, no data is processed
anymore, your are vegetable. Who is going to fix it? All these "distributions"
are built from the same, messy, branch. There is probably only a dozen of people
around the world who have figured this out well enough to be able to fully build
this toolchain. Apparently, none of the "Hadoop" companies
are able to support a newer Ubuntu than 2012.04 - are you sure
they have really understood what they are selling? I have doubts. All the
freelancers out there, they know how to download and use Hadoop. But
can they get that business-critical bug fix into the toolchain to get you up
and running again? This is much worse than with Linux distributions. They have
build daemons - servers that continuously check they can compile all the software
that is there. You need to type two well-documented lines to rebuild a typical
Linux package from scratch on your workstation - any experienced developer can
follow the manual, and get a fix into the package. There are even people who
try to recompile complete distributions with
a different compiler
to discover compatibility issues early that may
arise in the future.
In other words, the "Hadoop distribution" they are selling you is not
code they compiled themselves. It is mostly .jar files they downloaded
from unsigned, unencrypted, unverified sources on the internet. They have no idea
how to rebuild these parts, who compiled that, and how it was built. At most,
they know for the very last layer. You can figure out how to recompile the
Hadoop .jar. But when doing so, your computer will download a lot of binaries.
It will not warn you of that, and they are included in the Hadoop distributions, too.
As is, I can not recommend to trust your business data into Hadoop.
It is probably okay to copy the data into HDFS and play with it - in particular
if you keep your cluster and development machines isolated with strong
firewalls - but be prepared to toss everything and restart from scratch. It's
not ready yet for prime time, and as they keep on adding more and more unneeded
cruft, it does not look like it will be ready anytime soon.
One more examples of the immaturity of the toolchain:
The scala package from scala-lang.org cannot be cleanly installed as
an upgrade to the old scala package that already exists in Ubuntu and
Debian (and the distributions seem to have given up on compiling a newer Scala
due to a stupid Catch-22 build process, making it very hacky to bootstrap
scala and sbt compilation).
And the "upstream" package also cannot be easily fixed, because it is not built
with standard packaging tools, but with an automagic sbt helper that lacks
important functionality (in particular, access to the Replaces: field,
or even cleaner: a way of splitting the package properly into components)
instead - obviously written by someone with 0 experience in packaging for
Ubuntu or Debian; and instead of using the proven tools, he decided to hack
some wrapper that tries to automatically do things the wrong way...
I'm convinced that most "big data" projects will turn out to be a miserable
failure. Either due to overmanagement or undermanagement, and due to
lack of experience with the data, tools, and project management...
Except that - of course - nobody will be willing to admit these failures.
Since all these projects are political projects, they by definition must
be successful, even if they never go into production, and never earn a single