Search Results: "tincho"

12 May 2017

Mart n Ferrari: 6 days to SunCamp

Only six more days to go before SunCamp! If you are still considering it, hurry up, you might still find cheap tickets for the low season. It will be a small event (about 20-25 people), with a more intimate atmosphere than DebConf. There will be people fixing RC bugs, preparing stuff for after the release, or just discussing with other Debian folks. There will be at least one presentation from a local project, and surely some members of nearby communities will join us for the day like they did last year. See you all in Lloret! Comment

10 March 2017

Mart n Ferrari: SunCamp happening again this May!

As I announced in mailing lists a few days ago, the Debian SunCamp (DSC2017) is happening again this May. SunCamp different to most other Debian events. Instead of a busy schedule of talks, SunCamp focuses on the hacking and socialising aspect, without making it just a Debian party/vacation.
DSC2016 - Hacking and discussing
The idea is to have 4 very productive days, staying in a relaxing and comfy environment, working on your own projects, meeting with your team, or presenting to fellow Debianites your most recent pet project.
DSC2016 - Tincho talking about Prometheus
We have tried to make this event the simplest event possible, both for organisers and attendees. There will be no schedule, except for the meal times at the hotel. But these can be ignored too, there is a lovely bar that serves snacks all day long, and plenty of restaurants and caf s around the village. DSC2016 - Hacking and discussing The SunCamp is an event to get work done, but there will be time for relaxing and socialising too.
DSC2016 - Well deserved siesta
DSC2016 - Playing P tanque
Do you fancy a hack-camp in a place like this? Swimming pool Caf Caf  terrace One of the things that makes the event simple, is that we have negotiated a flat price for accommodation that includes usage of all the facilities in the hotel, and optionally food. We will give you a booking code, and then you arrange your accommodation as you please, you can even stay longer if you feel like it! The rooms are simple but pretty, and everything has been renovated very recently. Room Room view We are not preparing a talks programme, but we will provide the space and resources for talks if you feel inclined to prepare one. You will have a huge meeting room, divided in 4 areas to reduce noise, where you can hack, have team discussions, or present talks. Hacklab Hacklab Do you want to see more pictures? Check the full gallery
Debian SunCamp 2017 Hotel Anabel, LLoret de Mar, Province of Girona, Catalonia, Spain May 18-21, 2017

Tempted already? Head to the wikipage and register now, it is only 2 months away! Please try to reserve your room before the end of March. The hotel has reserved a number of rooms for us until that time. You can reserve a room after March, but we can't guarantee the hotel will still have free rooms. Comment

2 March 2017

Mart n Ferrari: Prometheus in Jessie(bpo) and Stretch

Prometheus logo Just over 2 years ago, I started packaging Prometheus for Debian. It was a daunting task, mainly because the Golang ecosystem is so young, almost none of its dependencies were packaged, and upstream practices are very different from Debian's. Today I want to announce that this is complete. The main part of the Prometheus stack is available in Debian testing, which very soon will become the Stretch release: These are available for a bunch of architectures (sadly, not all of them), and are the most important building blocks to deploy Prometheus in your network. I have also finished preparing backports for all the required dependencies, and jessie-backports has a complete Prometheus stack too! Adding to these, the archive already has the client libraries for Go, Perl, Python, and Django; and a bunch of other exporters. Except for the Postgres exporter, most of these are going to be in the Stretch release, and I plan to prepare backports for Jessie too: Note that not all of these have been packaged by me, luckily other Debianites are also working on Prometheus packaging! I am confident that Prometheus is going to become one of the main monitoring tools in the near future, and I am very happy that Debian is the first distribution to offer official packages for it. If you are interested, there is still lots of work ahead. Patches, bug reports, and co-maintainers are welcome! Update 3/3/2017: Today the Perl client library was uploaded to unstable, and it is waiting for ftp-master approval. Comment

17 November 2016

Mart n Ferrari: Replacing proprietary Play Services with MicroG

For over a year now, I have been using CyanogenMod in my Nexus phone. At first I just installed some bundle that brought all the proprietary Google applications and libraries, but later I decided that I wanted more control over it, so I did a clean install with no proprietary stuff. This was not so great at the beginning, because the base system lacks the geolocation helpers that allow you to get a position in seconds (using GSM antennas and Wifi APs). And so, every time I opened OsmAnd (a great maps application, free software and free maps), I would have to wait minutes for the GPS radio to locate enough satellites. Shortly after, I found about the UnifiedNLP project that provided a drop-in replacement for the location services, using pluggable location providers. This worked great, and you could choose to use the Apple or Mozilla on-line providers, or off-line databases that you could download or build yourself. This worked well for most of my needs, and I was happy about it. I also had F-droid for installing free software applications, DAVdroid for contacts and calendar synchronisation, K9 for reading email, etc. I still needed some proprietary apps, but most of the stuff in my phone was Free Software. But sadly, more and more application developers are buying into the vendor lock-in that is Google Play Services, which is a set of APIs that offer some useful functionality (including the location services), but that require non-free software that is not part of the AOSP (Android Open-Source Project). Mostly, this is because they make push notifications a requirement, or because they want you to be able to buy stuff from the application. This is not limited to proprietary software. Most notably, the Signal project refuses to work without these libraries, or even to distribute the pre-compiled binaries on any platform that is not Google Play! (This is one of many reason why I don't recommend Signal to anybody). And of course, many very useful services that people use every day require you to install proprietary applications, which care much less about your choices of non-standard platforms. For the most part, I had been able to just get the package files for these applications1 from somewhere, and have the functionality I wanted. Some apps would just work perfectly, others would complain about the lack of Play Services, but offer a degraded experience. You would not get notifications unless you re-opened the application, stuff like that. But they worked. Lately, some of the closed-source apps I sometimes use stopped working altogether. So, tired of all this, I decided to give the MicroG project a try.

MicroG MicroG is a direct descendant of UnifiedNLP and the NOGAPPS project. I had known about it for a while, but the installation procedures always put me off. LWN published an article about them recently, and so I decided to spend a good chunk of time making it work. This blog post is to help making this a bit less painful. Some prerequisites:
  • No Gapps installed.
  • Rooted phone (at least for the mechanism I describe here).
  • Working ADB with root access to the phone.
  • UnifiedNLP needs to be uninstalled (MicroG carries its own version of it).

Signature spoofing The main problem with the installation, is that MicroG needs to pretend to be the original Google bundle. It has to show the same name, but most importantly, it has to spoof its cryptographic signatures so other applications do not realise it is not the real thing. OmniROM and MarshROM (other alternative firmwares for Android) provide support for signature spoofing out of the box. If you are running these, go ahead and install MicroG, it will be very easy! Sadly, the CyanogenMod refused allowing signature spoofing, citing security concerns2, so most users will have to go the hard way. The options for enabling this feature are basically two: either patch some core libraries H4xx0r style, or use the "Xposed framework". Since I still don't really understand what this Xposed thing is, and it is one of these projects that distributes files on XDA forums, I decided to go the dirty way.

Patching the ROM Note: this method requires a rooted phone, java, and adb. The MicroG wiki links to three different projects for patching your ROM, but turns out that two of them would not work at all with CyanogenMod 11 and 13 (the two versions I tried), because the system is "odexed" (whatever that means, the Android ecosystem is really annoying). I actually upgraded CM to version 13 just to try this, so a couple of hours went there, with no results. The project that did work is Haystack by Lanchon, which seems to be the cleaner and better developed of the three. Also the one with the most complex documentation. In a nutshell, you need to download a bunch of files from the phone, apply a series of patches to it, and then reupload it. To obtain the files and place them in the maguro (the codename for my phone) directory:
$ ./pull-fileset maguro
Now you need to apply some patches with the patch-fileset script. The patches are located in the patches directory:
$ ls patches
sigspoof-core
sigspoof-hook-1.5-2.3
sigspoof-hook-4.0
sigspoof-hook-4.1-6.0
sigspoof-hook-7.0
sigspoof-ui-global-4.1-6.0
sigspoof-ui-global-7.0
The patch-fileset script takes these parameters:
patch-fileset <patch-dir> <api-level> <input-dir> [ <output-dir> ]
Note that this requires knowing your Android version, and the API level. If you don't specify an output directory, it will append the patch name to the input directory name. Note that you should also check the output of the script for any warnings or errors! First, you need to apply the patch that hooks into your specific version of Android (6.0, API level 23 in my case):
$ ./patch-fileset patches/sigspoof-hook-4.1-6.0 23 maguro
>>> target directory: maguro__sigspoof-hook-4.1-6.0
[...]
*** patch-fileset: success
Now, you need to add the core spoofing module:
$ ./patch-fileset patches/sigspoof-core 23 maguro__sigspoof-hook-4.1-6.0
>>> target directory: maguro__sigspoof-hook-4.1-6.0__sigspoof-core
[...]
*** patch-fileset: success
And finally, add the UI elements that let you enable or disable the signature spoofing:
$ ./patch-fileset patches/sigspoof-ui-global-4.1-6.0 23 maguro__sigspoof-hook-4.1-6.0__sigspoof-core
>>> target directory: maguro__sigspoof-hook-4.1-6.0__sigspoof-core__sigspoof-ui-global-4.1-6.0
[...]
*** patch-fileset: success
Now, you have a bundle ready to upload to your phone, and you do that with the push-fileset script:
$ ./push-fileset maguro__sigspoof-hook-4.1-6.0__sigspoof-core__sigspoof-ui-global-4.1-6.0
After this, reboot your phone, go to settings / developer settings, and at the end you should find a checkbx for "Allow signature spoofing" which you should now enable.

Installing MicroG Now that the difficult part is done, the rest of the installation is pretty easy. You can add the MicroG repository to F-Droid and install the rest of the modules from there. Check the installation guide for all the details. Once all the parts are in place, and after a last reboot, you should find a MicroG settings icon that will check that everything is working correctly, and give you the choice of which components to enable. So far, other applications believe this phone has nothing weird, I get quick GPS fixes, push notifications seem to work... Not bad at all for such a young project! Hope this is useful. I would love to hear your feedback in the comments!

  1. Which is a pretty horrible thing, having to download from fishy sites because Google refuses to let you use the marketplace without their proprietary application. I used to use a chrome extension to trick Google Play into believing your phone was downloading it, and so you could get the APK file, but now that I have no devices running stock Android, that does not work any more.
  2. Actually, I am still a bit concerned about this, because it is not completely clear to me how protected this is against malicious actors (I would love to get feedback on this!).
3 comments

28 August 2016

Mart n Ferrari: On speaking at community conferences

Many people reading this have already suffered me talking to them about Prometheus. In personal conversation, or in the talks I gave at DebConf15 in Heidelberg, the Debian SunCamp in Lloret de Mar, BRMlab in Prague, and even at a talk on a different topic at the RABS in Cluj-Napoca. Since their public announcement, I have been trying to support the project in the ways I could: by packaging it for Debian, and by spreading the word. Last week the first ever Prometheus conference took place, so this time I did the opposite thing and I spoke about Debian to Prometheus people: I gave a 5 minutes lightning talk on Debian support for Prometheus. What blew me away was the response I've got: from this tiny non-talk I prepared in 30 minutes, many people stopped me later to thank me for packaging Prometheus, and for Debian in general. They told me they were using my packages, gave me feedback, and even some RFPs! At the post-conference beers, I had quite a few interesting discussions about Debian, release processes, library versioning, and culture clashes with upstream. I was expecting some Debian-bashing (we are old-fashioned, slow, etc.), instead I had intelligent and enriching conversations. To me, this enforces once again my support and commitment to community conferences, where nobody is a VIP and everybody has something to share. It also taught me the value of intersecting different groups, even when there seems to be little in common. Comment

3 May 2016

Mart n Ferrari: New sources for contributors.debian.org

Many people might not be aware of it, but since a couple of years ago, we have an excellent tool for tracking and recognising contributors to the Debian Project: Debian Contributors Debian is a big project, and there are many people working that do not have great visibility, specially if they are not DDs or DMs. We are all volunteers, so it is very important that everybody gets credited for their work. No matter how small or unimportant they might think their work is, we need to recognise it! One great feature of the system is that anybody can sign up to provide a new data source. If you have a way to create a list of people that is helping in your project, you can give them credit! If you open the Contributors main page, you will get a list of all the groups with recent activity, and the people credited for their work. The data sources page gives information about each data source and who administers it. For example, my Contributors page shows the many ways in which the system recognises me, all the way back to 2004! That includes commits to different projects, bug reports, and package uploads. I have been maintaining a few of the data sources that track commits to Git and Subversion repositories: The last two are a bit problematic, as they group together all commits to the respective VCS repositories without distinguishing to which sub-projects the contributions were made. The Go and Perl groups' contributions are already extracted from that big pile of data, but it would be much nicer if each substantial packaging team had their own data source. Sadly, my time is limited, so this is were you come into the picture! If you are a member of a team, and want to help with this effort, adopt a new data source. You can be providing commit logs, but it is not limited to that; think of translators, event volunteers, BSP attendants, etc. The initial work is very small, and there is almost no maintenance. There is information on how to contribute here and here, but I would be more than happy to guide you if you contact me. Comment

9 April 2016

Mart n Ferrari: Come to SunCamp this May!

Do you fancy a hack-camp in a place like this? Swimming pool As you might have heard by now, Ana (Guerrero) and I are organising a small Debian event this spring: the Debian SunCamp 2016. It is going to be different to most other Debian events. Instead of a busy schedule of talks, SunCamp will focus on the hacking and socialising aspect. We have tried to make this event the simplest event possible, both for organisers and attendees. There will be no schedule, except for the meal times at the hotel. But these can be ignored too, there is a lovely bar that serves snacks all day long, and plenty of restaurants and caf s around the village. Caf Caf  terrace One of the things that makes the event simple, is that we have negotiated a flat price for accommodation that includes usage of all the facilities in the hotel, and optionally food. We will give you a booking code, and then you arrange your accommodation as you please, you can even stay longer if you feel like it! The rooms are simple but pretty, and everything has been renovated very recently. Room Room view We are not preparing a talks programme, but we will provide the space and resources for talks if you feel inclined to prepare one. You will have a huge meeting room, divided in 4 areas to reduce noise, where you can hack, have team discussions, or present talks. Hacklab Hacklab Of course, some people will prefer to move their discussions to the outdoor area. Outside Sun   lounge Or just skip the discussion, and have a good time with your Debian friends, playing p tanque, pool, air hockey, arcades, or board games. P tanque Snookers Do you want to see more pictures? Check the full gallery
Debian SunCamp 2016 Hotel Anabel, LLoret de Mar, Province of Girona, Catalonia, Spain May 26-29, 2016

Tempted already? Head to the wikipage and register now, it is only 7 weeks away! Please reserve your room before the end of April. The hotel has reserved a number of rooms for us until that time. You can reserve a room after April, but we can't guarantee the hotel will still have free rooms. Comment

5 March 2016

Mart n Ferrari: Serendipity

So I was reading G+, and saw there a post by Bernd Zeimetz about some "marble machine". Which turns out to be a very cool device that is programmed to play a single tune, and it is just mesmerising to watch: So, naturally, I click through to see if there is more music made with this machine. It turns out the machine has been on the making for a while, and the first complete song (the one embedded above) was released only a few days ago. It is obviously porn for nerds, and Wired had already posted an article about it. So instead I found a band called like the machine: Wintergatan, which sounds pretty great. It took me a while to realise the guy who built the machine is one of the members of the band. They even have a page collecting all the videos about the machine. After a while, and noticing the suggestions from Youtube, I realise that two of the members of Wintergatan were previously in Detektivbyr n, which is another band I love, and about which I wrote a post on this very blog, 7.5 years ago!1. So the sad news is that Detektivbyran disbanded, the good news is that this guy keeps making great music, now with insane machines. I only discovered Detektivbyran in the first place thanks to an article the -now sadly defunct- Coilhouse Magazine. I find this 8-year long loop that closes unexpectedly during a late-night idle browsing session pretty amusing.

  1. I keep telling my friends that I was a hipster before it was cool to do so...
Comment

13 February 2016

Mart n Ferrari: Why I am not touching node.js

Dear node.js/node-webkit people, what's the matter with you? I wanted to try out some stuff that requires node-webkit. So I try to use npm to download, build and install it, like CPAN would do. But then I see that the nodewebkit package is just a stub that downloads a 37MB file (using HTTP without TLS) containing pre-compiled binaries. Are you guys out of your minds? This is enough for me to never again get close to node.js and friends. I had already heard some awful stories, but this is just insane. Update: This was a rage post, and not really saying anything substantial, but I want to quote here the first comment reply from Christian as I think it is much more interesting:
I see a recurring pattern here: developers create something that they think is going to simplify distributing things by quite a bit, because what other people have been doing is way too complicated[tm]. In the initial iteration it usually turns out to have three significant problems:
  • a nightmare from a security perspective
  • there's no concern for stability, because the people working on it are devs and not people who have certain reliability requirements
  • bundling other software is often considered a good idea, because it makes life so much easier[tm]
Given a couple of years they start to catch up where the rest of the world (e.g. GNU/Linux distributions) is - at least to some extent. And then their solution is actually of a similar complexity compared to what is in use by other people, because they slowly realize that what they're trying to do is actually not that simple and that other people who have done things before were not stupid and the complexity is actually necessary... Just give node.js 4 or 5 more years or so, then you won't be so disappointed. ;-) I've seen this pattern over and over again:
  • the Ruby ecosystem was a catastrophe for quite some time (just ask e.g. the packages of Ruby software in distributions 4 or 5 years ago), and only now would I actually consider using things written in Ruby for anything productive
  • docker and vagrant were initially not designed with security in mind when it came to downloading images - only in the last one or two years have there actually been solutions available that even do the most basic cryptographic verification
  • the entire node.js ecosystem mess you're currently describing
Then the next new fad comes along in the development world and everything starts over again. The only thing that's unclear is what the next hype following this pattern is going to be.
And I also want to quote myself with what I think are some things you could do to improve this situation:
  1. You want to make sure your build is reproducible (in the most basic sense of being re-buildable from scratch), that you are not building upon scraps of code that nobody knows where they came from, or which version they are. If possible, at the package level don't vendor dependencies, depend on the user having the other dependencies pre-installed. Building should be a mainly automatic task. Automation tools then can take care of that (cpan, pip, npm).
  2. By doing this you are playing well with distributions, your software becomes available to people that can not live on the bleeding edge, and need to trust the traceability of the source code, stability, patching of security issues, etc.
  3. If you must download a binary blob, for example what Debian non-free does for Adobe Flashplayer, then for the love of all that is sacred, use TLS and verify checksums!
8 comments

7 November 2015

Mart n Ferrari: Tales from the SRE trenches: Coda

This is the fifth -and last- part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software. Previous articles can be found here: part 1, part 2, part 3, and part 4.

Other lessons learned Apart from these guidelines for SRE I have described in previous posts, there were a few other practices I learned at google that I believe could benefit most operations teams out there (and developers too!).

Automate everything It's worth repeating, it is not sustainable to do things manually. When your manual work grows linearly with your usage, either your service is not growing enough, or you will need to hire more people than there are available in the market.

Involve Ops in systems design Nobody knows better than Ops the challenges of a real production environment, getting early feedback from your operations team can save you a lot of trouble down the road. They might give you insights about scalability, capacity planning, redundancy, and the very real limitations of hardware.

Give Ops time and space to innovate If there is so much emphasis on limiting the amount of operational load put on SRE, that freed time is not only used for creating monitoring dashboards or writing documentation. As I said in the beginning, SRE is an engineering team. And when engineers are not running around putting fires off all day long, they get creative. Many interesting innovations come from SREs even when there is no requirement for them.

Write design documents Design documents are not only useful for writing software. Fleshing out the details of any system on paper is a very useful way to find problems early in the process, communicate with your team exactly what are you building, and convince them why it is a great idea.

Review your code, version your changes This is something that is ubiquitous at Google, and not exclusive to SRE: everything is committed into source control: big systems, small scripts, configuration files. It might not seem worth it at the beginning, but having complete history of all your production environment is invaluable. And with source control, another rule comes into play: no change is committed without first being reviewed by a peer. It can be frustrating having to find a reviewer and wait for the approval for every small change, and some times reviews will be suboptimal, but once you get used to it, the benefits greatly out-weights any annoyances.

Before touching, measure Monitoring is not only for sending alerts, it is an essential tool of SRE. It allows you to gauge your availability, evaluate trends, forecast growth, perform forensic analysis. Also very importantly, it should give you the data needed to take decisions based on reality and not on guesses. Before optimising, find really where the bottleneck is; do not decide which database instance to use until you saw the utilisation of the past few months, etc. And speaking of that... We really need to have a chat about your current monitoring system. But that's for another day.
I hope I did not bore you too much with these walls of text! Some people told me directly they were enjoying these posts, so soon I will be writing more about related topics. In particular I would like to write about modern monitoring and Prometheus. I would love to hear your comments, questions, and criticisms. Comment

4 November 2015

Mart n Ferrari: Tales from the SRE trenches: What can SRE do for you?

This is the fourth part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software. Previous articles can be found here: part 1, part 2, and part 3. As a side note, I am writing this from Heathrow Airport, about to board my plane to EZE. Buenos Aires, here I come!

So, what does SRE actually do? Enough about how to keep the holy wars under control and how to get work away from Ops. Let's talk about some of the things that SRE does. Obviously, SRE runs your service, performing all the traditional SysAdmin duties needed for your cat pictures to reach their consumers: provisioning, configuration, resource allocation, etc. They use specialised tools to monitor the service, and get alerted as soon as a problem is detected. They are also the ones waking up to fix your bugs in the middle of the night. But that is not the whole story: reliability is not only impacted by new launches. Suddenly usage grows, hardware fails, networks lose packets, solar flares flip your bits... When working at scale, things are going to break more often than not. You want these breakages to affect your availability as little as possible, there are three strategies you can apply to this end: minimise the impact of each failure, recover quickly, and avoid repetition. And of course, the best strategy of them all: preventing outages from happening at all.

Minimizing the impact of failures A single failure that takes the whole service down will affect severely your availability, so the first strategy as an SRE is to make sure your service is fragmented and distributed across what is called "failure domains". If the data center catches fire, you are going to have a bad day, but it is a lot better if only a fraction of your users depend on that one data center, while the others keep happily browsing cat pictures. SREs spend a lot of their time planning and deploying systems that span the globe to maximise redundancy while keeping latencies at reasonable levels.

Recovering quickly Many times, retrying after a failure is actually the best option. So another strategy is to automatically restart in milliseconds any piece of your system that fails. This way, less users are affected, while a human has time to investigate and fix the real problem. If a human needs to intervene, it is crucial that they get notified as fast as possible, and that they have quick access to all the information that is needed to solve the problem: detailed documentation of the production environment, meaningful and extensive monitoring, play-books1, etc. After all, at 2 AM you probably don't remember in which country the database shard lives, or what were the series of commands to redirect all the traffic to a different location. Implementing monitoring and automated alerts, writing documentation and play-books, and practising for disaster are other areas where SREs devote much effort.

Avoiding repetition Outages happen, pages ring, problems get fixed, but it should always be a chance for learning and improving. A page should require a human to think about a problem and find a solution. If the same problem keeps appearing, the human is not needed any more: the problem needs to be fixed at the root. Another key aspect of dealing with incidents is to write post-mortems2. Every incident should have one, and these should be tools for learning, not finger-pointing. Post-mortems can be an excellent tool, if people are honest about their mistakes, they are not used to blame other people, and issues are followed up by bug reports and discussions.

Preventing outages Of course nobody can prevent hard drives from failing, but there are certain classes of outages that can be forecasted with careful analysis. Usage spikes can bring a service down, but an SRE team will ensure that the systems are load-tested at higher-than-normal rates. They could also be prepared to quickly scale the service, provided the monitoring system will alert as soon as a certain threshold is reached. Monitoring is actually the key part in this: measuring relevant metrics for all the components of a system, and following their trends over time, SREs can completely avoid many outages: latencies growing out of acceptable bounds, disks filling up, progressive degradation of components, are all examples of typical problems automatically monitored (and alerted on) in an SRE team.
This is getting pretty long, so the next post will be the last one of these series, with some extra tips from SRE experience that I am sure can be applied in many places.

  1. A play-book is a document containing critical information needed when dealing with a specific alert: possible causes, hints for troubleshooting, links to more documentation. Some monitoring systems will automatically add a play-book URL to every alert sent, and you should be doing it too.
  2. Similarly to their real-life counterparts, a post-mortem is the process of examining a dead body (the outage), gutting it out and trying to understand what caused its demise.
Comment

29 October 2015

Mart n Ferrari: Tales from the SRE trenches: SREs are not firefighters

This is the third part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software. If you haven't already, you might want to read part 1 and part 2 first. In this post, I talk about some strategies to avoid drowning SRE people in operational work.

SREs are not firefighters As I said before, it is very important that there is trust and respect between Dev and Ops. If Ops is seen as just an army of firefighters that will gladly put away fires at any time of the night, there are less incentives to make good software. Conversely, if Dev is shielded from the realities of the production environment, Ops will tend to think of them as delusional and untrustworthy.

Common staffing pool for SRE and SWE The first tool to combat this at Google -and possibly a difficult one to implement in small organisations- is to have single headcount budgets for Dev and Ops. That means that the more SREs you need to support your service, the less developers you have to write it. Combined with this, Google offers the possibility to SREs to move freely between teams, or even to transfer out to SWE. Because of this, a service that is painful to support will see the most senior SREs leaving and will only be able to afford less experienced hires. All this is a strong incentive to write good quality software, to work closely and to listen to the Ops people.

Share 5% of operational work with the SWE team On the other hand, it is necessary that developers see first hand how the service works in production, understand the problems, and share the pain of things failing. To this end, SWEs are expected to take on a small fraction of the operational work from SRE: handling tickets, being on-call, performing deployments, or managing capacity. This results in better communication among the teams and a common understanding of priorities.

Cap operational load at 50% One very uncommon rule from SRE at Google, is that SREs are not supposed to spend more than half of their time on "operational work". SREs are supposed to be spending their time on automation, monitoring, forecasting growth... Not on repeatedly fixing manually issues that stem from bad systems.

Excess operational work overflows to SWE If an SRE team is found to be spending too much time on operational work, that extra load is automatically reassigned to the development team, so SRE can keep doing their non-operational duties. On extreme cases, a service might be deemed too unstable to maintain, and SRE support is completely removed: it means the development team now has to carry pagers and do all the operational work themselves. It is a nuclear option, but the threat of it happening is a strong incentive to keep things sane.
The next post will be less about how to avoid unwanted work and more about the things that SRE actually do, and how these make things better. Comment

27 October 2015

Mart n Ferrari: Tales from the SRE trenches: Dev vs Ops

This is the second part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software. On the first part, I introduced briefly what is SRE. Today, I present some concrete ways in which SRE tried to make things better, by stopping the war between developers and SysAdmins.

Dev vs Ops: the eternal battle So, it starts at looking at the problem: how to increase the reliability of the service? It turns out that some of the biggest sources of outages are new launches: a new feature that seemed innocuous somehow managed to bring the whole site down. Devs want to launch, and Ops want to have a quiet weekend, and this is were the struggle begins. When launches are problematic, bureaucracy is put in place to minimise the risks: launch reviews, checklists, long-lived canaries. This is followed by development teams finding ways of side-stepping those hurdles. Nobody is happy. One of the key aspects of SRE is to avoid this conflict completely, by changing the incentives, so these pressures between development and operations disappear. At Google, they achieve this with a few different strategies:

Have an SLA for your service Before any service can be supported by SRE, it has to be determined what is the level of availability that it must achieve to make the users and the company happy: this is called the Service Level Agreement (SLA). The SLA will define how availability is measured (for example, percentage of queries handled successfully in less than 50ms during the last quarter), and what is the minimum acceptable value for it (the Service Level Objective, or SLO). Note that this is a product decision, not a technical one. This number is very important for an SRE team and its relationship with the developers. It is not taken lightly, and it should be measured and enforced rigorously (more on that later). Only a few things on earth really require 100% availability (pacemakers, for example), and achieving really high availability is very costly. Most of us are dealing with more mundane affairs, and in the case of websites, there are many other things that fail pretty often: network glitches, OS freezes, browsers being slow, etc. So an SLO of 100% is almost never a good idea, and in most cases it is impossible to reach. In places like Google an SLO of "five nines" (99.999%) is not uncommon, and this means that the service can't fail completely for more than 5 minutes across a whole year!

Measure and report performance against SLA/SLO Once you have a defined SLA and SLO, it is very important that these are monitored accurately and reported constantly. If you wait for the end of the quarter to produce a hand-made report, the SLA is useless, as you only know you broke it when it is too late. You need automated and accurate monitoring of your service level, and this means that the SLA has to be concrete and actionable. Fuzzy requirements that can't be measured are just a waste of time. This is a very important tool for SRE, as it allows to see the progression of the service over time, detect capacity issues before they become outages, and at the same time show how much downtime can be taken without breaking the SLA. Which brings us to one core aspect of SRE:

Use error budgets and gate launches on them If SLO is the minimum rate of availability, then the result of calculating 1 - SLO is what fraction of the time a service can fail without failing out of the SLA. This is called an error budget, and you get to use it the way you want it. If the service is flaky (e.g. it fails consistently 1 of every 10000 requests), most of that budget is just wasted and you won't have any margin for launching riskier changes. On the other hand, a stable service that does not eat the budget away gives you the chance to bet part of it on releasing more often, and getting your new features quicker to the user. The moment the error budget is spent, no more launches are allowed until the average goes back out of the red. Once everyone can see how the service is performing against this agreed contract, many of the traditional sources of conflict between development and operations just disappear. If the service is working as intended, then SRE does not need to interfere on new feature launches: SRE trusts the developers' judgement. Instead of stopping a launch because it seems risky or under-tested, there are hard numbers that take the decisions for you. Traditionally, Devs get frustrated when they want to release, but Ops won't accept it. Ops thinks there will be problems, but it is difficult to back this feeling with hard data. This fuels resentment and distrust, and management is never pleased. Using error budgets based on already established SLAs means there is nobody to get upset at: SRE does not need to play bad cop, and SWE is free to innovate as much as they want, as long as things don't break. At the same time, this provides a strong incentive for developers to avoid risking their budget in poorly-prepared launches, to perform staged deployments, and to make sure the error budget is not wasted by recurrent issues.
That's all for today. The next article will continue delving into how traditional tensions between Devs and Ops are played in the SRE world. Comment

24 October 2015

Mart n Ferrari: Tales from the SRE trenches - Part 1

A few weeks ago, I was offered the opportunity to give a guest talk in the Romanian Association for Better Software. RABS is a group of people interested in improving the trade, and regularly hold events where invited speakers give presentations on a wide array of topics. The speakers are usually pretty high-profile, so this is quite a responsibility! To make things more interesting, much of the target audience works on enterprise software, under Windows platforms. Definitely outside my comfort zone! Considering all this, we decided the topic was going to be about Site Reliability Engineering (SRE), concentrating on some aspects of it which I believe could be useful independently of the kind of company you are working for. I finally gave the talk last Monday, and the audience seemed to enjoy it, so I am going to post here my notes, hopefully some other people will like it too.

Why should I care? I prepared this thinking of an audience of software engineers, so why would anyone want to hear about this idea that only seems to be about making the life of the operations people better? The thing is, having your work as a development team supported by an SRE team will also benefit you. This is not about empowering Ops to hit you harder when things blow apart, but to have a team that is your partner. A partner that will help you grow, handle the complexities of a production environment so you can concentrate on cool features, and that will get out of the way when things are running fine. A development team may seem to only care about adding features that will drive more and more users to your service. But an unreliable service is a service that loses users, so you should care about reliability. And what better to have a team has Reliability on their name?

What is SRE? SRE means Site Reliability Engineering, Reliability Engineering applied to "sites". Wikipedia defines Reliability Engineering as:
[..] engineering that emphasizes dependability in the life-cycle management of a product.
This is, historically, a branch of engineering that made possible to build devices that will work as expected even when their components were inherently unreliable. It focused on improving component reliability, establishing minimum requirements and expectations, and a heavy usage of statistics to predict failures and understand underlying problems. SRE started as a concept at Google about 12 years ago, when Ben Treynor joined the company and created the SRE team from a group of 7 production engineers. There is no good definition of what Site Reliability Engineering means; while the term and some of its ideas are clearly inspired in the more traditional RE, he defines SRE with these words1:
Fundamentally, it's what happens when you ask a software engineer to design an operations function.

Only hire coders After reading that quote it is not surprising that the first item in the SRE checklist2, is to only hire people who can code properly for SRE roles. Writing software is a key part of being SRE. But this does not mean that there is no separation between development and operations, nor that SRE is a fancy(er) name for DevOps3. It means treating operations as a software engineering problem, using software to solve problems that used to be solved by hand, implementing rigorous testing and code reviewing, and taking decisions based on data, not just hunches. It also implies that SREs can understand the product they are supporting, and that there is a common ground and respect between SREs and software engineers (SWEs). There are many things that make SRE what it is, some of these only make sense within a special kind of company like Google: many different development and operations teams, service growth that can't be matched by hiring, and more importantly, firm commitment from top management to implement these drastic rules. Therefore, my focus here is not to preach on how everybody should adopt SRE, but to extract some of the most useful ideas that can be applied in a wider array of situations. Nevertheless, I will first try to give an overview of how SRE works at Google.
That's it for today. In the next post I will talk about how to end the war between developers and SysAdmins. Stay tuned!

  1. http://www.site-reliability-engineering.info/2014/04/what-is-site-reliability-engineering.html
  2. SRE checklist extracted from Treynor's talk at SREcon14: https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre
  3. By the way, I am still not sure what DevOps mean, it seems that everyone has a different definition for it.
Comment

14 October 2015

Norbert Preining: On Lars Wirzenius, Mart n Ferrari, and Debian

Now that the tsunami that followed my recent blog post has passed, I can read through all the nice comments and blogs of other people. I didn t expect that just writing up some links to posts on LKML, plus adding my personal opinion on what has happened and about my interpretation concerning Debian could gather so much attention. Anyway, let us look at the blog posts written in the aftermath. debian-newspeak-coc In particular do I want to answer to two blogs: Lars Wirzenius blog and Mart n Ferrari s blog, both of which are excellent examples of insults. I also want to mention that although the previous two are bad examples, there are fortunately also those who are able to resist primitive insults, in particular Gunnar Wolf (direct link to the blog does not work), and Miriam Ruiz blog, which contradict my blog and propose a different view, without pulling the knife. Lars Wirzenius Let us start with Lars (ah sorry, I have to call him Wirzenius here, strange custom has suddenly popped up in Debian!):
I think Preining mis-characterises what happens on the Linux kernel mailing list, and in Debian, in free software development in general, and what Sarah Sharp has said and done. I continue to think he is wrong about everything on this topic.
I am very surprised that Lars^WWirzenius is in possession of a crystal ball that allows him to see and evaluate my attitude towards all these items, considering that I have presented links to posts on the LKML, and yes, my interpretation of the matter. Mind that the whole text amounts to about 130 lines on my screen, while my personal opinion was stated in two paragraphs of total 9 lines plus some interspersed comments. And in these maybe 20 lines I didn t make general statements about open source, kernel development etc etc. Lars, I know you have closed your blog for comments so I couldn t ask you but please, can you send me one of your crystal balls if you have more of them? At least he has managed to keep a bit of proper writing and disagreed with my statement on Debian (whether there is fun or not in Debian). He is of course free to do that, but please, don t rob me the right to state that I think Debian has changed. All this dispute centers around people not being capable to distinguish two things: One, being against the Code of Conduct due to the inclusion of administrative actions without clear definitions, and Two, being pro offensive behavior and and insults. Now, dear Lars^WWirzenius, please listen: I never advocated abusive behavior or insults, nor do I defend it. (Did you hear that!) I simply opposed the Code of Conduct as ruling instrument. And what kind of emails I got due to my opposition was far outside the Code of Conduct you are so strongly defending. So please, stay at the facts, and stop insulting me. Thanks. Mart n Ferrari Concerning Mart n I don t have much to say but please, stop spreading lies. You stated:
Once again, he s complaining about how the fun from Debian has been lost because making sexist jokes, or treating other people like shit is not allowed any more.
Could you please come up with a reference to this? Or are you just interpreting? I am very disappointed about this level of discussion between Debian Developers. You not even cared to answer my comment on your blog. Should I say something clear here you should be happy that this has not been written on a Debian mailing list, otherwise I hope the Code of Conduct hammer would hit you.
So yes, it seems that at least these two blogs underline exactly what my opinion is: Communication culture in Debian has changed to be more company-like. Probably due to the ever increasing amount of developers that are paid for their work on Debian (Ubuntu), the pressure to follow US company codes has taken a firm grip, so firm that even stating my diverging opinion is already enough to get branded. Good Future!

8 October 2015

Gunnar Wolf: On evolving communities and changing social practices

I will join Lars and Tincho in stating this, and presenting a version contrary to what Norbert portraits. I am very glad and very proud that the community I am most involved in, the Debian project, has kept its core identity over the years, at least for the slightly-over-a-decade I have been involved in it. And I am very glad and very proud that being less aggressive, more welcoming and in general more respectful to each other does not counter this. When I joined Debian, part of the mantra chants we had is that in order to join a Free Software project you had to grow a thick skin, as sooner or later we'd all be exposed to flamefests. But, yes, the median age of the DD was way lower back then I don't have the data at hand, but IIRC I have always been close to our median. Which means we are all growing old and grumpy. But old and wiser. A very successful, important and dear subproject to many of us is the Debian Women Project. Its original aim was, as the name shows, to try to reduce the imbalance between men and women participants in Debian IIRC back in 2004 we had 3 female DDs, and >950 male DDs. Soon, the project started morphing into pushing all of Debian to be less hostile, more open to contributions from any- and everyone (as today our diversity statement reads). And yes, we are still a long, long, long way from reaching equality. But we have done great steps. And not just WRT women, but all of the different minorities, as well as to diverging opinions within our community. Many people don't enjoy us abiding by a code of conduct; I also find it irritating sometimes to have to abide by certain codes if we mostly know each other and know we won't be offended by a given comment... Or will we? So, being more open and more welcoming also means being more civil. I cannot get myself to agree with Linus' quote, when he says that respect is not just given to everybody but must be earned. We should always start, and I enjoy feeling that in Debian this is becoming the norm, by granting respect to everybody And not losing it, even if things get out of hand. Thick skins are not good for communication.

Mart n Ferrari: Pining for the good ol' days

Today, Norbert Preining wrote a blog post syndicated in planet talking about Sarah Sharp quitting Linux development, and making a parallel with the Debian community. Once again, he's complaining about how the fun from Debian has been lost because making sexist jokes, or treating other people like shit is not allowed any more. He seems to think the LKML is the ideal environment and that Debian should be more like it. He also celebrates that Ms Sharp12 left Linux, because evidently the freedom to be a jerk to each other is more important than the contributions that she -or any other person that finds that atmosphere toxic, like mjg- has produced or could have produced in the future.
I would like to tell readers of Planet Debian that this view is not the norm in Debian any more. We are working hard to make the project more inclusive, fun, and welcoming to everybody. Sadly, some people are still pining for the good ol' days, and won't give up so easily on their privilege.

  1. Whom he calls "one more SJW", I guess then that after all, this is just about ethics in FOSS development.
  2. At this point, using the term SJW in a discussion should equate to a Godwin, and mean that you lost the argument.
6 comments

Norbert Preining: Looking at the facts: Sarah Sharp s crusade

Much has been written around the internet about this geeky kernel maintainer Sarah Sharp who left kernel development. I have now spent two hours reading through lkml posts, and want to summarize a few mails from the long thread, since most of the usual news sites just rewrap the original blog of hers without adding any background. darth-cookie The whole thread evolved out call for stable kernel review by Greg Kroah-Hartman where he complained about too many patches that are not actually in rc1 before going into stable:
<rant>
  I'm sitting on top of over 170 more patches that have been marked for
  the stable releases right now that are not included in this set of
  releases.  The fact that there are this many patches for stable stuff
  that are waiting to be merged through the main -rc1 merge window cycle
  is worrying to me.
  [...]
from where it developed into a typical Linus rant on people flagging crap for stable, followed by some jokes:
On Fri, Jul 12, 2013 at 8:47 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> I tend to hold things off after -rc4 because you scare me more than Greg
> does ;-)
 
Have you guys *seen* Greg? The guy is a freakish giant. He *should*
scare you. He might squish you without ever even noticing.
 
              Linus
and Ingo Molnar giving advice to Greg KH:
So Greg, if you want it all to change, create some _real_ threat: be frank 
with contributors and sometimes swear a bit. That will cut your mailqueue 
in half, promise!
with Greg KH taking a funny position in answering:
Ok, I'll channel my "inner Linus" and take a cue from my kids and start
swearing more.
Up to now a pretty decent and normal thread with some jokes and poking, nobody minded, and reading through it I had a good time. The thread continues with a discussion on requirements what to submit to stable, and some side threads on particular commits. And then, out of the blue, Social Justice Warrior (SJW) Sarah Sharp pops in with a very important contribution:
Seriously, guys?  Is this what we need in order to get improve -stable?
Linus Torvalds is advocating for physical intimidation and violence.
Ingo Molnar and Linus are advocating for verbal abuse.
 
Not *fucking* cool.  Violence, whether it be physical intimidation,
verbal threats or verbal abuse is not acceptable.  Keep it professional
on the mailing lists.
 
Let's discuss this at Kernel Summit where we can at least yell at each
other in person.  Yeah, just try yelling at me about this.  I'll roar
right back, louder, for all the people who lose their voice when they
get yelled at by top maintainers.  I won't be the nice girl anymore.
Onto which Linus answers in a great way:
That's the spirit.
 
Greg has taught you well. You have controlled your fear. Now, release
your anger. Only your hatred can destroy me.
 
Come to the dark side, Sarah. We have cookies.
On goes Sarah, gearing up in her SJW mode and starting to rant:
However, I am serious about this.  Linus, you're one of the worst
offenders when it comes to verbally abusing people and publicly tearing
their emotions apart.
 
http://marc.info/?l=linux-kernel&m=135628421403144&w=2
http://marc.info/?l=linux-acpi&m=136157944603147&w=2
 
I'm not going to put up with that shit any more.
Linus himself made clear what he thinks of her:
Trust me, there's a really easy way for me to curse at people: if you
are a maintainer, and you make excuses for your bugs rather than
trying to fix them, I will curse at *YOU*.
 
Because then the problem really is you.
[...]
It is easy to verify what Linus said, by reading the above two links and the answers of the maintainers, both agreed that it was their failure and were sorry. (Mauro s answer, Rafael s answer) It is just the geeky SJW that was not even attacked (who would dare to attack a woman nowadays?). The overall reaction to her by the maintainers can be exemplified by Thomas Gleixner s post:
Just for the record. I got grilled by Linus several times over the
last years and I can't remember a single instance where it was
unjustified.
What follows is a nearly endless discussion with Sarah meandering around, changing permanently her opinion what is acceptable. Linus tried to explain to her in simple words, without success, she continues to rant around. Here arguments are so weak I had nothing but good laugh:
> Sarah, that's a pretty potent argument by Linus, that "acting 
> professionally" risks replacing a raw but honest culture with a
> polished but dishonest culture - which is harmful to developing
> good technology.
> 
> That's a valid concern. What's your reply to that argument?
 
I don't feel the need to comment, because I feel it's a straw man
argument.  I feel that way because I disagree with the definition of
professionalism that people have been pushing.
 
To me, being "professional" means treating each other with respect.  I
can show emotion, express displeasure, be direct, and still show respect
for my fellow developers.
 
For example, I find the following statement to be both direct and
respectful, because it's criticizing code, not the person:
 
"This code is SHIT!  It adds new warnings and it's marked for stable
when it's clearly *crap code* that's not a bug fix.  I'm going to revert
this merge, and I expect a fix from you IMMEDIATELY."
 
The following statement is not respectful, because it targets the
person:
 
"Seriously, Maintainer.  Why are you pushing this kind of *crap* code to
me again?  Why the hell did you mark it for stable when it's clearly
not a bug fix?  Did you even try to f*cking compile this?"
Fortunately, she was immediately corrected and Ingo Molnar wrote an excellent refutation (starting another funny thread) of all her emails, statements, accusations (all of the email is a good read):
_That_ is why it might look to you as if the person was
attacked, because indeed the actions of the top level maintainer were
wrong and are criticised.
 
... and now you want to 'shut down' the discussion. With all due respect,
you started it, you have put out various heavy accusations here and elsewhere,
so you might as well take responsibility for it and let the discussion be
brought to a conclusion, wherever that may take us, compared to your initial view?
(He retracted that last statement, though I don t see a reason for it) Last but not least, let us return to her blog post, where she states herself that:
FYI, comments will be moderated by someone other than me. As this is my blog, not a
government entity, I have the right to replace any comment I feel like with 
 fart fart fart fart .
and she made lots of use of it, I counted at least 10 instances. She seems to remove or fart fart fart any comment that is not in line with her opinion. Further evidence is provided by this post on lkml. Everyone is free to have his own opinion (sorry, his/her), and I am free to form my own opinion on Sarah Sharp by just simply reading the facts. I am more than happy that one more SJW has left Linux development, as the proliferation of cleaning of speech from any personality has taken too far a grip. Coming to my home-base in Debian, unfortunately there is no one in the position and the state of mind of Linus, so we are suffering the same stupidities imposed by social justice worriers and some brainless feminists (no, don t get me wrong, these are two independent attributes. I do NOT state that feminism is brainless) that Linus and the maintainer crew was able to fend of this time. I finish with my favorite post from that thread, by Steven Rosted (from whom I also stole the above image!):
On Tue, 2013-07-16 at 18:37 -0700, Linus Torvalds wrote:
> Emotions aren't bad. Quite the reverse. 
 
Spock and Dr. Sheldon Cooper strongly disagree.
Post Scriptum (after a bike ride) The last point by Linus is what I criticize most on Debian nowdays, it has become a sterilized over-governed entity, where most fun is gone. Making fun is practically forbidden, since there is the slight chance that some minority somewhere on this planet might feel hurt, and by this we are breaking the CoC. Emotions are restricted to the Happy happy how nice we are and how good we are level of US and also Japanese self-reenforcement society. Post Post Scriptum I just read Sarah Sharp s post on What makes a good community? , and without giving a full account or review, I am just pi**ed by the usage of the word microaggressions I can only recommend everyone to read this article and this article to get a feeling how bs the idea of microaggressions has taken over academia and obviously not only academia. Post3 Scriptum I am happy to see Lars Wirzenius, Gunnar Wolf, and Mart n Ferrari opposing my view. I agree with them that my comments concerning Debian are not mainstream in Debian something that is not very surprising, though, and I think it is great that they have fun in Debian, like many other contributors. Post4 Scriptum Although nobody will read this, here is a great response from a female developer:
[...] To Linus: You're a hero to many of us. Don't change. Please. You DO
NOT need to take time away from doing code to grow a pair of breasts
and judge people's emotional states: [...]
Nothing to add here!

31 August 2015

Mart&iacute;n Ferrari: Romania

It's been over 2 years since I decided to start a new, nomadic life. I had the idea of blogging about this experience as it happened, but not only I am incredibly lazy when it comes to writing, most of the time I have been too busy just enjoying this lifestyle! The TL;DR version of these last 2 years: And now, I am in Cluj-Napoca, Romania.
View from my window
View from my window
2 comments

Mart&iacute;n Ferrari: IkiWiki

I haven't posted in a very long time. Not only because I suck at this, but also because IkiWiki decided to stop working with OpenID, so I can't use the web interface any more to post.. Very annoying. Already spent a good deal of time trying to find a solution, without any success.. I really don't want to migrate to another software again, but this is becoming a showstopper for me. 7 comments

Next.