Search Results: "francois"

21 February 2012

Francois Marier: Putting a limit on Apache and PHP memory usage

A little while ago, we ran into memory problems on mahara.org. It turned out to be due to the GD library having issues with large (as in height and width, not file size) images.

What we discovered is that the PHP memory limit (which is set to a fairly low value) only applies to actual PHP code, not C libraries like GD that are called from PHP. It's not obvious what PHP libraries are implemented as external C calls, which fall outside of the control of the interpreter, but anything that sounds like it's using some other library is probably not in PHP and is worth looking at.

To put a cap on the memory usage of Apache, we set process limits for the main Apache process and all of its children using ulimit.

Unfortunately, the limit we really wanted to change (resident memory or "-m") isn't implemented in the Linux kernel. So what we settled on was to limit the total virtual memory that an Apache process (or sub-process) can consume using "ulimit -v".

On a Debian box, this can be done by adding this to the bottom of /etc/default/apache2:
ulimit -v 1048576
for a limit of 1GB of virtual memory.

You can ensure that it works by setting it first to a very low value and then loading one of your PHP pages and seeing it die with some kind of malloc error.

I'm curious to know what other people do to prevent runaway Apache processes.

14 January 2012

Francois Marier: Debugging OpenWRT routers by shipping logs to a remote syslog server

Trying to debug problems with consumer-grade routers is notoriously difficult due to a lack of decent debugging information. It's quite hard to know what's going on without at least a few good error messages.

Here is how I made my OpenWRT-based Gargoyle router send its log messages to a network server running rsyslog.

Server ConfigurationGiven that the router (192.168.1.1) will be sending its log messages on UDP port 514, I started by opening that port in my firewall:
iptables -A INPUT -s 192.168.1.1 -p udp --dport 514 -j ACCEPT
Then I enabled the UDP module for rsyslog and redirected messages to a separate log file (so that it doesn't fill up /var/log/syslog) by putting the following (a modified version of these instructions) in /etc/rsyslog.d/10-gargoyle-router.conf:
$ModLoad imudp
$UDPServerRun 514
:fromhost-ip, isequal, "192.168.1.1" /var/log/gargoyle-router.log
& ~
The name of the file is important because this configuration snipet needs to be loaded before the directive which writes to /var/log/syslog for the discard statement (the "& ~" line) to work correctly.

Router ConfigurationFinally, I followed the instructions on the Gargoyle wiki to get the router to forward its log messages to my server (192.168.1.2).

After logging into the router via ssh, I ran the following commands:
uci set system.@system[0].log_ip=192.168.1.2
uci set system.@system[0].conloglevel=7
uci commit
before rebooting the router.


Now whenever I have to troubleshoot network problems, I can keep a terminal open on my server and get some visibility on what the router is doing:
tail -f /var/log/gargoyle-router.log

22 December 2011

Dirk Eddelbuettel: Rcpp 0.9.8

A new release 0.9.8 of Rcpp is now on CRAN and will also get into Debian shortly (once I finish building R 2.14.1). This release contains a few incremental changes. Romain, sponsored by by the Open Source Programs Office at Google, had released a new package int64 bringing larger integers to R, and this is now supported by Rcpp as well. John Chambers contributed some code to have Reference Classes extend existing C++ classes (typically brought in via Rcpp Modules). Jelmer Ypma sent us a patch to add a Rcout device not unlike cout, but aligned with R's io buffering. We added some more unit tests, and made a few small fixes here or there. The complete NEWS entry is below; more details are in the ChangeLog file in the package and on the Rcpp Changelog page.
0.9.8   2011-12-21
    o   wrap now handles 64 bit integers (int64_t, uint64_t) and containers 
        of them, and Rcpp now depends on the int64 package (also on CRAN).
        This work has been sponsored by the Google Open Source Programs
        Office.
    o   Added setRcppClass() function to create extended reference classes 
        with an interface to a C++ class (typically via Rcpp Module) which
        can have R-based fields and methods in addition to those from the C++.
    o   Applied patch by Jelmer Ypma which adds an output stream class
        'Rcout' not unlike std::cout, but implemented via Rprintf to
        cooperate with R and its output buffering.
        
    o   New unit tests for pf(), pnf(), pchisq(), pnchisq() and pcauchy()
    o   XPtr constructor now checks for corresponding type in SEXP
    o   Updated vignettes for use with updated highlight package
    o   Update linking command for older fastLm() example using external 
        Armadillo
Thanks to CRANberries, you can also look at a diff to the previous release 0.9.7. As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page

15 December 2011

Francois Marier: Installing Etherpad on Debian/Ubuntu

Etherpad is an excellent Open Source web application for collaborative text editing. Like Google Docs, it allows you to share documents with others through a secret URL or to set up private documents for which people need a login.

It's a little tricky to install so here's how I did it.

Build a Debian packageBecause the official repository is not kept up to date, you must build the package yourself:
  1. Grab the master branch from the official git repository:
    git clone git://github.com/ether/pad.git etherpad
  2. Build the package:
    dpkg-buildpackage -us -uc

Now, install some of its dependencies:
apt-get install --no-install-recommends dbconfig-common python-uno mysql-server

before installing the .deb you built:
dpkg -i etherpad_1.1.deb
apt-get install --no-install-recommends -f

Application configurationYou will likely need to change a few minor things in the default configuration at /etc/etherpad/etherpad.local.properties:
useHttpsUrls = true
customBrandingName = ExamplePad
customEmailAddress = etherpad@example.com
topdomains = etherpad.example.com,your.external.ip.address,127.0.0.1,localhost,localhost.localdomain

Nginx configurationIf you use Nginx as your web server of choice, create a vhost file in /etc/nginx/sites-available/etherpad:
server  
listen 443;
server_name etherpad.example.com *.etherpad.example.com;
add_header Strict-Transport-Security max-age=15768000;

ssl on;
ssl_certificate /etc/ssl/certs/etherpad.example.com.crt;
ssl_certificate_key /etc/ssl/certs/etherpad.example.com.pem;

ssl_session_timeout 5m;
ssl_session_cache shared:SSL:1m;

ssl_protocols TLSv1;
ssl_ciphers RC4-SHA:HIGH:!kEDH;
ssl_prefer_server_ciphers on;

access_log /var/log/nginx/etherpad.access.log;
error_log /var/log/nginx/etherpad.error.log;

location /
proxy_pass http://localhost:9000/;
proxy_set_header Host $host;

and then enable it and restart Nginx:
/etc/init.d/nginx restart

Apache configurationIf you prefer to use Apache instead, make sure that the required modules are enabled:
a2enmod proxy
a2enmod proxy_http

and then create a vhost file in /etc/apache2/sites-available/etherpad:
<VirtualHost *:443>
ServerName etherpad.example.com
ServerAlias *.etherpad.example.com

SSLEngine on
SSLCertificateFile /etc/apache2/ssl/etherpad.example.com.crt
SSLCertificateKeyFile /etc/apache2/ssl/etherpad.example.com.pem
SSLCertificateChainFile /etc/apache2/ssl/etherpad.example.com-chain.pem

SSLProtocol TLSv1
SSLHonorCipherOrder On
SSLCipherSuite RC4-SHA:HIGH:!kEDH
Header add Strict-Transport-Security: "max-age=15768000"

<Proxy>
Order deny,allow
Allow from all
</Proxy>

Alias /sitemap.xml /ep/tag/\?format=sitemap
Alias /static /usr/share/etherpad/etherpad/src/static

ProxyPreserveHost On
SetEnv proxy-sendchunked 1
ProxyRequests Off
ProxyPass / http://localhost:9000/
ProxyPassReverse / http://localhost:9000/
</VirtualHost>

before enabling that new vhost and restarting Apache:
a2ensite etherpad
apache2ctl configtest
apache2ctl graceful

DNS setupThe final step is to create these two DNS entries to point to your web server:
  • *.etherpad.example.com
  • etherpad.example.com

Also, as a precaution against an OpenOffice/LibreOffice-related bug, I suggest that you add the following entry to your web server's /etc/hosts file to avoid flooding your DNS resolver with bogus queries:
127.0.0.1 localhost.(none) localhost.(none).fulldomain.example.com
where fulldomain.example.com is the search base defined in /etc/resolv.conf.

Other useful instructionsHere are the most useful pages I used while setting this up:

4 December 2011

Francois Marier: Optimising PNG files

I have written about using lossless optimisations techniques to reduce the size of images before, but I recently learned of a few other tools to further reduce the size of PNG images.

Basic optimisationWhile you could use Smush.it to manually optimise your images, if you want a single Open Source tool you can use in your scripts, optipng is the most effective one:
optipng -o9 image.png

Removing unnecessary chunksWhile not as effective as optipng in its basic optimisation mode, pngcrush can be used remove unnecessary chunks from PNG files:
pngcrush -q -rem gAMA -rem alla -rem text image.png image.crushed.png
Depending on the software used to produce the original PNG file, this can yield significant savings so I usually start with this.

Reducing the colour paletteWhen optimising images uploaded by users, it's not possible to know whether or not the palette size can be reduced without too much quality degradation. On the other hand, if you are optimising your own images, it might be worth trying this lossy optimisation technique.

For example, this image went from 7.2 kB to 5.2 kB after running it through pngnq:
pngnq -f -n 32 -s 3 image.png

Re-compressing final imageMost PNG writers use zlib to compress the final output but it turns out that there are better algorithms to do this.

Using AdvanceCOMP I was able to bring the same image as above from 5.1kB to 4.6kB:
advpng -z -4 image.png

When the source image is an SVGAnother thing I noticed while optimising PNG files is that rendering a PNG of the right size straight from an SVG file produces a smaller result than exporting a large PNG from that same SVG and then resizing the PNG to smaller sizes.

Here's how you can use Inkscape to generate an 80x80 PNG:
inkscape --without-gui --export-width=80 --export-height=80 --export-png=80.png image.svg

14 November 2011

Francois Marier: Ideal OpenSSL configuration for Apache and nginx

After recently reading a number of SSL/TLS-related articles, I decided to experiment and look for the ideal OpenSSL configuration for Apache (using mod_ssl since I haven't tried mod_gnutls yet) and nginx.

By "ideal" I mean that this configuration needs to be compatible with most user agents likely to interact with my website as well as being fast and secure.

Here is what I came up with for Apache:
SSLProtocol TLSv1
SSLHonorCipherOrder On
SSLCipherSuite RC4-SHA:HIGH:!kEDH
and for nginx:
ssl_protocols  TLSv1;
ssl_ciphers RC4-SHA:HIGH:!kEDH;
ssl_prefer_server_ciphers on;

Cipher and protocol selectionIn terms of choosing a cipher to use, this configuration does three things:

Testing toolsThe main tool I used while testing various configurations was the SSL labs online tool. The CipherFox extension for Firefox was also quite useful to quickly identify the selected cipher.

Of course, you'll want to make sure that your configuration works in common browsers, but you should also test with tools like wget, curl and httping. Many of the online monitoring services are based on these.

Other considerationsTo increase the performance and security of your connections, you should ensure that the following features are enabled:
Note: If you have different SSL-enabled name-based vhosts on the same IP address (using SNI), make sure that their SSL cipher and protocol settings are identical.

6 November 2011

Dirk Eddelbuettel: Rcpp talk at Seattle RUG next month

The Seattle R User Group was kind enough to invite me to give a talk about R, C++ and Rcpp. So if you can make it to the Thomas building of the Fred Hutchinson Cancer Research Center in Seattle, WA, on December 7, I would love to see you there. I have some ideas about freshening up the presentation(s) based on material Romain and I have used in the past. This should make the why as well as how a little clearer; now I just have to find some to put this together. And if there are particular aspects you would like to see covered, please do get in touch with me.

1 November 2011

Francois Marier: Adding X-Content-Security-Policy headers in a Django application

Content Security Policy is a proposed HTTP extension which allows websites to restrict the external content that can be displayed by visiting web browsers. By expressing a set of rules to be enforced by the browser, a website is able to prevent the injection of outside resources by malicious users.

While adding support for the March 2011 draft in Libravatar, I looked at three different approaches.

Controlling the headers in the applicationThe first approach I considered was to have the Django application output all of the headers, which is what the django-csp module does. Unfortunately, I need to be able to vary the policy between pages (the views in Libravatar have different requirements) and that's one of the things that hasn't been implemented yet in that module.

Producing the same headers by hand is fairly simple:
response = render_to_response('app/view.html')
response['X-Content-Security-Policy'] = "allow 'self'"
return response
but it would mean adding a bit of code to every view and/or writing a custom wrapper for render_to_response().

Setting a default header in ApacheIdeally, I'd like to be able to set a default header in Apache using mod_headers and then override it as needed inside the application.

The first problem with this solution is that it's not possible (as far I can tell) for a Django application to override a header set by Apache:
The second problem is that mod_headers doesn't have an action that adds/sets a header only if it didn't already exist. It does have append and merge actions which could in theory be used to add extra terms to the policy but it unfortunately uses a different separator (the comma) from the CSP spec (which uses semi-colons).

Always set headers in ApacheWhile I would have liked to get the second approach working, in the end, I included all of the CSP directives within the main Apache config file:
Header set X-Content-Security-Policy: "allow 'self'; options inline-script; img-src 'self' data:"

<Location /account/confirm_email>
Header set X-Content-Security-Policy: "allow 'self'; options inline-script; img-src *"
</Location>

<Location /tools/check>
Header set X-Content-Security-Policy: "allow 'self'; options inline-script; img-src *"
</Location>
The first Header call sets a default policy which is later overriden based on the path to the Django view that's being used.

Related technologiesIf you are interested in Content Security Policy, you may also want to look into Application Boundaries Enforcer (part of the NoScript Firefox extension) for more security rules that can be supplied by the server and enforced client-side.

It's also worth mentioning the excellent Request Policy extension which solves the same problem by letting users whitelist the cross-site requests they want to allow.

22 October 2011

Francois Marier: Reducing the size of Apache 301 and 302 responses

Looking through the Libravatar access logs, I found that most of the traffic we currently serve consists of 302 redirects to Gravatar. Optimising that path is therefore very important.

While Apache allows admins to provide custom error pages for things like 404 or 500, it's not quite that straightforward for 30x return codes.

Standard 301 / 302 responsesBy default, Apache (and most web servers out there) returns a fairly large HTML page along with a 30x redirection. Try it for yourself by disabling automatic redirections in Firefox (Preferences Advanced General Accessibility) or by installing the Request Policy add-on.

The 302 responses sent by Libravatar looked like this:
$ curl -i http://cdn.libravatar.org/avatar/12345678901234567890123456789012
HTTP/1.1 302 Found
Date: Wed, 21 Sep 2011 01:51:52 GMT
Server: Apache
Cache-Control: max-age=86400
Location: http://www.gravatar.com/avatar/12345678901234567890123456789012.jpg?r=g&s=80&d=http://cdn.libravatar.org/nobody/80.png
Vary: Accept-Encoding
Content-Length: 310
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.gravatar.com/avatar/12345678901234567890123456789012.jpg?r=g&s=80&d=http://cdn.libravatar.org/nobody/80.png">here</a>.</p>
</body></html>
As you can see, the body of the response is just as large as the headers and isn't really necessary.

Body-less 301 responsesAfter reading about the ErrorDocument directive, I created an empty file called 302 in the root of the web server and included this directive in my vhost configuration file:
ErrorDocument 302 /302
which made the responses look like this:
$ curl -i http://example.com/redir
HTTP/1.1 302 Found
Date: Wed, 21 Sep 2011 03:39:26 GMT
Server: Apache
Last-Modified: Wed, 21 Sep 2011 03:39:17 GMT
ETag: "8024d-0-4ad6b52201036"
Accept-Ranges: bytes
Content-Length: 0
Content-Type: text/plain

This one does have a completely empty body, however, there's an important problem with this solution: the Location header is missing! Not much point in reducing the size of the redirect if it's no longer working.

Custom 302 response pageThe next thing I tried (and ended up settling on) is this:
ErrorDocument 302 " "
which results in a 1-byte response (a single space) in the body:
$ curl -i http://example.com/redir
HTTP/1.1 302 Found
Date: Wed, 21 Sep 2011 03:37:50 GMT
Server: Apache
Location: http://www.example.com
Vary: Accept-Encoding
Content-Length: 1
Content-Type: text/html; charset=iso-8859-1

There is still a little bit of unnecessary information in this response (character set, Vary and Server headers), but it's a major improvement over the original.

If you know of any other ways to reduce this further, please leave a comment!

3 October 2011

Francois Marier: Three Firefox extensions to enhance SSL security

There has been a lot of talk recently questioning the trust authorities that underpin the SSL/TLS world. After a few high-profile incidents, it is clear that there is something wrong with this structure.

While some people have suggested that DNSSEC might solve this problem, here are three Firefox add-ons that can be used today to enhance the security of HTTPS:

Unlike the Convergence approach which completely takes over certificate handling, all three of the above add-ons can be used together.

8 June 2011

Francois Marier: Sample Python application using Libgearman

Gearman is a distributed queue with several language bindings.

While Gearman has a nice Python implementation (python-gearman) of the client and worker, I chose to use the libgearman bindings (python-libgearman) directly since they are already packaged for Debian (as python-gearman.libgearman).

Unfortunately, these bindings are not very well documented, so here's the sample application I wished I had seen before I started.

Using the command-line toolsBefore diving into the Python bindings, you should make sure that you can get a quick application working on the command line (using the gearman-tools package).

Here's a very simple worker which returns verbatim the input it receives:
gearman -w -f myfunction cat
and here is the matching client:
gearman -f myfunction 'test'
You can have have a look at the status of the queues in the server by connecting to gearmand via telnet (port 4730) and issuing the status command.

Using the Python libgearman bindingsOnce your gearman setup is working (debugging is easier with the command-line tools), you can roll the gearman connection code into your application.

Here's a simple Python worker which returns what it receives:
#!/usr/bin/python

from gearman import libgearman

def work(job):
workload = job.get_workload()
return workload

gm_worker = libgearman.Worker()
gm_worker.add_server('localhost')
gm_worker.add_function('myfunction', work)

while True:
gm_worker.work()
and a matching client:
#!/usr/bin/python

from gearman import libgearman

gm_client = libgearman.Client()
gm_client.add_server('localhost')

result = gm_client.do('myfunction', 'test')
print result
This should behave in exactly the same way as the command-line examples above.

Returning job errorsIf you want to expose to the client errors in the processing done by the worker, modify the worker like this:
#!/usr/bin/python

from gearman import libgearman

def work(job):
workload = job.get_workload()
if workload == 'fail':
job.send_fail()
return workload

gm_worker = libgearman.Worker()
gm_worker.add_server('localhost')
gm_worker.add_function('myfunction', work)

while True:
gm_worker.work()
and the client this way:
#!/usr/bin/python

from gearman import libgearman

gm_client = libgearman.Client()
gm_client.add_server('localhost')

result = gm_client.do('myfunction', 'fail')
print result
LicenseThe above source code is released under the following terms:
CC0
To the extent possible under law, Francois Marier has waived all copyright and related or neighboring rights to this sample libgearman Python application. This work is published from: New Zealand.

30 May 2011

Francois Marier: Code reviews with Gerrit and Gitorious

The Mahara project has just moved to mandatory code reviews for every commit that gets applied to core code.

Here is a description of how Gerrit Code Review, the peer-review system used by Android, was retrofitted into our existing git repository on Gitorious.

(If you want to know more about Gerrit, listen to this FLOSS Weekly interview.)

Replacing existing Gitorious committers with a robotThe first thing to do was to log into Gitorious and remove commit rights from everyone in the main repository. Then I created a new maharabot account with a password-less SSH key (stored under /home/gerrit/.ssh/) and made that new account the sole committer.

This is to ensure that nobody pushes to the repository by mistake since all of these changes would be overwritten by Gerrit.

Basic Gerrit installationAfter going through the installation instructions, I logged into the Gerrit admin interface and created a new "mahara" project.

I picked the "merge if necessary" submit action because "cherry-pick" would disable dependency tracking which is quite a handy feature.

Reverse proxy using NginxSince we wanted to offer Gerrit over HTTPS, I decided to run it behind an Nginx proxy. This is the Nginx configuration I ended up with:
server  
listen 443;
server_name reviews.mahara.org;
add_header Strict-Transport-Security max-age=15768000;

ssl on;
ssl_certificate /etc/ssl/certs/reviews.mahara.org.crt;
ssl_certificate_key /etc/ssl/certs/reviews.mahara.org.pem;

ssl_session_timeout 5m;
ssl_session_cache shared:SSL:1m;

ssl_protocols TLSv1;
ssl_ciphers HIGH:!ADH;
ssl_prefer_server_ciphers on;

location /
proxy_pass http://127.0.0.1:8081;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header Host $host;


Things to note:
Mail setupTo enable Gerrit to email reviewers and committers, I installed Postfix and used "reviews.mahara.org" as the "System mail name".

Then I added the following to /home/gerrit/mahara_reviews/etc/gerrit.config:
[user]
email = "gerrit@reviews.mahara.org"
to fix the From address in outgoing emails.

Init script and cronFollowing the installation instructions, I created these symlinks:
ln -s /home/gerrit/mahara_reviews/bin/gerrit.sh /etc/init.d/gerrit
cd /etc/rc2.d && ln -s ../init.d/gerrit S19gerrit
cd /etc/rc3.d && ln -s ../init.d/gerrit S19gerrit
cd /etc/rc4.d && ln -s ../init.d/gerrit S19gerrit
cd /etc/rc5.d && ln -s ../init.d/gerrit S19gerrit
cd /etc/rc0.d && ln -s ../init.d/gerrit K21gerrit
cd /etc/rc1.d && ln -s ../init.d/gerrit K21gerrit
cd /etc/rc6.d && ln -s ../init.d/gerrit K21gerrit
and put the following settings into /etc/default/gerritcodereview:
GERRIT_SITE=/home/gerrit/mahara_reviews
GERRIT_USER=gerrit
GERRIT_WAR=/home/gerrit/gerrit.war
to automatically start and stop Gerrit.

I also added a cron job in /etc/cron.d/gitcleanup to ensure that the built-in git repository doesn't get bloated:
MAILTO=admin@example.com
20 4 * * * gerrit GIT_DIR=/home/gerrit/mahara_reviews/git/mahara.git git gc --quiet

Configuration enhancementsTo allow images in change requests to be displayed inside the browser, I marked them as safe in /home/gerrit/mahara_reviews/etc/gerrit.config:
[mimetype "image/*"]
safe = true

Another thing I did to enhance the review experience was to enable the gitweb repository browser:
apt-get install gitweb

and to make checkouts faster by enabling anonymous Git access:
[gerrit]
canonicalGitUrl = git://reviews.mahara.org/git/
[download]
scheme = ssh
scheme = anon_http
scheme = anon_git

which requires that you have a git daemon running and listening on port 9418:
apt-get install git-daemon-run
ln -s /home/gerrit/mahara_reviews/git/mahara.git /var/cache/git/
touch /home/gerrit/mahara_reviews/git/mahara.git/git-daemon-export-ok

Finally, I included the Mahara branding in the header and footer of each page by providing valid XHTML fragments in /home/gerrit/mahara_reviews/etc/GerritSiteHeader.html and GerritSiteFooter.html.

Initial import and replicationOnce Gerrit was fully working, I performed the initial code import by using my administrator account to push the exiting Gitorious branches to the internal git repository:
git remote add gerrit ssh://username@reviews.mahara.org:29418/mahara
git push gerrit 1.2_STABLE
git push gerrit 1.3_STABLE
git push gerrit master
Note that I had to temporarily disable "Require Change IDs" in the project settings in order to import the old commits which didn't have these.

To replicate the internal Gerrit repository back to Gitorious, I created a new /home/gerrit/mahara_reviews/etc/replication.config file:
[remote "gitorious"]
url = gitorious.org:mahara/$ name .git
push = +refs/heads/*:refs/heads/*
push = +refs/tags/*:refs/tags/*
(The $ name variable is required even when you have a single project.)

Contributor instructionsThis is how developers can get a working checkout of our code now:
git clone git://gitorious.org/mahara/mahara.git
cd mahara
git remote add gerrit ssh://username@reviews.mahara.org:29418/mahara
git fetch gerrit
scp -p -P 29418 reviews.mahara.org:hooks/commit-msg .git/hooks/
and this is how they can submit local changes to Gerrit:
git push gerrit HEAD:refs/for/master

Anybody can submit change requests or comment on them but make sure you do not have the Cookie Pie Firefox extension installed or you will be unable to log into Gerrit.

6 April 2011

Dirk Eddelbuettel: Rcpp workshop / master class on April 28 in Chicago

I realized I never announced this on the blog, so without further ado.... Rcpp Workshop in Chicago on April 28, 2011 This year's R/Finance conference will be preceded by a full-day masterclass on Rcpp and related topics which will be held on Thursday, April 28, 2011, on the University of Illinois at Chicago campus. Join Dirk Eddelbuettel and Romain Francois for six hours of detailed and hands-on instructions and discussions around Rcpp, inline, RInside, RcppArmadillo and other packages---in an intimate small-group setting. The full-day format allows to combine a morning introductory session with a more advanced afternoon session while leaving room for sufficient breaks. There will be about six hours of instructions, a one-hour lunch break and two half-hour coffee breaks. Morning session: "A hands-on introduction to R and C++" The morning session will provide a practical introduction to the Rcpp package (and other related packages). The focus will be on simple and straightforward applications of Rcpp in order to extend R and/or to significantly accelerate the execution of simple functions. The tutorial will cover the inline package which permits embedding of self-contained C, C++ or Fortran code in R scripts. We will also discuss RInside to embed R code in C++ applications, as well as standard Rcpp extension packages such as RcppArmadillo for linear algebra and RcppGSL. Afternoon session: "Advanced R and C++ topics" This afternoon tutorial will provide a hands-on introduction to more advanced Rcpp features. It will cover topics such as writing packages that use Rcpp, how 'Rcpp modules' and the new R ReferenceClasses interact, and how 'Rcpp sugar' lets us write C++ code that is often as expressive as R code. Another possible topic, time permitting, may be writing glue code to extend Rcpp to other C++ projects. We also hope to leave some time to discuss problems brought by the class participants. Prerequisites Knowledge of R as well as general programming knowledge; C or C++ knowledge is helpful but not required. Users should bring a laptop set up so that R packages can be built. That means on Windows, Rtools needs to be present and working, and on OS X the Xcode package should be installed. Registration Registration is available via the R/Finance conference at

http://www.RinFinance.com/register/
or directly at RegOnline
http://www.regonline.com/930153
The cost is USD 500 for the whole day, and space will be limited. Questions Please contact us directly at RomainAndDirk@r-enthusiasts.com.

3 April 2011

Francois Marier: Encrypted system backup to DVD

Inspired by World Backup Day, I decided to take a backup of my laptop. Thanks to using a free operating system I don't have to backup any of my software, just configuration and data files, which fit on a single DVD.

In order to avoid worrying too much about secure storage and disposal of these backups, I have decided to encrypt them using a standard encrypted loopback filesystem.

(Feel free to leave a comment if you can suggest an easier way of doing this.)

Cryptmount setupInstall cryptmount:
apt-get install cryptmount
and setup two encrypted mount points in /etc/cryptmount/cmtab:
backup  
dev=/backup.dat
dir=/backup
fstype=ext2 fsoptions=defaults cipher=aes

keyfile=/backup.key
keyhash=sha1 keycipher=des3


testbackup
dev=/cdrom/backup.dat
dir=/backup
fstype=ext2 fsoptions=defaults cipher=aes

keyfile=/cdrom/backup.key
keyhash=sha1 keycipher=des3

Initialize the encrypted filesystemMake sure you have at least 4.3 GB of free disk space on / and then run:
mkdir /backup
dd if=/dev/zero of=/backup.dat bs=1M count=4096
cryptmount --generate-key 32 backup
cryptmount --prepare backup
mkfs.ext2 -m 0 /dev/mapper/backup
cryptmount --release backup

Burn the data to a DVDMount the newly created partition:
cryptmount backup
and then copy the files you want to /backup/ before unmounting that partition:
cryptmount -u backup
Finally, use your favourite DVD-burning program to burn these two files:

Test your backupBefore deleting these two files, test the DVD you've just burned by mounting it:
mount /cdrom
cryptmount testbackup
and looking at a random sampling of the files contained in /backup.

Once you are satisfied that your backup is fine, umount the DVD:
cryptmount -u testbackup
umount /cdrom
and remove the temporary files:
rm /backup.dat /backup.key

29 March 2011

Francois Marier: Preventing man-in-the-middle attacks on fetchmail and postfix

Recent attacks against the DNS infrastructure have exposed the limitations of relying on TLS/SSL certificates for securing connections on the Internet.

Given that typical mail servers don't rotate their keys very often, it's not too cumbersome to hardcode their fingerprints and prevent your mail software from connecting to them should the certificate change. This is similar to how most people use ssh: assume that the certificate is valid on the first connection, but be careful if the certificate changes afterwards.

FetchmailHere's how to specify a certificate for a POP/IMAP server (Gmail in this example).

First of all, you need to download the server certificate:

openssl s_client -connect pop.gmail.com:995 -showcerts
openssl s_client -connect imap.gmail.com:993 -showcerts

Then copy the output of that command to a file, say gmail.out, and extract its md5 fingerprint:

openssl x509 -fingerprint -md5 -noout -in gmail.out

Once you have the fingerprint, add it to your ~/.fetchmailrc:

poll pop.gmail.com protocol pop3 user "remoteusername" is "localusername" password "mypassword" fetchall ssl sslproto ssl3 sslfingerprint "12:34:AB:CD:56:78:EF:12:34:AB:CD:56:78:EF:12:34"

PostfixSimilarly, to detect changes to the certificate on your outgoing mail server (used as a smarthost on your local postfix instance), extract its sha1 fingerprint:

openssl s_client -connect mail.yourisp.net:465 -showcerts
openssl x509 -fingerprint -sha1 -noout -in isp.out

Then add the fingerprint to /etc/postfix/main.cf:

relayhost = mail.isp.net
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_auth_enable = yes
smtp_sasl_security_options = noanonymous
smtp_tls_security_level = fingerprint
smtp_tls_mandatory_ciphers = high
smtp_tls_mandatory_protocols = !SSLv2, !SSLv3
smtp_tls_fingerprint_digest = sha1
smtp_tls_fingerprint_cert_match =
12:34:AB:CD:56:78:EF:90:12:AB:CD:34:56:EF:78:90:AB:CD:12:34

25 March 2011

Dirk Eddelbuettel: R inside Qt: A simple RInside application

The RInside package makes it pretty simple and straightforward to embed R, the wonderful statistical programming environment and language, inside of a C++ application. This uses both the robust embedding API provided by R itself, and the higher-level abstractions from our Rcpp package. A number of examples are shown on this blog both here and here as well as on the RInside page; and the source package actually contains well over a dozen complete examples which cover anything from simple examples to parallel use via MPI for parallel computing. Beginning users sometimes ask about how to use RInside inside larger projects. And as I had meant to experiment with embedding inside of the powerful Qt framework anyway, I started to dabble a little. A first result is now in the SVN sources of RInside. My starting point was the classic tkdensity demo that comes with R itself. It is a good point of departure as Tcl/Tk makes it very portable---in fact it should run on every platform that runs R---and quite expressive. And having followed some of the GUI experiments around R over the years, I have also seen various re-implementations using different GUI frameworks. And so I am adding mine to this body of work: Example of embedding R via RInside into a Qt C++ application: density estimation for a mixture The problem I addressed first was actual buildability. For the RInside examples, Romain and I provide a Makefile that just works by making calls to R itself to learn about flags for R, Rcpp and RInside such that all required headers and libraries are found. That is actually relatively straightforward (and documented in our vignettes) but a little intimidating at first---which is why a ready-made Makefile is a good thing. Qt of course uses qmake and the .pro files to encode / resolve dependencies. So task one was to map what our Makefile does into its variables. Turns out that wasn't all that hard:
## -*- mode: Makefile; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
##
## Qt usage example for RInside, inspired by the standard 'density
## sliders' example for other GUI toolkits
##
## Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois
TEMPLATE =              app
HEADERS =               qtdensity.h 
SOURCES =               qtdensity.cpp main.cpp
QT +=                   svg
## comment this out if you need a different version of R, 
## and set set R_HOME accordingly as an environment variable
R_HOME =                $$system(R RHOME)
## include headers and libraries for R 
RCPPFLAGS =             $$system($$R_HOME/bin/R CMD config --cppflags)
RLDFLAGS =              $$system($$R_HOME/bin/R CMD config --ldflags)
RBLAS =                 $$system($$R_HOME/bin/R CMD config BLAS_LIBS)
RLAPACK =               $$system($$R_HOME/bin/R CMD config LAPACK_LIBS)
## if you need to set an rpath to R itself, also uncomment
#RRPATH =               -Wl,-rpath,$$R_HOME/lib
## include headers and libraries for Rcpp interface classes
RCPPINCL =              $$system($$R_HOME/bin/Rscript -e \'Rcpp:::CxxFlags\(\)\')
RCPPLIBS =              $$system($$R_HOME/bin/Rscript -e \'Rcpp:::LdFlags\(\)\')
## for some reason when building with Qt we get this each time
## so we turn unused parameter warnings off
RCPPWARNING =           -Wno-unused-parameter 
## include headers and libraries for RInside embedding classes
RINSIDEINCL =           $$system($$R_HOME/bin/Rscript -e \'RInside:::CxxFlags\(\)\')
RINSIDELIBS =           $$system($$R_HOME/bin/Rscript -e \'RInside:::LdFlags\(\)\')
## compiler etc settings used in default make rules
QMAKE_CXXFLAGS +=       $$RCPPWARNING $$RCPPFLAGS $$RCPPINCL $$RINSIDEINCL
QMAKE_LFLAGS +=         $$RLDFLAGS $$RBLAS $$RLAPACK $$RCPPLIBS $$RINSIDELIBS
## addition clean targets
QMAKE_CLEAN +=          qtdensity Makefile
The double dollar signs and escaping of parentheses are a little tedious, but hey it works and expands the compiler and linker flags such that everything <emjust works="Works">. The code itself is pretty straightforward too. We instantiate the RInside object as well as the main Qt application object. We then instantiate a new object of class QtDensity that will launch the main widget; it is given a reference to the RInside object.
// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
//
// Qt usage example for RInside, inspired by the standard 'density
// sliders' example for other GUI toolkits
//
// Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois
#include <QApplication>
#include "qtdensity.h"
int main(int argc, char *argv[])
 
    RInside R(argc, argv);  		// create an embedded R instance
    QApplication app(argc, argv);
    QtDensity qtdensity(R);
    return app.exec();
 
The definition of the main object is pretty simple: a few private variables, and a few functions to interact with the GUI and get values from the radio buttons, slider or input field---as well as functions to update the chart or re-draw the random variables.
// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
//
// Qt usage example for RInside, inspired by the standard 'density
// sliders' example for other GUI toolkits
//
// Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois
#ifndef QTDENSITY_H
#define QTDENSITY_H
#include <RInside.h>
#include <QMainWindow>
#include <QHBoxLayout>
#include <QSlider>
#include <QSpinBox>
#include <QLabel>
#include <QTemporaryFile>
#include <QSvgWidget>
class QtDensity : public QMainWindow
 
    Q_OBJECT
public:
    QtDensity(RInside & R);
private slots:
    void getBandwidth(int bw);
    void getKernel(int kernel);
    void getRandomDataCmd(QString txt);
    void runRandomDataCmd(void);
private:
    void setupDisplay(void);    // standard GUI boilderplate of arranging things
    void plot(void);            // run a density plot in R and update the
    void filterFile(void);      // modify the richer SVG produced by R
    QSvgWidget *m_svg;          // the SVG device
    RInside & m_R;              // reference to the R instance passed to constructor
    QString m_tempfile;         // name of file used by R for plots
    QString m_svgfile;          // another temp file, this time from Qt
    int m_bw, m_kernel;         // parameters used to estimate the density
    QString m_cmd;              // random draw command string
 ;
#endif
Lastly, no big magic in the code either (apart from the standard magic provided by RInside). A bit of standard GUI layouting, and then some functions to pick values from the inputs as well as to compute / update the output. One issue is worth mentioning. The screenshot and code show the second version of this little application. I built a first one using a standard portable network graphics (png) file. That was fine, but not crisp as png is a pixel format so I went back and experimented with scalable vector graphics (svg) instead. One can create svg output with R in a number of ways, one of which is the cairoDevice package by Michael Lawrence (who also wrote RGtk2 and good chunks of Ggobi). Now, it turns out that Qt displays the so-called SVG tiny standard whereas R creates a fuller SVG format. Some discussion with Michael reveals that one can modify the svg file suitably (which is what the function filterFile below does) and it all works. Well: almost. There is a bug (and Michael thinks it is the SVG rendering) in which the density estimate does not get clipped to the plotting region.
// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
//
// Qt usage example for RInside, inspired by the standard 'density
// sliders' example for other GUI toolkits -- this time with SVG
//
// Copyright (C) 2011  Dirk Eddelbuettel and Romain Francois
#include <QtGui>
#include "qtdensity.h"
QtDensity::QtDensity(RInside & R) : m_R(R)
 
    m_bw = 100;                 // initial bandwidth, will be scaled by 100 so 1.0
    m_kernel = 0;               // initial kernel: gaussian
    m_cmd = "c(rnorm(100,0,1), rnorm(50,5,1))"; // simple mixture
    m_R["bw"] = m_bw;           // pass bandwidth to R, and have R compute a temp.file name
    m_tempfile = QString::fromStdString(Rcpp::as<std::string>(m_R.parseEval("tfile <- tempfile()")));
    m_svgfile = QString::fromStdString(Rcpp::as<std::string>(m_R.parseEval("sfile <- tempfile()")));
    m_R.parseEvalQ("library(cairoDevice)");
    setupDisplay();
 
void QtDensity::setupDisplay(void)   
    QWidget *window = new QWidget;
    window->setWindowTitle("Qt and RInside demo: density estimation");
    QSpinBox *spinBox = new QSpinBox;
    QSlider *slider = new QSlider(Qt::Horizontal);
    spinBox->setRange(5, 200);
    slider->setRange(5, 200);
    QObject::connect(spinBox, SIGNAL(valueChanged(int)), slider, SLOT(setValue(int)));
    QObject::connect(slider, SIGNAL(valueChanged(int)), spinBox, SLOT(setValue(int)));
    spinBox->setValue(m_bw);
    QObject::connect(spinBox, SIGNAL(valueChanged(int)), this, SLOT(getBandwidth(int)));
    QLabel *cmdLabel = new QLabel("R command for random data creation");
    QLineEdit *cmdEntry = new QLineEdit(m_cmd);
    QObject::connect(cmdEntry,  SIGNAL(textEdited(QString)), this, SLOT(getRandomDataCmd(QString)));
    QObject::connect(cmdEntry,  SIGNAL(editingFinished()), this, SLOT(runRandomDataCmd()));
    QGroupBox *kernelRadioBox = new QGroupBox("Density Estimation kernel");
    QRadioButton *radio1 = new QRadioButton("&Gaussian");
    QRadioButton *radio2 = new QRadioButton("&Epanechnikov");
    QRadioButton *radio3 = new QRadioButton("&Rectangular");
    QRadioButton *radio4 = new QRadioButton("&Triangular");
    QRadioButton *radio5 = new QRadioButton("&Cosine");
    radio1->setChecked(true);
    QVBoxLayout *vbox = new QVBoxLayout;
    vbox->addWidget(radio1);
    vbox->addWidget(radio2);
    vbox->addWidget(radio3);
    vbox->addWidget(radio4);
    vbox->addWidget(radio5);
    kernelRadioBox->setMinimumSize(260,140);
    kernelRadioBox->setMaximumSize(260,140);
    kernelRadioBox->setSizePolicy(QSizePolicy::Fixed, QSizePolicy::Fixed);
    kernelRadioBox->setLayout(vbox);
    QButtonGroup *kernelGroup = new QButtonGroup;
    kernelGroup->addButton(radio1, 0);
    kernelGroup->addButton(radio2, 1);
    kernelGroup->addButton(radio3, 2);
    kernelGroup->addButton(radio4, 3);
    kernelGroup->addButton(radio5, 4);
    QObject::connect(kernelGroup, SIGNAL(buttonClicked(int)), this, SLOT(getKernel(int)));
    m_svg = new QSvgWidget();
    runRandomDataCmd();         // also calls plot()
    QGroupBox *estimationBox = new QGroupBox("Density estimation bandwidth (scaled by 100)");
    QHBoxLayout *spinners = new QHBoxLayout;
    spinners->addWidget(spinBox);
    spinners->addWidget(slider);
    QVBoxLayout *topright = new QVBoxLayout;
    topright->addLayout(spinners);
    topright->addWidget(cmdLabel);
    topright->addWidget(cmdEntry);
    estimationBox->setMinimumSize(360,140);
    estimationBox->setMaximumSize(360,140);
    estimationBox->setSizePolicy(QSizePolicy::Fixed, QSizePolicy::Fixed);
    estimationBox->setLayout(topright);
    QHBoxLayout *upperlayout = new QHBoxLayout;
    upperlayout->addWidget(kernelRadioBox);
    upperlayout->addWidget(estimationBox);
    QHBoxLayout *svglayout = new QHBoxLayout;
    svglayout->addWidget(m_svg);
    QVBoxLayout *outer = new QVBoxLayout;
    outer->addLayout(upperlayout);
    outer->addLayout(svglayout);
    window->setLayout(outer);
    window->show();
 
void QtDensity::plot(void)  
    const char *kernelstrings[] =   "gaussian", "epanechnikov", "rectangular", "triangular", "cosine"  ;
    m_R["bw"] = m_bw;
    m_R["kernel"] = kernelstrings[m_kernel]; // that passes the string to R
    std::string cmd1 = "Cairo(width=6,height=6,pointsize=10,surface='svg',filename=tfile); "
                       "plot(density(y, bw=bw/100, kernel=kernel), xlim=range(y)+c(-2,2), main=\"Kernel: ";
    std::string cmd2 = "\"); points(y, rep(0, length(y)), pch=16, col=rgb(0,0,0,1/4));  dev.off()";
    std::string cmd = cmd1 + kernelstrings[m_kernel] + cmd2; // stick the selected kernel in the middle
    m_R.parseEvalQ(cmd);
    filterFile();               // we need to simplify the svg file for display by Qt 
    m_svg->load(m_svgfile);
 
void QtDensity::getBandwidth(int bw)  
    if (bw != m_bw)  
        m_bw = bw;
        plot();
     
 
void QtDensity::getKernel(int kernel)  
    if (kernel != m_kernel)  
        m_kernel = kernel;
        plot();
     
 
void QtDensity::getRandomDataCmd(QString txt)  
    m_cmd = txt;
 
void QtDensity::runRandomDataCmd(void)  
    std::string cmd = "y <- " + m_cmd.toStdString();
    m_R.parseEvalQ(cmd);
    plot();                     // after each random draw, update plot with estimate
 
void QtDensity::filterFile()  
    // cairoDevice creates richer SVG than Qt can display
    // but per Michaele Lawrence, a simple trick is to s/symbol/g/ which we do here
    QFile infile(m_tempfile);
    infile.open(QFile::ReadOnly);
    QFile outfile(m_svgfile);
    outfile.open(QFile::WriteOnly   QFile::Truncate);
    
    QTextStream in(&infile);
    QTextStream out(&outfile);
    QRegExp rx1("<symbol"); 
    QRegExp rx2("</symbol");    
    while (!in.atEnd())  
        QString line = in.readLine();
        line.replace(rx1, "<g"); // so '<symbol' becomes '<g ...'
        line.replace(rx2, "</g");// and '</symbol becomes '</g'
        out << line << "\n";
     
    infile.close();
    outfile.close();
 
What the little application does is actually somewhat neat for the few lines. One key features is that the generated data can be specified directly by an R expression which allows for mixtures (as shown, and as is the default). With that it easy to see how many points are needed in the second hump to make the estimate multi-modal, and how much of a distance between both centers is needed and so on. Obviously, the effect of the chosen kernel and bandwidth can also be visualized. And with the chart the being a support vector graphics display, we can resize and scale at will and it still looks crisp. The code (for both the simpler png variant and the svg version shown here) is in the SVN repository for RInside and will be in the next release. Special thanks to Michael Lawrence for patiently working through some svg woes with me over a few emails.

Update: Some typos fixed.

Update 2: Two URLs corrected.

13 March 2011

Francois Marier: Setting up RAID on an existing Debian/Ubuntu installation

I run RAID1 on all of the machines I support. While such hard disk mirroring is not a replacement for having good working backups, it means that a single drive failure is not going to force me to have to spend lots of time rebuilding a machine.

The best possible time to set this up is of course when you first install the operating system. The Debian installer will set everything up for you if you choose that option and Ubuntu has alternate installation CDs which allow you to do the same.

This post documents the steps I followed to retrofit RAID1 into an existing Debian squeeze installation. Getting a mirrored setup after the fact.

OverviewBefore you start, make sure the following packages are installed:
apt-get install mdadm rsync initramfs-tools
Then go through these steps:
  1. Partition the new drive.
  2. Create new degraded RAID arrays.
  3. Install GRUB2 on both drives.
  4. Copy existing data onto the new drive.
  5. Reboot using the RAIDed drive and test system.
  6. Wipe the original drive by adding it to the RAID array.
  7. Test booting off of the original drive.
  8. Resync drives.
  9. Test booting off of the new drive.
  10. Reboot with the two drives and resync the array.
(My instructions are mostly based on this old tutorial but also on this more recent one.)

1- Partition the new driveOnce you have connected the new drive (/dev/sdb), boot into your system and use one of cfdisk or fdisk to display the partition information for the existing drive (/dev/sda on my system).

The idea is to create partitions of the same size on the new drive. (If the new drive is bigger, leave the rest of the drive unpartitioned.)

Partition types should all be: fd (or "linux raid autodetect").

2- Create new degraded RAID arraysThe newly partioned drive, consisting of a root and a swap partition, can be added to new RAID1 arrays using mdadm:
mdadm --create /dev/md0 --level=1 --raid-devices=2 missing /dev/sdb1
mdadm --create /dev/md1 --level=1 --raid-devices=2 missing /dev/sdb2
and formatted like this:
mkswap /dev/md1
mkfs.ext4 /dev/md0
Specify these devices explicitly in /etc/mdadm/mdadm.conf:
DEVICE /dev/sda* /dev/sdb*
and append the RAID arrays to the end of that file:
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
dpkg-reconfigure mdadm
You can check the status of your RAID arrays at any time by running this command:
cat /proc/mdstat

3- Install GRUB2 on both drivesThe best way to ensure that GRUB2, the default bootloader in Debian and Ubuntu, is installed on both drives is to reconfigure its package:
dpkg-reconfigure grub-pc
and select both /dev/sda and /dev/sdb (but not /dev/md0) as installation targets.

This should cause the init ramdisk (/boot/initrd.img-2.6.32-5-amd64) and the grub menu (/boot/grub/grub.cfg) to be rebuilt with RAID support.

4- Copy existing data onto the new driveCopy everything that's on the existing drive onto the new one using rsync:
mkdir /tmp/mntroot
mount /dev/md0 /tmp/mntroot
rsync -auHxv --exclude=/proc/* --exclude=/sys/* --exclude=/tmp/* /* /tmp/mntroot/

5- Reboot using the RAIDed drive and test systemBefore rebooting, open /tmp/mntroot/etc/fstab, and change /dev/sda1 and /dev/sda2 to /dev/md0 and /dev/md1 respectively.

Then reboot and from within the GRUB menu, hit "e" to enter edit mode and make sure that you will be booting off of the new disk:
set root='(md/0)'
linux /boot/vmlinuz-2.6.32-5-amd64 root=/dev/md0 ro quiet
Once the system is up, you can check that the root partition is indeed using the RAID array by running mount and looking for something like:
/dev/md0 on / type ext4 (rw,noatime,errors=remount-ro)

6- Wipe the original drive by adding it to the RAID arrayOnce you have verified that everything is working on /dev/sdb, it's time to change the partition types on /dev/sda to fd and to add the original drive to the degraded RAID array:
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
You'll have to wait until the two partitions are fully synchronized but you can check the sync status using:
watch -n1 cat /proc/mdstat

7- Test booting off of the original driveOnce the sync is finished, update the boot loader menu:
update-grub
and shut the system down:
shutdown -h now
before physically disconnecting /dev/sdb and turning the machine back on to test booting with only /dev/sda present.

After a successful boot, shut the machine down and plug the second drive back in before powering it up again.

8- Resync drivesIf everything works, you should see the following after running cat /proc/mdstat:
md0 : active raid1 sda1[1]
280567040 blocks [2/1] [_U]
indicating that the RAID array is incomplete and that the second drive is not part of it.

To add the second drive back in and start the sync again:
mdadm /dev/md0 -a /dev/sdb1

9- Test booting off of the new driveTo complete the testing, shut the machine down, pull /dev/sda out and try booting with /dev/sdb only.

10- Reboot with the two drives and resync the arrayOnce you are satisfied that it works, reboot with both drives plugged in and re-add the first drive to the array:
mdadm /dev/md0 -a /dev/sda1
Your setup is now complete and fully tested.

Ongoing maintenanceI recommend making sure the two RAIDed drives stay in sync by enabling periodic RAID checks. The easiest way is to enable the checks that are built into the Debian package:
dpkg-reconfigure mdadm
but you can also create a weekly or monthly cronjob which does the following:
echo "check" > /sys/block/md0/md/sync_action
Something else you should seriously consider is to install the smartmontools package and run weekly SMART checks by putting something like this in your /etc/smartd.conf:
/dev/sda -a -d ata -o on -S on -s (S/../.././02 L/../../6/03)
/dev/sdb -a -d ata -o on -S on -s (S/../.././02 L/../../6/03)
These checks, performed by the hard disk controllers directly, could warn you of imminent failures ahead of time. Personally, when I start seeing errors in the SMART log (smartctl -a /dev/sda), I order a new drive straight away.

31 January 2011

Stefano Zacchiroli: who the bloody hell cares about Debian

A down-under talk on the role of Debian, A.D. 2011 I'm back from LCA 2011, which I've attended to share some thoughts about the role that Debian plays in the Free Software ecosystem, 18 years after its inception (yes, we are that old^Welder). The talk title---Who the bloody hell cares about Debian? [1]---was meant to be rather provocative. The idea was indeed to challenge the meme that, in the era of distributions that release every 6 months, a distribution with a release cycle of circa 2 years (like Debian, considering the past 5 years) is not a project that deserves your attention anymore. Is it really the case? In the talk I (obviously) claim it is not, using two main arguments. The first argument is based on the observation that Debian offers a set of pretty rare, if not unique, features among mainstream FOSS distributions. Those features consist of a mix of technical and "political" aspects: (1) a focus on package quality, with no distinction among first and second class packages; (2) a strong culture of software freedom, which refuses to offer non-free software (or firmware) by default to users and distribution developers (as parts of the infrastructure used to make Debian); (3) independence from commercial interests, with no single company or entity that could claim to babysit Debian; and (4) a decision making model based on a weighted sum of do-ocracy and democracy, which implies that by doing (rather than talking) everyone has a chance to have an impact on Debian. Considering all that and looking at the most popular FOSS distributions, one can easily identify Debian as one of the few remaining players who both care about Free Software and can be trusted in making choices not driven by profit. Mind you, I've nothing against companies in general and I'm very well aware that many FOSS companies carry a good deal of the burden of developing and promoting Free Software. Nonetheless, in days in which it is striking how quickly FOSS-friendly companies can become very much FOSS-unfriendly, I can't help putting my trust and efforts into community-driven projects, better if with no attached company label whatsoever. Furthermore, having distributions like Debian around can encourage other company-backed distributions to demand more and more independence and clarifications about the relationships between the community and the backing company. The second argument about the relevance of Debian is more pragmatic and rather straightforward: Debian is the root of a huge tree of derived-distribution (AKA "derivatives"), more than 120 according to popular distribution indexing sites. Each Debian derivative focuses its attention on, and directs its people power to, customizing Debian for a specific target and build entierly upon Debian work for all parts that do not need customization. That possibility is one key advantage of Free Software, after all. A well-known derivative example is Ubuntu, which is probably the most popular Debian derivative, enjoying a user base way larger than that of Debian itself. Ubuntu is heavily customized with respect to Debian and still had only about (at the time of Natty) 25% of packages which either differed from their Debian counterparts or that did not exist in Debian at all. Other Debian derivatives tend to be way less customized than that. Either way, if you are running a Debian derivative, chances are that you heavily depend on Debian and on its well-being. (Yes, it is so also if you didn't know about that, sorry.) Let's now fast-forward to the end of the talk, skipping my usual comments on how to keep the whole tree of Debian derivatives sustainable and beneficial for Free Software as a whole (i.e. by reducing as much as possible the viscosity of patch flow along the derivatives tree). Feedback People's reaction to such a provocative talk has been positive and we have enjoyed a fairly long Q&A session discussing the topics mentioned above as well as other Debian-related topics. A (commented) summary of the obtained feedback is reported below. Judging from the recurrent questions and suggestions I've received doing Debian talks in the past 8 months, I dare to say that LCA feedback is fairly representative of the feelings of many Debian users.
  • Some people believe that Debian is too silent about its role with respect to derivatives. I agree: we should communicate clearly about this, as well as asking our derivatives to do the same. Initiatives such as the Derivatives Front Desk and, more recently, the Derivatives Census seem to go in the right direction.
  • More generally, users seem to believe that we have the tendency to undersell what Debian has to offer and, in particular, the testing distribution (whose name is too scary in comparison to the unique balance it offers of up to date and tested software).
  • There is a clear interest in rolling distributions among GNU/Linux users and Debian enthusiasts are no exception. Mentioning CUT seems to invariably whet the appetite of our users.
  • On the other hand, there are also inquiries about the support period for Debian stable releases (currently about 3.5 years) and the possibility of having it extended. Once more, recent work in progress by the security team seems to be going in the right direction. As another potential underselling problem, not all users seem to be aware that Debian security support is on the whole archive, whereas other LTS offerings are not.
  • Various users are enthusiast about the free firmware achievement for Squeeze and happy about the way we communicated about it. Questions about the purpose of other "free firmware" distributions with respect to Debian invariably arise, although it's not up to us to answer those.
  • There is also interest in cross-distribution collaboration, probably triggered by a natural generalization of the invite of of collaborating across derivatives. I've been asked about Debian participation into the AppStream thingie and I've been happy to reiterate that it has been an important milestone in cross-distro collaboration.
Slides of the talk are available, while a video will eventually be posted at http://linuxconfau.blip.tv.
[1] Kudos to Francois Marier for suggesting the title. If the title doesn't ring a bell for you (it didn't for me in the beginning), you might want to check out the story of this ad. Update: minor rephrasing in the 4th paragraph

Francois Marier: Keeping a log of branch updates on a git server

Using a combination of bad luck and some of the more advanced git options, it is possible to mess up a centralised repository by accidentally pushing a branch and overwriting the existing branch pointer (or "head") on the server.

If you know where the head was pointing prior to that push, recovering it is a simple matter of running this on the server:
git update-ref /refs/heads/branchname commit_id
However, if you don't know the previous commit ID, then you pretty much have to dig through the history using git log.

Enabling a server-side reflogOne option to prevent this from happening is to simply enable the reflog, which is disabled by default in bare repositories, on the server.

Simply add this to your git config file on the server:
[core]
logallrefupdates = true
and then whenever a head is updated, an entry will be added to the reflog.

26 January 2011

Francois Marier: Serving pre-compressed files using Apache

The easiest way to compress the data that is being served to the visitors of your web application is to make use of mod_deflate. Once you have enabled that module and provided it with a suitable configuration file, it will compress all releant files on the fly as it is serving them.

Given that I was already going to minify my Javascript and CSS files ahead of time (i.e. not using mod_pagespeed), I figured that there must be a way for me to serve gzipped files directly.

"Compiling" Static FilesI decided to treat my web application like a c program. After all, it starts as readable source code and ends up as an unreadable binary file.

So I created a Makefile to minify and compress all CSS and Javascript files using YUI Compressor and gzip:

all: build

build:
find static/css -type f -name "[^.]*.css" -execdir yui-compressor -o .css \;
find static/js -type f -name "[^.]*.js" -execdir yui-compressor -o .js \;
cd static/css && for f in *.css.css ; do gzip -c $$f > basename $$f .css .gz ; done
cd static/js && for f in *.js.js ; do gzip -c $$f > basename $$f .js .gz ; done

clean:
find static/css -name "*.css.css" -delete
find static/js -name "*.js.js" -delete
find static/css -name "*.css.gz" -delete
find static/js -name "*.js.gz" -delete
find -name "*.pyc" -delete

This leaves the original files intact and adds minified .css.css and .js.js files as well as minified and compressed .css.gz and .js.gz files.

How browsers advertise gzip supportThe nice thing about serving compressed content to browsers is that browsers that support receiving gzipped content (almost all of them nowadays) include the following HTTP header in their requests:
Accept-Encoding = gzip,deflate
(Incidently, if you want to test what non-gzipped enable browsers see, just browse to about:config and remove what's in the network.http.accept-encoding variable.)

Serving compressed files to clientsTo serve different files to different browsers, all that's needed is to enable Multiviews in our Apache configuration (as suggested on the Apache mailing list):

<Directory /var/www/static/css>
AddEncoding gzip gz
ForceType text/css
Options +Multiviews
SetEnv force-no-vary
Header set Cache-Control "private"
</Directory>

<Directory /var/www/static/js>
AddEncoding gzip gz
ForceType text/javascript
Options +Multiviews
SetEnv force-no-vary
Header set Cache-Control "private"
</Directory>

The ForceType directive is there to force the mimetype (as described in this solution) and to make sure that browsers (including Firefox) don't download the files to disk.

As for the SetEnv directive, it turns out that on Internet Explorer, most files with a Vary header (added by Apache) are not cached and so we must make sure it gets stripped out before the response goes out.

Finally, the Cache-Control headers are set to private to prevent intermediate/transparent proxies from caching our CSS and Javascript files, while allowing browsers to do so. If intermediate proxies start caching compressed content, they may incorrectly serve it to clients without gzip support.

Next.

Previous.