Search Results: "kilobyte"

21 July 2021

Sean Whitton: Delivering Common Lisp executables using Consfigurator

I realised this week that my recent efforts to improve how Consfigurator makes the fork(2) system call have also created a way to install executables to remote systems which will execute arbitrary Common Lisp code. Distributing precompiled programs using free software implementations of the Common Lisp standard tends to be more of a hassle than with a lot of other high level programming languages. Executables will often be hundreds of megabytes in size even if your codebase is just a few megabytes, because the whole interactive Common Lisp environment gets bundled along with your program s code. Commercial Common Lisp implementations manage to do better, as I understand it, by knowing how to shake out unused code paths. Consfigurator s new mechanism uploads only changed source code, which might only be kilobytes in size, and updates the executable on the remote system. So it should be useful for deploying Common Lisp-powered web services, and the like. Here s how it works. When you use Consfigurator you define an ASDF system analagous to a Python package or Perl distribution called your consfig . This defines HOST objects to represent the machines that you ll use Consfigurator to manage, and any custom properties, functions those properties call, etc.. An ASDF system can depend upon other systems; for example, every consfig depends upon Consfigurator itself. When you execute Consfigurator deployments, Consfigurator uploads the source code of any ASDF systems that have changed since you last deployed this host, starts up Lisp on the remote machine, and loads up all the systems. Now the remote Lisp image is in a similarly clean state to when you ve just started up Lisp on your laptop and loaded up the libraries you re going to use. Only then are the actual deployment instructions are sent on stdin. What I ve done this week is insert an extra step for the remote Lisp image in between loading up all the ASDF systems and reading the deployment from stdin: the image calls fork(2) and establishes a pipe to communicate with the child process. The child process can be sent Lisp forms to evaluate, but for each Lisp form it receives it will actually fork again, and have its child process evaluate the form. Thus, going into the deployment, the original remote Lisp image has the capability to have arbitrary Lisp forms evaluated in a context in which all that has happened is that a statically defined set of ASDF systems has been loaded the child processes never see the full deployment instructions sent on stdin. Further, the child process responsible for actually evaluating the Lisp form received from the first process first forks off another child process and sets up its own control pipe, such that it too has the capacbility to have arbitrary Lisp forms evaluated in a cleanly loaded context, no matter what else it might put in its memory in the meantime. (Things are set up such that the child processes responsible for actually evaluating the Lisp forms never see the Lisp forms received for evaluation by other child processes, either.) So suppose now we have an ASDF system :com.silentflame.cool-web-service, and there is a function (start-server PORT) which we should call to start listening for connections. Then we can make our consfig depend upon that ASDF system, and do something like this:
CONSFIG> (deploy-these ((:ssh :user "root") :sbcl) server.example.org
           ;; Set up Apache to proxy requests to our service.
           (apache:https-vhost ...)
           ;; Now apply a property to dump the image.
           (image-dumped "/usr/local/bin/cool-web-service"
                         '(cool-web-service:start-server 1234)))
Consfigurator will: SSH to server.example.org; upload all the ASDF source for your consfig and its dependencies; compile and load that code into a remote SBCL process; call fork(2) and set up the control pipe; receive the applications of APACHE:HTTPS-VHOST and IMAGE-DUMPED shown above from your laptop, on stdin; apply the APACHE:HTTPS-VHOST property to ensure that Apache is proxying connections to port 1234; send a request into the control pipe to have the child process fork again and dump an executable which, when started, will evaluate the form (cool-web-service:start-server 1234). And that form will get evaluated in a pristine Lisp image, where the only meaningful things that have happened is that some ASDF systems have been loaded and a single fork(2) has taken place. You d probably need to add some other properties to add some mechanism for actually invoking /usr/local/bin/cool-web-service and restarting it when the executable is updated. (Background: The primary reason why Consfigurator s remote Lisp images need to call fork(2) is that they need to do things like setuid from root to other accounts and enter chroots without getting stuck in those contexts. Previously we forked right before entering such contexts, but that meant that Consfigurator deployments could never be multithreaded, because it might later be necessary to fork, and you can t usually do that once you ve got more than one thread running. So now we fork before doing anything else, so that the parent can then go multithreaded if desired, but can still execute subdeployments in contexts like chroots by sending Lisp forms to evaluate in those contexts into the control pipe.)

30 April 2020

Chris Lamb: Free software activities in April 2020

Here is my monthly update covering what I have been doing in the free software world during April 2020 (previous month's report). Looking it over prior to publishing, I am surprised how much I got done this month I felt that I was not only failing to do all the extra things I had planned, but I was doing far less than normal. But let us go easy on ourselves; nobody is nailing this. In addition, I did more hacking on the Lintian static analysis tool for Debian packages:
Reproducible builds One of the original promises of open source software is that distributed peer review and transparency of process results in enhanced end-user security. However, whilst anyone may inspect the source code of free and open source software for malicious flaws, almost all software today is distributed as pre-compiled binaries. This allows nefarious third-parties to compromise systems by injecting malicious code into ostensibly secure software during the various compilation and distribution processes. The motivation behind the Reproducible Builds effort is to ensure no flaws have been introduced during this compilation process by promising identical results are always generated from a given source, thus allowing multiple third-parties to come to a consensus on whether a build was compromised. The initiative is proud to be a member project of the Software Freedom Conservancy, a not-for-profit 501(c)(3) charity focused on ethical technology and user freedom. Conservancy acts as a corporate umbrella allowing projects to operate as non-profit initiatives without managing their own corporate structure. If you like the work of the Conservancy or the Reproducible Builds project, please consider becoming an official supporter. Elsewhere in our tooling, I made the following changes to diffoscope, our in-depth and content-aware diff utility that can locate and diagnose reproducibility issues, including preparing and uploading versions 139, 140, 141 and 142 to Debian: Lastly, I made a large number of changes to our website and documentation in the following categories:
Debian LTS This month I have contributed 18 hours to Debian Long Term Support (LTS) and 7 hours on its sister Extended LTS project. You can find out more about the project via the following video:
Debian I only filed three bugs in April, including one against snapshot.debian.org to report that a Content-Type HTTP header is missing when downloading .deb files (#956471) and to report build failures in the macs & ruby-enumerable-statistics packages:

1 June 2016

Antoine Beaupr : (Still) working too much on the computer

I have been using Workrave to try to force me to step away from the computer regularly to work around Repetitive Strain Injury (RSI) issues that have plagued my life on the computer intermittently in the last decade. Workrave itself is only marginally efficient at getting me away from the machine: as any warning systems, it suffers from alarm fatigue as you frenetically click the dismiss button every time a Workrave warning pops up. However, it has other uses.

Analyzing data input In the past, I have used Workrave to document how I work too much on the computer, but never went through more serious processing of the vast data store that Workrave accumulates about mouse movements and keystrokes. Interested in knowing how much my leave from Koumbit affected time spent on the computer, I decided to look into this again. It turns out I am working as much, if not more, on the computer since I took that "time off": per machine keystrokes per day and average We can see here that I type a lot on the computer. Normal days range from 10 000 to 60 000 keystrokes, with extremes at around 100 000 keystrokes per day. The average seem to fluctuate around 30 to 40 000 keystrokes per day, but rises sharply around the end of the second quarter of this very year. For those unfamiliar with the underlying technology, one keystroke is roughly one byte I put on the computer. So the average of 40 000 keystrokes is 40 kilobyte (KB) per day on the computer. That means 15 MB over a year or about 150MB (or 100 MiB if you want to be picky about it) over the course of the last decade. That is a lot of typing. I originally thought this could have been only because I type more now, as opposed to use more the mouse previously. Unfortunately, Workrave also tracks general "active time" which we can also examine: per machine keystrokes per day and average Here we see that I work around 4 hours a day continuously on the computer. That is active time: not just login, logout time. In other words, the time where i look away from the computer and think for a while, jot down notes in my paper agenda or otherwise step away from the computer for small breaks is not counted here. Notice how some days go up to 12 hours and how recently the average went up to 7 hours of continuous activity. So we can clearly see that I basically work more on the computer now than I ever did in the last 7 years. This is a problem - one of the reasons of this time off was to step away from the computer, and it seems I have failed.
Update: it turns out the graph was skewed towards the last samples. I went more easy on the keyboard in the last few days and things have significantly improved: per machine keystrokes per day and average, 3 days later
Another interesting thing we can see is when I switched from using my laptop to using the server as my main workstation, around early 2011, which is about the time marcos was built. Now that marcos has been turned into a home cinema downstairs, I went back to using my laptop as my main computing device, in late 2015. We can also clearly see when I stopped using Koumbit machines near the end of 2015 as well.

Further improvements and struggle for meaning The details of how the graph was produced are explained at the end of this article. This is all quite clunky: it doesn't help that the Workrave data structure is not easily parsable and so easily corruptible. It would be best if each data point was on its own separate line, which would be long, granted, but so easier to parse. Furthermore, I do not like the perl/awk/gnuplot data processing pipeline much. It doesn't allow me to do interesting analysis like averages, means and linear regressions easily. It could be interesting to rewrite the tools in Python to allow better graphs and easier data analysis, using the tools I learned in 2015-09-28-fun-with-batteries. Finally, this looks only at keystrokes and non-idle activity. It could be more interesting to look at idle/active times and generally the amount of time spent on the computer each day. And while it is interesting to know that I basically write a small book on the computer every day (according to Wikipedia, 120KB is about the size of a small pocket book), it is mostly meaningless if all that stuff is machine-readable code. Where is, after all, the meaning in all those shell commands and programs we constantly input on our keyboards, in the grand scheme of human existence? Most of those bytes are bound to be destroyed by garbage collection (my shell's history) or catastrophic backup failures.
While the creative works from the 16th century can still be accessed and used by others, the data in some software programs from the 1990s is already inaccessible. - Lawrence Lessig
But is my shell history relevant? Looking back at old posts on this blog, one has to wonder if the battery life of the Thinkpad 380z laptop or how much e-waste I threw away in 2005 will be of any historical value in 20 years, assuming the data survives that long.

How this graph was made I was happy to find that Workrave has some contrib scripts to do such processing. Unfortunately, those scripts are not shipped with the Debian package, so I requested that to be fixed (#825982). There were also some fixes necessary to make the script work at all: first, there was a syntax error in the Perl script. But then since my data is so old, there was bound to be some data corruption in there: incomplete entries or just plain broken data. I had lines that were all NULL characters, typical of power failures or disk corruptions. So I have made a patch to fix that script (#826021). But this wasn't enough: while this processes data on the current machine fine, it doesn't deal with multiple machines very well. In the last 7 years of data I could find, I was using 3 different machines: this server (marcos), my laptop (angela) and Koumbit's office servers (koumbit). I ended up modifying the contrib scripts to be able to collate that data meaningfully. First, I copied over the data from Koumbit in a local fake-koumbit directory. Second, I mounted marcos home directory locally with SSHFS:
sshfs anarc.at:/home/anarcat marcos
I also made this script to sum up datasets:
#!/usr/bin/perl -w
use List::MoreUtils 'pairwise';
$  = 1;
my %data = ();
while (<>)  
    my @fields = split;
    my $date = shift @fields;
    if (defined($data $date ))  
        my @t = pairwise   $a + $b   @ $data $date , @fields;
        $data $date  = \@t;
     
    else  
        $data $date  = \@fields;
     
 
foreach my $d ( sort keys %data )  
    print "$d @ $data $d \n";
 
Then I made a modified version of the Gnuplot script that processes all those files together:
#!/usr/bin/gnuplot
set title "Workrave"
set ylabel "Keystrokes per day"
set timefmt "%Y%m%d%H%M"
#set xrange [450000000:*]
set format x "%Y-%m-%d"
set xtics rotate
set xdata time
set terminal svg
set output "workrave.svg"
plot "workrave-angela.dat" using 1:28 title "angela", \
     "workrave-marcos.dat" using 1:28 title "marcos", \
     "workrave-koumbit.dat" using 1:28 title "koumbit", \
     "workrave-sum.dat" using 1:2 smooth sbezier linewidth 3 title "average"
#plot "workrave-angela.dat" using 1:28 smooth sbezier title "angela", \
#     "workrave-marcos.dat" using 1:28 smooth sbezier title "marcos", \
#     "workrave-koumbit.dat" using 1:28 smooth sbezier title "koumbit"
And finally, I made a small shell script to glue this all together:
#!/bin/sh
perl workrave-dump > workrave-$(hostname).dat
HOME=$HOME/marcos perl workrave-dump > workrave-marcos.dat
HOME=$PWD/fake-koumbit perl workrave-dump > workrave-koumbit.dat
# remove idle days as they skew the average
sed -i '/ 0$/d' workrave-*.dat
# per-day granularity
sed -i 's/^\(........\)....\? /\1 /' workrave-*.dat
# sum up all graphs
cat workrave-*.dat   sort   perl sum.pl > workrave.dat
./gnuplot-workrave-anarcat
I used a different gnuplot script to generate the activity graph:
#!/usr/bin/gnuplot
set title "Workrave"
set ylabel "Active hours per day"
set timefmt "%Y%m%d%H%M"
#set xrange [450000000:*]
set format x "%Y-%m-%d"
set xtics rotate
set xdata time
set terminal svg
set output "workrave.svg"
plot "workrave-angela.dat"  using 1:($23/3600) title "angela", \
     "workrave-marcos.dat"  using 1:($23/3600) title "marcos", \
     "workrave-koumbit.dat" using 1:($23/3600) title "koumbit", \
     "workrave.dat"         using 1:($23/3600) title "average" smooth sbezier linewidth 3
#plot "workrave-angela.dat" using 1:28 smooth sbezier title "angela", \
#     "workrave-marcos.dat" using 1:28 smooth sbezier title "marcos", \
#     "workrave-koumbit.dat" using 1:28 smooth sbezier title "koumbit"

2 April 2016

Petter Reinholdtsen: syslog-trusted-timestamp - chain of trusted timestamps for your syslog

Two years ago, I had a look at trusted timestamping options available, and among other things noted a still open bug in the tsget script included in openssl that made it harder than necessary to use openssl as a trusted timestamping client. A few days ago I was told the Norwegian government office DIFI is close to releasing their own trusted timestamp service, and in the process I was happy to learn about a replacement for the tsget script using only curl:
openssl ts -query -data "/etc/shells" -cert -sha256 -no_nonce \
    curl -s -H "Content-Type: application/timestamp-query" \
         --data-binary "@-" http://zeitstempel.dfn.de > etc-shells.tsr
openssl ts -reply -text -in etc-shells.tsr
This produces a binary timestamp file (etc-shells.tsr) which can be used to verify that the content of the file /etc/shell with the calculated sha256 hash existed at the point in time when the request was made. The last command extract the content of the etc-shells.tsr in human readable form. The idea behind such timestamp is to be able to prove using cryptography that the content of a file have not changed since the file was stamped. To verify that the file on disk match the public key signature in the timestamp file, run the following commands. It make sure you have the required certificate for the trusted timestamp service available and use it to compare the file content with the timestamp. In production, one should of course use a better method to verify the service certificate.
wget -O ca-cert.txt https://pki.pca.dfn.de/global-services-ca/pub/cacert/chain.txt
openssl ts -verify -data /etc/shells -in etc-shells.tsr -CAfile ca-cert.txt -text
Wikipedia have a lot more information about trusted Timestamping and linked timestamping, and there are several trusted timestamping services around, both as commercial services and as free and public services. Among the latter is the zeitstempel.dfn.de service mentioned above and freetsa.org service linked to from the wikipedia web site. I believe the DIFI service should show up on https://tsa.difi.no, but it is not available to the public at the moment. I hope this will change when it is into production. The RFC 3161 trusted timestamping protocol standard is even implemented in LibreOffice, Microsoft Office and Adobe Acrobat, making it possible to verify when a document was created. I would find it useful to be able to use such trusted timestamp service to make it possible to verify that my stored syslog files have not been tampered with. This is not a new idea. I found one example implemented on the Endian network appliances where the configuration of such feature was described in 2012. But I could not find any free implementation of such feature when I searched, so I decided to try to build a prototype named syslog-trusted-timestamp. My idea is to generate a timestamp of the old log files after they are rotated, and store the timestamp in the new log file just after rotation. This will form a chain that would make it possible to see if any old log files are tampered with. But syslog is bad at handling kilobytes of binary data, so I decided to base64 encode the timestamp and add an ID and line sequence numbers to the base64 data to make it possible to reassemble the timestamp file again. To use it, simply run it like this:
syslog-trusted-timestamp /path/to/list-of-log-files
This will send a timestamp from one or more timestamp services (not yet decided nor implemented) for each listed file to the syslog using logger(1). To verify the timestamp, the same program is used with the --verify option:
syslog-trusted-timestamp --verify /path/to/log-file /path/to/log-with-timestamp
The verification step is not yet well designed. The current implementation depend on the file path being unique and unchanging, and this is not a solid assumption. It also uses process number as timestamp ID, and this is bound to create ID collisions. I hope to have time to come up with a better way to handle timestamp IDs and verification later. Please check out the prototype for syslog-trusted-timestamp on github and send suggestions and improvement, or let me know if there already exist a similar system for timestamping logs already to allow me to join forces with others with the same interest. As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

1 March 2016

Vincent Sanders: Hope is tomorrow's veneer over today's disappointment.

Recently I have been very hopeful about the 96boards Hikey SBC and as Evan Esar predicted I have therefore been very disappointed. I was given a Hikey after a Linaro connect event some time ago by another developer who could not get the system working usefully and this is the tale of what followed.
The Standard Design
This board design was presented as Linaro creating a standard for the 64bit Single Board Computer (SBC) market. So I had expected that a project with such lofty goals would have considered many factors and provided at least as good a solution as the existing 32bit boards.

The lamentable hubris of creating a completely new form factor unfortunately sets a pattern for the whole enterprise. Given the aim towards "makers" I would have accepted that the system would not be a ATX PC size motherboard, but mini/micro/nano and pico ITX have been available for several years.

If opting for a smaller "credit card" form factor why not use one of the common ones that have already been defined by systems such as the Raspberry Pi B+? Instead now every 96Board requires special cases and different expansion boards.

Not content with defining their own form factor the design also uses a 8-18V supply, this is the only SBC I own that is not fed from a 5V supply. I understand that a system might require more current than a micro USB connector can provide, but for example the Banana Pi manages with a DC barrel jack readily capable of delivering 25W which would seem more than enough

The new form factor forced the I/O connectors to be placed differently to other SBC, given the opportunity to concentrate all connectors on one edge (like ATX designs) and avoid the issues where two or three sides are used. The 96board design instead puts connectors on two edges removing any possible benefit this might have given.

The actual I/O connectors specified are rather strange. There is a mandate for HDMI removing the possibility of future display technology changes. The odd USB arrangement of two single sockets instead of a stacked seems to be an attempt to keep height down but the expansion headers and CPU heatsink mean this is largely moot.

The biggest issue though is mandating WIFI but not Ethernet (even as an option), everything else in the design I could logically understand but this makes no sense. It means the design is not useful for many applications without adding USB devices.

Expansion is presented as a 2mm pitch DIL socket for "low speed" signals and a high density connector for "high speed" signals. The positioning and arrangement of these connectors proffered an opportunity to improve upon previous SBC designs which was not taken. The use of 2mm pitch and low voltage signals instead of the more traditional 2.54mm pitch 3.3v signals means that most maker type applications will need adapting from the popular Raspberry Pi and Arduino style designs.

In summary the design appears to have been a Linaro project to favour one of their members which took a Hisilicon Android phone reference design and put it onto a board with no actual thought beyond getting it done. Then afterwards attempted to turn that into a specification, this has simply not worked as an approach.

My personal opinion is that this specification is fatally flawed and, is a direct cause of, the bizarre situation where the "consumer" specification exists alongside the "enterprise" edition which itself has an option of microATX form factor anyhow!
The ImplementationIf we ignore the specification appearing to be nothing more than a codification of the original HiKey design we can look at the HiKey as an implementation.

Initially the board required modifying to add headers to attach a USB to 1.8V LVTTL serial adaptor on the UART0 serial port. Once Andy Simpkins had made this change for me I was able to work through the instructions and attempt to install a bootloader and OS image.

The initial software was essentially HiSilicon vendor code using the Android fastboot system to configure booting. There was no source and the Ubuntu OS images were an obvious afterthought to the Android images. Just getting these images installed required a great deal of effort, repetition and debugging. It was such a dreadful experience this signalled the commencement one of the repeated hiatuses throughout this project, the allure of 64 bit ARM computing has its limits even for me.

When I returned to the project I attempted to use the system from the on-board eMMC but the pre-built binary only kernel and OS image were very limited. Building a replacement kernel , or even modules for the existing one proved fruitless and the system was dreadfully unstable.

I wanted to use the system as a builder for some Open Source projects but the system instability ruled this out. I considered attempting to use virtualisation which would also give better system isolation for builder systems. By using KVM running a modern host kernel and OS as a guest this would also avoid issues with the host systems limitations. At which point I discovered the system had no virtualisation enabled apparently because the bootloader lacked support.

In addition to these software issues there were hardware problems, despite forcing the use of USB for all additional connectivity the USB implementation was dreadful. For a start all USB peripherals have to run at the same speed! One cannot mix full (12Mbit) and high speed (480Mbit) devices which makes adding a USB Ethernet and SATA device challenging when you cannot use a keyboard.

And because I needed more challenges only one of the USB root hubs was functional. In effect this made the console serial port critical as it was the only reliable way to reconfigure the system without a keyboard or network link (and WIFI was not reliable either)

After another long pause in proceedings I decided that I should house all the components together and that perhaps being out on my bench might be the cause of some instability. I purchased a powered Amazon basics USB 2 hub, an Ethernet adaptor and a USB 2 SATA interface in the hope of accessing some larger mass storage.

The USB hub power supply was 12V DC which matched the Hikey requirements so I worked out I could use a single 4A capable supply and run a 3.5inch SATA hard drive too. I designed a laser cut enclosure and mounted all the components. As it turned out I only had a 2.5inch hard drive handy so the enclosure is a little over size. If I were redoing this design I would attempt to make it fit in 1U of height and be mountable in a 19inch rack instead it is an 83mm high (under 2U) box.

A new software release had also become available which purported to use an UEFI bootloader after struggling to install this version unsuccessfully, made somewhat more problematic by the undocumented change from UART0 (unpopulated header) to UART3 on the low speed 2mm pitch header. The system seemed to start the kernel which panicked and hung either booting from eMMC or SD card. Once again the project went on hold after spending tens of hours trying to make progress.
Third time's a charmAs the year rolled to a close I once again was persuaded to look at the hikey, I followed the much improved instructions and installed the shiny new November software release which appears to have been made for the re-release of the Hikey through LeMaker. This time I obtained a Debian "jessie" system that booted from the eMMC.

Having a booted system meant I could finally try and use it. I had decided to try and use the system to host virtual machines used as builders within the NetSurf CI system.

The basic OS uses a mixture of normal Debian packages with some replacements from Linaro repositories. I would have prefered to see more use of Debain packages even if they were from the backports repositories but on the positive side it is good to see the use of Debian instead of Ubuntu.

The kernel is a heavily patched 3.18 built in a predominantly monolithic (without modules) manner with the usual exceptions such as the mali and wifi drivers (both of which appear to be binary blobs). The use of a non-mainline capable SoC means the standard generic distribution kernels cannot be used and unless Linaro choose to distribute a kernel with the feature built in it is necessary to compile your own from sources.

The default install has a linaro user which I renamed to my user and ensured all the ssh keys and passwords on the system were changed. This is an important step when using this pre-supplied images as often a booted system is identical to every other copy.

To access mass storage my only option was via USB, indeed to add any additional expansion that is the only choice. The first issue here is that the USB host support is compiled in so when the host ports are initialised it is not possible to select a speed other than 12MBit. The speed is changed to 480Mbit by using a user space application found in the users home directory (why this is not a tool provided by a package and held in sbin I do not know).

When the usb_speed tool is run there is a chance that the previously enumerated devices will be rescanned and what was /dev/sda has become /dev/sdb if this happens there is a high probability that the system must be rebooted to prevent random crashes due to the "zombie" device.

Because the speed change operation is unreliable it cannot be reliably placed in the boot sequence so this must be executed by hand on each boot to get access to the mass storage.

NetSurf project already uses a x86_64 virtual host system which runs an LLVM physical volume from which we allocate logical volumes for each VM. I initially hoped to do this with the hikey but as soon as I tried to use the logical volume with a VM the system locked up with nothing shown on console. I did not really try very hard to discover why and instead simply used files on disc for virtual drives which seemed to work.

To provide reliable network access I used a USB attached Ethernet device, this like the mass storage suffered from unreliable enumeration and for similar reasons could not be automated requiring manually using the serial console to start the system.

Once the system was started I needed to install the guest VM. I had hoped I might be able to install locally from Debian install media as I do for x86 using the libvirt tools. After a great deal of trial and error I finally was forced to abandon this approach when I discovered the Linaro kernel is lacking iso9660 support so installing from standard media was not possible.

Instead I used the instructions provided by Leif Lindholm to create a virtual machine image on my PC and copied the result across. These instructions are great except I used version 2.5 of Qemu instead of 2.2 which had no negative effect. I also installed the Debian backports for Jessie to get an up to date 4.3 kernel.

After copying the image to the Hikey I started it by hand from the command line as a four core virtual machine and was successfully able to log in. The guest would operate for up to a day before stopping with output such as

$
Message from syslogd@ciworker13 at Jan 29 07:45:28 ...
kernel:[68903.702501] BUG: soft lockup - CPU#0 stuck for 27s! [mv:24089]

Message from syslogd@ciworker13 at Jan 29 07:45:28 ...

kernel:[68976.958028] BUG: soft lockup - CPU#2 stuck for 74s! [swapper/2:0]

Message from syslogd@ciworker13 at Jan 29 07:47:39 ...
kernel:[69103.199724] BUG: soft lockup - CPU#3 stuck for 99s! [swapper/3:0]

Message from syslogd@ciworker13 at Jan 29 07:53:21 ...
kernel:[69140.321145] BUG: soft lockup - CPU#3 stuck for 30s! [rs:main Q:Reg:505]

Message from syslogd@ciworker13 at Jan 29 07:53:21 ...
kernel:[69192.880804] BUG: soft lockup - CPU#0 stuck for 21s! [jbd2/vda3-8:107]

Message from syslogd@ciworker13 at Jan 29 07:53:21 ...
kernel:[69444.805235] BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]

Message from syslogd@ciworker13 at Jan 29 07:55:21 ...
kernel:[69570.177600] BUG: soft lockup - CPU#1 stuck for 112s! [systemd:1]

Timeout, server 192.168.7.211 not responding.

After this output the host system would not respond and had to be power cycled never mind the guest!

Once I changed to single core operation the system would run for some time until the host suffered from the dreaded kernel OOM killer. I was at a loss as to why the oom killer was running as the VM was only allocated half the physical memory (512MB) allowing the host what I presumed to be an adequate amount.

By adding a 512MB swapfile the system was able to push the few hundred kilobytes it wanted to swap and the system was now stable! The swapfile of course has to be started by hand as the external storage is unreliable and unavailable at boot.

I converted the qemu command line to a libvirt config using the virsh tool
virsh domxml-from-native qemu-argv cmdln.args
The converted configuration required manual editing to get a working system but now I have a libvirt based VM guest I can control along with all my other VM using the virt-manager tool.

This system is now stable and has been in production use for a month at time of writing. The one guest VM is a single core 512MB aarch64 system which takes over 1100 seconds (19 minutes) to do what a Banana Pi 2 dual core 1GB memory 32bit native ARM system manages in 300 seconds.

It seems the single core limited memory system with USB SATA attached storage is very, very slow.

I briefly attempted to run the CI system job natively within the host system but within minutes it crashed hard and required a power cycle to retrieve, it had also broken the UEFI boot. I must thank Leif for walking me through recovering the system otherwise I would have needed to start over.
ConclusionsI must stress these conclusions and observations are my own and do not represent anyone else.

My main conclusions are:

This project has taken almost a year to get to the final state and has been one of the least enjoyable within that time. The only reason I have a running system at the end of it is sheer bloody mindedness because after spending hundreds of hours of my free time I was not prepared to see it all go to waste.

To be fair, the road I travelled is now much smoother and if the application is suited to having a mobile phone without a screen then the Hikey probably works as a solution. For me, however, the Hikey product with the current hardware and software limitations is not something I would recommend in preference to other options.

12 January 2016

Bits from Debian: New Debian Developers and Maintainers (November and December 2015)

The following contributors got their Debian Developer accounts in the last two months: The following contributors were added as Debian Maintainers in the last two months: Congratulations!

29 December 2015

David Bremner: Converting PDFs to DJVU

Today I was wondering about converting a pdf made from scan of a book into djvu, hopefully to reduce the size, without too much loss of quality. My initial experiments with pdf2djvu were a bit discouraging, so I invested some time building gsdjvu in order to be able to run djvudigital. Watching the messages from djvudigital I realized that the reason it was achieving so much better compression was that it was using black and white for the foreground layer by default. I also figured out that the default 300dpi looks crappy since my source document is apparently 600dpi. I then went back an compared djvudigital to pdf2djvu a bit more carefully. My not-very-scientific conclusions: Perhaps most compellingly, the output from pdf2djvu has sensible metadata and is searchable in evince. Even with the --words option, the output from djvudigital is not. This is possibly related to the error messages like
Can't build /Identity.Unicode /CIDDecoding resource. See gs_ciddc.ps .
It could well be my fault, because building gsdjvu involved guessing at corrections for several errors. Some of these issues have to do with building software from 2009 (the instructions suggestion building with ghostscript 8.64) in a modern toolchain; others I'm not sure. There was an upload of gsdjvu in February of 2015, somewhat to my surprise. AT&T has more or less crippled the project by licensing it under the CPL, which means binaries are not distributable, hence motivation to fix all the rough edges is minimal.
Version kilobytes per page position in figure
Original PDF 80.9 top
pdf2djvu --dpi=450 92.0 not shown
pdf2djvu --monochrome --dpi=450 27.5 second from top
pdf2djvu --monochrome --dpi=600 --loss-level=50 21.3 second from bottom
djvudigital --dpi=450 29.4 bottom
djvu-compare.png

14 December 2014

Enrico Zini: html5-sse

HTML5 Server-sent events I have a Django view that runs a slow script server-side, and streams the script output to Javascript. This is the bit of code that runs the script and turns the output into a stream of events:
def stream_output(proc):
    '''
    Take a subprocess.Popen object and generate its output, line by line,
    annotated with "stdout" or "stderr". At process termination it generates
    one last element: ("result", return_code) with the return code of the
    process.
    '''
    fds = [proc.stdout, proc.stderr]
    bufs = [b"", b""]
    types = ["stdout", "stderr"]
    # Set both pipes as non-blocking
    for fd in fds:
        fcntl.fcntl(fd, fcntl.F_SETFL, os.O_NONBLOCK)
    # Multiplex stdout and stderr with different prefixes
    while len(fds) > 0:
        s = select.select(fds, (), ())
        for fd in s[0]:
            idx = fds.index(fd)
            buf = fd.read()
            if len(buf) == 0:
                fds.pop(idx)
                if len(bufs[idx]) != 0:
                    yield types[idx], bufs.pop(idx)
                types.pop(idx)
            else:
                bufs[idx] += buf
                lines = bufs[idx].split(b"\n")
                bufs[idx] = lines.pop()
                for l in lines:
                    yield types[idx], l
    res = proc.wait()
    yield "result", res
I used to just serialize its output and stream it to JavaScript, then monitor onreadystatechange on the XMLHttpRequest object browser-side, but then it started failing on Chrome, which won't trigger onreadystatechange until something like a kilobyte of data has been received. I didn't want to stream a kilobyte of padding just to work-around this, so it was time to try out Server-sent events. See also this. This is the Django view that sends the events:
class HookRun(View):
    def get(self, request):
        proc = run_script(request)
        def make_events():
            for evtype, data in utils.stream_output(proc):
                if evtype == "result":
                    yield "event:  \ndata:  \n\n".format(evtype, data)
                else:
                    yield "event:  \ndata:  \n\n".format(evtype, data.decode("utf-8", "replace"))
        return http.StreamingHttpResponse(make_events(), content_type='text/event-stream')
    @method_decorator(never_cache)
    def dispatch(self, *args, **kwargs):
        return super().dispatch(*args, **kwargs)
And this is the template that renders it:
 % extends "base.html" % 
 % load i18n % 
 % block head_resources % 
 block.super 
<style type="text/css">
.out  
    font-family: monospace;
    padding: 0;
    margin: 0;
 
.stdout  
.stderr   color: red;  
.result  
.ok   color: green;  
.ko   color: red;  
</style>
 # Polyfill for IE, typical... https://github.com/remy/polyfills/blob/master/EventSource.js # 
<script src="  STATIC_URL  js/EventSource.js"></script>
<script type="text/javascript">
$(function()  
    // Manage spinners and other ajax-related feedback
    $(document).nav();
    $(document).nav("ajax_start");
    var out = $("#output");
    var event_source = new EventSource(" % url 'session_hookrun' name=name % ");
    event_source.addEventListener("open", function(e)  
      //console.log("EventSource open:", arguments);
     );
    event_source.addEventListener("stdout", function(e)  
      out.append($("<p>").attr("class", "out stdout").text(e.data));
     );
    event_source.addEventListener("stderr", function(e)  
      out.append($("<p>").attr("class", "out stderr").text(e.data));
     );
    event_source.addEventListener("result", function(e)  
      if (+e.data == 0)
          out.append($("<p>").attr("class", "result ok").text(" % trans 'Success' % "));
      else
          out.append($("<p>").attr("class", "result ko").text(" % trans 'Script failed with code' %  " + e.data));
      event_source.close();
      $(document).nav("ajax_end");
     );
    event_source.addEventListener("error", function(e)  
      // There is an annoyance here: e does not contain any kind of error
      // message.
      out.append($("<p>").attr("class", "result ko").text(" % trans 'Error receiving script output from the server' % "));
      console.error("EventSource error:", arguments);
      event_source.close();
      $(document).nav("ajax_end");
     );
 );
</script>
 % endblock % 
 % block content % 
<h1> % trans "Processing..." % </h1>
<div id="output">
</div>
 % endblock % 
It's simple enough, it seems reasonably well supported besides needing a polyfill for IE and, astonishingly, it even works!

24 March 2014

Andrew Shadura: Tired of autotools? Try this: mk-configure

mk-configure is a project which tries to be autotools done right. Instead of supporting an exceedingly large number of platforms, modern and ancient, at costs of generated unreadable multi-kilobyte shell scripts, mk-configure aims at better support of less platforms, but those which are really in use today. One of the main differences of this project is that it avoids code generation as much as possible. The author of mk-configure, Aleksey Cheusov, a NetBSD hacker from Belarus, uses NetBSD make (bmake) and shell script snippets instead of monstrous libraries written in m4 interleaved with shell scripts. As the result, there s no need in a separate step of package configuration or bootstrapping the configure script, everything is done by just running bmake, or a convenience wrapper for it, mkcmake, which prepends a proper library path to bmake arguments, so you don t have to specify it yourself. Today, mk-configure is already powerful enough to be able replace autotools for most of the projects, and what is missing from it can be easily done by hacking the Makefile, which would otherwise be quite simple. Try it for your project, you may really like it. I already did. And report bugs.

29 November 2013

Axel Beckert: PDiffs are still useful

probably just not as default. I do agree with Richi and with Michael that disabling PDiffs by default gives the big majority of Debian Testing or Unstable users a speedier package list update. I m though not sure, if disabling PDiffs by default would Additionally I want to remind you that PDiffs per se are nothing bad and should be continued to be supported: So yes, disabling PDiffs by default is probably ok, but the feature must be kept available for those who haven t 100 MBit/s fibre connection into their homes or are sitting just one hop away from the next Debian mirror (like me at work :-). Oh, and btw., for the very same reasons I m also a big fan of debdelta which is approximately the same as PDiffs, just not for package lists but for binary packages. Using debdelta I was able to speed up my download rates over EDGE to up to virtual 100 kB/s, i.e. by factor four (depending on the packages). Just imagine a LibreOffice minor update at 15 kB/s average download rate. ;-) And all these experiences were not made with a high-performance CPU but with the approximately 5 year old Intel Atom processor of my ASUS EeePC 900A. So I used PDiffs and debdelta even despite having a slight performance penalty on applying the diffs and deltas.

5 September 2013

Vincent Sanders: Strive for continuous improvement, instead of perfection.


Kim Collins was perhaps thinking more about physical improvement but his advice holds well for software.

A lot has been written about the problems around software engineers wanting to rewrite a codebase because of "legacy" issues. Experience has taught me that refactoring is a generally better solution than rewriting because you always have something that works and can be released if necessary.

Although that observation applies to the whole of a project, sometimes the technical debt in component modules means they need reimplementing. Within NetSurf we have historically had problems when such a change was done because of the large number of supported platforms and configurations.
HistoryA year ago I implemented a Continuous Integration (CI) solution for NetSurf which, combined with our switch to GIT for revision control, has transformed our process. Making several important refactor and rewrites possible while being confident about the overall project stability.

I know it has been a year because the VPS hosting bill from Mythic turned up and we are still a very happy customer. We have taken the opportunity to extend the infrastructure to add additional build systems which is still within the NetSurf projects means.

Over the last twelve months the CI system has attempted over 100,000 builds including the projects libraries and browser. Each commit causes an attempt to build for eight platforms, in multiple configurations with multiple compilers. Because of this the response time to a commit is dependant on the slowest build slave (the mac mini OS X leopard system).

Currently this means a browser build, not including the support libraries, completes in around 450 seconds. The eleven support libraries range from 30 to 330 seconds each. This gives a reasonable response time for most operations. The worst case involves changing the core buildsystem which causes everything to be rebuilt from scratch taking some 40 minutes.

The CI system has gained capability since it was first set up, there are now jobs that:
DownsidesIt has not all been positive though, the administration required to keep the builds running has been more than expected and it has highlighted just how costly supporting all our platforms is. When I say costly I do not just refer to the physical requirements of providing build slaves but more importantly the time required.

Some examples include:
  • Procuring the hardware, installing the operating system and configuring the build environment for the OS X slaves
  • Getting the toolchain built and installed for cross compilation
  • Dealing with software upgrades and updates on the systems
  • Solving version issues with interacting parts, especially limiting is the lack of JAVA 1.6 on PPC OS X preventing jenkins updates
This administration is not interesting to me and consumes time which could otherwise be spent improving the browser. Though the benefits of having the system are considered by the development team to outweigh the disadvantages.

The highlighting of the costs of supporting so many platforms has lead us to reevaluate their future viability. Certainly the PPC mac os X port is in gravest danger of being imminently dropped and was only saved when the build slaves drive failed because there were actual users.

There is also the question of the BeOS platform which we are currently unable to even build with the CI system at all as it cannot be targeted for cross compilation and cannot run a sufficiently complete JAVA implementation to run a jenkins slave.

An unexpected side effect of publishing every CI build has been that many non developer user are directly downloading and using these builds. In some cases we get messages to the mailing list about a specific build while the rest of the job is still ongoing.

Despite the prominent warning on the download area and clear explanation on the mailing lists we still get complaints and opinions about what we should be "allowing" in terms of stability and features with these builds. For anyone else considering allowing general access to CI builds I would recommend a very clear statement of intent and to have a policy prepared for when when users ignore the statement.
Tools
Using jenkins has also been a learning experience. It is generally great but there are some issues I have which, while not insurmountable, are troubling:
Configuration and history cannot easily be stored in a revision control system.
This means our system has to be restored from a backup in case of failure and I cannot simply redeploy it from scratch.

Job filtering, especially for matrix jobs with many combinations, is unnecessarily complicated.
This requires the use of a single text line "combination filter" which is a java expression limiting which combinations are built. An interface allowing the user to graphically select from a grid similar to the output tables showing success would be preferable. Such a tool could even generate the textural combination filter if thats easier.

This is especially problematic of the main browser job which has options for label (platform that can compile the build), javascript enablement, compiler and frontend (the windowing toolkit if you prefer e.g. linux label can build both gtk and framebuffer). The filter for this job is several kilobytes of text which due to the first issue has to be cut and pasted by hand.

Handling of simple makefile based projects is rudimentary.
This has been worked around mainly by creating shell scripts to perform the builds. These scripts are checked into the repositories so they are readily modified. Initially we had the text in each job but that quickly became unmanageable.

Output parsing is limited.
Fortunately several plugins are available which mitigate this issue but I cannot help feeling that they ought to be integrated by default.

Website output is not readily modifiable.
Instead of perhaps providing a default css file and all generated content using that styling someone with knowledge of JAVA must write a plugin to change any of the look and feel of the jenkins tool. I understand this helps all jenkins instances look like the same program but it means integrating jenkins into the rest of our projects web site is not straightforward.
ConclusionIn conclusion I think the CI system is an invaluable tool for almost any non trivial software project but the implementation costs with current tools should not be underestimated.

18 June 2013

Daniel Pocock: RSA Key Sizes: 2048 or 4096 bits?

Many people are taking a fresh look at IT security strategies in the wake of the NSA revelations. One of the issues that comes up is the need for stronger encryption, using public key cryptography instead of just passwords. This is sometimes referred to as certificate authentication, but certificates are just one of many ways to use public key technology. One of the core decisions in this field is the key size. Most people have heard that 1024 bit RSA keys have been cracked and are not used any more for web sites or PGP. The next most fashionable number after 1024 appears to be 2048, but a lot of people have also been skipping that and moving to 4096 bit keys. This has lead to some confusion as people try to make decisions about which smartcards to use, which type of CA certificate to use, etc. The discussion here is exclusively about RSA key pairs, although the concepts are similar for other algorithms (although key lengths are not equivalent) The case for using 2048 bits instead of 4096 bits
  • Some hardware (many smart cards, some card readers, and some other devices such as Polycom phones) don't support anything bigger than 2048 bits.
  • Uses less CPU than a longer key during encryption and authentication
  • Using less CPU means using less battery power (important for mobile devices)
  • Uses less storage space: while not an issue on disk, this can be an issue in small devices like smart cards that measure their RAM in kilobytes rather than gigabytes
So there are some clear benefits of using 2048 bit keys and not just jumping on the 4096 bit key bandwagon The case for using 4096 bits
  • For some types of attack, security is not just double, it is exponential. 4096 is significantly more secure in this scenario. If an attack is found that allows a 2048 bit key to be hacked in 100 hours, that does not imply that a 4096 bit key can be hacked in 200 hours. The hack that breaks a 2048 bit key in 100 hours may still need many years to crack a single 4096 bit key
  • Some types of key (e.g. an OpenPGP primary key which is signed by many other people) are desirable to keep for an extended period of time, perhaps 10 years or more. In this context, the hassle of replacing all those signatures may be quite high and it is more desirable to have a long-term future-proof key length.
The myth of certificate expiration Many types of public key cryptography, such as X.509, offer an expiry feature. This is not just a scheme to force you to go back to the certificate authority and pay more money every 12 months. It provides a kind of weak safety net in the case where somebody is secretly using an unauthorised copy of the key or a certificate that the CA issued to an imposter. However, the expiry doesn't eliminate future algorithmic compromises. If, in the future, an attacker succeeds in finding a shortcut to break 2048 bit keys, then they would presumably crack the root certificate as easily as they crack the server certificates and then, using their shiny new root key, they would be in a position to issue new server certificates with extended expiry dates. Therefore, the expiry feature alone doesn't protect against abuse of the key in the distant future. It does provide some value though: forcing people to renew certificates periodically allows the industry to bring in new minimum key length standards from time to time. In practical terms, content signed with a 2048 bit key today will not be valid indefinitely. Imagine in the year 2040 you want to try out a copy of some code you released with a digital signature in 2013. In 2040, that signature may not be trustworthy: most software in that era would probably see the key and tell you there is no way you can trust it. The NIST speculates that 2048 bit keys will be valid up to about the year 2030, so that implies that any code you sign with a 2048 bit key today will have to be resigned with a longer key in the year 2029. You would do that re-signing in the 2048 bit twilight period while you still trust the old signature. Fortunately, there are likely to be few projects where such old code will be in demand. 4096 in practice One of the reasons I decided to write this blog is the fact that some organisations have made the 4096 bit keys very prominent (although nobody has made them mandatory as far as I am aware). Debian's guide to key creation currently recommends 4096 bit keys (although it doesn't explicitly mandate their use) Fedora's archive keys are all 4096 bit keys. The CACert.org project has developed a 4096 bit root These developments may leave people feeling a little bit naked if they have to use a shorter 2048 bit key for any of the reasons suggested above (e.g. for wider choice of smart cards and compatibility with readers). It has also resulted in some people spending time looking for 4096 bit smart cards and compatible readers when they may be better off just using 2048 bits and investing their time in other security improvements. In fact, the "risk" of using only 2048 rather than 4096 bits in the smartcard may well be far outweighed by the benefits of hardware security (especially if a smartcard reader with pin-pad is used) My own conclusion is that 2048 is not a dead duck and using this key length remains a valid decision and is very likely to remain so for the next 5 years at least. The US NIST makes a similar recommendation and suggests it will be safe until 2030, although it is the minimum key length they have recommended. My feeling is that the Debian preference for 4096 bit PGP keys is not based solely on security, rather, it is also influenced by the fact that Debian is a project run by volunteers. Given this background, there is a perception that if everybody migrates from 1024 to 2048, then there would be another big migration effort to move all users from 2048 to 4096 and that those two migrations could be combined into a single effort going directly from 1024 to 4096, reducing the future workload of the volunteers who maintain the keyrings. This is a completely rational decision for administrative reasons, but it is not a decision that questions the security of using 2048 bit keys today. Therefore, people should not see Debian's preference to use 4096 bit keys as a hint that 2048 bit keys are fundamentally flawed. Unlike the Debian keys (which are user keys), the CACert.org roots and Fedora archive signing keys are centrally managed keys with a long lifetime and none of the benefits of using 2048 bit keys is a compelling factor in those use cases. Practical issues to consider when choosing key-length Therefore, the choice of using 2048 or 4096 is not pre-determined, and it can be balanced with a range of other decisions:
  • Key lifetime: is it a long life key, such as an X.509 root for an in-house CA or an OpenPGP primary key? Or is it just for a HTTPS web server or some other TLS server that can be replaced every two years?
  • Is it for a dedicated application (e.g. a closed user group all using the same software supporting 4096 bit) or is it for a widespread user base where some users need to use 2048 bit due to old software/hardware?
  • Is it necessary to use the key(s) in a wide variety of smartcard readers?
  • Is it a mobile application (where battery must be conserved) or a server that is likely to experience heavy load?

17 April 2013

Joey Hess: Template Haskell on impossible architectures

Imagine you had an excellent successful Kickstarter campaign, and during it a lot of people asked for an Android port to be made of the software. Which is written in Haskell. No problem, you'd think -- the user interface can be written as a local webapp, which will be nicely platform agnostic and so make it easy to port. Also, it's easy to promise a lot of stuff during a Kickstarter campaign. Keeps the graph going up. What could go wrong? So, rather later you realize there is no Haskell compiler for Android. At all. But surely there will be eventually. And so you go off and build the webapp. Since Yesod seems to be the pinnacle of type-safe Haskell web frameworks, you use it. Hmm, there's this Template Haskell stuff that it uses a lot, but it only makes compiles a little slow, and the result is cool, so why not. Then, about half-way through the project, it seems time to get around to this Android port. And, amazingly, a Haskell compiler for Android has appeared in the meantime. Like the Haskell community has your back. (Which they generally seem to.) It's early days and rough, lots of libraries need to be hacked to work, but it only takes around 2 weeks to get a port of your program that basically works. But, no webapp. Cause nobody seems to know how to make a cross-compiling Haskell compiler do Template Haskell. (Even building a fully native compiler on Android doesn't do the trick. Perhaps you missed something though.) At this point you can give up and write a separate Android UI (perhaps using these new Android JNI bindings for Haskell that have also appeared in the meantime). Or you can procrastinate for a while, and mull it over; consider rewriting the webapp to not use Yesod but some other framework that doesn't need Template Haskell. Eventually you might think this: If I run ghc -ddump-splices when I'm building my Yesod code, I can see all the thousands of lines of delicious machine generated Haskell code. I just have to paste that in, in place of the Template Haskell that generated it, and I'll get a program I can build on Android! What could go wrong? And you even try it, and yeah, it seems to work. For small amounts of code that you paste in and carefully modify and get working. Not a whole big, constantly improving webapp where every single line of html gets converted to types and lambdas that are somehow screamingly fast. So then, let's automate this pasting. And so the EvilSplicer is born! That's a fairly general-purpose Template Haskell splicer. First do a native build with -ddump-splices output redirected to a log file. Run the EvilSplicer to fix up the code. Then run an Android cross-compile. But oh, the caveats. There are so many ways this can go wrong.. Anyway, if you struggle with it, or pay me vast quantities of money, your program will, eventually, link. And that's all I can promise for now.
PS, I hope nobody will ever find this blog post useful in their work. PPS, Also, if you let this put you off Haskell in any way .. well, don't. You just might want to wait a year or so before doing Haskell on Android.

9 February 2013

Matthew Garrett: Samsung laptop bug is not Linux specific

I bricked a Samsung laptop today. Unlike most of the reported cases of Samsung laptops refusing to boot, I never booted Linux on it - all experimentation was performed under Windows. It seems that the bug we've been seeing is simultaneously simpler in some ways and more complicated in others than we'd previously realised.

So, some background. The original belief was that the samsung-laptop driver was doing something that caused the system to stop working. This driver was coded to a Samsung specification in order to support certain laptop features that weren't accessible via any standardised mechanism. It works by searching a specific area of memory for a Samsung-specific signature. If it finds it, it follows a pointer to a table that contains various magic values that need to be written in order to trigger some system management code that actually performs the requested change. This is unusual in this day and age, but not unique. The problem is that the magic signature is still present on UEFI systems, but attempting to use the data contained in the table causes problems.

We're not quite sure what those problems are yet. Originally we assumed that the magic values we wrote were causing the problem, so the samsung-laptop driver was patched to disable it on UEFI systems. Unfortunately, this doesn't actually fix the problem - it just avoids the easiest way of triggering it. It turns out that it wasn't the writes that caused the problem, it was what happened next. Performing the writes triggered a hardware error of some description. The Linux kernel caught and logged this. In the old days, people would often never see these logs - the system would then be frozen and it would be impossible to access the hard drive, so they never got written to disk. There's code in the kernel to make this easier on UEFI systems. Whenever a severe error is encountered, the kernel copies recent messages to the UEFI variable storage space. They're then available to userspace after a reboot, allowing more accurate diagnostics of what caused the crash.

That crash dump takes about 10K of UEFI storage space. Microsoft require that Windows 8 systems have at least 64K of storage space available. We only keep one crash dump - if the system crashes again it'll simply overwrite the existing one rather than creating another. This is all completely compatible with the UEFI specification, and Apple actually do something very similar on their hardware. Unfortunately, it turns out that some Samsung laptops will fail to boot if too much of the variable storage space is used. We don't know what "too much" is yet, but writing a bunch of variables from Windows is enough to trigger it. I put some sample code here - it writes out 36 variables each containing a kilobyte of random data. I ran this as an administrator under Windows and then rebooted the system. It never came back.

This is pretty obviously a firmware bug. Writing UEFI variables is expressly permitted by the specification, and there should never be a situation in which an OS can fill the variable store in such a way that the firmware refuses to boot the system. We've seen similar bugs in Intel's reference code in the past, but they were all fixed early last year. For now the safest thing to do is not to use UEFI on any Samsung laptops. Unfortunately, if you're using Windows, that'll require you to reinstall it from scratch.

comment count unavailable comments

14 April 2012

Axel Beckert: Automatically hardlinking duplicate files under /usr/share/doc with APT

On my everyday netbook (a very reliable first generation ASUS EeePC 701 4G) the disk (4 GB as the product name suggests :-) is nearly always close to full. TL;DWTR? Jump directly to the HowTo. :-) So I came up with a few techniques to save some more disk space. Installing localepurge was one of the earliest. Another one was to implement aptitude filters to do interactively what deborphan does non-interactively. Yet another one is to use du and friends a lot ncdu is definitely my favourite du-like tool in the meanwhile. Using du and friends I often noticed how much disk space /usr/share/doc takes up. But since I value the contents of /usr/share/doc a lot, I condemn how Nokia solved that on the N900: They let APT delete all files and directories under /usr/share/doc (including the copyright files!) via some package named docpurge. I also dislike Ubuntu s solution to truncate the shipped changelog files (you can still get the remainder of the files on the web somewhere) as they re an important source of information for me. So when aptitude showed me that some package suddenly wanted to use up quite some more disk space, I noticed that the new package version included the upstream changelog twice. So I started searching for duplicate files under /usr/share/doc. There are quite some tools to find duplicate files in Debian. hardlink seemed most appropriate for this case. First I just looked for duplicate files per package, which even on that less than four gigabytes installation on my EeePC found nine packages which shipped at least one file twice. As recommended I rather opted for an according Lintian check (see bugs. Niels Thykier kindly implemented such a check in Lintian and its findings are as reported as tags duplicate-changelog-files (Severity: normal, from Lintian 2.5.2 on) and duplicate-files (Severity: minor, experimental, from Lintian 2.5.0 on). Nevertheless, some source packages generate several binary packages and all of them (of course) ship the same, in some cases quite large (Debian) changelog file. So I found myself running hardlink /usr/share/doc now and then to gain some more free disk space. But as I run Sid and package upgrades happen more than daily, I came to the conclusion that I should run this command more or less after each aptitude run, i.e. automatically. Having taken localepurge s APT hook as example, I added the following content as /etc/apt/apt.conf.d/98-hardlink-doc to my system:
// Hardlink identical docs, changelogs, copyrights, examples, etc
DPkg
 
Post-Invoke  "if [ -x /usr/bin/hardlink ]; then /usr/bin/hardlink -t /usr/share/doc; else exit 0; fi"; ;
 ;
So now installing a package which contains duplicate files looks like this:
~ # aptitude install perl-tk
The following NEW packages will be installed:
  perl-tk 
0 packages upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 2,522 kB of archives. After unpacking 6,783 kB will be used.
Get: 1 http://ftp.ch.debian.org/debian/ sid/main perl-tk i386 1:804.029-1.2 [2,522 kB]
Fetched 2,522 kB in 1s (1,287 kB/s)  
Selecting previously unselected package perl-tk.
(Reading database ... 121849 files and directories currently installed.)
Unpacking perl-tk (from .../perl-tk_1%3a804.029-1.2_i386.deb) ...
Processing triggers for man-db ...
Setting up perl-tk (1:804.029-1.2) ...
Mode:     real
Files:    15423
Linked:   3 files
Compared: 14724 files
Saved:    7.29 KiB
Duration: 4.03 seconds
localepurge: Disk space freed in /usr/share/locale: 0 KiB
localepurge: Disk space freed in /usr/share/man: 0 KiB
localepurge: Disk space freed in /usr/share/gnome/help: 0 KiB
localepurge: Disk space freed in /usr/share/omf: 0 KiB
Total disk space freed by localepurge: 0 KiB
Sure, that wasn t the most space saving example, but on some installations I saved around 100 MB of disk space that way and I still haven t found a case where this caused unwanted damage. (Use of this advice on your own risk, though. Pointers to potential problems welcome. :-)

21 August 2011

Vincent Sanders: A year of entropy

It has been a couple of years now since the release of the Entropy Key Around a year ago we finally managed to have enough stock on hand that I obtained a real production unit and installed it in my border router.

I installed the Debian packages, configured the ekeyd into EGD server mode and installed the EGD client packages on my other machines and pretty much forgot about it.

The recent new release of the ekey host software (version 1.1.4) reminded me that I had been quietly collecting statistics for almost a whole year and had some munin graphs to share.

The munin graphs of the generated output is pretty dull. Aside from the minor efficiency improvement in the 1.1.3 release installed mid December the generated rate has been a flat 3.93 Kilobytes a second.
The temperature sensor on the Entropy key shows a good correlation with the on-board CPU thermal sensors within the host system.
The host border router/server is a busy box which provides most network services including secure LDAP and SSL web services, it shows no sign of not having enough entropy at any point in the year.
The sites main file server and compile engine is a 4 core 8 gigabyte system with 12 drives. This system is heavily used with high load almost all the time but without the EGD client running has almost no entropy available.
The next system is my personal workstation. This machine often gets rebooted and is usually turned off overnight which is why there are gaps in the graph and odd discontinuities. Nonetheless entropy is always available just like the rest of my systems ;-)
And almost as a "control" here is a file server on the same network which has not been running EGD client (Ok, Ok already it was misconfigured and I am an idiot ;-)
In conclusion it seems an entropy key can keep at least this small network completely filled up with all the entropy it needs without much fuss. YAY!

Christian Perrier: 10 years being Debian Developer - part 1: before 1992

So, on July 31st 2011, it was exactly 10 years since I am a Debian developer. What happened in the meantime? What lead me to this? What turned this unskilled dude into a sometimes quite visible contributor of one of the major free software projects? If you're interested in that, please read on. Otherwise, well, this is just yet another "bubulle talks about self" post and you can skip it. Well, first of all, how did I end up being a DD? And first of firsts, how did I end up not being a random user of Windows, playing games on his desktop computer at work? I am not a computer scientist, an "informaticien" as we sometimes say in French (most often, what people who are not in computing stuff say...thinking that all folks working more or less closely to computers know everything about them). I graduated with a PhD in Materials science, in 1989 after conducting a research on "Influence of Yttrium Oxide dispersion on the strength of titanium alloys", at Onera, a French public research institution for aerospace and defence. I was then hired, still at Onera (where I'm still working, 25 years after starting my PhD work) to lead the Mechanical Testing Laboratory in the Materials Science department. So, my lab had many big machines designed to conduct creep and fatigue tests often at interestingly high temperatures such as 2000 C, on samples of various materials that are used in aircrafts structures, engines, etc. or in various space thingies (remember Herm s, the european shuttle?) or in various "things that fly but just one way and you shouldn't be there when they land". So, we had computers handling data acquisition for these tests. So I became involved with designing data acquisition setups, or even programming data acquisition programs (one of mine, written in Forth, ran all Onera creep tests until 2004 and successfully passed Y2K because I knew that 2000 was a leap year. It could even pass 2100 as I knew this is not a leap year..:-) One day, I had to buy a modem in order to communicate with out italian counterpart (CIRA) and exchange tests results. Then I had to learn how modems work in MS-Dos(yeah...). Then, I discovered "online" resources...indeed more that strange world that was then called "BBS" (Bulletin Board Systems), those mysterious things ran by some happy few who were using modems at their home place to communicate in "forums" and everything related to technology. PC-Board, Remote Access, Fidonet, etc. became familiar to me these days, back in 1988-1990. So familiar that I ended up running my own BBS at home, killing our phone bills and using very sophisticated techniques such as US Robotics "High Speed Transfer" modems that could be used at 19200bps asymetrical for very high speed transfers of kilobytes and kilobytes of useless MS-Dos "freeware" and "shareware". Indeed, my very first BBS didn't even have one of these: it was using a cheaper V.22bis modem operating at 2400bps. I was waiting for my order of a sophisticated HST modem to arrive from USA through obscure import channels meant to circumvent the French telephone company regulations that were forbidding the use of "unapproved" systems, in order to protect the famous French Telephone System from interferences brought by Bad American Material. Bubulle System was born. During those years, I discovered very interesting things: computers can run together once you draw a wire between them. That's called Local Area Network and you can even transfer data at 1Mbps between two computers, assuming you don't forget to put terminating resistors at the end of the line on these funny 10-baseT connectors at the back of your home-assembled PC that was using 2MB of RAM (bought for very cheap through the help of an American friend who was in touch with some Chinese folks who were selling 256kb RAM sticks for half the retail price....assuming you want to drive in a mysterious storing place close to Charles de Gaulle airport or Eurodisney). Hello Gordon. Yes, I know, we're still friends on Facebook and you're probably still using that weird programming language which you were, IIRC, the only person in the world to use. 2MB, that should be enough for barely anything, including running *multitasking* software where, miracle, I could run two tasks at the same time on my one and only PC at home. Miracle, I don't have to shutdown my BBS if I want to read forums on my friends BBS. Yay for Desqview/386! Multitasking for dummies Still, I have to build one of these "networks" at home. Elizabeth won't like that as it means one more box (home-made of course) in our living room and some more wires. And why the hell is it running 24/7? So that friends can visit the BBS, darling... And I can even communicate with them: I write a message, I get an answer the day after and so on. In one full week, we can have a great conversation that would have taken *minutes* to have in the real life. Isn't this the miracle of technlogy? And, yes, this is a good reason for having phone bills raising up to 500 Francs/month: people can "exchange" programs through my BBS, that horrible white box running in the living room. Often, these programs are written for free and some of them are even given with "source code", which allows people to *modify* them. And that even makes friends, you know? Imagine that some day we have to move from one house to another: then I can just call out "who wants to help bubulle moving?", and probably a dozen of (sometimes scary but always nice and polite...and sometimes even showered) geeks will pop up and happily carry boxes full of my vinyl LP collection all day long. An entire world of friends. "bubulle", you say, dear? What's that? That's my nickname. It was invented by one of these friends, a really strange guy named Ren Cougnenc who wrote this "free software" program anmed BBTH, which allows you to use modems to connect to BBS, and even to those many "Minitel" BBS we have in France, thanks to our wonderful world-leading technology using V.23 communication. Many people know me as "bubulle" because, you know, Perrier water has bubbles and Ren likes Gaston Lagaffe fish companion who he named "bubulle". Ren , I love you. You chose to leave this world back in 1998. We'll no longer have our "p'tits midis" in Antony where you were showing me the marvels of what's coming in next episode. All this was around 1988 and about 1992, doing all these mysterious things at home (between Jean-Baptiste and Sophie's diapers) while still working with data acquisition MS-Dos machines at work. How did this end up being a Debian Developer? You'll know in the next episode..:-)

15 March 2011

Colin Watson: Wubi bug 693671

I spent most of last week working on Ubuntu bug 693671 ("wubi install will not boot - phase 2 stops with: Try (hd0,0): NTFS5"), which was quite a challenge to debug since it involved digging into parts of the Wubi boot process I'd never really touched before. Since I don't think much of this is very well-documented, I'd like to spend a bit of time explaining what was involved, in the hope that it will help other developers in the future. Wubi is a system for installing Ubuntu into a file in a Windows filesystem, so that it doesn't require separate partitions and can be uninstalled like any other Windows application. The purpose of this is to make it easy for Windows users to try out Ubuntu without the need to worry about repartitioning, before they commit to a full installation. Wubi started out as an external project, and initially patched the installer on the fly to do all the rather unconventional things it needed to do; we integrated it into Ubuntu 8.04 LTS, which involved turning these patches into proper installer facilities that could be accessed using preseeding, so that Wubi only needs to handle the Windows user interface and other Windows-specific tasks. Anyone familiar with a GNU/Linux system's boot process will immediately see that this isn't as simple as it sounds. Of course, ntfs-3g is a pretty solid piece of software so we can handle the Windows filesystem without too much trouble, and loopback mounts are well-understood so we can just have the initramfs loop-mount the root filesystem. Where are you going to get the kernel and initramfs from, though? Well, we used to copy them out to the NTFS filesystem so that GRUB could read them, but this was overly complicated and error-prone. When we switched to GRUB 2, we could instead use its built-in loopback facilities, and we were able to simplify this. So all was more or less well, except for the elephant in the room. How are you going to load GRUB? In a Wubi installation, NTLDR (or BOOTMGR in Windows Vista and newer) still owns the boot process. Ubuntu is added as a boot menu option using BCDEdit. You might then think that you can just have the Windows boot loader chain-load GRUB. Unfortunately, NTLDR only loads 16 sectors - 8192 bytes - from disk. GRUB won't fit in that: the smallest core.img you can generate at the moment is over 18 kilobytes. Thus, you need something that is small enough to be loaded by NTLDR, but that is intelligent enough to understand NTFS to the point where it can find a particular file in the root directory of a filesystem, load boot loader code from it, and jump to that. The answer for this was GRUB4DOS. Most of GRUB4DOS is based on GRUB Legacy, which is not of much interest to us any more, but it includes an assembly-language program called GRLDR that supports doing this very thing for FAT, NTFS, and ext2. In Wubi, we build GRLDR as wubildr.mbr, and build a specially-configured GRUB core image as wubildr. Now, the messages shown in the bug report suggested a failure either within GRLDR or very early in GRUB. The first thing I did was to remember that GRLDR has been integrated into the grub-extras ntldr-img module suitable for use with GRUB 2, so I tried building wubildr.mbr from that; no change, but this gave me a modern baseline to work on. OK; now to try QEMU (you can use tricks like qemu -hda /dev/sda if you're very careful not to do anything that might involve writing to the host filesystem from within the guest, such as recursively booting your host OS ... [update: Tollef Fog Heen and Zygmunt Krynicki both point out that you can use the -snapshot option to make this safer]). No go; it hung somewhere in the middle of NTLDR. Still, I could at least insert debug statements, copy the built wubildr.mbr over to my test machine, and reboot for each test, although it would be slow and tedious. Couldn't I? Well, yes, I mostly could, but that 8192-byte limit came back to bite me, along with an internal 2048-byte limit that GRLDR allocates for its NTFS bootstrap code. There were only a few spare bytes. Something like this would more or less fit, to print a single mark character at various points so that I could see how far it was getting:
	pushal
	xorw	%bx, %bx	/* video page 0 */
	movw	$0x0e4d, %ax	/* print 'M' */
	int	$0x10
	popal
In a few places, if I removed some code I didn't need on my test machine (say, CHS compatibility), I could even fit in cheap and nasty code to print a single register in hex (as long as you didn't mind 'A' to 'F' actually being ':' to '?' in ASCII; and note that this is real-mode code, so the loop counter is %cx not %ecx):
	/* print %edx in dumbed-down hex */
	pushal
	xorw	%bx, %bx
	movb	$0xe, %ah
	movw	$8, %cx
1:
	roll	$4, %edx
	movb	%dl, %al
	andb	$0xf, %al
	int	$0x10
	loop	1b
	popal
After a considerable amount of work tracking down problems by bisection like this, I also observed that GRLDR's NTFS code bears quite a bit of resemblance in its logical flow to GRUB 2's NTFS module, and indeed the same person wrote much of both. Since I knew that the latter worked, I could use it to relieve my brain of trying to understand assembly code logic directly, and could compare the two to look for discrepancies. I did find a few of these, and corrected a simple one. Testing at this point suggested that the boot process was getting as far as GRUB but still wasn't printing anything. I removed some Ubuntu patches which quieten down GRUB's startup: still nothing - so I switched my attentions to grub-core/kern/i386/pc/startup.S, which contains the first code executed from GRUB's core image. Code before the first call to real_to_prot (which switches the processor into protected mode) succeeded, while code after that point failed. Even more mysteriously, code added to real_to_prot before the actual switch to protected mode failed too. Now I was clearly getting somewhere interesting, but what was going on? What I really wanted was to be able to single-step, or at least see what was at the memory location it was supposed to be jumping to. Around this point I was venting on IRC, and somebody asked if it was reproducible in QEMU. Although I'd tried that already, I went back and tried again. Ubuntu's qemu is actually built from qemu-kvm, and if I used qemu -no-kvm then it worked much better. Excellent! Now I could use GDB:
(gdb) target remote   qemu -gdb stdio -no-kvm -hda /dev/sda
This let me run until the point when NTLDR was about to hand over control, then interrupt and set a breakpoint at 0x8200 (the entry point of startup.S). This revealed that the address that should have been real_to_prot was in fact garbage. I set a breakpoint at 0x7c00 (GRLDR's entry point) and stepped all the way through to ensure it was doing the right thing. In the process it was helpful to know that GDB and QEMU don't handle real mode very well between them. Useful tricks here were: Single-stepping showed that GRLDR was loading the entirety of wubildr correctly and jumping to it. The first instruction it jumped to wasn't in startup.S, though, and then I remembered that we prefix the core image with grub-core/boot/i386/pc/lnxboot.S. Stepping through this required a clear head since it copies itself around and changes segment registers a few times. The interesting part was at real_code_2, where it copies a sector of the kernel to the target load address, and then checks a known offset to find out whether the "kernel" is in fact GRUB rather than a Linux kernel. I checked that offset by hand, and there was the smoking gun. GRUB recently acquired Reed-Solomon error correction on its core image, to allow it to recover from other software writing over sectors in the boot track. This moved the magic number lnxboot.S was checking somewhat further into the core image, after the first sector. lnxboot.S couldn't find it because it hadn't copied it yet! A bit of adjustment and all was well again. The lesson for me from all of this has been to try hard to get an interactive debugger working. Really hard. It's worth quite a bit of up-front effort if it saves you from killing neurons stepping through pages of code by hand. I think the real-mode debugging tricks I picked up should be useful for working on GRUB in the future.

14 March 2011

Colin Watson: Wubi bug 693671

I spent most of last week working on Ubuntu bug 693671 ("wubi install will not boot - phase 2 stops with: Try (hd0,0): NTFS5"), which was quite a challenge to debug since it involved digging into parts of the Wubi boot process I'd never really touched before. Since I don't think much of this is very well-documented, I'd like to spend a bit of time explaining what was involved, in the hope that it will help other developers in the future. Wubi is a system for installing Ubuntu into a file in a Windows filesystem, so that it doesn't require separate partitions and can be uninstalled like any other Windows application. The purpose of this is to make it easy for Windows users to try out Ubuntu without the need to worry about repartitioning, before they commit to a full installation. Wubi started out as an external project, and initially patched the installer on the fly to do all the rather unconventional things it needed to do; we integrated it into Ubuntu 8.04 LTS, which involved turning these patches into proper installer facilities that could be accessed using preseeding, so that Wubi only needs to handle the Windows user interface and other Windows-specific tasks. Anyone familiar with a GNU/Linux system's boot process will immediately see that this isn't as simple as it sounds. Of course, ntfs-3g is a pretty solid piece of software so we can handle the Windows filesystem without too much trouble, and loopback mounts are well-understood so we can just have the initramfs loop-mount the root filesystem. Where are you going to get the kernel and initramfs from, though? Well, we used to copy them out to the NTFS filesystem so that GRUB could read them, but this was overly complicated and error-prone. When we switched to GRUB 2, we could instead use its built-in loopback facilities, and we were able to simplify this. So all was more or less well, except for the elephant in the room. How are you going to load GRUB? In a Wubi installation, NTLDR (or BOOTMGR in Windows Vista and newer) still owns the boot process. Ubuntu is added as a boot menu option using BCDEdit. You might then think that you can just have the Windows boot loader chain-load GRUB. Unfortunately, NTLDR only loads 16 sectors - 8192 bytes - from disk. GRUB won't fit in that: the smallest core.img you can generate at the moment is over 18 kilobytes. Thus, you need something that is small enough to be loaded by NTLDR, but that is intelligent enough to understand NTFS to the point where it can find a particular file in the root directory of a filesystem, load boot loader code from it, and jump to that. The answer for this was GRUB4DOS. Most of GRUB4DOS is based on GRUB Legacy, which is not of much interest to us any more, but it includes an assembly-language program called GRLDR that supports doing this very thing for FAT, NTFS, and ext2. In Wubi, we build GRLDR as wubildr.mbr, and build a specially-configured GRUB core image as wubildr. Now, the messages shown in the bug report suggested a failure either within GRLDR or very early in GRUB. The first thing I did was to remember that GRLDR has been integrated into the grub-extras ntldr-img module suitable for use with GRUB 2, so I tried building wubildr.mbr from that; no change, but this gave me a modern baseline to work on. OK; now to try QEMU (you can use tricks like qemu -hda /dev/sda if you're very careful not to do anything that might involve writing to the host filesystem from within the guest, such as recursively booting your host OS ...). No go; it hung somewhere in the middle of NTLDR. Still, I could at least insert debug statements, copy the built wubildr.mbr over to my test machine, and reboot for each test, although it would be slow and tedious. Couldn't I? Well, yes, I mostly could, but that 8192-byte limit came back to bite me, along with an internal 2048-byte limit that GRLDR allocates for its NTFS bootstrap code. There were only a few spare bytes. Something like this would more or less fit, to print a single mark character at various points so that I could see how far it was getting:
	pushal
	xorw	%bx, %bx	/* video page 0 */
	movw	$0x0e4d, %ax	/* print 'M' */
	int	$0x10
	popal
In a few places, if I removed some code I didn't need on my test machine (say, CHS compatibility), I could even fit in cheap and nasty code to print a single register in hex (as long as you didn't mind 'A' to 'F' actually being ':' to '?' in ASCII; and note that this is real-mode code, so the loop counter is %cx not %ecx):
	/* print %edx in dumbed-down hex */
	pushal
	xorw	%bx, %bx
	movb	$0xe, %ah
	movw	$8, %cx
1:
	roll	$4, %edx
	movb	%dl, %al
	andb	$0xf, %al
	int	$0x10
	loop	1b
	popal
After a considerable amount of work tracking down problems by bisection like this, I also observed that GRLDR's NTFS code bears quite a bit of resemblance in its logical flow to GRUB 2's NTFS module, and indeed the same person wrote much of both. Since I knew that the latter worked, I could use it to relieve my brain of trying to understand assembly code logic directly, and could compare the two to look for discrepancies. I did find a few of these, and corrected a simple one. Testing at this point suggested that the boot process was getting as far as GRUB but still wasn't printing anything. I removed some Ubuntu patches which quieten down GRUB's startup: still nothing - so I switched my attentions to grub-core/kern/i386/pc/startup.S, which contains the first code executed from GRUB's core image. Code before the first call to real_to_prot (which switches the processor into protected mode) succeeded, while code after that point failed. Even more mysteriously, code added to real_to_prot before the actual switch to protected mode failed too. Now I was clearly getting somewhere interesting, but what was going on? What I really wanted was to be able to single-step, or at least see what was at the memory location it was supposed to be jumping to. Around this point I was venting on IRC, and somebody asked if it was reproducible in QEMU. Although I'd tried that already, I went back and tried again. Ubuntu's qemu is actually built from qemu-kvm, and if I used qemu -no-kvm then it worked much better. Excellent! Now I could use GDB:
(gdb) target remote   qemu -gdb stdio -no-kvm -hda /dev/sda
This let me run until the point when NTLDR was about to hand over control, then interrupt and set a breakpoint at 0x8200 (the entry point of startup.S). This revealed that the address that should have been real_to_prot was in fact garbage. I set a breakpoint at 0x7c00 (GRLDR's entry point) and stepped all the way through to ensure it was doing the right thing. In the process it was helpful to know that GDB and QEMU don't handle real mode very well between them. Useful tricks here were: Single-stepping showed that GRLDR was loading the entirety of wubildr correctly and jumping to it. The first instruction it jumped to wasn't in startup.S, though, and then I remembered that we prefix the core image with grub-core/boot/i386/pc/lnxboot.S. Stepping through this required a clear head since it copies itself around and changes segment registers a few times. The interesting part was at real_code_2, where it copies a sector of the kernel to the target load address, and then checks a known offset to find out whether the "kernel" is in fact GRUB rather than a Linux kernel. I checked that offset by hand, and there was the smoking gun. GRUB recently acquired Reed-Solomon error correction on its core image, to allow it to recover from other software writing over sectors in the boot track. This moved the magic number lnxboot.S was checking somewhat further into the core image, after the first sector. lnxboot.S couldn't find it because it hadn't copied it yet! A bit of adjustment and all was well again. The lesson for me from all of this has been to try hard to get an interactive debugger working. Really hard. It's worth quite a bit of up-front effort if it saves you from killing neurons stepping through pages of code by hand. I think the real-mode debugging tricks I picked up should be useful for working on GRUB in the future.

12 January 2011

Mike Hommey: Attempting to track I/O with systemtap

There are several ways a program can hit the disk, and it can be hard to know exactly what s going on, especially when you want to take into account the kernel caches. This includes, but is not limited to: There are various ways to track system calls (e.g. strace) allowing to track what a program is doing with files and directories, but that doesn t tell you if you re actually hitting the disk or not. There are also various ways to track block I/O (e.g. blktrace) allowing to track actual I/O on the block devices, but it is then hard to back-track what part of which files or directories these I/O relate to. To the best of my knowledge, there are unfortunately no tools to do such tracking easily. Systemtap, however, allows to access the kernel s internals and to gather almost any kind of information from any place in a running kernel. The downside is that it means you need to know how the kernel works internally to gather the data you need ; that will limit the focus of this post. I had been playing, in the past, with Taras script, which he used a while ago to track I/O during startup. Unfortunately, it became clear something was missing in the picture, so I had to investigate what s going on in the kernel. Setting up systemtap On Debian systems, you need to install the following packages: That should be enough to pull all the required dependencies. You may want to add yourself to the stapdev and stapusr groups, if you don t want to run systemtap as root. If, like in my case, you don t have enough space left for all the files in /usr/lib/debug, you can trick dpkg into not unpacking files you don t need:
# echo path-exclude /usr/lib/debug/lib/modules/*/kernel/drivers/* > /etc/dpkg/dpkg.cfg.d/kernel-dbg
The file in /etc/dpkg/dpkg.cfg.d can obviously be named as you like, and you can adjust the path-exclude pattern to what you (don t) want. In the above case, kernel drivers debugging symbols will be ignored. Please note that this feature requires dpkg version 1.15.8 or greater. Small digression One of the first problems I had with Taras script is that systemtap would complain that it doesn t know the kernel.function("ext4_get_block") probe. It is due to a very unfortunate misfeature of systemtap, where the kernel.* probes refer to whatever is in the vmlinux image. Modules probes have a separate namespace, namely module("name").*. So for the ext4_get_block() function, this means you need to set a probe for either kernel.function("ext4_get_block") or module("ext4").function("ext4_get_block"), depending how your kernel was compiled. And you can t even use both in your script, because systemtap will complain about either being an unknown probe Tracking the right thing I was very recently pointed to a Red Hat document containing 4 useful systemtap scripts ported from dtrace, which gives me a good occasion to explain the issue at hand with the first of these scripts. This script attempts to track I/O by following read() and write() system calls. Which is not tracking I/O, it is merely tracking some system calls (Taras script had the same kind of problem with read()/write() induced I/O). You could just do the very same with existing tools like strace, and that wouldn t even require some system privileges. To demonstrate the script doesn t actually track the use of storage devices as the document claims, consider the following source code:
#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>
/* Ensure the resulting binary is decently sized */
const char dummy[1024 * 1024] = "a";
int main(int argc, char *argv[])  
  struct stat st;
  char buf[65536];
  int i, fd = open(argv[0], O_RDONLY);
  for (i = 0; i < 100000; i++)  
    read(fd, buf, 65536);
    lseek(fd, 0, SEEK_SET);
    read(fd, buf, 65536);
    lseek(fd, 512 * 1024, SEEK_SET);
   
  close(fd);
  return 0;
 
All it does is reading some parts of the executable a lot of times (notice the trick to make the executable size at least 1MB). Not a lot of programs will actually do something as bold as reading the same data again and again (though we could probably be surprised), but this easily points the problem. Here is what the output of the systemtap script looks like for this program (stripping other unrelevant processes):
         process     read   KB tot    write   KB tot
            test   400002 25600001        0        0
Now, do you really believe 25MB have actually been read from the disk? (Note that the read() count seems odd as well, as there should only be around 200000 calls) Read-ahead, page cache and mmap() What the kernel actually does is that a read() on a file is going to check the page cache first. If there is nothing corresponding to the read() request in the page cache, then it goes down to the disk, and fills page cache. But as loading only a few bytes or kilobytes from the disk could be wasteful, the kernel also reads a few more blocks ahead, apparently with some heuristic. But read() and write() aren t the sole way a program may hit the disk. On UNIX systems, a file can be mapped in memory with the mmap() system call, and all accesses in the memory range corresponding to this mapping will be reflected on the file. There are exceptions, depending on how the mapping is established, but let s keep it simple. There is a lot of litterature on the subject if you want to document yourself on mmap(). The way the kernel will read from the file, however, is quite similar to that of read(), and uses page cache and read ahead. The systemtap script debunked above doesn t track these. I ll skip write accesses, because for now, they haven t been in my scope. Tracking some I/O with systemtap What I ve been trying to track so far has been limited to disk reads, which happen to be the only accesses occurring on shared library files. Programs and shared libraries are first being read() from, so that the dynamic linker gets the ELF headers and knows what to load, and then are mmap()ed following PT_LOAD entries in these headers. As far as my investigation in the Linux kernel code goes, fortunately, both accesses, before they end up actually hitting the disk, go through the __do_page_cache_readahead kernel function (this function was tracked in Taras script). Unfortunately, while it is called with an offset and a number of pages to read for a given file, it turns out the last pages in that number are not necessarily being read from the disk. I don t know for sure, because the latter had an effect on my observations, but some might even already be in the page cache. Going further down, we reach the VFS layer, which ends up being file-system specific. But fortunately, a bunch of (common) file-systems actually share page mapping code, and commonly use the mpage_readpage and mpage_readpages functions, both calling do_mpage_readpage to do the actual work. And this function seems to be properly called only once for each page not in the page cache already. If my reading of the kernel source is right, this however is not really where it ends, and do_mpage_readpage doesn t actually hit the disk. It seems to only gather some information (basically, a mapping between storage blocks and memory) that is going to be submitted to the block I/O layer, which itself may do some fancy stuff with it, such as reordering the I/Os depending on other requests it got from other processes, the position of the disk heads, etc. And when I say do_mpage_readpage doesn t actually hit the disk, I m again simplifying the issue, because it actually might, as there might be a need to read some metadata from the disk to know where some blocks are located. But tracking metadata reads is much harder, and I haven t investigated it. Anyways, skipping metadata, going further down after do_mpage_readpage is hard because it s difficult to back-track which block I/O is related to which read-ahead, corresponding to what read at what position in which file. do_mpage_readpage already has part of this problem because it is not called with any reference to the corresponding file. But __do_page_cache_readahead is. So knowing all the above, here is my script, the one I used to get the most recent Firefox startup data you can find in my last posts:
global targetpid;
global file_path;
probe begin  
  targetpid = target();
 
probe kernel.function("__do_page_cache_readahead")  
  if (targetpid == pid())
    file_path[tid()] = d_path(&$filp->f_path);
 
probe kernel.function("do_mpage_readpage")  
  if (targetpid == pid() && (tid() in file_path))  
    now = gettimeofday_us();
    printf("%d %s %d\n", now, file_path[tid()], $page->index*4096);
   
 
probe kernel.function("__do_page_cache_readahead").return  
  if (targetpid == pid())
    delete file_path[tid()];
 
This script needs to be used with a command given to systemtap, with the -c option, such as in the following command line:
# stap readpage.stp -c firefox
Each line of output represents a page (i.e. 4,096 bytes) being read, and contains a timestamp, the name of the file being read, and the offset in the file. As discussed above, do_mpage_readpage is not really the place the I/O actually occurs, so the timestamps are not entirely accurate, and the actual read order from disk might be slightly different, but it still is a quite reliable view in that the result should be reproducible with the same files even when they re not located on the same blocks on disk, and provided their page cache status is the same when starting the program. This systemtap script ignores writes, as well as metadata accesses (including but not limited to inodes, dentries, bitmap blocks and indirect blocks). It also doesn t account accesses to files opened with the O_DIRECT flag or similar constructs (raw devices, etc.) Read-ahead in action Back to the small example program, my systemtap script records 102 page accesses, that is, 417,792 bytes, much less than the actual binary size on my system (1,055,747 bytes). We are still far from the 25MB figure from the other systemtap script. But we are also far from the 128KiB the program actively reads (64KiB twice, leaving a hole between the two blocks). At this point, it is important to note that the ELF headers and program code all fit within a single page (4 KiB), and following the 1MiB section corresponding to the dummy variable, there are only 5,219 bytes of other ELF sections, including the .dynamic section. So even counting everything the dynamic linker needs to read, and the program code itself, we re still far from what my systemtap script records. Grouping consecutive blocks with consecutive timestamps, here is what can be observed:
Offset Length
0 16,384
983,040 73,728
16,384 262,144
524,288 65,536
(By now, you should have guessed why I wanted that big hole between the read()s ; if you want to reproduce at home, I suggest you also use my page cache flush helper) As earlier investigations showed, the first accesses by the dynamic loader are to read the ELF headers and .dynamic section. As mentioned above, these are really small. Yet, the kernel actually reads much more: 16KiB at the beginning of the file when reading the ELF headers, and 72KiB at the end when reading the .dynamic section. Subsequent accesses from the dynamic loader are obviously already covered by these accesses. Next accesses are those due to the program itself. The program actively reads 64KiB at the beginning of the file, then 64KiB starting at offset 524,288. For the first read, the kernel already had 16KiB in the page cache, so it didn t read them again, but instead of reading the remainder, it reads 256KiB. For the second read, however, it only reads the requested 64KiB. As you can see, this is far from being something like you wanted n KiB, I ll read that fixed amount now . Further testing with different read patterns (e.g. changing the hole size, read size, or reading from the dummy variable directly instead of read()ing the binary) is left as an exercise to the reader.

Next.