I spent most of last week working on Ubuntu bug 693671
("wubi install will not boot - phase 2 stops with: Try (hd0,0): NTFS5"), which was quite a challenge to debug since it involved digging into parts of the Wubi boot process I'd never really touched before. Since I don't think much of this is very well-documented, I'd like to spend a bit of time explaining what was involved, in the hope that it will help other developers in the future.
is a system for installing Ubuntu into a file in a Windows filesystem, so that it doesn't require separate partitions and can be uninstalled like any other Windows application. The purpose of this is to make it easy for Windows users to try out Ubuntu without the need to worry about repartitioning, before they commit to a full installation. Wubi started out as an external project, and initially patched the installer on the fly to do all the rather unconventional things it needed to do; we integrated it into Ubuntu 8.04 LTS, which involved turning these patches into proper installer facilities that could be accessed using preseeding, so that Wubi only needs to handle the Windows user interface and other Windows-specific tasks.
Anyone familiar with a GNU/Linux system's boot process will immediately see that this isn't as simple as it sounds. Of course, ntfs-3g
is a pretty solid piece of software so we can handle the Windows filesystem without too much trouble, and loopback mounts are well-understood so we can just have the initramfs loop-mount the root filesystem. Where are you going to get the kernel and initramfs from, though? Well, we used to copy them out to the NTFS filesystem so that GRUB could read them, but this was overly complicated and error-prone. When we switched to GRUB 2, we could instead use its built-in loopback facilities, and we were able to simplify this. So all was more or less well, except for the elephant in the room. How are you going to load GRUB?
In a Wubi installation, NTLDR (or BOOTMGR in Windows Vista and newer) still owns the boot process. Ubuntu is added as a boot menu option using BCDEdit. You might then think that you can just have the Windows boot loader chain-load GRUB. Unfortunately, NTLDR only loads 16 sectors - 8192 bytes - from disk. GRUB won't fit in that: the smallest core.img you can generate at the moment is over 18 kilobytes. Thus, you need something that is small enough to be loaded by NTLDR, but that is intelligent enough to understand NTFS to the point where it can find a particular file in the root directory of a filesystem, load boot loader code from it, and jump to that. The answer for this was GRUB4DOS
. Most of GRUB4DOS is based on GRUB Legacy, which is not of much interest to us any more, but it includes an assembly-language program called GRLDR that supports doing this very thing for FAT, NTFS, and ext2. In Wubi, we build GRLDR as
, and build a specially-configured GRUB core image as
Now, the messages shown in the bug report suggested a failure either within GRLDR or very early in GRUB. The first thing I did was to remember that GRLDR has been integrated into the grub-extras
module suitable for use with GRUB 2, so I tried building
from that; no change, but this gave me a modern baseline to work on. OK; now to try QEMU (you can use tricks like
qemu -hda /dev/sda
if you're very careful not to do anything that might involve writing to the host filesystem from within the guest, such as recursively booting your host OS ... [update: Tollef Fog Heen and Zygmunt Krynicki both point out that you can use the
option to make this safer]). No go; it hung somewhere in the middle of NTLDR. Still, I could at least insert debug statements, copy the built
over to my test machine, and reboot for each test, although it would be slow and tedious. Couldn't I?
Well, yes, I mostly could, but that 8192-byte limit came back to bite me, along with an internal 2048-byte limit that GRLDR allocates for its NTFS bootstrap code. There were only a few spare bytes. Something like this would more or less fit, to print a single mark character at various points so that I could see how far it was getting:
xorw %bx, %bx /* video page 0 */
movw $0x0e4d, %ax /* print 'M' */
In a few places, if I removed some code I didn't need on my test machine (say, CHS compatibility), I could even fit in cheap and nasty code to print a single register in hex (as long as you didn't mind 'A' to 'F' actually being ':' to '?' in ASCII; and note that this is real-mode code, so the loop counter is
/* print %edx in dumbed-down hex */
xorw %bx, %bx
movb $0xe, %ah
movw $8, %cx
roll $4, %edx
movb %dl, %al
andb $0xf, %al
After a considerable amount of work tracking down problems by bisection like this, I also observed that GRLDR's NTFS code bears quite a bit of resemblance in its logical flow to GRUB 2's NTFS module, and indeed the same person wrote much of both. Since I knew that the latter worked, I could use it to relieve my brain of trying to understand assembly code logic directly, and could compare the two to look for discrepancies. I did find a few of these, and corrected a simple one. Testing at this point suggested that the boot process was getting as far as GRUB but still wasn't printing anything. I removed some Ubuntu patches which quieten down GRUB's startup: still nothing - so I switched my attentions to grub-core/kern/i386/pc/startup.S
, which contains the first code executed from GRUB's core image. Code before the first call to
(which switches the processor into protected mode) succeeded, while code after that point failed. Even more mysteriously, code added to
the actual switch to protected mode failed too. Now I was clearly getting somewhere interesting, but what was going on? What I really wanted was to be able to single-step, or at least see what was at the memory location it was supposed to be jumping to.
Around this point I was venting on IRC, and somebody asked if it was reproducible in QEMU. Although I'd tried that already, I went back and tried again. Ubuntu's
is actually built from qemu-kvm, and if I used
then it worked much better. Excellent! Now I could use GDB:
(gdb) target remote qemu -gdb stdio -no-kvm -hda /dev/sda
This let me run until the point when NTLDR was about to hand over control, then interrupt and set a breakpoint at
(the entry point of
). This revealed that the address that should have been
was in fact garbage. I set a breakpoint at
(GRLDR's entry point) and stepped all the way through to ensure it was doing the right thing. In the process it was helpful to know that GDB and QEMU don't handle real mode very well between them
. Useful tricks here were:
set architecture i8086 before disassembling real-mode code (and
set architecture i386 to switch back).
- GDB prints addresses relative to the current segment base, but if you want to enter an address then you need to calculate a linear address yourself. For example, breakpoints must be set at
(CS << 4) + IP, rather than just at
Single-stepping showed that GRLDR was loading the entirety of
correctly and jumping to it. The first instruction it jumped to wasn't in
, though, and then I remembered that we prefix the core image with grub-core/boot/i386/pc/lnxboot.S
. Stepping through this required a clear head since it copies itself around and changes segment registers a few times. The interesting part was at
, where it copies a sector of the kernel to the target load address, and then checks a known offset to find out whether the "kernel" is in fact GRUB rather than a Linux kernel. I checked that offset by hand, and there was the smoking gun. GRUB recently acquired Reed-Solomon error correction on its core image, to allow it to recover from other software writing over sectors in the boot track. This moved the magic number
was checking somewhat further into the core image, after the first sector.
couldn't find it because it hadn't copied it yet! A bit of adjustment
and all was well again.
The lesson for me from all of this has been to try hard to get an interactive debugger working. Really hard. It's worth quite a bit of up-front effort if it saves you from killing neurons stepping through pages of code by hand. I think the real-mode debugging tricks I picked up should be useful for working on GRUB in the future.