Search Results: "Kyle McMartin"

21 January 2009

Kyle McMartin: fudcon f11

I was in Boston last week for FUDcon F11. As usual, despite bringing several tonnes of camera gear, I couldn t be bothered to pull it out aside from a few (terrible) casual shots. But hey, I m not big on being spontaneous. Anyway, the two hackfest days were really productive (as were the days I managed to make it out to the Red Hat office.) Although sadly we didn t make it into MIT for the Sunday hackfest day, due to the weather and the Sunday bus schedule in Arlington.

Instead of DaveJ doing his usual what s going on in the kernel for F$next I pitched that session and Dave held a session on dracut, he and Jeremy s new framework for building initramfs. My session was well attended and got a lot of questions about various things like the new drivers/staging policy (though, that s a post for another time.)

The rest of the time was productively spent poking at updates for the kernel for F9 and F10, rebasing them to 2.6.28, and poking at 2.6.29-rc1 for rawhide. This meant btrfs, while not ready for your critical data, was in mainline! After pushing out the new build, I spent most of the weekend doing some btrfs testing, and pushing out a new btrfsprogs build. Once they were composed into rawhide, I believe Fedora became the first to ship the mainline working btrfs in a distro.

All said, it was a fairly productive weekend, and FUDPub on Saturday night was a great chance to catch up with a bunch of my coworkers who I rarely get to see. I kind of wish I had bothered to set up an umbrella and get some decent shots of my friends, colleagues and coworkers having a raucous good time, but oh well, there s always next release. off-topic
I absolutely hate being politic, and desperately hate posting about life or whatever, since I don t care about yours and don t see why you should care about mine, but the callous arrogance of these striking transit workers here in Ottawa is beginning to irritate me. These are people who ve threatened to picket students attempting to get to school if the universities accepted funding from the municipal government to continue to run replacement shuttles, and (though they ve reached a side agreement now) had threatened to picket for attempting to hire more drivers to increase their ParaTranspo service which provides shuttle services for the elderly and less-able-bodied residents of Ottawa. My point is, one of the shops I walked past on the high street yesterday said it best: Caution, possibly NSFW depending on your point of view. Given that I live in a (fairly) suburban part of Ottawa, this means I get to enjoy a forty minute walk to the nearest (competent) grocery store or coffee shop, and haven t bothered much. But hey, wonderful North American urban planning strikes again. Huzzah.

7 September 2008

Kyle McMartin: CONFIG_PRINTK_TIME, what is time?

GMsoft reported a few weeks ago, that kernels on his A500 were hanging on startup with CONFIG_PRINTK_TIME enabled. Knowing that all this does is prefix the kernel messages with a timestamp, I was interested to find out how this could possibly be causing a hard hang. Obviously, the first thing to do, is to try and reproduce the problem. But I was completely unable to reproduce it on my RP3440… How very strange. Ok, well, let’s poke at the A500 and see if it will happen there. It does! Spooky…[1] Well, now I was really interested. Let’s see what CONFIG_PRINTK_TIME actually does… In kernel/printk.c::vprintk, around the if (printk_time) section, we see that after printing the priority level tag, such as KERN_INFO, etc., we attempt to print a timestamp obtained with cpu_clock. The returned value from cpu_clock is in nanoseconds, so before it’s printed, it is munged into two smaller integers, the whole-seconds portion, and the decimal portion. Ok, this gives us a great place to start looking for ways PA-RISC could be tripping up on these codepaths. The first was a fairly fruitless search of the cpu_clock call chain, which turned up nothing suspicious, aside from a maze of CONFIG_ options. Turned out, on non-x86, this code reduced into some fairly trivial stuff, none of which could really have been causing a hang. However, we now had a basis for a fairly good hunch. If it wasn’t the cpu_clock going awry, the do_div routine, or sprintf must have been causing it. A quick boot-test to comment out the do_div routine and replace it will a fixed value resulted in a working system. Hooray. Then it hit me like a freight train, when I saw what was being printed. This was the banner line… the very first thing printed in init/main.c, right after jumping into the kernel in virtual mode at start_kernel.



Linux version 2.6.27-rc5-00283-g70bb089 (kyle@shortfin) (gcc version 4.3.1 (GCC) ) #2 SMP Sat Sep 6 19:45:05 PDT 2008

FP[0] enabled: Rev 1 Model 20

The 64-bit Kernel has started…

console [ttyB0] enabled

Initialized PDC Console for debugging.

Seeing the “FP[0] enabled” line immediately caused me to smack my head at the obviousness of the problem. We were attempting to do a division, which, because of how the kernel and libgcc are compiled, is attempting to use the fpu. However, this is faulting on the very first printk in the kernel, well before any of the architecture-specific initialization is done. A quick hack removing any printks before we initialized the fpu fixed the problem as well. But, dirty hacks are not appropriate for mainline. I thought long and hard about a nice way to fix this, but, really, open coding firmware calls in assembly didn’t strike my fancy. There is, however, another easy way to solve it. I ended up replacing the jump to start_kernel in head.S with my own function to turn on the fpu, and called start_kernel from there. Kind of ugly, but at least the fix is entirely contained to arch/parisc, instead of leaking all over the tree. The patch is available, but I’ve been too busy to push it this last release-cycle (and, didn’t really want to tempt fate at pushing a not-quite-actually-serious-fix outside of -rc1 time.) This has been another post brought to you by the maintainer of an inconsequential architecture. We do hope you enjoy it. 1. This ended up being due to either 1) the fpu being enabled by firmware on the PA8800, or 2) the fact that I was doing warm resets instead of cold starts.

31 May 2008

Kyle McMartin: it s like a party in my compiler, and i wasn t invited

I’ve spent more or less the last few days of spare time trying to figure out why gcc-4.3 built kernels were so unhappy on parisc. Basically the symptom was that any IPv4 networking operation would only complete once and a while, for example, ping would drop 90% of packets. This turned out to be a really troublesome bug to track down, and the fix was only 7 characters long. gcc-4.2 wasn’t problematic, so that provided an interesting base. A bit of thinking ruled out some obvious parts of the kernel that would likely not be an issue. For instance, ARP appeared to be fine, so it likely wasn’t an issue at the network driver level. So rebuilding net/built-in.o on gcc-4.2, copying it to the gcc-4.3 tree, and rebuilding the tree using it resulted in a working kernel. Ok, excellent, we know it’s likely an issue in net/, what else can we rule out, and what do we know? ICMP is affected, so it’s probably not a TCP problem… Rebuilding the kernel with IPv6 enabled, and testing ping6? Ok, works. So it looks like an IPv4 issue. Test that assertion… yup, net/ipv4/built-in.o compiled with gcc-4.2 works. Ok, peachy. What now? Bisecting the contents of the directory results in ip_output.c being the problematic file… Not all that helpfully, the differences between the code generated by 4.2 and 4.3 are extensive. Well, ok, let’s try turning off a variety of the added compiler options… nope, no luck, but the file works when compiled at -O0. Next, I bisected the file until I found the problematic function (unfortunately for me starting in the middle, it ended up being the first function in the file.) Which was in a chain of inline functions, eventually calling some architecture specific inline assembly for ip_fast_csum. Ok, that looks like our canary in the coal mine… Look it up… SIGH. The routine touches memory, but didn’t have “memory” in the list of clobbers. Adding the clobber and everything magically works again. Wasn’t that a party? [ Of course, this makes things sound almost easy, except that at around 10 minutes to boot a test kernel, reboot, and boot a working kernel, this gets extremely tedious. I also didn't bother to talk about how many hours I probably wasted twiddling compiler flags trying to figure out which optimization pass might be broken... ]

17 April 2008

Kyle McMartin: debugging strace

A bug was reported on linux-parisc@ last week, that “ls -l” was whinging about “Operation not supported” once per file. Easy, I thought, I’ll just break out strace, figure out which syscall it is, and Bob’s your uncle, we’ll have it fixed. Not so, unfortunately, as strace was bloody broken. upeek: ptrace(PTRACE_PEEKUSER,2606,4294967292,0): Input/output error Again, no big deal, some kernel change must have broken ptrace(2) on me. git logs didn’t shed any light on it, so I pulled an old strace binary out of an etch chroot and tried that, which failed as well. Wonderful. But… curious… it still works inside the chroot. At this point I was a little bit beyond annoyed, but the fact that it worked in the chroot was a big clue. Anyway, long story short, strace decides to assume that you can have up to 32 possible syscall arguments (wtf?#1) which is all well and good and generally harmless. However, it also decides to yank MAX_ARGS of them out with ptrace(PEEK_USER, …) if strace is unaware of the syscall prototype. Again, this would be mostly harmless, on every other architecture. On parisc though, argument registers count down from %r26 through %r20, which is the syscall number. Instrumenting the PEEK_USER path through sys_ptrace in the kernel made it pretty obvious when a consecutive request was suddenly 0xfffffffc (which is 0-4, especially suspicious given the previous values of 4, 8, 16…) that I needed to find a loop in ptrace. Searching through all the HPPA specific syscall code found it quickly,



for (i = 0; i < tcp->u_nargs; i++)  

    if (upeek(pid, PT_GR26-4*i, &tcp->u_arg[i]) < 0)

        return -1;

where “i” is 0 .. u_nargs, which is set to either the known number of syscall args, or MAX_ARGS (which, remember, was 32.) There, the culprit found, and with a simple one line fix, strace works again. In case you’re curious, the reason things worked inside the chroot, is because the syscall being strace’d was dependant on libselinux being loaded.



syscall_298(0x400e0fbc, 0x58, 0xfb444310, 0, 0x409878ac, 0x409878ac) = -1 (errno 2)

Which is statfs64. The moral of the story here? strace is out of date and doesn’t yet support recently added syscalls. Also, the code quality is pretty atrocious. Oh, and the problems with ls? getxattr was returning -EOPNOTSUP because the filesystem doesn’t support
xattrs. Or, more correctly, because that config option was disabled. This seems to be causing laughs on the buildds too. Doesn’t anyone check errno anymore? Now back to our regularly scheduled hacking. –
wtf?#1: Generally speaking, more than 5 uint32_t syscall args is frowned upon, since that makes the poor register-starved i386 pass args on the stack.

16 January 2008

Kyle McMartin: first post.

I’ve climbed up above the tagclouds and appear to be back in the “blogosphere” whatever that is. I hate that “blog” word. I also hate the word “prosumer.” Spent last weekend in Raleigh at FUDCon which was a lot of fun. I doff my fedora to the organizers. It was great to finally have the opportunity to meet a lot of my coworkers who I haven’t yet had a chance to run into at conferences. Poked a bit at btrfs in my idle cycles… managed to make it fall-over-go-boom on parisc64. Fixing that up should keep me busy in my spare time for the next little while… Oh. And if anyone would like to make me a new disembodied head^W^Whackergotchi, I’d very much appreciate it.