My main desktop/server machine (running Debian sid) at home has been running XFS on mdadm raid-1 on a pair of SSDs for the last few years. A few days ago, one of the SSDs died.
I ve been planning to switch to ZFS as the root filesystem for a while now, so instead of just replacing the failed drive, I took the opportunity to convert it.
NOTE: at this point in time, ZFS On Linux does NOT support TRIM for either datasets or zvols on SSD. There s a patch almost ready (
TRIM/Discard support from Nexenta #3656), so I m betting on that getting merged before it becomes an issue for me.
Here s the procedure I came up with:
1. Buy new disks, shutdown machine, install new disks, reboot.
The details of this stage are unimportant, and the only thing to note is that I m switching from mdadm RAID-1 with two SSDs to ZFS with two mirrored pairs (RAID-10) on four SSDs (Crucial MX300 275G at around $100 AUD each, they re hard to resist). Buying four 275G SSDs is slightly more expensive than buying two of the 525G models, but will perform a lot better.
When installed in the machine, they ended up as
/dev/sdp
,
/dev/sdq
,
/dev/sdr
, and
/dev/sds
. I ll be using the symlinks in /dev/disk/by-id/ for the zpool, but for partition and setup, it s easiest to use the /dev/sd? device nodes.
2. Partition the disks identically with gpt partition tables, using
gdisk
and
sgdisk
.
I need:
- A small partition (type EF02, 1MB) for grub to install itself in. Needed on gpt.
- A small partition (type EF00, 1MB) for EFI System. I m not currently booting with UEFI but I want the option to move to it later.
-
A small partition (type 8300, 2GB) for /boot.
I want /boot on a separate partition to make it easier to recover from problems that might occur with future upgrades. 2GB might seem excessive, but as this is my
tftp
& dhcp
server I can t rely on network boot for rescues, so I want to be able to put rescue ISO images in there and boot them with grub
and memdisk
.
This will be mdadm RAID-1, with 4 copies.
-
A larger partition (type 8200, 4GB) for swap. With 4 identically partitioned SSDs, I ll end up with 16GB swap (using
zswap
for block-device backed compressed RAM swap)
-
A large partition (type bf07, 210GB) for my rootfs
-
A small partition (type bf08, 2GBB) to provide ZIL for my HDD zpools
-
A larger partition (type bf09, 32GB) to provide L2ARC for my HDD zpools
ZFS On Linux uses partition type bf08 ( Solaris Reserved 1 ) natively, but doesn t seem to care what the partition types are for ZIL and L2ARC. I arbitrarily used bf08 ( Solaris Reserved 2 ) and bf09 ( Solaris Reserved 3 ) for easy identification. I ll set these up later, once I ve got the system booted I don t want to risk breaking my existing zpools by taking away their ZIL and L2ARC (and forgetting to
zpool remove
them, which I might possibly have done once) if I have to repartition.
I used
gdisk
to interactively set up the partitions:
# gdisk -l /dev/sdp
GPT fdisk (gdisk) version 1.0.1
Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present
Found valid GPT with protective MBR; using GPT.
Disk /dev/sdp: 537234768 sectors, 256.2 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 4234FE49-FCF0-48AE-828B-3C52448E8CBD
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 537234734
Partitions will be aligned on 8-sector boundaries
Total free space is 6 sectors (3.0 KiB)
Number Start (sector) End (sector) Size Code Name
1 40 2047 1004.0 KiB EF02 BIOS boot partition
2 2048 2099199 1024.0 MiB EF00 EFI System
3 2099200 6293503 2.0 GiB 8300 Linux filesystem
4 6293504 14682111 4.0 GiB 8200 Linux swap
5 14682112 455084031 210.0 GiB BF07 Solaris Reserved 1
6 455084032 459278335 2.0 GiB BF08 Solaris Reserved 2
7 459278336 537234734 37.2 GiB BF09 Solaris Reserved 3
I then cloned the partition table to the other three SSDs with this little script:
clone-partitions.sh
#! /bin/bash
src='sdp'
targets=( 'sdq' 'sdr' 'sds' )
for tgt in "$ targets[@] "; do
sgdisk --replicate="/dev/$tgt" /dev/"$src"
sgdisk --randomize-guids "/dev/$tgt"
done
3. Create the mdadm for /boot, the zpool, and and the root filesystem.
Most rootfs on ZFS guides that I ve seen say to call the pool
rpool
, then create a dataset called
"$(hostname)-1"
and then create a
ROOT
dataset under that. so on my machine, that would be
rpool/ganesh-1/ROOT
. Some reverse the order of hostname and the rootfs dataset, for
rpool/ROOT/ganesh-1
.
There might be uses for this naming scheme in other environments but not in mine. And, to me, it looks ugly. So I ll use just
$(hostname)/root
for the rootfs. i.e.
ganesh/root
I wrote a script to automate it, figuring I d probably have to do it several times in order to optimise performance. Also, I wanted to document the procedure for future reference, and have scripts that would be trivial to modify for other machines.
create.sh
#! /bin/bash
exec &> ./create.log
hn="$(hostname -s)"
base='ata-Crucial_CT275MX300SSD1_'
md='/dev/md0'
md_part=3
md_parts=( $(/bin/ls -1 /dev/disk/by-id/$ base *-part$ md_part ) )
zfs_part=5
# 4 disks, so use the top half and bottom half for the two mirrors.
zmirror1=( $(/bin/ls -1 /dev/disk/by-id/$ base *-part$ zfs_part head -n 2) )
zmirror2=( $(/bin/ls -1 /dev/disk/by-id/$ base *-part$ zfs_part tail -n 2) )
# create /boot raid array
mdadm "$md" --create \
--bitmap=internal \
--raid-devices=4 \
--level 1 \
--metadata=0.90 \
"$ md_parts[@] "
mkfs.ext4 "$md"
# create zpool
zpool create -o ashift=12 "$hn" \
mirror "$ zmirror1[@] " \
mirror "$ zmirror2[@] "
# create zfs rootfs
zfs set compression=on "$hn"
zfs set atime=off "$hn"
zfs create "$hn/root"
zpool set bootfs="$hn/root"
# mount the new /boot under the zfs root
mount "$md" "/$hn/root/boot"
If you want or need other ZFS datasets (e.g. for /home, /var etc) then create them here in this script. Or you can do that later after you ve got the system up and running on ZFS.
If you run mysql or postgresql, read the various tuning guides for how to get best performance for databases on ZFS (they both need their own datasets with particular
recordsize
and other settings). If you download Linux ISOs or anything with bit-torrent, avoid COW fragmentation by setting up a dataset to download into with
recordsize=16K
and configure your BT client to move the downloads to another directory on completion.
I did this after I got my system booted on ZFS. For my db, I stoppped the postgres service, renamed
/var/lib/postgresql
to
/var/lib/p
, created the new datasets with:
zfs create -o recordsize=8K -o logbias=throughput -o mountpoint=/var/lib/postgresql \
-o primarycache=metadata ganesh/postgres
zfs create -o recordsize=128k -o logbias=latency -o mountpoint=/var/lib/postgresql/9.6/main/pg_xlog \
-o primarycache=metadata ganesh/pg-xlog
followed by
rsync
and then started postgres again.
4. rsync my current system to it.
Logout all user sessions, shut down all services that write to the disk (postfix, postgresql, mysql, apache, asterisk, docker, etc). If you haven t booted into recovery/rescue/single-user mode, then you should be as close to it as possible everything non-esssential should be stopped. I chose not to boot to single-user in case I needed access to the web to look things up while I did all this (this machine is my internet gateway).
Then:
hn="$(hostname -s)"
time rsync -avxHAXS -h -h --progress --stats --delete / /boot/ "/$hn/root/"
After the rsync, my 130GB of data from XFS was compressed to 91GB on ZFS with transparent lz4 compression.
Run the rsync again if (as I did), you realise you forgot to shut down postfix (causing newly arrived mail to not be on the new setup) or something.
You can do a (very quick & dirty) performance test now, by running
zpool scrub "$hn"
. Then run
watch zpool status "$hn"
. As there should be no errorss to correct, you should get scrub speeds approximating the combined sequential read speed of all vdevs in the pool. In my case, I got around 500-600M/s I was kind of expecting closer to 800M/s but that s good enough .the Crucial MX300s aren t the fastest drive available (but they re great for the price), and ZFS is optimised for reliability more than speed. The scrub took about 3 minutes to scan all 91GB. My HDD zpools get around 150 to 250M/s, depending on whether they have mirror or RAID-Z vdevs and on what kind of drives they have.
For real benchmarking, use
bonnie++
or
fio
.
5. Prepare the new rootfs for chroot, chroot into it, edit
/etc/fstab
and
/etc/default/grub
.
This script bind mounts /proc, /sys, /dev, and /dev/pts before chrooting:
chroot.sh
#! /bin/sh
hn="$(hostname -s)"
for i in proc sys dev dev/pts ; do
mount -o bind "/$i" "/$ hn /root/$i"
done
chroot "/$ hn /root"
Change
/etc/fstab
(on the new zfs root to) have the zfs root and ext4 on raid-1 /boot:
/ganesh/root / zfs defaults 0 0
/dev/md0 /boot ext4 defaults,relatime,nodiratime,errors=remount-ro 0 2
I haven t bothered with setting up the swap at this point. That s trivial and I can do it after I ve got the system rebooted with its new ZFS rootfs (which reminds me, I still haven t done that :).
add
boot=zfs
to the
GRUB_CMDLINE_LINUX
variable in
/etc/default/grub
. On my system, that s:
GRUB_CMDLINE_LINUX="iommu=noagp usbhid.quirks=0x1B1C:0x1B20:0x408 boot=zfs"
NOTE: If you end up needing to run rsync again as in step 4. above copy
/etc/fstab
and
/etc/default/grub
to the old root filesystem first. I suggest to
/etc/fstab.zfs
and
/etc/default/grub.zfs
6. Install grub
Here s where things get a little complicated. Running
install-grub
on /dev/sd[pqrs] is fine, we created the type ef02 partition for it to install itself into.
But running
update-grub
to generate the new
/boot/grub/grub.cfg
will fail with an error like this:
/usr/sbin/grub-probe: error: failed to get canonical path of /dev/ata-Crucial_CT275MX300SSD1_163313AADD8A-part5'.
IMO, that s a bug in
grub-probe
it should look in
/dev/disk/by-id/
if it can t find what it s looking for in
/dev/
I fixed that problem with this script:
fix-ata-links.sh
#! /bin/sh
cd /dev
ln -s /dev/disk/by-id/ata-Crucial* .
After that,
update-grub
works fine.
NOTE: you will have to add
udev
rules to create these symlinks, or run this script on every boot otherwise you ll get that error every time you run
update-grub
in future.
7. Prepare to reboot
Unmount proc, sys, dev/pts, dev, the new raid /boot, and the new zfs filesystems. Set the mount point for the new rootfs to /
umount-zfs-root.sh
#! /bin/sh
hn="$(hostname -s)"
md="/dev/md0"
for i in dev/pts dev sys proc ; do
umount "/$ hn /root/$i"
done
umount "$md"
zfs umount "$ hn /root"
zfs umount "$ hn "
zfs set mountpoint=/ "$ hn /root"
zfs set canmount=off "$ hn "
8. Reboot
Remember to configure the BIOS to boot from your new disks.
The system should boot up with the new rootfs, no rescue disk required as in some other guides the rsync and chroot stuff has already been done.
9. Other notes
10. Useful references
Reading these made it much easier to come up with my own method. Highly recommended.
Converting to a ZFS rootfs is a post from:
Errata