Antoine Beaupr : BTRFS notes
I'm not a fan of BTRFS. This page serves as a reminder of why,
but also a cheat sheet to figure out basic tasks in a BTRFS
environment because those are not obvious to me, even after
repeatedly having to deal with them.
Content warning: there might be mentions of ZFS.
Stability concerns
I'm worried about BTRFS stability, which has been historically
... changing. RAID-5 and RAID-6 are still marked unstable, for
example. It's kind of a lucky guess whether your current kernel will
behave properly with your planned workload. For example, in Linux
4.9, RAID-1 and RAID-10 were marked as "mostly OK" with a note that
says:
Stability concerns
I'm worried about BTRFS stability, which has been historically
... changing. RAID-5 and RAID-6 are still marked unstable, for
example. It's kind of a lucky guess whether your current kernel will
behave properly with your planned workload. For example, in Linux
4.9, RAID-1 and RAID-10 were marked as "mostly OK" with a note that
says:
Needs to be able to create two copies always. Can get stuck in
irreversible read-only mode if only one copy can be made.
Even as of now, RAID-1 and RAID-10 has this note:
The simple redundancy RAID levels utilize different mirrors in a way
that does not achieve the maximum performance. The logic can be
improved so the reads will spread over the mirrors evenly or based
on device congestion.
Granted, that's not a stability concern anymore, just performance. A
reviewer of a draft of this article actually claimed that BTRFS only
reads from one of the drives, which hopefully is inaccurate, but goes
to show how confusing all this is.
There are other warnings in the Debian wiki that are quite
scary. Even the legendary Arch wiki has a warning on top of their
BTRFS page, still.
Even if those issues are now fixed, it can be hard to tell when they
were fixed. There is a changelog by feature but it explicitly
warns that it doesn't know "which kernel version it is considered
mature enough for production use", so it's also useless for this.
It would have been much better if BTRFS was released into the world
only when those bugs were being completely fixed. Or that, at least,
features were announced when they were stable, not just "we merged to
mainline, good luck". Even now, we get mixed messages even in the
official BTRFS documentation which says "The Btrfs code base is
stable" (main page) while at the same time clearly stating
unstable parts in the status page (currently RAID56).
There are much harsher BTRFS critics than me out there so I
will stop here, but let's just say that I feel a little uncomfortable
trusting server data with full RAID arrays to BTRFS. But surely, for a
workstation, things should just work smoothly... Right? Well, let's
see the snags I hit.
My BTRFS test setup
Before I go any further, I should probably clarify how I am testing
BTRFS in the first place.
The reason I tried BTRFS is that I was ... let's just say "strongly
encouraged" by the LWN editors to install Fedora for the
terminal emulators series.
That, in turn, meant the setup was done with BTRFS, because that was
somewhat the default in Fedora 27 (or did I want to experiment? I
don't remember, it's been too long already).
So Fedora was setup on my 1TB HDD and, with encryption, the partition
table looks like this:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part /boot/efi
sda2 8:2 0 1G 0 part /boot
sda3 8:3 0 7,8G 0 part
fedora_swap 253:5 0 7.8G 0 crypt [SWAP]
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /
(This might not entirely be accurate: I rebuilt this from the Debian
side of things.)
This is pretty straightforward, except for the swap partition:
normally, I just treat swap like any other logical volume and create
it in a logical volume. This is now just speculation, but I bet it was
setup this way because "swap" support was only added in BTRFS 5.0.
I fully expect BTRFS experts to yell at me now because this is an old
setup and BTRFS is so much better now, but that's exactly the point
here. That setup is not that old (2018? old? really?), and migrating
to a new partition scheme isn't exactly practical right now. But let's
move on to more practical considerations.
No builtin encryption
BTRFS aims at replacing the entire mdadm, LVM, and ext4
stack with a single entity, and adding new features like
deduplication, checksums and so on.
Yet there is one feature it is critically missing: encryption. See, my
typical stack is actually mdadm, LUKS, and then LVM and
ext4. This is convenient because I have only a single volume to
decrypt.
If I were to use BTRFS on servers, I'd need to have one LUKS volume
per-disk. For a simple RAID-1 array, that's not too bad: one extra
key. But for large RAID-10 arrays, this gets really unwieldy.
The obvious BTRFS alternative, ZFS, supports encryption out of
the box and mixes it above the disks so you only have one passphrase
to enter. The main downside of ZFS encryption is that it happens above
the "pool" level so you can typically see filesystem names (and
possibly snapshots, depending on how it is built), which is not the
case with a more traditional stack.
Subvolumes, filesystems, and devices
I find BTRFS's architecture to be utterly confusing. In the
traditional LVM stack (which is itself kind of confusing if you're new
to that stuff), you have those layers:
- disks: let's say
/dev/nvme0n1
and nvme1n1
- RAID arrays with mdadm: let's say the above disks are joined in a
RAID-1 array in
/dev/md1
- volume groups or VG with LVM: the above RAID device (technically a
"physical volume" or PV) is assigned into a VG, let's call it
vg_tbbuild05
(multiple PVs can be added to a single VG which is
why there is that abstraction)
- LVM logical volumes: out of that volume group actually "virtual
partitions" or "logical volumes" are created, that is where your
filesystem lives
- filesystem, typically with ext4: that's your normal filesystem,
which treats the logical volume as just another block device
A typical server setup would look like this:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 1.7T 0 disk
nvme0n1p1 259:1 0 8M 0 part
nvme0n1p2 259:2 0 512M 0 part
md0 9:0 0 511M 0 raid1 /boot
nvme0n1p3 259:3 0 1.7T 0 part
md1 9:1 0 1.7T 0 raid1
crypt_dev_md1 253:0 0 1.7T 0 crypt
vg_tbbuild05-root 253:1 0 30G 0 lvm /
vg_tbbuild05-swap 253:2 0 125.7G 0 lvm [SWAP]
vg_tbbuild05-srv 253:3 0 1.5T 0 lvm /srv
nvme0n1p4 259:4 0 1M 0 part
I stripped the other nvme1n1
disk because it's basically the same.
Now, if we look at my BTRFS-enabled workstation, which doesn't even
have RAID, we have the following:
- disk:
/dev/sda
with, again, /dev/sda4
being where BTRFS lives
- filesystem:
fedora_crypt
, which is, confusingly, kind of like a
volume group. it's where everything lives. i think.
- subvolumes:
home
, root
, /
, etc. those are actually the things
that get mounted. you'd think you'd mount a filesystem, but no, you
mount a subvolume. that is backwards.
It looks something like this to lsblk
:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part /boot/efi
sda2 8:2 0 1G 0 part /boot
sda3 8:3 0 7,8G 0 part [SWAP]
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /srv
Notice how we don't see all the BTRFS volumes here? Maybe it's because
I'm mounting this from the Debian side, but lsblk
definitely gets
confused here. I frankly don't quite understand what's going on, even
after repeatedly looking around the rather dismal
documentation. But that's what I gather from the following
commands:
root@curie:/home/anarcat# btrfs filesystem show
Label: 'fedora' uuid: 5abb9def-c725-44ef-a45e-d72657803f37
Total devices 1 FS bytes used 883.29GiB
devid 1 size 922.47GiB used 916.47GiB path /dev/mapper/fedora_crypt
root@curie:/home/anarcat# btrfs subvolume list /srv
ID 257 gen 108092 top level 5 path home
ID 258 gen 108094 top level 5 path root
ID 263 gen 108020 top level 258 path root/var/lib/machines
I only got to that point through trial and error. Notice how I use an
existing mountpoint to list the related subvolumes. If I try to use
the filesystem path, the one that's listed in filesystem show
, I
fail:
root@curie:/home/anarcat# btrfs subvolume list /dev/mapper/fedora_crypt
ERROR: not a btrfs filesystem: /dev/mapper/fedora_crypt
ERROR: can't access '/dev/mapper/fedora_crypt'
Maybe I just need to use the label? Nope:
root@curie:/home/anarcat# btrfs subvolume list fedora
ERROR: cannot access 'fedora': No such file or directory
ERROR: can't access 'fedora'
This is really confusing. I don't even know if I understand this
right, and I've been staring at this all afternoon. Hopefully, the
lazyweb will correct me eventually.
(As an aside, why are they called "subvolumes"? If something is a
"sub" of "something else", that "something else" must exist
right? But no, BTRFS doesn't have "volumes", it only has
"subvolumes". Go figure. Presumably the filesystem still holds "files"
though, at least empirically it doesn't seem like it lost anything so
far.
In any case, at least I can refer to this section in the future, the
next time I fumble around the btrfs
commandline, as I surely will. I
will possibly even update this section as I get better at it, or based
on my reader's judicious feedback.
Mounting BTRFS subvolumes
So how did I even get to that point? I have this in my /etc/fstab
,
on the Debian side of things:
UUID=5abb9def-c725-44ef-a45e-d72657803f37 /srv btrfs defaults 0 2
This thankfully ignores all the subvolume nonsense because it relies
on the UUID. mount
tells me that's actually the "root" (? /
?)
subvolume:
root@curie:/home/anarcat# mount grep /srv
/dev/mapper/fedora_crypt on /srv type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)
Let's see if I can mount the other volumes I have on there. Remember
that subvolume list
showed I had home
, root
, and
var/lib/machines
. Let's try root
:
mount -o subvol=root /dev/mapper/fedora_crypt /mnt
Interestingly, root
is not the same as /
, it's a different
subvolume! It seems to be the Fedora root (/
, really) filesystem. No
idea what is happening here. I also have a home
subvolume, let's
mount it too, for good measure:
mount -o subvol=home /dev/mapper/fedora_crypt /mnt/home
Note that lsblk
doesn't notice those two new mountpoints, and that's
normal: it only lists block devices and subvolumes (rather
inconveniently, I'd say) do not show up as devices:
root@curie:/home/anarcat# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part
sda2 8:2 0 1G 0 part
sda3 8:3 0 7,8G 0 part
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /srv
This is really, really confusing. Maybe I did something wrong in the
setup. Maybe it's because I'm mounting it from outside Fedora. Either
way, it just doesn't feel right.
No disk usage per volume
If you want to see what's taking up space in one of those subvolumes,
tough luck:
root@curie:/home/anarcat# df -h /srv /mnt /mnt/home
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora_crypt 923G 886G 31G 97% /srv
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt/home
(Notice, in passing, that it looks like the same filesystem is mounted
in different places. In that sense, you'd expect /srv
and /mnt
(and /mnt/home
?!) to be exactly the same, but no: they are entirely
different directory structures, which I will not call "filesystems"
here because everyone's head will explode in sparks of confusion.)
Yes, disk space is shared (that's the Size
and Avail
columns,
makes sense). But nope, no cookie for you: they all have the same
Used
columns, so you need to actually walk the entire filesystem to
figure out what each disk takes.
(For future reference, that's basically:
root@curie:/home/anarcat# time du -schx /mnt/home /mnt /srv
124M /mnt/home
7.5G /mnt
875G /srv
883G total
real 2m49.080s
user 0m3.664s
sys 0m19.013s
And yes, that was painfully slow.)
ZFS actually has some oddities in that regard, but at least it tells
me how much disk each volume (and snapshot) takes:
root@tubman:~# time df -t zfs -h
Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/debian 3.5T 1.4G 3.5T 1% /
rpool/var/tmp 3.5T 384K 3.5T 1% /var/tmp
rpool/var/spool 3.5T 256K 3.5T 1% /var/spool
rpool/var/log 3.5T 2.0G 3.5T 1% /var/log
rpool/home/root 3.5T 2.2G 3.5T 1% /root
rpool/home 3.5T 256K 3.5T 1% /home
rpool/srv 3.5T 80G 3.5T 3% /srv
rpool/var/cache 3.5T 114M 3.5T 1% /var/cache
bpool/BOOT/debian 571M 90M 481M 16% /boot
real 0m0.003s
user 0m0.002s
sys 0m0.000s
That's 56360 times faster, by the way.
But yes, that's not fair: those in the know will know there's a
different command to do what df
does with BTRFS filesystems, the
btrfs filesystem usage
command:
root@curie:/home/anarcat# time btrfs filesystem usage /srv
Overall:
Device size: 922.47GiB
Device allocated: 916.47GiB
Device unallocated: 6.00GiB
Device missing: 0.00B
Used: 884.97GiB
Free (estimated): 30.84GiB (min: 27.84GiB)
Free (statfs, df): 30.84GiB
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:906.45GiB, Used:881.61GiB (97.26%)
/dev/mapper/fedora_crypt 906.45GiB
Metadata,DUP: Size:5.00GiB, Used:1.68GiB (33.58%)
/dev/mapper/fedora_crypt 10.00GiB
System,DUP: Size:8.00MiB, Used:128.00KiB (1.56%)
/dev/mapper/fedora_crypt 16.00MiB
Unallocated:
/dev/mapper/fedora_crypt 6.00GiB
real 0m0,004s
user 0m0,000s
sys 0m0,004s
Almost as fast as ZFS's df! Good job. But wait. That doesn't actually
tell me usage per subvolume. Notice it's filesystem usage
, not
subvolume usage
, which unhelpfully refuses to exist. That command
only shows that one "filesystem" internal statistics that are pretty
opaque.. You can also appreciate that it's wasting 6GB of
"unallocated" disk space there: I probably did something Very Wrong
and should be punished by Hacker News. I also wonder why it has 1.68GB
of "metadata" used...
At this point, I just really want to throw that thing out of the
window and restart from scratch. I don't really feel like learning the
BTRFS internals, as they seem oblique and completely bizarre to me. It
feels a little like the state of PHP now: it's actually pretty solid,
but built upon so many layers of cruft that I still feel it corrupts
my brain every time I have to deal with it (needle or haystack first?
anyone?)...
Conclusion
I find BTRFS utterly confusing and I'm worried about its
reliability. I think a lot of work is needed on usability and
coherence before I even consider running this anywhere else than a
lab, and that's really too bad, because there are really nice features
in BTRFS that would greatly help my workflow. (I want to use
filesystem snapshots as high-performance, high frequency backups.)
So now I'm experimenting with OpenZFS. It's so much simpler, just
works, and it's rock solid. After this 8 minute read, I had a
good understanding of how ZFS worked. Here's the 30 seconds overview:
- vdev: a RAID array
- vpool: a volume group of vdevs
- datasets: normal filesystems (or block device, if you want to use
another filesystem on top of ZFS)
There's also other special volumes like caches and logs that
you can (really easily, compared to LVM caching) use to tweak your
setup. You might also want to look at recordsize or ashift
to tweak the filesystem to fit better your workload (or deal with
drives lying about their sector size, I'm looking at you Samsung), but
that's it.
Running ZFS on Linux currently involves building kernel modules from
scratch on every host, which I think is pretty bad. But I was able to
setup a ZFS-only server using this excellent documentation
without too much problem.
I'm hoping some day the copyright issues are resolved and we can at
least ship binary packages, but the politics (e.g. convincing Debian
that is the right thing to do) and the logistics (e.g. DKMS
auto-builders? is that even a thing? how about signed DKMS packages?
fun-fun-fun!) seem really impractical. Who knows, maybe hell will
freeze over (again) and Oracle will fix the CDDL. I
personally think that we should just completely ignore this
problem (which wasn't even supposed to be a problem) and ship
binary packages directly, but I'm a pragmatic and do not always fit
well with the free software fundamentalists.
All of this to say that, short term, we don't have a reliable,
advanced filesystem/logical disk manager in Linux. And that's really
too bad.
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part /boot/efi
sda2 8:2 0 1G 0 part /boot
sda3 8:3 0 7,8G 0 part
fedora_swap 253:5 0 7.8G 0 crypt [SWAP]
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /
(This might not entirely be accurate: I rebuilt this from the Debian
side of things.)
This is pretty straightforward, except for the swap partition:
normally, I just treat swap like any other logical volume and create
it in a logical volume. This is now just speculation, but I bet it was
setup this way because "swap" support was only added in BTRFS 5.0.
I fully expect BTRFS experts to yell at me now because this is an old
setup and BTRFS is so much better now, but that's exactly the point
here. That setup is not that old (2018? old? really?), and migrating
to a new partition scheme isn't exactly practical right now. But let's
move on to more practical considerations.
No builtin encryption
BTRFS aims at replacing the entire mdadm, LVM, and ext4
stack with a single entity, and adding new features like
deduplication, checksums and so on.
Yet there is one feature it is critically missing: encryption. See, my
typical stack is actually mdadm, LUKS, and then LVM and
ext4. This is convenient because I have only a single volume to
decrypt.
If I were to use BTRFS on servers, I'd need to have one LUKS volume
per-disk. For a simple RAID-1 array, that's not too bad: one extra
key. But for large RAID-10 arrays, this gets really unwieldy.
The obvious BTRFS alternative, ZFS, supports encryption out of
the box and mixes it above the disks so you only have one passphrase
to enter. The main downside of ZFS encryption is that it happens above
the "pool" level so you can typically see filesystem names (and
possibly snapshots, depending on how it is built), which is not the
case with a more traditional stack.
Subvolumes, filesystems, and devices
I find BTRFS's architecture to be utterly confusing. In the
traditional LVM stack (which is itself kind of confusing if you're new
to that stuff), you have those layers:
- disks: let's say
/dev/nvme0n1
and nvme1n1
- RAID arrays with mdadm: let's say the above disks are joined in a
RAID-1 array in
/dev/md1
- volume groups or VG with LVM: the above RAID device (technically a
"physical volume" or PV) is assigned into a VG, let's call it
vg_tbbuild05
(multiple PVs can be added to a single VG which is
why there is that abstraction)
- LVM logical volumes: out of that volume group actually "virtual
partitions" or "logical volumes" are created, that is where your
filesystem lives
- filesystem, typically with ext4: that's your normal filesystem,
which treats the logical volume as just another block device
A typical server setup would look like this:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 1.7T 0 disk
nvme0n1p1 259:1 0 8M 0 part
nvme0n1p2 259:2 0 512M 0 part
md0 9:0 0 511M 0 raid1 /boot
nvme0n1p3 259:3 0 1.7T 0 part
md1 9:1 0 1.7T 0 raid1
crypt_dev_md1 253:0 0 1.7T 0 crypt
vg_tbbuild05-root 253:1 0 30G 0 lvm /
vg_tbbuild05-swap 253:2 0 125.7G 0 lvm [SWAP]
vg_tbbuild05-srv 253:3 0 1.5T 0 lvm /srv
nvme0n1p4 259:4 0 1M 0 part
I stripped the other nvme1n1
disk because it's basically the same.
Now, if we look at my BTRFS-enabled workstation, which doesn't even
have RAID, we have the following:
- disk:
/dev/sda
with, again, /dev/sda4
being where BTRFS lives
- filesystem:
fedora_crypt
, which is, confusingly, kind of like a
volume group. it's where everything lives. i think.
- subvolumes:
home
, root
, /
, etc. those are actually the things
that get mounted. you'd think you'd mount a filesystem, but no, you
mount a subvolume. that is backwards.
It looks something like this to lsblk
:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part /boot/efi
sda2 8:2 0 1G 0 part /boot
sda3 8:3 0 7,8G 0 part [SWAP]
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /srv
Notice how we don't see all the BTRFS volumes here? Maybe it's because
I'm mounting this from the Debian side, but lsblk
definitely gets
confused here. I frankly don't quite understand what's going on, even
after repeatedly looking around the rather dismal
documentation. But that's what I gather from the following
commands:
root@curie:/home/anarcat# btrfs filesystem show
Label: 'fedora' uuid: 5abb9def-c725-44ef-a45e-d72657803f37
Total devices 1 FS bytes used 883.29GiB
devid 1 size 922.47GiB used 916.47GiB path /dev/mapper/fedora_crypt
root@curie:/home/anarcat# btrfs subvolume list /srv
ID 257 gen 108092 top level 5 path home
ID 258 gen 108094 top level 5 path root
ID 263 gen 108020 top level 258 path root/var/lib/machines
I only got to that point through trial and error. Notice how I use an
existing mountpoint to list the related subvolumes. If I try to use
the filesystem path, the one that's listed in filesystem show
, I
fail:
root@curie:/home/anarcat# btrfs subvolume list /dev/mapper/fedora_crypt
ERROR: not a btrfs filesystem: /dev/mapper/fedora_crypt
ERROR: can't access '/dev/mapper/fedora_crypt'
Maybe I just need to use the label? Nope:
root@curie:/home/anarcat# btrfs subvolume list fedora
ERROR: cannot access 'fedora': No such file or directory
ERROR: can't access 'fedora'
This is really confusing. I don't even know if I understand this
right, and I've been staring at this all afternoon. Hopefully, the
lazyweb will correct me eventually.
(As an aside, why are they called "subvolumes"? If something is a
"sub" of "something else", that "something else" must exist
right? But no, BTRFS doesn't have "volumes", it only has
"subvolumes". Go figure. Presumably the filesystem still holds "files"
though, at least empirically it doesn't seem like it lost anything so
far.
In any case, at least I can refer to this section in the future, the
next time I fumble around the btrfs
commandline, as I surely will. I
will possibly even update this section as I get better at it, or based
on my reader's judicious feedback.
Mounting BTRFS subvolumes
So how did I even get to that point? I have this in my /etc/fstab
,
on the Debian side of things:
UUID=5abb9def-c725-44ef-a45e-d72657803f37 /srv btrfs defaults 0 2
This thankfully ignores all the subvolume nonsense because it relies
on the UUID. mount
tells me that's actually the "root" (? /
?)
subvolume:
root@curie:/home/anarcat# mount grep /srv
/dev/mapper/fedora_crypt on /srv type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)
Let's see if I can mount the other volumes I have on there. Remember
that subvolume list
showed I had home
, root
, and
var/lib/machines
. Let's try root
:
mount -o subvol=root /dev/mapper/fedora_crypt /mnt
Interestingly, root
is not the same as /
, it's a different
subvolume! It seems to be the Fedora root (/
, really) filesystem. No
idea what is happening here. I also have a home
subvolume, let's
mount it too, for good measure:
mount -o subvol=home /dev/mapper/fedora_crypt /mnt/home
Note that lsblk
doesn't notice those two new mountpoints, and that's
normal: it only lists block devices and subvolumes (rather
inconveniently, I'd say) do not show up as devices:
root@curie:/home/anarcat# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part
sda2 8:2 0 1G 0 part
sda3 8:3 0 7,8G 0 part
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /srv
This is really, really confusing. Maybe I did something wrong in the
setup. Maybe it's because I'm mounting it from outside Fedora. Either
way, it just doesn't feel right.
No disk usage per volume
If you want to see what's taking up space in one of those subvolumes,
tough luck:
root@curie:/home/anarcat# df -h /srv /mnt /mnt/home
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora_crypt 923G 886G 31G 97% /srv
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt/home
(Notice, in passing, that it looks like the same filesystem is mounted
in different places. In that sense, you'd expect /srv
and /mnt
(and /mnt/home
?!) to be exactly the same, but no: they are entirely
different directory structures, which I will not call "filesystems"
here because everyone's head will explode in sparks of confusion.)
Yes, disk space is shared (that's the Size
and Avail
columns,
makes sense). But nope, no cookie for you: they all have the same
Used
columns, so you need to actually walk the entire filesystem to
figure out what each disk takes.
(For future reference, that's basically:
root@curie:/home/anarcat# time du -schx /mnt/home /mnt /srv
124M /mnt/home
7.5G /mnt
875G /srv
883G total
real 2m49.080s
user 0m3.664s
sys 0m19.013s
And yes, that was painfully slow.)
ZFS actually has some oddities in that regard, but at least it tells
me how much disk each volume (and snapshot) takes:
root@tubman:~# time df -t zfs -h
Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/debian 3.5T 1.4G 3.5T 1% /
rpool/var/tmp 3.5T 384K 3.5T 1% /var/tmp
rpool/var/spool 3.5T 256K 3.5T 1% /var/spool
rpool/var/log 3.5T 2.0G 3.5T 1% /var/log
rpool/home/root 3.5T 2.2G 3.5T 1% /root
rpool/home 3.5T 256K 3.5T 1% /home
rpool/srv 3.5T 80G 3.5T 3% /srv
rpool/var/cache 3.5T 114M 3.5T 1% /var/cache
bpool/BOOT/debian 571M 90M 481M 16% /boot
real 0m0.003s
user 0m0.002s
sys 0m0.000s
That's 56360 times faster, by the way.
But yes, that's not fair: those in the know will know there's a
different command to do what df
does with BTRFS filesystems, the
btrfs filesystem usage
command:
root@curie:/home/anarcat# time btrfs filesystem usage /srv
Overall:
Device size: 922.47GiB
Device allocated: 916.47GiB
Device unallocated: 6.00GiB
Device missing: 0.00B
Used: 884.97GiB
Free (estimated): 30.84GiB (min: 27.84GiB)
Free (statfs, df): 30.84GiB
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:906.45GiB, Used:881.61GiB (97.26%)
/dev/mapper/fedora_crypt 906.45GiB
Metadata,DUP: Size:5.00GiB, Used:1.68GiB (33.58%)
/dev/mapper/fedora_crypt 10.00GiB
System,DUP: Size:8.00MiB, Used:128.00KiB (1.56%)
/dev/mapper/fedora_crypt 16.00MiB
Unallocated:
/dev/mapper/fedora_crypt 6.00GiB
real 0m0,004s
user 0m0,000s
sys 0m0,004s
Almost as fast as ZFS's df! Good job. But wait. That doesn't actually
tell me usage per subvolume. Notice it's filesystem usage
, not
subvolume usage
, which unhelpfully refuses to exist. That command
only shows that one "filesystem" internal statistics that are pretty
opaque.. You can also appreciate that it's wasting 6GB of
"unallocated" disk space there: I probably did something Very Wrong
and should be punished by Hacker News. I also wonder why it has 1.68GB
of "metadata" used...
At this point, I just really want to throw that thing out of the
window and restart from scratch. I don't really feel like learning the
BTRFS internals, as they seem oblique and completely bizarre to me. It
feels a little like the state of PHP now: it's actually pretty solid,
but built upon so many layers of cruft that I still feel it corrupts
my brain every time I have to deal with it (needle or haystack first?
anyone?)...
Conclusion
I find BTRFS utterly confusing and I'm worried about its
reliability. I think a lot of work is needed on usability and
coherence before I even consider running this anywhere else than a
lab, and that's really too bad, because there are really nice features
in BTRFS that would greatly help my workflow. (I want to use
filesystem snapshots as high-performance, high frequency backups.)
So now I'm experimenting with OpenZFS. It's so much simpler, just
works, and it's rock solid. After this 8 minute read, I had a
good understanding of how ZFS worked. Here's the 30 seconds overview:
- vdev: a RAID array
- vpool: a volume group of vdevs
- datasets: normal filesystems (or block device, if you want to use
another filesystem on top of ZFS)
There's also other special volumes like caches and logs that
you can (really easily, compared to LVM caching) use to tweak your
setup. You might also want to look at recordsize or ashift
to tweak the filesystem to fit better your workload (or deal with
drives lying about their sector size, I'm looking at you Samsung), but
that's it.
Running ZFS on Linux currently involves building kernel modules from
scratch on every host, which I think is pretty bad. But I was able to
setup a ZFS-only server using this excellent documentation
without too much problem.
I'm hoping some day the copyright issues are resolved and we can at
least ship binary packages, but the politics (e.g. convincing Debian
that is the right thing to do) and the logistics (e.g. DKMS
auto-builders? is that even a thing? how about signed DKMS packages?
fun-fun-fun!) seem really impractical. Who knows, maybe hell will
freeze over (again) and Oracle will fix the CDDL. I
personally think that we should just completely ignore this
problem (which wasn't even supposed to be a problem) and ship
binary packages directly, but I'm a pragmatic and do not always fit
well with the free software fundamentalists.
All of this to say that, short term, we don't have a reliable,
advanced filesystem/logical disk manager in Linux. And that's really
too bad.
- disks: let's say
/dev/nvme0n1
andnvme1n1
- RAID arrays with mdadm: let's say the above disks are joined in a
RAID-1 array in
/dev/md1
- volume groups or VG with LVM: the above RAID device (technically a
"physical volume" or PV) is assigned into a VG, let's call it
vg_tbbuild05
(multiple PVs can be added to a single VG which is why there is that abstraction) - LVM logical volumes: out of that volume group actually "virtual partitions" or "logical volumes" are created, that is where your filesystem lives
- filesystem, typically with ext4: that's your normal filesystem, which treats the logical volume as just another block device
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 1.7T 0 disk
nvme0n1p1 259:1 0 8M 0 part
nvme0n1p2 259:2 0 512M 0 part
md0 9:0 0 511M 0 raid1 /boot
nvme0n1p3 259:3 0 1.7T 0 part
md1 9:1 0 1.7T 0 raid1
crypt_dev_md1 253:0 0 1.7T 0 crypt
vg_tbbuild05-root 253:1 0 30G 0 lvm /
vg_tbbuild05-swap 253:2 0 125.7G 0 lvm [SWAP]
vg_tbbuild05-srv 253:3 0 1.5T 0 lvm /srv
nvme0n1p4 259:4 0 1M 0 part
I stripped the other nvme1n1
disk because it's basically the same.
Now, if we look at my BTRFS-enabled workstation, which doesn't even
have RAID, we have the following:
- disk:
/dev/sda
with, again,/dev/sda4
being where BTRFS lives - filesystem:
fedora_crypt
, which is, confusingly, kind of like a volume group. it's where everything lives. i think. - subvolumes:
home
,root
,/
, etc. those are actually the things that get mounted. you'd think you'd mount a filesystem, but no, you mount a subvolume. that is backwards.
lsblk
:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part /boot/efi
sda2 8:2 0 1G 0 part /boot
sda3 8:3 0 7,8G 0 part [SWAP]
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /srv
Notice how we don't see all the BTRFS volumes here? Maybe it's because
I'm mounting this from the Debian side, but lsblk
definitely gets
confused here. I frankly don't quite understand what's going on, even
after repeatedly looking around the rather dismal
documentation. But that's what I gather from the following
commands:
root@curie:/home/anarcat# btrfs filesystem show
Label: 'fedora' uuid: 5abb9def-c725-44ef-a45e-d72657803f37
Total devices 1 FS bytes used 883.29GiB
devid 1 size 922.47GiB used 916.47GiB path /dev/mapper/fedora_crypt
root@curie:/home/anarcat# btrfs subvolume list /srv
ID 257 gen 108092 top level 5 path home
ID 258 gen 108094 top level 5 path root
ID 263 gen 108020 top level 258 path root/var/lib/machines
I only got to that point through trial and error. Notice how I use an
existing mountpoint to list the related subvolumes. If I try to use
the filesystem path, the one that's listed in filesystem show
, I
fail:
root@curie:/home/anarcat# btrfs subvolume list /dev/mapper/fedora_crypt
ERROR: not a btrfs filesystem: /dev/mapper/fedora_crypt
ERROR: can't access '/dev/mapper/fedora_crypt'
Maybe I just need to use the label? Nope:
root@curie:/home/anarcat# btrfs subvolume list fedora
ERROR: cannot access 'fedora': No such file or directory
ERROR: can't access 'fedora'
This is really confusing. I don't even know if I understand this
right, and I've been staring at this all afternoon. Hopefully, the
lazyweb will correct me eventually.
(As an aside, why are they called "subvolumes"? If something is a
"sub" of "something else", that "something else" must exist
right? But no, BTRFS doesn't have "volumes", it only has
"subvolumes". Go figure. Presumably the filesystem still holds "files"
though, at least empirically it doesn't seem like it lost anything so
far.
In any case, at least I can refer to this section in the future, the
next time I fumble around the btrfs
commandline, as I surely will. I
will possibly even update this section as I get better at it, or based
on my reader's judicious feedback.
Mounting BTRFS subvolumes
So how did I even get to that point? I have this in my /etc/fstab
,
on the Debian side of things:
UUID=5abb9def-c725-44ef-a45e-d72657803f37 /srv btrfs defaults 0 2
This thankfully ignores all the subvolume nonsense because it relies
on the UUID. mount
tells me that's actually the "root" (? /
?)
subvolume:
root@curie:/home/anarcat# mount grep /srv
/dev/mapper/fedora_crypt on /srv type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)
Let's see if I can mount the other volumes I have on there. Remember
that subvolume list
showed I had home
, root
, and
var/lib/machines
. Let's try root
:
mount -o subvol=root /dev/mapper/fedora_crypt /mnt
Interestingly, root
is not the same as /
, it's a different
subvolume! It seems to be the Fedora root (/
, really) filesystem. No
idea what is happening here. I also have a home
subvolume, let's
mount it too, for good measure:
mount -o subvol=home /dev/mapper/fedora_crypt /mnt/home
Note that lsblk
doesn't notice those two new mountpoints, and that's
normal: it only lists block devices and subvolumes (rather
inconveniently, I'd say) do not show up as devices:
root@curie:/home/anarcat# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part
sda2 8:2 0 1G 0 part
sda3 8:3 0 7,8G 0 part
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /srv
This is really, really confusing. Maybe I did something wrong in the
setup. Maybe it's because I'm mounting it from outside Fedora. Either
way, it just doesn't feel right.
No disk usage per volume
If you want to see what's taking up space in one of those subvolumes,
tough luck:
root@curie:/home/anarcat# df -h /srv /mnt /mnt/home
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora_crypt 923G 886G 31G 97% /srv
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt/home
(Notice, in passing, that it looks like the same filesystem is mounted
in different places. In that sense, you'd expect /srv
and /mnt
(and /mnt/home
?!) to be exactly the same, but no: they are entirely
different directory structures, which I will not call "filesystems"
here because everyone's head will explode in sparks of confusion.)
Yes, disk space is shared (that's the Size
and Avail
columns,
makes sense). But nope, no cookie for you: they all have the same
Used
columns, so you need to actually walk the entire filesystem to
figure out what each disk takes.
(For future reference, that's basically:
root@curie:/home/anarcat# time du -schx /mnt/home /mnt /srv
124M /mnt/home
7.5G /mnt
875G /srv
883G total
real 2m49.080s
user 0m3.664s
sys 0m19.013s
And yes, that was painfully slow.)
ZFS actually has some oddities in that regard, but at least it tells
me how much disk each volume (and snapshot) takes:
root@tubman:~# time df -t zfs -h
Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/debian 3.5T 1.4G 3.5T 1% /
rpool/var/tmp 3.5T 384K 3.5T 1% /var/tmp
rpool/var/spool 3.5T 256K 3.5T 1% /var/spool
rpool/var/log 3.5T 2.0G 3.5T 1% /var/log
rpool/home/root 3.5T 2.2G 3.5T 1% /root
rpool/home 3.5T 256K 3.5T 1% /home
rpool/srv 3.5T 80G 3.5T 3% /srv
rpool/var/cache 3.5T 114M 3.5T 1% /var/cache
bpool/BOOT/debian 571M 90M 481M 16% /boot
real 0m0.003s
user 0m0.002s
sys 0m0.000s
That's 56360 times faster, by the way.
But yes, that's not fair: those in the know will know there's a
different command to do what df
does with BTRFS filesystems, the
btrfs filesystem usage
command:
root@curie:/home/anarcat# time btrfs filesystem usage /srv
Overall:
Device size: 922.47GiB
Device allocated: 916.47GiB
Device unallocated: 6.00GiB
Device missing: 0.00B
Used: 884.97GiB
Free (estimated): 30.84GiB (min: 27.84GiB)
Free (statfs, df): 30.84GiB
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:906.45GiB, Used:881.61GiB (97.26%)
/dev/mapper/fedora_crypt 906.45GiB
Metadata,DUP: Size:5.00GiB, Used:1.68GiB (33.58%)
/dev/mapper/fedora_crypt 10.00GiB
System,DUP: Size:8.00MiB, Used:128.00KiB (1.56%)
/dev/mapper/fedora_crypt 16.00MiB
Unallocated:
/dev/mapper/fedora_crypt 6.00GiB
real 0m0,004s
user 0m0,000s
sys 0m0,004s
Almost as fast as ZFS's df! Good job. But wait. That doesn't actually
tell me usage per subvolume. Notice it's filesystem usage
, not
subvolume usage
, which unhelpfully refuses to exist. That command
only shows that one "filesystem" internal statistics that are pretty
opaque.. You can also appreciate that it's wasting 6GB of
"unallocated" disk space there: I probably did something Very Wrong
and should be punished by Hacker News. I also wonder why it has 1.68GB
of "metadata" used...
At this point, I just really want to throw that thing out of the
window and restart from scratch. I don't really feel like learning the
BTRFS internals, as they seem oblique and completely bizarre to me. It
feels a little like the state of PHP now: it's actually pretty solid,
but built upon so many layers of cruft that I still feel it corrupts
my brain every time I have to deal with it (needle or haystack first?
anyone?)...
Conclusion
I find BTRFS utterly confusing and I'm worried about its
reliability. I think a lot of work is needed on usability and
coherence before I even consider running this anywhere else than a
lab, and that's really too bad, because there are really nice features
in BTRFS that would greatly help my workflow. (I want to use
filesystem snapshots as high-performance, high frequency backups.)
So now I'm experimenting with OpenZFS. It's so much simpler, just
works, and it's rock solid. After this 8 minute read, I had a
good understanding of how ZFS worked. Here's the 30 seconds overview:
- vdev: a RAID array
- vpool: a volume group of vdevs
- datasets: normal filesystems (or block device, if you want to use
another filesystem on top of ZFS)
There's also other special volumes like caches and logs that
you can (really easily, compared to LVM caching) use to tweak your
setup. You might also want to look at recordsize or ashift
to tweak the filesystem to fit better your workload (or deal with
drives lying about their sector size, I'm looking at you Samsung), but
that's it.
Running ZFS on Linux currently involves building kernel modules from
scratch on every host, which I think is pretty bad. But I was able to
setup a ZFS-only server using this excellent documentation
without too much problem.
I'm hoping some day the copyright issues are resolved and we can at
least ship binary packages, but the politics (e.g. convincing Debian
that is the right thing to do) and the logistics (e.g. DKMS
auto-builders? is that even a thing? how about signed DKMS packages?
fun-fun-fun!) seem really impractical. Who knows, maybe hell will
freeze over (again) and Oracle will fix the CDDL. I
personally think that we should just completely ignore this
problem (which wasn't even supposed to be a problem) and ship
binary packages directly, but I'm a pragmatic and do not always fit
well with the free software fundamentalists.
All of this to say that, short term, we don't have a reliable,
advanced filesystem/logical disk manager in Linux. And that's really
too bad.
UUID=5abb9def-c725-44ef-a45e-d72657803f37 /srv btrfs defaults 0 2
root@curie:/home/anarcat# mount grep /srv
/dev/mapper/fedora_crypt on /srv type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)
mount -o subvol=root /dev/mapper/fedora_crypt /mnt
mount -o subvol=home /dev/mapper/fedora_crypt /mnt/home
root@curie:/home/anarcat# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931,5G 0 disk
sda1 8:1 0 200M 0 part
sda2 8:2 0 1G 0 part
sda3 8:3 0 7,8G 0 part
sda4 8:4 0 922,5G 0 part
fedora_crypt 253:4 0 922,5G 0 crypt /srv
root@curie:/home/anarcat# df -h /srv /mnt /mnt/home
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora_crypt 923G 886G 31G 97% /srv
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt
/dev/mapper/fedora_crypt 923G 886G 31G 97% /mnt/home
(Notice, in passing, that it looks like the same filesystem is mounted
in different places. In that sense, you'd expect /srv
and /mnt
(and /mnt/home
?!) to be exactly the same, but no: they are entirely
different directory structures, which I will not call "filesystems"
here because everyone's head will explode in sparks of confusion.)
Yes, disk space is shared (that's the Size
and Avail
columns,
makes sense). But nope, no cookie for you: they all have the same
Used
columns, so you need to actually walk the entire filesystem to
figure out what each disk takes.
(For future reference, that's basically:
root@curie:/home/anarcat# time du -schx /mnt/home /mnt /srv
124M /mnt/home
7.5G /mnt
875G /srv
883G total
real 2m49.080s
user 0m3.664s
sys 0m19.013s
And yes, that was painfully slow.)
ZFS actually has some oddities in that regard, but at least it tells
me how much disk each volume (and snapshot) takes:
root@tubman:~# time df -t zfs -h
Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/debian 3.5T 1.4G 3.5T 1% /
rpool/var/tmp 3.5T 384K 3.5T 1% /var/tmp
rpool/var/spool 3.5T 256K 3.5T 1% /var/spool
rpool/var/log 3.5T 2.0G 3.5T 1% /var/log
rpool/home/root 3.5T 2.2G 3.5T 1% /root
rpool/home 3.5T 256K 3.5T 1% /home
rpool/srv 3.5T 80G 3.5T 3% /srv
rpool/var/cache 3.5T 114M 3.5T 1% /var/cache
bpool/BOOT/debian 571M 90M 481M 16% /boot
real 0m0.003s
user 0m0.002s
sys 0m0.000s
That's 56360 times faster, by the way.
But yes, that's not fair: those in the know will know there's a
different command to do what df
does with BTRFS filesystems, the
btrfs filesystem usage
command:
root@curie:/home/anarcat# time btrfs filesystem usage /srv
Overall:
Device size: 922.47GiB
Device allocated: 916.47GiB
Device unallocated: 6.00GiB
Device missing: 0.00B
Used: 884.97GiB
Free (estimated): 30.84GiB (min: 27.84GiB)
Free (statfs, df): 30.84GiB
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:906.45GiB, Used:881.61GiB (97.26%)
/dev/mapper/fedora_crypt 906.45GiB
Metadata,DUP: Size:5.00GiB, Used:1.68GiB (33.58%)
/dev/mapper/fedora_crypt 10.00GiB
System,DUP: Size:8.00MiB, Used:128.00KiB (1.56%)
/dev/mapper/fedora_crypt 16.00MiB
Unallocated:
/dev/mapper/fedora_crypt 6.00GiB
real 0m0,004s
user 0m0,000s
sys 0m0,004s
Almost as fast as ZFS's df! Good job. But wait. That doesn't actually
tell me usage per subvolume. Notice it's filesystem usage
, not
subvolume usage
, which unhelpfully refuses to exist. That command
only shows that one "filesystem" internal statistics that are pretty
opaque.. You can also appreciate that it's wasting 6GB of
"unallocated" disk space there: I probably did something Very Wrong
and should be punished by Hacker News. I also wonder why it has 1.68GB
of "metadata" used...
At this point, I just really want to throw that thing out of the
window and restart from scratch. I don't really feel like learning the
BTRFS internals, as they seem oblique and completely bizarre to me. It
feels a little like the state of PHP now: it's actually pretty solid,
but built upon so many layers of cruft that I still feel it corrupts
my brain every time I have to deal with it (needle or haystack first?
anyone?)...