Differences

This shows you the differences between two versions of the page.

--- computing:btrfsreminders [2026/02/08 15:45] – oemb1905
+++ computing:btrfsreminders [2026/04/04 19:48] (current) – lo oemb1905
@@ Line 15: / Line 15: @@
 === Overview of Design Model ===
-Encrypting the home partition is essential because it ensures that the pool key is never directly exposed; its behind LUKS on the boot volume and the sysadmin keeps this credential stored in KeePassXC offsite. Thus, the physical layer is protected by LUKS with integrity. As for Pam's mounting utilities, I use [[https://jasonschaefer.com/encrypting-home-dir-decrypting-on-login-pam/|this method]] because it allows for easy remote reboot as there is no need to enter an FDE key in the post-BIOS FDE splash and/or require you to log in to IPMI each time. Instead, you encrypt home and then unlock that in a screen session after remote reboot with ''screen'' then ''su - user'' - after that, detach from the session with ctrl-d. In short, this method provides two advantages, namely, a secure LUKS-encrypted location for keys/credentials that's not exposed if a physical compromise takes place, and using built-in pam and simple UNIX login infra to avoid cumbersome BIOS/IPMI-level FDE unlocking after reboot.
+Encrypting the home partition is essential because it ensures that the pool key is never directly exposed; its behind LUKS on the boot volume and the sysadmin keeps this credential stored in KeePassXC offsite. Thus, the physical layer is protected by LUKS with integrity. As for Pam's mounting utilities, I use [[https://jasonschaefer.com/encrypting-home-dir-decrypting-on-login-pam/|this method]] because it allows for easy remote reboot as there is no need to enter an FDE key in the post-BIOS FDE splash and/or require you to log in to IPMI each time. Instead, you encrypt home and then unlock that in a screen session after remote reboot with ''screen'' then ''su - user'' and then detach from the session with ctrl-a-d. In short, this method provides two advantages, namely, a secure LUKS-encrypted location for keys/credentials that's not exposed if a physical compromise takes place, and using built-in pam and simple UNIX login infra to avoid cumbersome BIOS/IPMI-level FDE unlocking after reboot.
 === Installation Instructions ===
@@ Line 114: / Line 114: @@
 echo "Here are the RAM usage stats ..." >> $LOG
 free -h
-#echo "Here are the zfs stats ..." >> $LOG
-#zpool status -v
-#zpool iostat -v
-#zpool list -v
-#zfs list -ro space
 echo "Here are the btrfs stats for the vm pool ..." >> $LOG
-btrfs filesystem show /mnt/vms
+btrfs filesystem show /mnt/vm
-btrfs filesystem df /mnt/vms
+btrfs filesystem df /mnt/vm
-btrfs filesystem usage /mnt/vms
+btrfs filesystem usage /mnt/vm
-btrfs device usage /mnt/vms
+btrfs device usage /mnt/vm
-btrfs scrub status /mnt/vms
+btrfs scrub status /mnt/vm
-btrfs device stats /mnt/vms
+btrfs device stats /mnt/vm
-btrfs device stats /mnt/vms -c
+btrfs device stats /mnt/vm -c
-mount | grep /mnt/vms
+mount | grep /mnt/vm
 dmesg | grep -i btrfs | tail -n 40
 dmesg | grep -E 'sd[a,c,d,e,f,g,h]' | tail -n 30
-btrfs fi show /mnt/vms | grep -i missing
+btrfs fi show /mnt/vm | grep -i missing
-btrfs fi df -h /mnt/vms
+btrfs fi df -h /mnt/vm
-btrfs fi usage -T /mnt/vms
+btrfs fi usage -T /mnt/vm
-btrfs qgroup show /mnt/vms 2>/dev/null
+btrfs qgroup show /mnt/vm 2>/dev/null
-btrfs subvolume list -a /mnt/vms
+btrfs subvolume list -a /mnt/vm
-btrfs balance status /mnt/vms
+btrfs balance status /mnt/vm
-echo "Here are the btrfs stats for the warehouse pool ..." >> $LOG
+echo "Here are the btrfs stats for the wh pool ..." >> $LOG
-btrfs filesystem show /mnt/warehouse
+btrfs filesystem show /mnt/wh
-btrfs filesystem df /mnt/warehouse
+btrfs filesystem df /mnt/wh
-btrfs filesystem usage /mnt/warehouse
+btrfs filesystem usage /mnt/wh
-btrfs device usage /mnt/warehouse
+btrfs device usage /mnt/wh
-btrfs scrub status /mnt/warehouse
+btrfs scrub status /mnt/wh
-btrfs device stats /mnt/warehouse
+btrfs device stats /mnt/wh
-btrfs device stats /mnt/warehouse -c
+btrfs device stats /mnt/wh -c
-mount | grep /mnt/warehouse
+mount | grep /mnt/wh
 dmesg | grep -i btrfs | tail -n 40
 dmesg | grep -E 'sd[a,c,d,e,f,g,h]' | tail -n 30
-btrfs fi show /mnt/warehouse | grep -i missing
+btrfs fi show /mnt/wh | grep -i missing
-btrfs fi df -h /mnt/warehouse
+btrfs fi df -h /mnt/wh
-btrfs fi usage -T /mnt/warehouse
+btrfs fi usage -T /mnt/wh
-btrfs qgroup show /mnt/warehouse 2>/dev/null
+btrfs qgroup show /mnt/wh 2>/dev/null
-btrfs subvolume list -a /mnt/warehouse
+btrfs subvolume list -a /mnt/wh
-btrfs balance status /mnt/warehouse
+btrfs balance status /mnt/wh
 for disk in \
@@ Line 182: / Line 176: @@
 </code>
+If you use the script above, you will also need to ''sudo apt install smartmontools'' to ensure you can get the hard drive temperatures. You will need to adjust the blkids to your drives and check the output from grep and awk as those particular strings above came after hours of trial and error for particular hard drives. Again, there's no entries in fstab because this tutorial presumes your hardware is offsite and that you must keep your volumes physically secure. This is why we manually mount the boot OS's home directory post-reboot in order to free up the key directory (using screen as mentioned twice above) and then manually unlock the LUKS volumes and mount the BTRFS R10 subvolume manually (with the script mentioned above) after a successful reboot. This balances security and convenience. The best part? Gone are the terrible and non-native zfs speeds and RAM consumption!
+<code bash>
+root@net:~# free -h
+               total        used        free      shared  buff/cache   available
+Mem:           377Gi       151Gi       2.6Gi       131Gi       357Gi       225Gi
+Swap:             0B          0B          0B
+root@net:~# /usr/bin/btrfs scrub status /mnt/wh
+UUID:             b2867f1b-cfb3-4597-ac5b-3a48f5eb1d04
+Scrub started:    Sun Feb  1 08:24:42 2026
+Status:           finished
+Duration:         4:16:49
+Total to scrub:   14.33TiB
+Rate:             948.10MiB/s
+Error summary:    no errors found
+</code>
+To test or compare your new pool's speed to your prior setup and/or just to obtain some benchmarks, I recommend using ''fio''. Here's what you can do:
+  sudo apt install fio
+  sudo fio --name=seqread --rw=read --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/vm/testfile
+  sudo fio --name=seqwrite --rw=write --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/vm/testfile
+  sudo fio --name=seqread --rw=read --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/wh/testfile
+  sudo fio --name=seqwrite --rw=write --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/wh/testfile
+With zfs on my production server, I found I was still getting the read speed of one hard drive, despite the presumed parallelization benefits from having 8 enterprise SAS SSDs in a R10 pool?! Since I migrated to BTRFS, the speeds are near hardware level caps. Here's the read test:
+<code bash>
+seqread: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=32
+...
+fio-3.39
+Starting 8 processes
+seqread: Laying out IO file (1 file / 4096MiB)
+Jobs: 8 (f=8): [R(8)][100.0%][r=5797MiB/s][r=46.4k IOPS][eta 00m:00s]
+seqread: (groupid=0, jobs=8): err= 0: pid=2279596: Sun Feb  8 08:58:58 2026
+  read: IOPS=42.1k, BW=5264MiB/s (5520MB/s)(32.0GiB/6225msec)
+    slat (usec): min=11, max=28981, avg=106.92, stdev=402.06
+    clat (usec): min=43, max=53886, avg=5831.38, stdev=4638.03
+     lat (usec): min=183, max=53910, avg=5938.30, stdev=4655.29
+    clat percentiles (usec):
+     |  1.00th=[  326],  5.00th=[  553], 10.00th=[  783], 20.00th=[ 1778],
+     | 30.00th=[ 3064], 40.00th=[ 4113], 50.00th=[ 5080], 60.00th=[ 6063],
+     | 70.00th=[ 7242], 80.00th=[ 8717], 90.00th=[11469], 95.00th=[14615],
+     | 99.00th=[21365], 99.50th=[24773], 99.90th=[34341], 99.95th=[39060],
+     | 99.99th=[46924]
+   bw (  MiB/s): min= 4508, max= 6109, per=100.00%, avg=5373.48, stdev=52.77, samples=96
+   iops        : min=36066, max=48878, avg=42987.83, stdev=422.13, samples=96
+  lat (usec)   : 50=0.01%, 250=0.27%, 500=3.68%, 750=5.41%, 1000=4.25%
+  lat (msec)   : 2=7.99%, 4=17.26%, 10=46.94%, 20=12.77%, 50=1.41%
+  lat (msec)   : 100=0.01%
+  cpu          : usr=2.22%, sys=28.68%, ctx=252056, majf=0, minf=8262
+  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
+     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
+     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
+     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
+     latency   : target=0, window=0, percentile=100.00%, depth=32
+Run status group 0 (all jobs):
+   READ: bw=5264MiB/s (5520MB/s), 5264MiB/s-5264MiB/s (5520MB/s-5520MB/s), io=32.0GiB (34.4GB), run=6225-6225msec
+</code>
+Here's the write test:
+<code bash>
+seqwrite: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=32
+...
+fio-3.39
+Starting 8 processes
+seqwrite: Laying out IO file (1 file / 4096MiB)
+Jobs: 6 (f=6): [W(6),_(2)][95.5%][w=1611MiB/s][w=12.9k IOPS][eta 00m:01s]
+seqwrite: (groupid=0, jobs=8): err= 0: pid=2279720: Sun Feb  8 08:59:26 2026
+  write: IOPS=12.2k, BW=1529MiB/s (1603MB/s)(32.0GiB/21431msec); 0 zone resets
+    slat (usec): min=38, max=33255, avg=595.61, stdev=1120.04
+    clat (usec): min=176, max=96135, avg=18562.40, stdev=10492.59
+     lat (usec): min=264, max=96296, avg=19158.01, stdev=10752.61
+    clat percentiles (usec):
+     |  1.00th=[ 3490],  5.00th=[ 5342], 10.00th=[ 6849], 20.00th=[11600],
+     | 30.00th=[14222], 40.00th=[15139], 50.00th=[15795], 60.00th=[16581],
+     | 70.00th=[19792], 80.00th=[24773], 90.00th=[33817], 95.00th=[40633],
+     | 99.00th=[53216], 99.50th=[57410], 99.90th=[67634], 99.95th=[71828],
+     | 99.99th=[79168]
+   bw (  MiB/s): min= 1074, max= 2563, per=100.00%, avg=1790.51, stdev=55.14, samples=311
+   iops        : min= 8597, max=20509, avg=14323.95, stdev=441.12, samples=311
+  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
+  lat (msec)   : 2=0.14%, 4=1.40%, 10=15.01%, 20=53.91%, 50=27.91%
+  lat (msec)   : 100=1.60%
+  cpu          : usr=2.07%, sys=64.47%, ctx=142306, majf=0, minf=20562
+  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
+     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
+     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
+     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
+     latency   : target=0, window=0, percentile=100.00%, depth=32
+Run status group 0 (all jobs):
+  WRITE: bw=1529MiB/s (1603MB/s), 1529MiB/s-1529MiB/s (1603MB/s-1603MB/s), io=32.0GiB (34.4GB), run=21431-21431msec
+</code>
+In lay terms, these reports confirm that read speed is 5,520 MB/s, or 5.5 GB/s, and write speed is 1,603 MB/s, or 1.6 GB/s. This is a 4x improvement for reads and 2x improvement for writes compared to zfs. For whatever reason, zfs was not benefitting from the parallelization. It's possible that I could get zfs to perform better with tinkering, but why? Every major upgrade I have to re-compile it with dkms against the new kernel headers, which takes forever. Additionally, zfs gobbles up my RAM (BTRFS is using 40% comparitively). Lastly, I am not a fan of zfs-send / zfs-receive and/or its snapshotting tools ... I use rsync and rsnapshot for my needs and have no need for those tools. So, although it might be possible to fix the old zfs setup, there's no value in doing so. Natively supported filesystems work out of the box and don't require janky re-compilation with dkms. Also, these benchmarks above speak for themselves!
+Simple version
+<code>
+sudo lsblk -d -o NAME,SIZE,TYPE,MODEL,TRAN,ROTA,VENDOR,SERIAL
+df -h
+ls -lah /dev/disk/by-id/
+sudo mkdir -p /mnt/storage
+sudo cryptsetup luksFormat /dev/disk/by-id/xxx-1x7000000000000023 --key-file /home/user/.unlock/storage.key --type luks2 --cipher aes-xts-plain64 --key-size 512 --pbkdf argon2id --pbkdf-memory 4194304 --pbkdf-parallel 4 --iter-time 4000 --sector-size 4096 --use-random
+sudo cryptsetup luksOpen /dev/disk/by-id/xxx-1x7000000000000023 storage --key-file /home/user/.unlock/storage.key
+sudo mkfs.btrfs -f -L storage /dev/mapper/storage #single volume
+sudo mkfs.btrfs -f -d raid10 -m raid1 --checksum=xxhash --nodesize=32k /dev/mapper/hdd1 /dev/mapper/hdd2 /dev/mapper/hdd3 /dev/mapper/hdd4 #JBOD in R10 equivalent
+sudo mount -o compress=zstd:1,noatime,space_cache=v2,discard=async,commit=120,nodatacow /dev/mapper/storage /mnt/storage
+cat << 'EOF' | sudo tee /usr/local/bin/btrfs-mount-volumes.sh > /dev/null
+#!/bin/bash
+cryptsetup luksOpen /dev/disk/by-id/xxx-1x7000000000000023 storage --key-file /home/user/.unlock/storage.key
+mount -o compress=zstd:1,noatime,space_cache=v2,discard=async,commit=120,nodatacow /dev/mapper/storage /mnt/storage
+EOF
+chmod 750 /usr/local/bin/btrfs-mount-volumes.sh
+</code>
+ --- //[[alerts@haacksnetworking.org|oemb1905]] 2026/04/04 19:39//

Haack's Wiki

User Tools

Site Tools

Differences

Page Tools