Differences

This shows you the differences between two versions of the page.

--- computing:btrfsreminders [2026/02/08 15:37] – oemb1905
+++ computing:btrfsreminders [2026/04/04 19:48] (current) – lo oemb1905
@@ Line 1: / Line 1: @@
 -------------------------------------------
-  * **vmserver**
+  * **btrfsreminders**
   * **Jonathan Haack**
   * **Haack's Networking**
@@ Line 7: / Line 7: @@
 -------------------------------------------
-//vmserver//
+//btrfsreminders//
 -------------------------------------------
+=== Introduction ===
+This tutorial is for Debian users that want to create a JBOD pool using BTRFS subvolumes and its RAID10 equivalent. These types of setups are common and helpful for virtualization environments and hosting multiple services, either for serious home hobbyist use and/or small business level production. These approaches are not designed for enterprise or large-scale production.
+=== Overview of Design Model ===
+Encrypting the home partition is essential because it ensures that the pool key is never directly exposed; its behind LUKS on the boot volume and the sysadmin keeps this credential stored in KeePassXC offsite. Thus, the physical layer is protected by LUKS with integrity. As for Pam's mounting utilities, I use [[https://jasonschaefer.com/encrypting-home-dir-decrypting-on-login-pam/|this method]] because it allows for easy remote reboot as there is no need to enter an FDE key in the post-BIOS FDE splash and/or require you to log in to IPMI each time. Instead, you encrypt home and then unlock that in a screen session after remote reboot with ''screen'' then ''su - user'' and then detach from the session with ctrl-a-d. In short, this method provides two advantages, namely, a secure LUKS-encrypted location for keys/credentials that's not exposed if a physical compromise takes place, and using built-in pam and simple UNIX login infra to avoid cumbersome BIOS/IPMI-level FDE unlocking after reboot.
-Introduction
+=== Installation Instructions ===
-This tutorial is for Debian users that want to create a JBOD pool using BTRFS subvolumes and its RAID10 equivalent. These types of setups are common and helpful for virtualization environments and hosting multiple services, either for serious home hobbyist use and/or small business level production. These approaches are not designed for enterprise or large-scale production.
+Let's install btrfs, LUKS, and identify your hard drives:
-Overview of setups
-Encrypting the home partition is essential because it ensures that the pool key is never directly exposed; its behind LUKS on the boot volume and the sysadmin keeps this credential stored in KeePassXC offsite. Thus, the physical layer is protected by LUKS with integrity. As for Pam's mounting utilities, I use [[https://jasonschaefer.com/encrypting-home-dir-decrypting-on-login-pam/|this method]] because it allows for easy remote reboot as there is no need to enter an FDE key in the post-BIOS FDE splash and/or require you to log in to IPMI each time. Instead, you encrypt home and then unlock that in a screen session after remote reboot with ''screen'' then ''su - user'' - after that, detach from the session with ctrl-d. In short, this method provides two advantages, namely, a secure LUKS-encrypted location for keys/credentials that's not exposed if a physical compromise takes place, and using built-in pam and simple UNIX login infra to avoid cumbersome BIOS/IPMI-level FDE unlocking after reboot. Let's install btrfs, LUKS, and identify your hard drives:
   sudo apt-get install cryptsetup libpam-mount btrfs*
@@ Line 71: / Line 72: @@
   btrfs property set /mnt/wh compression zstd:3
+=== Maintenance and Monitoring ===
 Once that's done and you've rebooted a few times and tested things a few times, you can safely make a mount script for remote rebooting. This way, you reboot and then log in to your user and detach, run a simple script to unlock and mount the BTRFS subvolumes ... and you are done! Create ''nano /usr/local/bin/btrfs-mount-datasets.sh'' and some ''chmod 750 /usr/local/bin/btrfs-mount-datasets.sh'' and enter something like:
@@ Line 93: / Line 95: @@
   mount -o compress=zstd:3,noatime,autodefrag,space_cache=v2,discard=async,commit=120 /dev/mapper/hdd1 /mnt/wh
+This script is designed to be run manually post reboot. In order, you reboot, log in to the admin user via ssh, unlock the crypt key directory with ''screen'' and then ''su - sexa'' (detach with ctrl-a-d). After detaching, simply run the mount the script ''/bin/bash /usr/local/bin/btrfs-mount-datasets.sh''. In the weeks ahead, it is essential to regularly scrub the pool. For that, put the following commands on a cronjob:
+  /usr/bin/btrfs scrub start /mnt/vm
+  /usr/bin/btrfs scrub start /mnt/wh
+To check the status, you use:
+  /usr/bin/btrfs scrub status /mnt/vm
+  /usr/bin/btrfs scrub status /mnt/wh
+In addition to scrubbing, I compiled a slew of commands to assess pool health more granularly. I put this script on a cronjob which runs and sends me a statistics report every hour:
+<code bash>
+#!/bin/bash
+DATE=`date +"%Y%m%d-%H:%M:%S"`
+LOG="/root/vitals.log"
+echo "Here are the RAM usage stats ..." >> $LOG
+free -h
+echo "Here are the btrfs stats for the vm pool ..." >> $LOG
+btrfs filesystem show /mnt/vm
+btrfs filesystem df /mnt/vm
+btrfs filesystem usage /mnt/vm
+btrfs device usage /mnt/vm
+btrfs scrub status /mnt/vm
+btrfs device stats /mnt/vm
+btrfs device stats /mnt/vm -c
+mount | grep /mnt/vm
+dmesg | grep -i btrfs | tail -n 40
+dmesg | grep -E 'sd[a,c,d,e,f,g,h]' | tail -n 30
+btrfs fi show /mnt/vm | grep -i missing
+btrfs fi df -h /mnt/vm
+btrfs fi usage -T /mnt/vm
+btrfs qgroup show /mnt/vm 2>/dev/null
+btrfs subvolume list -a /mnt/vm
+btrfs balance status /mnt/vm
+echo "Here are the btrfs stats for the wh pool ..." >> $LOG
+btrfs filesystem show /mnt/wh
+btrfs filesystem df /mnt/wh
+btrfs filesystem usage /mnt/wh
+btrfs device usage /mnt/wh
+btrfs scrub status /mnt/wh
+btrfs device stats /mnt/wh
+btrfs device stats /mnt/wh -c
+mount | grep /mnt/wh
+dmesg | grep -i btrfs | tail -n 40
+dmesg | grep -E 'sd[a,c,d,e,f,g,h]' | tail -n 30
+btrfs fi show /mnt/wh | grep -i missing
+btrfs fi df -h /mnt/wh
+btrfs fi usage -T /mnt/wh
+btrfs qgroup show /mnt/wh 2>/dev/null
+btrfs subvolume list -a /mnt/wh
+btrfs balance status /mnt/wh
+for disk in \
+  /dev/disk/by-id/wwn-0x5002538a98416870 \
+  /dev/disk/by-id/wwn-0x5002538a98356f30 \
+  /dev/disk/by-id/wwn-0x5002538a983571d0 \
+  /dev/disk/by-id/wwn-0x5002538a0840a300 \
+  /dev/disk/by-id/wwn-0x5002538a98356500 \
+  /dev/disk/by-id/wwn-0x5002538a98356590 \
+  /dev/disk/by-id/wwn-0x5002538a084065d0 \
+  /dev/disk/by-id/wwn-0x5002538a98357220 \
+  /dev/disk/by-id/wwn-0x5000c500d775df03 \
+  /dev/disk/by-id/wwn-0x5000c500d7694517 \
+  /dev/disk/by-id/wwn-0x5000c500d7771943 \
+  /dev/disk/by-id/wwn-0x5000c500cb1689e3; do
+  temp=$(sudo smartctl -a "$disk" | grep 'Current Drive Temperature' | awk '{print $4}' || echo "N/A")
+  echo "$disk: $temp°C"
+done
+for disk in \
+  /dev/disk/by-id/ata-SATA_SSD_22100512800207 \
+  /dev/disk/by-id/ata-SATA_SSD_22100512800205; do
+  temp=$(sudo smartctl -a "$disk" | grep '^194 Temperature_Celsius' | head -n 1 | awk '{print $10}' || echo "N/A")
+  echo "$disk: $temp°C"
+done
+</code>
+If you use the script above, you will also need to ''sudo apt install smartmontools'' to ensure you can get the hard drive temperatures. You will need to adjust the blkids to your drives and check the output from grep and awk as those particular strings above came after hours of trial and error for particular hard drives. Again, there's no entries in fstab because this tutorial presumes your hardware is offsite and that you must keep your volumes physically secure. This is why we manually mount the boot OS's home directory post-reboot in order to free up the key directory (using screen as mentioned twice above) and then manually unlock the LUKS volumes and mount the BTRFS R10 subvolume manually (with the script mentioned above) after a successful reboot. This balances security and convenience. The best part? Gone are the terrible and non-native zfs speeds and RAM consumption!
+<code bash>
+root@net:~# free -h
+               total        used        free      shared  buff/cache   available
+Mem:           377Gi       151Gi       2.6Gi       131Gi       357Gi       225Gi
+Swap:             0B          0B          0B
+root@net:~# /usr/bin/btrfs scrub status /mnt/wh
+UUID:             b2867f1b-cfb3-4597-ac5b-3a48f5eb1d04
+Scrub started:    Sun Feb  1 08:24:42 2026
+Status:           finished
+Duration:         4:16:49
+Total to scrub:   14.33TiB
+Rate:             948.10MiB/s
+Error summary:    no errors found
+</code>
+To test or compare your new pool's speed to your prior setup and/or just to obtain some benchmarks, I recommend using ''fio''. Here's what you can do:
+  sudo apt install fio
+  sudo fio --name=seqread --rw=read --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/vm/testfile
+  sudo fio --name=seqwrite --rw=write --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/vm/testfile
+  sudo fio --name=seqread --rw=read --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/wh/testfile
+  sudo fio --name=seqwrite --rw=write --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --size=4g --numjobs=8 --runtime=60 --group_reporting --filename=/mnt/wh/testfile
+With zfs on my production server, I found I was still getting the read speed of one hard drive, despite the presumed parallelization benefits from having 8 enterprise SAS SSDs in a R10 pool?! Since I migrated to BTRFS, the speeds are near hardware level caps. Here's the read test:
+<code bash>
+seqread: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=32
+...
+fio-3.39
+Starting 8 processes
+seqread: Laying out IO file (1 file / 4096MiB)
+Jobs: 8 (f=8): [R(8)][100.0%][r=5797MiB/s][r=46.4k IOPS][eta 00m:00s]
+seqread: (groupid=0, jobs=8): err= 0: pid=2279596: Sun Feb  8 08:58:58 2026
+  read: IOPS=42.1k, BW=5264MiB/s (5520MB/s)(32.0GiB/6225msec)
+    slat (usec): min=11, max=28981, avg=106.92, stdev=402.06
+    clat (usec): min=43, max=53886, avg=5831.38, stdev=4638.03
+     lat (usec): min=183, max=53910, avg=5938.30, stdev=4655.29
+    clat percentiles (usec):
+     |  1.00th=[  326],  5.00th=[  553], 10.00th=[  783], 20.00th=[ 1778],
+     | 30.00th=[ 3064], 40.00th=[ 4113], 50.00th=[ 5080], 60.00th=[ 6063],
+     | 70.00th=[ 7242], 80.00th=[ 8717], 90.00th=[11469], 95.00th=[14615],
+     | 99.00th=[21365], 99.50th=[24773], 99.90th=[34341], 99.95th=[39060],
+     | 99.99th=[46924]
+   bw (  MiB/s): min= 4508, max= 6109, per=100.00%, avg=5373.48, stdev=52.77, samples=96
+   iops        : min=36066, max=48878, avg=42987.83, stdev=422.13, samples=96
+  lat (usec)   : 50=0.01%, 250=0.27%, 500=3.68%, 750=5.41%, 1000=4.25%
+  lat (msec)   : 2=7.99%, 4=17.26%, 10=46.94%, 20=12.77%, 50=1.41%
+  lat (msec)   : 100=0.01%
+  cpu          : usr=2.22%, sys=28.68%, ctx=252056, majf=0, minf=8262
+  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
+     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
+     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
+     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
+     latency   : target=0, window=0, percentile=100.00%, depth=32
+Run status group 0 (all jobs):
+   READ: bw=5264MiB/s (5520MB/s), 5264MiB/s-5264MiB/s (5520MB/s-5520MB/s), io=32.0GiB (34.4GB), run=6225-6225msec
+</code>
+Here's the write test:
+<code bash>
+seqwrite: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=32
+...
+fio-3.39
+Starting 8 processes
+seqwrite: Laying out IO file (1 file / 4096MiB)
+Jobs: 6 (f=6): [W(6),_(2)][95.5%][w=1611MiB/s][w=12.9k IOPS][eta 00m:01s]
+seqwrite: (groupid=0, jobs=8): err= 0: pid=2279720: Sun Feb  8 08:59:26 2026
+  write: IOPS=12.2k, BW=1529MiB/s (1603MB/s)(32.0GiB/21431msec); 0 zone resets
+    slat (usec): min=38, max=33255, avg=595.61, stdev=1120.04
+    clat (usec): min=176, max=96135, avg=18562.40, stdev=10492.59
+     lat (usec): min=264, max=96296, avg=19158.01, stdev=10752.61
+    clat percentiles (usec):
+     |  1.00th=[ 3490],  5.00th=[ 5342], 10.00th=[ 6849], 20.00th=[11600],
+     | 30.00th=[14222], 40.00th=[15139], 50.00th=[15795], 60.00th=[16581],
+     | 70.00th=[19792], 80.00th=[24773], 90.00th=[33817], 95.00th=[40633],
+     | 99.00th=[53216], 99.50th=[57410], 99.90th=[67634], 99.95th=[71828],
+     | 99.99th=[79168]
+   bw (  MiB/s): min= 1074, max= 2563, per=100.00%, avg=1790.51, stdev=55.14, samples=311
+   iops        : min= 8597, max=20509, avg=14323.95, stdev=441.12, samples=311
+  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
+  lat (msec)   : 2=0.14%, 4=1.40%, 10=15.01%, 20=53.91%, 50=27.91%
+  lat (msec)   : 100=1.60%
+  cpu          : usr=2.07%, sys=64.47%, ctx=142306, majf=0, minf=20562
+  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
+     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
+     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
+     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
+     latency   : target=0, window=0, percentile=100.00%, depth=32
+Run status group 0 (all jobs):
+  WRITE: bw=1529MiB/s (1603MB/s), 1529MiB/s-1529MiB/s (1603MB/s-1603MB/s), io=32.0GiB (34.4GB), run=21431-21431msec
+</code>
+In lay terms, these reports confirm that read speed is 5,520 MB/s, or 5.5 GB/s, and write speed is 1,603 MB/s, or 1.6 GB/s. This is a 4x improvement for reads and 2x improvement for writes compared to zfs. For whatever reason, zfs was not benefitting from the parallelization. It's possible that I could get zfs to perform better with tinkering, but why? Every major upgrade I have to re-compile it with dkms against the new kernel headers, which takes forever. Additionally, zfs gobbles up my RAM (BTRFS is using 40% comparitively). Lastly, I am not a fan of zfs-send / zfs-receive and/or its snapshotting tools ... I use rsync and rsnapshot for my needs and have no need for those tools. So, although it might be possible to fix the old zfs setup, there's no value in doing so. Natively supported filesystems work out of the box and don't require janky re-compilation with dkms. Also, these benchmarks above speak for themselves!
+Simple version
+<code>
+sudo lsblk -d -o NAME,SIZE,TYPE,MODEL,TRAN,ROTA,VENDOR,SERIAL
+df -h
+ls -lah /dev/disk/by-id/
+sudo mkdir -p /mnt/storage
+sudo cryptsetup luksFormat /dev/disk/by-id/xxx-1x7000000000000023 --key-file /home/user/.unlock/storage.key --type luks2 --cipher aes-xts-plain64 --key-size 512 --pbkdf argon2id --pbkdf-memory 4194304 --pbkdf-parallel 4 --iter-time 4000 --sector-size 4096 --use-random
+sudo cryptsetup luksOpen /dev/disk/by-id/xxx-1x7000000000000023 storage --key-file /home/user/.unlock/storage.key
+sudo mkfs.btrfs -f -L storage /dev/mapper/storage #single volume
+sudo mkfs.btrfs -f -d raid10 -m raid1 --checksum=xxhash --nodesize=32k /dev/mapper/hdd1 /dev/mapper/hdd2 /dev/mapper/hdd3 /dev/mapper/hdd4 #JBOD in R10 equivalent
+sudo mount -o compress=zstd:1,noatime,space_cache=v2,discard=async,commit=120,nodatacow /dev/mapper/storage /mnt/storage
+cat << 'EOF' | sudo tee /usr/local/bin/btrfs-mount-volumes.sh > /dev/null
+#!/bin/bash
+cryptsetup luksOpen /dev/disk/by-id/xxx-1x7000000000000023 storage --key-file /home/user/.unlock/storage.key
+mount -o compress=zstd:1,noatime,space_cache=v2,discard=async,commit=120,nodatacow /dev/mapper/storage /mnt/storage
+EOF
+chmod 750 /usr/local/bin/btrfs-mount-volumes.sh
+</code>
+ --- //[[alerts@haacksnetworking.org|oemb1905]] 2026/04/04 19:39//

Haack's Wiki

User Tools

Site Tools

Differences

Page Tools