User Tools

Site Tools


computing:vmserver

This is an old revision of the document!



  • vmserver
  • Jonathan Haack
  • Haack's Networking
  • netcmnd@jonathanhaack.com

This tutorial documents the steps I took to create an entry level enterprise VM server using a minimal Debian install on a SuperMicro host with 96GB of RAM and 2X 8-core dual-thread CPUs (32 total threads). I estimate that this system can handle up to 28 VMs w/ 1 CPU core and 3GB RAM and/or 4 larger VMs with 8 CPU cores and 16GB RAM. My intent in using this system is to offer Big Blue Button instances to a few small schools and/or educational organizations. I began setup with a 120GB SSD boot volume that I installed Debian Bullseye on. For the boot volume, I wanted some physical protection over contents I might store in the Data Center I keep the server at, so I made the / directory only 32GB and saved the rest for a LUKS crypt for my home directory. If you are unclear on how to use pam_mount and LUKS together to unlock your home directory crypt, check Jason's tutorial here. My server is at a Data Center, so this way I can reboot the remote system easily and still have some content protection if the physical device is compromised. I doubt this will happen, but why not - it just works. I also had 7 leftover bays that were usable, so I devoted another 1TB drive as another crypt just for kicks, leaving 6 drives for a zfs pool. It is my preference to have LUKS underneath zfs, which I did as follows:

cryptsetup luksFormat /dev/sda
cryptsetup luksOpen /dev/sda sdafc11

In order to be straight about which devices were used and how, I appended the last four characters of the block ID to the crypt label, as you see above. To find the corresponding block ID, just ls -lah /dev/disk/by-uuid and/or run blkid. I repeated this for each of the 6 drives (sdb, sdc, etc,) I intended to mirror and pool with zfs. Once these crypts were all created and opened (not mounted, just opened), I then created the zfs pool as follows:

sudo apt install zfs-utils
zpool create -m /mnt/vms vms -f mirror sdafc11 sdb9322 mirror sdc8a33 sdh6444 mirror sde5b55 sdf8066

In order to make sure the pool was created correctly, I ran df -h to check the mountpoint and size of the pool. I paired two 2TB drives, two 1TB drives, and another pairing of two 1TB drives for a total of around 4TB actual, since the remaining 4TB are lost to parity on the zfs mirror / RAID1 equivalent. I do not want these drives to unlock automatically with a key file for two reasons: 1) security (I suppose I could use the extra crypt, but nah … why?) and 2) if a drive breaks, the whole system could potentially not boot. I also don't want to use the zfs options of legacy and/or none because they change the functionality of zfs in ways I don't like. However, this means that the automatic pool creation zfs conducts on boot via zfs-import-cache.service will fail. After much searching of how to stop this failure while preserving the otherwise automatic zfs features, I determined that there was no reliable way to do this. The closest way was to set the leasefile=none once the pool was created, and this did successfully stop the pool and service from starting, but when I re-enabled the leasefile option to its native directory, the zfs pool never mounted again thereafter. For this reason, I decided that waiting 10 seconds and letting zfs-import-cache.service fail at boot was entirely acceptable to me. So, here is how I set up everything on the server upon reboot:

sudo -i
screen
su - user [pam_mount unlocks /home for physical host primary user and the spare 1TB vault]
ctrl-a-d [detaches from screen]

After unlocking my home directory and the spare 1TB vault, the next step is to unlock each LUKS volume, which I decided a simple shell script would suffice which looks like this:

cryptsetup luksOpen /dev/disk/by-uuid/2702e690-...-0c4267a6fc11 sdafc11
cryptsetup luksOpen /dev/disk/by-uuid/e3b568ad-...-cdc5dedb9322 sdb9322
cryptsetup luksOpen /dev/disk/by-uuid/d353e727-...-e4d66a9b8a33 sdc8a33
cryptsetup luksOpen /dev/disk/by-uuid/352660ca-...-5a8beae15b44 sde5b44
cryptsetup luksOpen /dev/disk/by-uuid/fa1a6109-...-f46ce1cf8055 sdf8055
cryptsetup luksOpen /dev/disk/by-uuid/86da0b9f-...-13bc38656466 sdh6466

Obviously not all of my drives have … in the middle of the block ID - this is just me obfuscating because I am paranoid. Also, even though I used the short-names like sda, sdb, etc., for convenience when setting up the LUKS volumes, I deliberately did not do so here because upon reboot, those can sometimes change. This is why I did two things - once, I originally included the last four digits of the block ID on the LUKS device name, but I also made the script open the corresponding hard drive using the block ID instead of the short name. This ensures that, if for some odd reason, that Debian decides to rename sda to sdb or whatever, that the LUKS volumes will open properly. Also, do note that since the zpool was created with the LUKS names as well, that there is no way the pool could start by accident and/or incorrectly with the short-names (which some users report on SE but makes no sense to me). At any rate, I simply copy/paste the password 6 times, and then re-create the pool as follows once the volumes are opened:

zpool import vms

Altogether, rebooting this server takes about 4 minutes of wait time, and then about 2 minutes to mount my home directory, 1TB vault, and my recreate my zpool. I know this sounds a bit old school, but it ensures that I don't have to travel 63 miles to my data center to figure out what happened if a drive fails. In most failure scenarios, I will know what caused it and/since I can still boot, where more complex setups like key-based unlocking of LUKS on failed HDs will just fail to boot. And … because I have offsite backups of the VMs abd hardrive.img files, I can easily destroy the zpool and use a smaller pool with only 4 drives (to keep uptime and production going), while I order a new pair of drives and set up time to visit the center and replace them. Also, if/when the server does not start - remember, that could be a hard drive, or it could be something else - using this workflow, you always know what happened (unless the boot volume crashes). Of course, the center has remote KVM and so on, but why increase the chances of failure and rely on such kludgy access … this is cleaner and less chances for failure. I do realize, however, that this won't scale to 100X this size, or even 50X, but again - this is entry level enterprise for advanced self-hosters and/or robust residential use-cases that just exceed normal residential use parameters.

Phew … that was a mouthful … but if I don't write it down, I won't remember what I did lmao. Criticism welcome. Also, I should have started with this, but only started calculating mid-game, when some 3TB drives failed and I got PME errors often. That is, total power disipation is estimated to be 150W for host, 250-300W for the fkn RAM, and 50-100W for the HDDs … which is close enough to 500W total, but/and was tested and rebooted multiple times for stability before putting in the Data Center for production. Conversely, when I set up a 20TB / 10TB actual zpool, it failed repeatedly, some times powering up 4 drives, other times 6, some times none. Some of this, but not all, was sed drive related - more on that later. GG$

– Alternate Setup –

I am now tinkering with native zfs encryption. So, I destroyed the LUKS pools above, made a zfs pool with the same command, but used the regular short-names and no LUKS. Then, after that, I created two datasets that are encrypted, each that unlocks/mounts by pulling a dd-generated key from the spare 1TB crypt (only unlocks with key after boot). This will be better if a hard drive fails on the pool and I need to do hot fixes. Here are the commands I ran after using the above-mentioned zpool command on the regular short-names:

dd if=/dev/random of=/mnt/vault/example.key bs=1 count=32
zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///mnt/vault/example.key pool/dataset

When you create this on the current running instance, it will also mount it for you as a courtesy, but upon reboot, you need to do the following:

zfs load-key pool/dataset
zfs mount pool/dataset

Optionally, as you create more of these, once can create a shell script to run to automate this. As for backing up the .img files, I do the following. Bear in mind that cp has had native support for seeking hole in .img files, i.e., not making the target copy any larger than what is used on the .img instead of what is allocated (and as long as target file system also supports this). However, scp does not have this functionality and rsync is slow. So, I decided on two approaches - live copies and sane/offiline copies:

nano /usr/local/bin/sane-image-maker.sh
chmod 750 /usr/local/bin/sane-image-maker.sh

In the script, enter the following (adjust as needed):

DATE=`date +"%Y%m%d-%H:%M:%S"`
virsh shutdown image1
wait
cp -ar /mnt/vms/image1.img /backups/image1.img.bak_SANE_$DATE
tar --use-compress-program=pbzip2 -cSf image1_$DATE.tar.bz2 /backups/tarballs/image1.img.bak_SANE_$DATE
wait
virsh start haacksnetworking

I also run a quick version of this script that comments out the virsh commands which runs daily at 1am. These live daily backups are fairly stable as virt-manager contains programming that instructs the guest to stop active processes when you cp -ar the .img file in order to preserve as much sanity of the image as possible. It is still better to do sane imaging, but not always realistic - so I do a scheduled backup 1/week at 330am only for the sane backups. but I tested many 'live' images and all rebooted in virt-manager just fine. There are three sources of backups and restore points:

  • 1) zfs runs snapshots on all pools/datasets 4 times a day, every 6 hours on cron
  • 2) the above script provides 1 per day quick image, 1 per week sane image on separate pool
  • 3) an offsite host runs rsync against the tarballs directory, pulling all images offsite daily

Okay, this should allow me to restore/return functionality to almost any VM system with a corrupted .img file in 2-4 hours or often quite less.

oemb1905 2021/11/03 11:35

computing/vmserver.1635961405.txt.gz · Last modified: 2021/11/03 17:43 by oemb1905