Haack's Wiki

This is an old revision of the document!

vmserver
Jonathan Haack
Haack's Networking
netcmnd@jonathanhaack.com

vmserver

I am currently running a Supermicro 6028U-TRTP+ w/ Dual 12-core Xeon E5-2650 at 2.2Ghz, 384GB RAM, with four two-way mirrors of Samsung enterprise SSDs for the primary vdev, and two two-way mirrors of 16TB platters for the backup vdev. All drives using SAS. I am using a 500W PSU. I determine the RAM would be about 5-10W a stick, the mobo about 100W, and the drives would consume most of the rest at roughly 18-22W per drive. The next step was to install Debian on the bare metal to control and manage the virtualization environment. The virtualization stack is virsh and kvm/qemu. As for the file system and drive formatting, I used luks and pam_mount to open an encrypted home partition and mapped home directory. I use this encrypted home directory to store keys for the zfs pool and/or other sensitive data, thus protecting them behind FDE. Additionally, I create file-level encrypted zfs data sets within each of the vdevs that are unlocked by the keys on the LUKS home partition. Instead of tracking each UUID down on your initial build, do the following:

zpool create -m /mnt/pool pool -f mirror sda sdb mirror sdc sdh mirror sde sdf mirror sdg sdh
zpool export pool
zpool import -d /dev/disk/by-id pool

Once the pool is created, you can create your encrypted datasets. To do so, I made some unlock keys with the dd command and placed the keys in a hidden directory inside that LUKS encrypted home partition I mentioned above:

dd if=/dev/random of=/secure/area/example.key bs=1 count=32
zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///mnt/vault/example.key pool/dataset

When the system reboots, the vdevs will automatically mount but the data sets won't because the LUKS keys won't be available until you mount the home partition by logging in to the user that holds the keys. For security reasons, this must be done manually or it defeats the entire purpose. So, once the administrator has logged in to the user in a screen session (remember, it is using pam_mount), they simple detach from that session and then load the keys and datasets as follows:

zfs load-key pool/dataset
zfs mount pool/dataset

If you have a lot of data sets, you can make a simple script to load them all at once, etc. Since we have zfs, it's a good idea to run some snapshots. To do that, I created a small shell script with the following commands and then set it to run 4 times a day, or every 6 hours:

DATE=date +"%Y%m%d-%H:%M:%S"
/usr/sbin/zfs snapshot -r pool/vm1dataset@backup_$DATE
/usr/sbin/zfs snapshot -r pool/vm2dataset@backup_$DATE
/usr/sbin/zfs snapshot -r pool/@backup_$DATE
/usr/sbin/zfs snapshot pool@backup_$DATE

Make sure to manage your snapshots and only retain as many as you can etc., as they will impact performance. If you need to zap all of them and start over, you can use this command:

zfs list -H -o name -t snapshot | xargs -n1 zfs destroy

Off-site full backups are essential but they take a long time to download. For that reason, it's best to have the images as small as possible. When using cp in your workflow, make sure to specify –sparse=always. Before powering the virtual hard disk back up, you should run virt-sparsify on the image to free up the unused blocks on the host and that are not actually used in the VM. In order for the VM to designate those blocks as empty, ensure that you are running fstrim within the VM. If you want the ls command to show the size of the virtual disk that remains after the zeroing, you will need to run qemu-img create on it, which will create a new copy of the image without listing the ballooned size. the new purged virtual hard disk image can then be copied to a backup directory where one can compress and tarball it to further reduce its size. I use BSD tar and the pbzip2 compression which makes ridiculously small images. GNU tar glitches with the script for some reason. BSD tar can be downloaded with sudo apt install libarchive-tools. I made a script to automate all of those steps for a qcow2 image. I also adapted that to work for raw images.

vm-bu-production-QCOW-loop.sh
vm-bu-production-RAW-loop.sh

On the off-site backup machine, I originally would pull the tarballs down using a one line rsync script. I would adjust the cron timing of the rsync script to work well with when the tarballs are created.

sudo rsync -av --log-file=/home/logs/backup-of-vm-tarballs.log --ignore-existing -e 'ssh -i /home/user/.ssh/id_rsa' root@domain.com:/backups/tarballs/ /media/user/Backups/

Since then, I've switched to using rsnapshot to pull down the tarballs in some cases. The rsnapshot configurations can be found here:

Rsnapshot Scripts

– Network Bridge Setup / VMs –

Up until now, I've covered how to provision the machines with virt-manager, how to backup the machines on the physical host, and how to pull those backups to an off-site workstation. Now I will discuss how to assign each VM an external IP. The first step is to provision the physical host with a virtual switch (wrongly called a bridge) to which VMs can connect. To do this, I kept it simple and used ifup and bridge-utils package and some manual editing in /etc/network/interfaces.

sudo apt install bridge-utils
sudo brctl addbr br0
sudo nano /etc/network/interfaces

Now that you have added created the virtual switch, you need to reconfigure your physical host's /etc/network/interfaces file to use the switch. In my case, I used 1 IP for the host itself, and another for the switch, meaning that two ethernet cables are plugged into my physical host. I did this so that if I hose my virtual switch settings, I still have a separate connection to the box. Here's the configuration in interfaces:

#eth0  [1st physical port]
auto ent8s0g0
  iface ent8s0f0 inet static
  address 8.25.76.160
  netmask 255.255.255.0
  gateway 8.25.76.1
  nameserver 8.8.8.8

#eth1 [2nd physical port]
auto enp8s0g1
iface enp8s0g1 inet manual

auto br0
iface br0 inet static
  address 8.25.76.159
  netmask 255.255.255.0
  gateway 8.25.76.1
  bridge_ports enp8s0g1
  nameserver 8.8.8.8

After that, either reboot or systemctl restart networking.service to make the changes current. Execute ip a and you should see both external IPs on two separate interfaces, and you should see br0 state UP in the output of the second interface enp8s0g1. You should also run some ping 8.8.8.8 and ping google.com tests to confirm you can route. If anyone wants to do this in a home, small business, or other non-public facing environment, you can easily use dhcp and provision the home/small business server's interface file as follows:

auto eth1
iface eth1 inet manual

auto br0
iface br0 inet dhcp
      bridge_ports eth1

The above home-version allows, for example, users to have a virtual machine that gets an ip address on your LAN and makes ssh/xrdp access far easier. If you have any trouble routing on the physical host, it could be that you do not have nameservers setup. If that's the case, do the following:

  echo nameserver 8.8.8.8 > /etc/resolv.conf
  systemctl restart networking.service

Now that the virtual switch is setup, I can now provision VMs and connect them to the virtual switch br0 in virt-manager. You can provision the VMs within the GUI using X passthrough, or use the command line. First, create a virtual disk to your desired size by excuting sudo qemu-img create -f raw new 1000G and then run something like this:

sudo virt-install --name=new.img \
--os-type=Linux \
--os-variant=debian10 \
--vcpu=1 \
--ram=2048 \
--disk path=/mnt/vms/students/new.img \
--graphics spice \
--location=/mnt/vms/isos/debian-11.4.0-amd64-netinst.iso \
--network bridge:br0

The machine will open in virt-viewer, but if you lose the connection you can reconnect easily with:

virt-viewer --connect qemu:///system --wait new.img

Once you finish installation, configure the guestOS interfaces file sudo nano /etc/network/interfaces with the IP you intend to assign it. You should have something like this:

auto epr1
iface epr1 inet static
  address 8.25.76.158
  netmask 255.255.255.0
  gateway 8.25.76.1
  nameservers 8.8.8.8

If you are creating VMs attached to a virtual switch on the smaller home/business environment, then adjust the guest OS by executing sudo nano /etc/network/interfaces and then something like this recipe:

auto epr1
iface epr1 inet dhcp

If your guest OS uses Ubuntu, you will need to do extra steps to ensure that the guestOS can route. This is because Ubuntu-based distros have deprecated ifupdown in favor of netplan and disabled manual editing of /etc/resolv.conf. So, either you want to learn netplan syntax and make interface changes using its YAML derivative, or you can install the optional resolvconf package to restore ifupdown functionality. To do this, adjust the VM provision script above (or use the virt-manager GUI with X passthrough) to temporarily use NAT then override Ubuntu defaults and restore ifupdown functionality as follows:

sudo apt install ifupdown
sudo apt remove --purge netplan.io
sudo apt install resolvconf
sudo nano /etc/resolvconf/resolv.conf.d/tail
<nameserver 8.8.8.8>
systemctl restart networking.service

You should once again execute ping 8.8.8.8 and ping google.com to confirm you can route within the guest OS. Sometimes, I find a reboot is required. At this stage, you now have a physical host configured with a virtual switch, and one VM provisioned to use the switch with its own external IP. Both the physical host and guest OS in this scenario are public facing so take precautions to properly secure each by checking services netstat -tulpn and/or utilizing a firewall. The main things to configure at this point are ssh access so you no longer need to rely on the virt-viewer console which is slow. To do that, you will need to add packages (if you use the netinst.iso). To make that easy, I keep the sources.list on my primary business server:

wget https://haacksnetworking.org/sources.list

Once you grab the sources.list file, install openssh-server and exchange keys, you can now use a shell to ssh into the guestOS henceforward. This means that at this point you are now in a position to create VMs and various production environments at will or start working on the one you just created. Another thing to consider is to create base VMs that have interfaces and ssh access all ready to go, and then leverage those to make new instances using cp. Alternately, you can power down a base VM and then clone it as follows:

virt-clone \
--original=clean \
--name=sequoia \
--file=/mnt/vms/students/sequoia.img

The purpose of this project was to create my own virtualized VPS infrastructure (using KVM and VMs), to run my own production environments and for clients, students, and family. Here's a few to check out:

That's all folks! Well … except for one more thing. When I first did all of this, I was convinced that zfs should be within LUKS as it was difficult for me to let go of LUKS / full disk encryption. I've now decided that's insane because of one primary reason. Namely, by putting zfs (or any file system) within LUKS, you lose the hot swapability that you have when zfs (or regular RAID) run directly on the hardware. That would mean that replacing a hard drive would require an entire server rebuild, which is insane. However, it is arguably more secure that way, so if budget and time permits, I've retained how I put zfs inside LUKS in the passage that follows. Proceed at your own risk lol.

– LUKS FIRST, ZFS SECOND - (LEGACY SETUP, NOT CURRENT) –

My initial idea was to do LUKS first, then zfs, meaning 6 could be mirrors in zfs and I would keep 1 as a spare LUKS crypt for keys, other crap, etc. To create the LUKS crypts, I did the following 6 times, each time appending the last 4 digits of the block ID to the LUKS crypt name:

cryptsetup luksFormat /dev/sda
cryptsetup luksOpen /dev/sda sdafc11

You then make sure to use the LUKS label names when making the zpool, not the short names, which can change at times during reboots. I did this as follows:

sudo apt install zfs-utils bridge-utils
zpool create -m /mnt/vms vms -f mirror sdafc11 sdb9322 mirror sdc8a33 sdh6444 mirror sde5b55 sdf8066

ZFS by default executes its mount commands at boot. This is a problem if you don't use auto-unlocking and key files with LUKS to also unlock on boot (and/or a custom script that unlocks). The problem, in this use cases, is ZFS will try to mount the volumes before they are unlocked. The two other options are none/legacy modes, both of which rely on you mounting the volume using traditional methods. But, the whole point of using zfs finally was to not use traditional methods lol, so for that reason I investigated if there was a fix. The closest to a fix is setting cachefile=none boot, but this a) hosed the pool once b) requires resetting, rebooting again and/or manually re-mounting the pool - either of which defeat the point. Using key files, cache file adjustments, etc., and/or none/legacy were all no-gos for me, so in the end, I decided to tolerate that zfs would fail at boot, and that I would zpool import it afterwards.

sudo -i
screen
su - user [pam_mount unlocks /home for physical host primary user and the spare 1TB vault]
ctrl-a-d [detaches from screen]

After unlocking my home directory and the spare 1TB vault, the next step is to unlock each LUKS volume, which I decided a simple shell script would suffice which looks like this mount-luks.sh:

cryptsetup luksOpen /dev/disk/by-uuid/2702e690-…-0c4267a6fc11 sdafc11
cryptsetup luksOpen /dev/disk/by-uuid/e3b568ad-…-cdc5dedb9322 sdb9322
cryptsetup luksOpen /dev/disk/by-uuid/d353e727-…-e4d66a9b8a33 sdc8a33
cryptsetup luksOpen /dev/disk/by-uuid/352660ca-…-5a8beae15b44 sde5b44
cryptsetup luksOpen /dev/disk/by-uuid/fa1a6109-…-f46ce1cf8055 sdf8055
cryptsetup luksOpen /dev/disk/by-uuid/86da0b9f-…-13bc38656466 sdh6466

This script simply opens each LUKS crypt so long as you enter or copy/paste your HD password 6 times. After that, one has to re-mount the pool / rebuild the quasi RAID1 mirror/logical volumes with the import command as follows once the volumes are opened:

zpool import pool

Rebooting in this manner takes about 3-5 minutes for the host, and 2 minutes to screen into my user name, detach, and run the mount LUKS script to mount the pools/datasets, etc. Again, I ultimately rejected this because you cannot use zfs tools when hard drives fail with this setup.

— oemb1905 2022/11/12 12:39

Haack's Wiki

User Tools

Site Tools

Page Tools