Differences

This shows you the differences between two versions of the page.

--- computing:vmserver [2021/10/30 03:53] – oemb1905
+++ computing:vmserver [2024/02/17 21:11] (current) – oemb1905
@@ Line 3: / Line 3: @@
   * **Jonathan Haack**
   * **Haack's Networking**
-  * **netcmnd@jonathanhaack.com**
+  * **webmaster@haacksnetworking.org**
 -------------------------------------------
-This tutorial documents the steps I took to create an entry level enterprise VM server using a minimal Debian install on a SuperMicro host with 96GB of RAM and 2X 8-core dual-thread CPUs (32 total threads).  I estimate that this system can handle up to 28 VMs w/ 1 CPU core and 3GB RAM and/or 4 larger VMs with 8 CPU cores and 16GB RAM.  My intent in using this system is to offer Big Blue Button instances to a few small schools and/or educational organizations.  I began setup with a 120GB SSD boot volume that I installed Debian Bullseye on. For the boot volume, I wanted some physical protection over contents I might store in the Data Center I keep the server at, so I made the ''/'' directory only 32GB and saved the rest for a LUKS crypt for my home directory.  If you are unclear on how to use pam_mount and LUKS together to unlock your home directory crypt, check Jason's tutorial [[https://jasonschaefer.com/encrypting-home-dir-decrypting-on-login-pam/|here]].  My server is at a Data Center, so this way I can reboot the remote system easily and still have some content protection if the physical device is compromised.  I doubt this will happen, but why not - it just works.  I also had 7 leftover bays that were usable, so I devoted another 1TB drive as another crypt just for kicks, leaving 6 drives for a zfs pool.  It is my preference to have LUKS underneath zfs, which I did as follows:
+//vmserver//
-  cryptsetup luksFormat /dev/sda
+-------------------------------------------
-  cryptsetup luksOpen /dev/sda sdafc11
+This tutorial covers how to set up a production server that's intended to be used as a virtualization stack for a small business or educator. I am currently running a Supermicro 6028U-TRTP+ w/ Dual 12-core Xeon E5-2650 at 2.2Ghz, 384GB RAM, with four two-way mirrors of Samsung enterprise SSDs for the primary vdev, and two two-way mirrors of 16TB platters for the backup vdev. All drives using SAS. I am using a 500W PSU. I determine the RAM would be about 5-10W a stick, the mobo about 100W, and the drives would consume most of the rest at roughly 18-22W per drive. The next step was to install Debian on the bare metal to control and manage the virtualization environment. The virtualization stack is virsh and kvm/qemu. As for the file system and drive formatting, I used luks and pam_mount to open an encrypted home partition and mapped home directory. I use this encrypted home directory to store keys for the zfs pool and/or other sensitive data, thus protecting them behind FDE. Additionally, I create file-level encrypted zfs data sets within each of the vdevs that are unlocked by the keys on the LUKS home partition. Instead of tracking each UUID down on your initial build, do the following:
+  zpool create -m /mnt/pool pool -f mirror sda sdb mirror sdc sdh mirror sde sdf mirror sdg sdh
+  zpool export pool
+  zpool import -d /dev/disk/by-id pool
+Once the pool is created, you can create your encrypted datasets. To do so, I made some unlock keys with the dd command and placed the keys in a hidden directory inside that LUKS encrypted home partition I mentioned above:
+  dd if=/dev/random of=/secure/area/example.key bs=1 count=32
+  zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///mnt/vault/example.key pool/dataset
+When the system reboots, the vdevs will automatically mount but the data sets won't because the LUKS keys won't be available until you mount the home partition by logging in to the user that holds the keys. For security reasons, this must be done manually or it defeats the entire purpose. So, once the administrator has logged in to the user in a screen session (remember, it is using pam_mount), they simple detach from that session and then load the keys and datasets as follows:
+  zfs load-key pool/dataset
+  zfs mount pool/dataset
-In order to be straight about which devices were used and how, I appended the last four characters of the block ID to the crypt label, as you see above.  To find the corresponding block ID, just ''ls -lah /dev/disk/by-uuid'' and/or run ''blkid''.  I repeated this for each of the 6 drives (''sdb'', ''sdc'', etc,) I intended to mirror and pool with zfs.  Once these crypts were all created and opened (not mounted, just opened), I then created the zfs pool as follows:
+If you have a lot of data sets, you can make a simple script to load them all at once, etc. Since we have zfs, it's a good idea to run some snapshots. To do that, I created a small shell script with the following commands and then set it to run 4 times a day, or every 6 hours:
-  sudo apt install zfs-utils
+  DATE=date +"%Y%m%d-%H:%M:%S"
-  zpool create -m /mnt/vms vms -f mirror sdafc11 sdb9322 mirror sdc8a33 sdh6444 mirror sde5b55 sdf8066
+  /usr/sbin/zfs snapshot -r pool/vm1dataset@backup_$DATE
+  /usr/sbin/zfs snapshot -r pool/vm2dataset@backup_$DATE
+  /usr/sbin/zfs snapshot -r pool/@backup_$DATE
+  /usr/sbin/zfs snapshot pool@backup_$DATE
+Make sure to manage your snapshots and only retain as many as you can etc., as they will impact performance. If you need to zap all of them and start over, you can use this command:
+  zfs list -H -o name -t snapshot | xargs -n1 zfs destroy
+Off-site //full// backups are essential but they take a long time to download. For that reason, it's best to have the images as small as possible. When using ''cp'' in your workflow, make sure to specify ''--sparse=always''. Before powering the virtual hard disk back up, you should run ''virt-sparsify'' on the image to free up the unused blocks on the host and that are not actually used in the VM. In order for the VM to designate those blocks as empty, ensure that you are running fstrim within the VM. If you want the ls command to show the size of the virtual disk that remains after the zeroing, you will need to run ''qemu-img create'' on it, which will create a new copy of the image without listing the ballooned size. the new purged virtual hard disk image can then be copied to a backup directory where one can compress and tarball it to further reduce its size. I use BSD tar and the pbzip2 compression which makes ridiculously small images. GNU tar glitches with the script for some reason. BSD tar can be downloaded with ''sudo apt install libarchive-tools''. I made a script to automate all of those steps for a qcow2 image. I also adapted that to work for raw images.
+[[https://repo.haacksnetworking.org/haacknet/haackingclub/-/blob/main/scripts/virtualmachines/vm-bu-production-QCOW-loop.sh|vm-bu-production-QCOW-loop.sh]] \\
+[[https://repo.haacksnetworking.org/haacknet/haackingclub/-/blob/main/scripts/virtualmachines/vm-bu-production-RAW-loop.sh|vm-bu-production-RAW-loop.sh]]
+On the off-site backup machine, I originally would pull the tarballs down using a one line rsync script. I would adjust the cron timing of the rsync script to work well with when the tarballs are created.
+  sudo rsync -av --log-file=/home/logs/backup-of-vm-tarballs.log --ignore-existing -e 'ssh -i /home/user/.ssh/id_rsa' root@domain.com:/backups/tarballs/ /media/user/Backups/
-In order to make sure the pool was created correctly, I ran ''df -h'' to check the mountpoint and size of the pool.  I paired two 2TB drives, two 1TB drives, and another pairing of two 1TB drives for a total of around 4TB actual, since the remaining 4TB are lost to parity on the zfs mirror / RAID1 equivalent.  I do not want these drives to unlock automatically with a key file for two reasons: 1) security (I suppose I could use the extra crypt, but nah ... why?) and 2) if a drive breaks, the whole system could potentially not boot.  I also don't want to use the zfs options of ''legacy'' and/or ''none'' because they change the functionality of zfs in ways I don't like.  However, this means that the automatic pool creation zfs conducts on boot via ''zfs-import-cache.service'' will fail.  After much searching of how to stop this failure while preserving the otherwise automatic zfs features, I determined that there was no reliable way to do this.  The closest way was to set the ''leasefile=none'' once the pool was created, and this did successfully stop the pool and service from starting, but when I re-enabled the leasefile option to its native directory, the zfs pool never mounted again thereafter.  For this reason, I decided that waiting 10 seconds and letting ''zfs-import-cache.service'' fail at boot was entirely acceptable to me.  So, here is how I set up everything on the server upon reboot:
+Since then, I've switched to using rsnapshot to pull down the tarballs in some cases. The rsnapshot configurations can be found here:
-  sudo -i
+[[https://repo.haacksnetworking.org/haacknet/haackingclub/-/tree/main/scripts/rsnapshot|Rsnapshot Scripts]]
-  screen
-  su - user [pam_mount unlocks /home for physical host primary user and the spare 1TB vault]
-  ctrl-a-d [detaches from screen]
+****
+-- Network Bridge Setup / VMs --
+Up until now, I've covered how to provision the machines with virt-manager, how to backup the machines on the physical host, and how to pull those backups to an off-site workstation. Now I will discuss how to assign each VM an external IP. The first step is to provision the physical host with a virtual switch (wrongly called a bridge) to which VMs can connect. To do this, I kept it simple and used ''ifup'' and ''bridge-utils'' package and some manual editing in ''/etc/network/interfaces''.
-After unlocking my home directory and the spare 1TB vault, the next step is to unlock each LUKS volume, which I decided a simple shell script would suffice which looks like this:
+  sudo apt install bridge-utils
+  sudo brctl addbr br0
+  sudo nano /etc/network/interfaces
-  cryptsetup luksOpen /dev/disk/by-uuid/2702e690-...-0c4267a6fc11 sdafc11
+Now that you have added created the virtual switch, you need to reconfigure your physical host's ''/etc/network/interfaces'' file to use the switch. In my case, I used 1 IP for the host itself, and another for the switch, meaning that two ethernet cables are plugged into my physical host. I did this so that if I hose my virtual switch settings, I still have a separate connection to the box. Here's the configuration in ''interfaces'':
-  cryptsetup luksOpen /dev/disk/by-uuid/e3b568ad-...-cdc5dedb9322 sdb9322
-  cryptsetup luksOpen /dev/disk/by-uuid/d353e727-...-e4d66a9b8a33 sdc8a33
-  cryptsetup luksOpen /dev/disk/by-uuid/352660ca-...-5a8beae15b44 sde5b44
-  cryptsetup luksOpen /dev/disk/by-uuid/fa1a6109-...-f46ce1cf8055 sdf8055
-  cryptsetup luksOpen /dev/disk/by-uuid/86da0b9f-...-13bc38656466 sdh6466
-Obviously not all of my drives have ... in the middle of the block ID - this is just me obfuscating because I am paranoid.  Also, even though I used the short-names like sda, sdb, etc., for convenience when setting up the LUKS volumes, I deliberately did not do so here because upon reboot, those can sometimes change.  This is why I did two things - once, I originally included the last four digits of the block ID on the LUKS device name, but I also made the script open the corresponding hard drive using the block ID instead of the short name.  This ensures that, if for some odd reason, that Debian decides to rename sda to sdb or whatever, that the LUKS volumes will open properly.  Also, do note that since the zpool was created with the LUKS names as well, that there is no way the pool could start by accident and/or incorrectly with the short-names (which some users report on SE but makes no sense to me).  At any rate, I simply copy/paste the password 6 times, and then re-create the pool as follows once the volumes are opened:
+  #eth0  [1st physical port]
+  auto ent8s0g0
+    iface ent8s0f0 inet static
+    address 8.25.76.160
+    netmask 255.255.255.0
+    gateway 8.25.76.1
+    nameserver 8.8.8.8
-  zpool import vms
+  #eth1 [2nd physical port]
+  auto enp8s0g1
+  iface enp8s0g1 inet manual
+  auto br0
+  iface br0 inet static
+    address 8.25.76.159
+    netmask 255.255.255.0
+    gateway 8.25.76.1
+    bridge_ports enp8s0g1
+    nameserver 8.8.8.8
+After that, either reboot or ''systemctl restart networking.service'' to make the changes current. Execute ''ip a'' and you should see both external IPs on two separate interfaces, and you should see ''br0 state UP'' in the output of the second interface ''enp8s0g1''. You should also run some ''ping 8.8.8.8'' and ''ping google.com'' tests to confirm you can route. If anyone wants to do this in a home, small business, or other non-public facing environment, you can easily use dhcp and provision the home/small business server's ''interface'' file as follows:
+  auto eth1
+  iface eth1 inet manual
+  auto br0
+  iface br0 inet dhcp
+        bridge_ports eth1
+The above home-version allows, for example, users to have a virtual machine that gets an ip address on your LAN and makes ssh/xrdp access far easier. If you have any trouble routing on the physical host, it could be that you do not have nameservers setup. If that's the case, do the following:
+    echo nameserver 8.8.8.8 > /etc/resolv.conf
+    systemctl restart networking.service
+Now that the virtual switch is setup, I can now provision VMs and connect them to the virtual switch ''br0'' in virt-manager. You can provision the VMs within the GUI using X passthrough, or use the command line. First, create a virtual disk to your desired size by excuting ''sudo qemu-img create -f raw new 1000G'' and then run something like this:
+  sudo virt-install --name=new.img \
+  --os-type=Linux \
+  --os-variant=debian10 \
+  --vcpu=1 \
+  --ram=2048 \
+  --disk path=/mnt/vms/students/new.img \
+  --graphics spice \
+  --location=/mnt/vms/isos/debian-11.4.0-amd64-netinst.iso \
+  --network bridge:br0
+The machine will open in virt-viewer, but if you lose the connection you can reconnect easily with:
+  virt-viewer --connect qemu:///system --wait new.img
-Altogether, rebooting this server takes about 4 minutes of wait time, and then about 2 minutes to mount my home directory, 1TB vault, and my recreate my zpool.  I know this sounds a bit old school, but it ensures that I don't have to travel 63 miles to my data center to figure out what happened if a drive fails.  In most failure scenarios, I will know what caused it and/since I can still boot, where more complex setups like key-based unlocking of LUKS on failed HDs will just fail to boot.  And ... because I have offsite backups of the VMs abd hardrive.img files, I can easily destroy the zpool and use a smaller pool with only 4 drives (to keep uptime and production going), while I order a new pair of drives and set up time to visit the center and replace them.  Also, if/when the server does not start - remember, that could be a hard drive, or it could be something else - using this workflow, you always know what happened (unless the boot volume crashes).  Of course, the center has remote KVM and so on, but why increase the chances of failure and rely on such kludgy access ... this is cleaner and less chances for failure.  I do realize, however, that this won't scale to 100X this size, or even 50X, but again - this is entry level enterprise for advanced self-hosters and/or robust residential use-cases that just exceed normal residential use parameters.
+Once you finish installation, configure the guestOS interfaces file ''sudo nano /etc/network/interfaces'' with the IP you intend to assign it. You should have something like this:
+  auto epr1
+  iface epr1 inet static
+    address 8.25.76.158
+    netmask 255.255.255.0
+    gateway 8.25.76.1
+    nameservers 8.8.8.8
+If you are creating VMs attached to a virtual switch on the smaller home/business environment, then adjust the guest OS by executing ''sudo nano /etc/network/interfaces'' and then something like this recipe:
+  auto epr1
+  iface epr1 inet dhcp
+If your guest OS uses Ubuntu, you will need to do extra steps to ensure that the guestOS can route. This is because Ubuntu-based distros have deprecated ''ifupdown'' in favor of ''netplan'' and disabled manual editing of ''/etc/resolv.conf''. So, either you want to learn netplan syntax and make interface changes using its YAML derivative, or you can install the optional ''resolvconf'' package to restore ''ifupdown'' functionality. To do this, adjust the VM provision script above (or use the virt-manager GUI with X passthrough) to temporarily use NAT then override Ubuntu defaults and restore ''ifupdown'' functionality as follows:
+  sudo apt install ifupdown
+  sudo apt remove --purge netplan.io
+  sudo apt install resolvconf
+  sudo nano /etc/resolvconf/resolv.conf.d/tail
+  <nameserver 8.8.8.8>
+  systemctl restart networking.service
-Phew ... that was a mouthful ... but if I don't write it down, I won't remember what I did lmao.  Criticism welcome.  Also, I should have started with this, but only started calculating mid-game, when some 3TB drives failed and I got PME errors often.  That is, total power disipation is estimated to be 150W for host, 250-300W for the fkn RAM, and 50-100W for the HDDs ... which is close enough to 500W total, but/and was tested and rebooted multiple times for stability before putting in the Data Center for production.  Conversely, when I set up a 20TB / 10TB actual zpool, it failed repeatedly, some times powering up 4 drives, other times 6, some times none.  Some of this, but not all, was sed drive related - more on that later.  GG$
+You should once again execute ''ping 8.8.8.8'' and ''ping google.com'' to confirm you can route within the guest OS. If it fails, reboot and try again. Its a good idea at this point to check ''netstat -tulpn'' on both the host and in any VMs to ensure only approved services are listening. When I first began spinning up machines, I would make template machines and then use ''virt-clone'' to make new machines which I would then tweak for the new use case. You always get ssh hash errors this way and it is just kind of cumbersome and not clean. Over time, I found out about how to pass preseed.cfg files to Debian through virt-install, and so now I simply spin up new images with desired parameters and the preseed.cfg files passes nameservers, network configuration details, and ssh keys into the newly created machine. Although related, that topic stands on its own, so I wrote up the steps I took over at [[computing:preseed]]. One other thing that people might want do is enable some type of GUI-based monitoring tool for the physical host like munin, cacti, smokeping, etc., in order to monitor snmp or other characteristics of the VMs. If so, make sure you only run those web administration panels locally and/or block 443/80 in a firewall. You will want to put the physical host behind a vpn, like I've documented in [[computing:vpnserver-debian]] and then just access it by its internal IP. This completes the tutorial on setting up a virtualization stack with virsh and qemu/kvm.
- --- //[[jonathan@haacksnetworking.com|oemb1905]] 2021/10/29 21:43//
+ --- //[[webmaster@haacksnetworking.org|oemb1905]] 2024/02/17 20:46//

Haack's Wiki

User Tools

Site Tools

Differences

Page Tools