Differences

This shows you the differences between two versions of the page.

--- computing:vmserver [2021/10/30 01:44] – oemb1905
+++ computing:vmserver [2023/06/17 22:44] – oemb1905
@@ Line 7: / Line 7: @@
 -------------------------------------------
-This tutorial documents the steps I took to create an entry level enterprise VM server using a minimal Debian install on a SuperMicro host with 96GB of RAM and 2X 8-core dual-thread CPUs (32 total threads).  I estimate that this system can handle up to 28 VMs w/ 1 CPU core and 3GB RAM and/or 4 larger VMs with 8 CPU cores and 16GB RAM.  My intent in using this system is to offer Big Blue Button instances to a few small schools and/or educational organizations.  I began setup with a 120GB SSD boot volume that I installed Debian Bullseye on. For the boot volume, I wanted some physical protection over contents I might store in the Data Center I keep the server at, so I made the ''/'' directory only 32GB and saved the rest for a LUKS crypt for my home directory.  If you are unclear on how to use pam_mount and LUKS together to unlock your home directory crypt, check Jason's tutorial [[https://jasonschaefer.com/encrypting-home-dir-decrypting-on-login-pam/|here]].  My server is at a Data Center, so this way I can reboot the remote system easily and still have some content protection if the physical device is compromised.  I doubt this will happen, but why not - it just works.  I also had 7 leftover bays that were usable, so I devoted another 1TB drive as another crypt just for kicks, leaving 6 drives for a zfs pool.  It is my preference to have LUKS underneath zfs, which I did as follows:
+//vmserver//
-  cryptsetup luksFormat /dev/sda1
+-------------------------------------------
-  cryptsetup luksOpen /dev/sda1 sda8fce
+I was given a dual 8-core Xeon SuperMicro server (32 threads), with 8 HD bays in use, 96GBRAM, 8x 6TB Western Digital in Raid1 zfs mirror (24TB actual), with a 120GBSSD boot volume stuck behind the power front panel running non-GUI Debian. (Thanks to Kilo Sierra for the donation.) My first job was to calculate whether my PSU was up to the task I intended for it. I used a 500W PSU. From my calculations, I determined that the RAM would be around 360W at capacity but rarely hit that or even close, that the HDs would often (especially on boot) hit up to 21.3W per drive, or around 150W total, and that excluded the boot SSD volume. The motherboard would be 100W, putting me at 610W. Since I did not expect the RAM, HDs, and other physical components to concurrently hit peak consumption, I considered it safe to proceed, and figured no more than around 75% of that ceiling would be used at any one time. The next step was to install the physical host OS (Debian) and setup the basics of the system (hostname, DNS, etc., basic package installs). On the 120GB SSD boot volume, I used a luks / pam_mount encrypted home directory, where I could store keys for the zfs pool and/or other sensitive data. I used a nifty trick in order to first create the pools simply with short names, and then magically change them to block ids without having to make the pool creation syntax cumbersome.
+**Update**: I am now running a newer server with 48 threads, 12 hard drive bays, 384GB RAM, 4 two-way mirrors of Samsung enterprise SSDs for the primary vm zpool, and 2 two-way mirrors of 16TB platters for the backup zpool and for some mailservers. These are also SAS hard drives now, not SATA. The server can handle 1.5TB of RAM.
+  zpool create -m /mnt/pool pool -f mirror sda sdb mirror sdc sdh mirror sde sdf mirror sdg sdh
+  zpool export pool
+  zpool import -d /dev/disk/by-id pool
+Now that pool was created, I created two encrypted datasets, which is zfs name for encrypted file storage inside the pool. The datasets each unlock by pulling a dd-generated key from the encrypted (and separate) home partition on the SSD boot volume. I set up the keys/datasets as follows:
+  dd if=/dev/random of=/secure/area/example.key bs=1 count=32
+  zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///mnt/vault/example.key pool/dataset
+When you create this on the current running instance, it will also mount it for you as a courtesy, but upon reboot, you need to load the key, then mount the dataset using zfs commands. In my case, I created three datasets (one for raw isos, one for disk images, and a last one for backup sparse tarballs). Each one was created as follows:
+  zfs load-key pool/dataset
+  zfs mount pool/dataset
-In order to be straight about which devices were used and how, I appended the last four characters of the block ID to the crypt label, as you see above.  To find the corresponding block ID, just ''ls -lah /dev/disk/by-uuid'' and/or run ''blkid''.  I repeated this for each of the 6 drives (''sdb'', ''sdc'', etc,) I intended to mirror and pool with zfs.  Once these crypts were all created and opened (not mounted, just opened), I then created the zfs pool as follows:
+Once I created all the datasets, I made a script that would load the keys and unlock all of them, then rebooted and tested it for functionality. Upon verifying that the datasets worked, I could now feel comfortable creating VMs again, since the hard drive images for those VMs would be stored in encrypted datasets with zfs. My next task was to create both snapshots within zfs, which would handle routine rollbacks and smaller errors/mistakes. I did that by creating a small script that runs via cron 4 times a day, or every 6 hours:
-  sudo apt install zfs-utils
+  DATE=date +"%Y%m%d-%H:%M:%S"
-  zpool create -m /mnt/vms vms -f mirror sdafc11 sdb9322 mirror sdc8a33 sdh6444 mirror sde5b55 sdf8066
+  /usr/sbin/zfs snapshot -r pool/vm1dataset@backup_$DATE
+  /usr/sbin/zfs snapshot -r pool/vm2dataset@backup_$DATE
+  /usr/sbin/zfs snapshot -r pool/@backup_$DATE
+  /usr/sbin/zfs snapshot pool@backup_$DATE
+The snapshots allow me to perform roll backs when end-users make mistakes, e.g., delete an instructional video after a class session, etc., or what have you.  To delete all snapshots and start over, run:
+  zfs list -H -o name -t snapshot | xargs -n1 zfs destroy
+Of course, off-site backups are essential. To do this, I use a small script that powers down the VM, uses ''cp'' with the ''--sparse=always'' flag to preserve space, and then uses tar with pbzip2 ''sudo apt install pbzip2'' compression to save even more space. From my research, bsdtar seems to honor sparsity better than gnutar so install that with ''sudo apt install libarchive-tools''. The ''cp'' command is not optional, moreover, for remember tar will not work directly on an ''.img'' file. Here's a small shell script with a loop for multiple VMs within the same directory. I also added a command at the end that will delete any tarballs older than 180 days.
+  DATE=`date +"%Y%m%d-%H:%M:%S"`
+  IMG="vm1.img  vm2.img"
+  for i in $IMG;
+  do
+  virsh shutdown $i
+  wait
+  cd /mnt/vms/backups
+  cp -ar --sparse=always /mnt/vms/students/$i /mnt/vms/backups/SANE_$i.bak
+  wait
+  virsh start $i
+  bsdtar --use-compress-program=pbzip2 -Scf SANE_$i.tar.bz2 SANE_$i.bak
+  mv /mnt/vms/backups/SANE_$i.tar.bz2 /mnt/vms/backups/tarballs/$i:_SANE_$DATE:_.tar.bz2
+  rm /mnt/vms/backups/SANE_$i.bak
+  find /mnt/vms/backups/tarballs -type f -mtime +180 -delete
+The script above can be downloaded here [[https://repo.haacksnetworking.org/oemb1905/haackingclub/-/blob/master/scripts/sane-vm-backup.sh|sane-vm-backup.sh]]. I use multiple copies of the loop script for related groups of VMs on the same physical host, and then stagger when they run with cron to limit simultaneous read/write time as follows:
+  #backup student machines, client machines
+03 1,15 * * /usr/local/bin/sane-vm-backup-students.sh >> /root/sane-vm-backup-students.log
+03 2,16 * * /usr/local/bin/sane-vm-backup-clients.sh >> /root/sane-vm-backup-clients.log
+On the off-site backup machine, I pull the tarballs down using a one line rsync script. I adjust the cron timing of the rsync script to work well with when the tarballs are created.
+  sudo rsync -av --log-file=/home/logs/backup-of-vm-tarballs.log --ignore-existing -e 'ssh -i /home/user/.ssh/id_rsa' root@domain.com:/backups/tarballs/ /media/user/Backups/
+The off-site backup workstation uses rsnapshot, which provides me with months of restore points and thus provides version control for if/when errors are not caught immediately.
+****
+-- Network Bridge Setup / VMs --
+Up until now, I've covered how to provision the machines with virt-manager, how to backup the machines on the physical host, and how to pull those backups to an off-site workstation. Now I will discuss how to assign each VM an external IP. The first step is to provision the physical host with a virtual switch (wrongly called a bridge) to which VMs can connect. To do this, I kept it simple and used ''ifup'' and ''bridge-utils'' package and some manual editing in ''/etc/network/interfaces''.
-In order to make sure the pool was created correctly, I ran ''df -h'' to check the mountpoint and size of the pool.  I paired two 2TB drives, two 1TB drives, and another pairing of two 1TB drives for a total of around 4TB actual, since the remaining 4TB are lost to parity on the zfs mirror / RAID1 equivalent.  I do not want these drives to unlock automatically with a key file for two reasons: 1) security (I suppose I could use the extra crypt, but nah ... why?) and 2) if a drive breaks, the whole system could potentially not boot.  I also don't want to use the zfs options of ''legacy'' and/or ''none'' because they change the functionality of zfs in ways I don't like.  However, this means that the automatic pool creation zfs conducts on boot via ''zfs-import-cache.service'' will fail.  After much searching of how to stop this failure while preserving the otherwise automatic zfs features, I determined that there was no reliable way to do this.  The closest way was to set the ''leasefile=none'' once the pool was created, and this did successfully stop the pool and service from starting, but when I re-enabled the leasefile option to its native directory, the zfs pool never mounted again thereafter.  For this reason, I decided that waiting 10 seconds and letting ''zfs-import-cache.service'' fail at boot was entirely acceptable to me.  So, here is how I set up everything on the server upon reboot:
+  sudo apt install bridge-utils
+  sudo brctl addbr br0
+  sudo nano /etc/network/interfaces
+Now that you have added created the virtual switch, you need to reconfigure your physical host's ''/etc/network/interfaces'' file to use the switch. In my case, I used 1 IP for the host itself, and another for the switch, meaning that two ethernet cables are plugged into my physical host. I did this so that if I hose my virtual switch settings, I still have a separate connection to the box. Here's the configuration in ''interfaces'':
+  #eth0  [1st physical port]
+  auto ent8s0g0
+    iface ent8s0f0 inet static
+    address 8.25.76.160
+    netmask 255.255.255.0
+    gateway 8.25.76.1
+    nameserver 8.8.8.8
+  #eth1 [2nd physical port]
+  auto enp8s0g1
+  iface enp8s0g1 inet manual
+  auto br0
+  iface br0 inet static
+    address 8.25.76.159
+    netmask 255.255.255.0
+    gateway 8.25.76.1
+    bridge_ports enp8s0g1
+    nameserver 8.8.8.8
+After that, either reboot or ''systemctl restart networking.service'' to make the changes current. Execute ''ip a'' and you should see both external IPs on two separate interfaces, and you should see ''br0 state UP'' in the output of the second interface ''enp8s0g1''. You should also run some ''ping 8.8.8.8'' and ''ping google.com'' tests to confirm you can route. If anyone wants to do this in a home, small business, or other non-public facing environment, you can easily use dhcp and provision the home/small business server's ''interface'' file as follows:
+  auto eth1
+  iface eth1 inet manual
+  auto br0
+  iface br0 inet dhcp
+        bridge_ports eth1
+The above home-version allows, for example, users to have a virtual machine that gets an ip address on your LAN and makes ssh/xrdp access far easier. If you have any trouble routing on the physical host, it could be that you do not have nameservers setup. If that's the case, do the following:
+    echo nameserver 8.8.8.8 > /etc/resolv.conf
+    systemctl restart networking.service
+Now that the virtual switch is setup, I can now provision VMs and connect them to the virtual switch ''br0'' in virt-manager. You can provision the VMs within the GUI using X passthrough, or use the command line. First, create a virtual disk to your desired size by excuting ''sudo qemu-img create -f raw new 1000G'' and then run something like this:
+  sudo virt-install --name=new.img \
+  --os-type=Linux \
+  --os-variant=debian10 \
+  --vcpu=1 \
+  --ram=2048 \
+  --disk path=/mnt/vms/students/new.img \
+  --graphics spice \
+  --location=/mnt/vms/isos/debian-11.4.0-amd64-netinst.iso \
+  --network bridge:br0
+The machine will open in virt-viewer, but if you lose the connection you can reconnect easily with:
+  virt-viewer --connect qemu:///system --wait new.img
+Once you finish installation, configure the guestOS interfaces file ''sudo nano /etc/network/interfaces'' with the IP you intend to assign it. You should have something like this:
+  auto epr1
+  iface epr1 inet static
+    address 8.25.76.158
+    netmask 255.255.255.0
+    gateway 8.25.76.1
+    nameservers 8.8.8.8
+If you are creating VMs attached to a virtual switch on the smaller home/business environment, then adjust the guest OS by executing ''sudo nano /etc/network/interfaces'' and then something like this recipe:
+  auto epr1
+  iface epr1 inet dhcp
+If your guest OS uses Ubuntu, you will need to do extra steps to ensure that the guestOS can route. This is because Ubuntu-based distros have deprecated ''ifupdown'' in favor of ''netplan'' and disabled manual editing of ''/etc/resolv.conf''. So, either you want to learn netplan syntax and make interface changes using its YAML derivative, or you can install the optional ''resolvconf'' package to restore ''ifupdown'' functionality. To do this, adjust the VM provision script above (or use the virt-manager GUI with X passthrough) to temporarily use NAT then override Ubuntu defaults and restore ''ifupdown'' functionality as follows:
+  sudo apt install ifupdown
+  sudo apt remove --purge netplan.io
+  sudo apt install resolvconf
+  sudo nano /etc/resolvconf/resolv.conf.d/tail
+  <nameserver 8.8.8.8>
+  systemctl restart networking.service
+You should once again execute ''ping 8.8.8.8'' and ''ping google.com'' to confirm you can route within the guest OS. Sometimes, I find a reboot is required. At this stage, you now have a physical host configured with a virtual switch, and one VM provisioned to use the switch with its own external IP. Both the physical host and guest OS in this scenario are public facing so take precautions to properly secure each by checking services ''netstat -tulpn'' and/or utilizing a firewall. The main things to configure at this point are ssh access so you no longer need to rely on the virt-viewer console which is slow. To do that, you will need to add packages (if you use the netinst.iso). To make that easy, I keep the sources.list on my primary business server:
+  wget https://haacksnetworking.org/sources.list
+Once you grab the ''sources.list'' file, install ''openssh-server'' and exchange keys, you can now use a shell to ssh into the guestOS henceforward. This means that at this point you are now in a position to create VMs and various production environments at will or start working on the one you just created. Another thing to consider is to create base VMs that have ''interfaces'' and ''ssh'' access all ready to go, and then leverage those to make new instances using ''cp''. Alternately, you can power down a base VM and then clone it as follows:
+  virt-clone \
+  --original=clean \
+  --name=sequoia \
+  --file=/mnt/vms/students/sequoia.img
+The purpose of this project was to create my own virtualized VPS infrastructure (using KVM and VMs), to run my own production environments and for clients, students, and family. Here's a few to check out:
+  * [[https://nextcloud.haacksnetworking.org|Haack's Networking - Nextcloud Talk Instance]]
+  * [[https://mrhaack.org|GNU/Linux Social - Mastodon Instance]]
+  * [[http://space.hackingclub.org|My Daughter's Space Website]]
+  * [[http://bianca.hackingclub.org|A Student's Pentesting Website]]
+That's all folks! Well ... except for one more thing. When I first did all of this, I was convinced that zfs should be within LUKS as it was difficult for me to let go of LUKS / full disk encryption. I've now decided that's insane because of one primary reason. Namely, by putting zfs (or any file system) within LUKS, you lose the hot swapability that you have when zfs (or regular RAID) run directly on the hardware. That would mean that replacing a hard drive would require an entire server rebuild, which is insane. However, it is arguably more secure that way, so if budget and time permits, I've retained how I put zfs inside LUKS in the passage that follows. Proceed at your own risk lol.
+-- LUKS FIRST, ZFS SECOND - (LEGACY SETUP, NOT CURRENT) --
+My initial idea was to do LUKS first, then zfs, meaning 6 could be mirrors in zfs and I would keep 1 as a spare LUKS crypt for keys, other crap, etc. To create the LUKS crypts, I did the following 6 times, each time appending the last 4 digits of the block ID to the LUKS crypt name:
+  cryptsetup luksFormat /dev/sda
+  cryptsetup luksOpen /dev/sda sdafc11
+You then make sure to use the LUKS label names when making the zpool, not the short names, which can change at times during reboots. I did this as follows:
+  sudo apt install zfs-utils bridge-utils
+  zpool create -m /mnt/vms vms -f mirror sdafc11 sdb9322 mirror sdc8a33 sdh6444 mirror sde5b55 sdf8066
+ZFS by default executes its mount commands at boot. This is a problem if you don't use auto-unlocking and key files with LUKS to also unlock on boot (and/or a custom script that unlocks). The problem, in this use cases, is ZFS will try to mount the volumes before they are unlocked. The two other options are none/legacy modes, both of which rely on you mounting the volume using traditional methods. But, the whole point of using zfs finally was to not use traditional methods lol, so for that reason I investigated if there was a fix. The closest to a fix is setting cachefile=none boot, but this a) hosed the pool once b) requires resetting, rebooting again and/or manually re-mounting the pool - either of which defeat the point. Using key files, cache file adjustments, etc., and/or none/legacy were all no-gos for me, so in the end, I decided to tolerate that zfs would fail at boot, and that I would ''zpool import'' it afterwards.
   sudo -i
   screen
   su - user [pam_mount unlocks /home for physical host primary user and the spare 1TB vault]
   ctrl-a-d [detaches from screen]
-After unlocking my home directory and the spare 1TB vault, the next step is to unlock each LUKS volume, which I decided a simple shell script would suffice which looks like this:
-  cryptsetup luksOpen /dev/disk/by-uuid/2702e690-...-0c4267a6fc11 sdafc11
+After unlocking my home directory and the spare 1TB vault, the next step is to unlock each LUKS volume, which I decided a simple shell script would suffice which looks like this mount-luks.sh:
-  cryptsetup luksOpen /dev/disk/by-uuid/e3b568ad-...-cdc5dedb9322 sdb9322
-  cryptsetup luksOpen /dev/disk/by-uuid/d353e727-...-e4d66a9b8a33 sdc8a33
-  cryptsetup luksOpen /dev/disk/by-uuid/352660ca-...-5a8beae15b44 sde5b44
-  cryptsetup luksOpen /dev/disk/by-uuid/fa1a6109-...-f46ce1cf8055 sdf8055
-  cryptsetup luksOpen /dev/disk/by-uuid/86da0b9f-...-13bc38656466 sdh6466
-Obviously not all of my drives have ... in the middle of the block ID - this is just me obfuscating because I am paranoid.  Also, even though I used the short-names like sda, sdb, etc., for convenience when setting up the LUKS volumes, I deliberately did not do so here because upon reboot, those can sometimes change.  This is why I did two things - once, I originally included the last four digits of the block ID on the LUKS device name, but I also made the script open the corresponding hard drive using the block ID instead of the short name.  This ensures that, if for some odd reason, that Debian decides to rename sda to sdb or whatever, that the LUKS volumes will open properly.  Also, do note that since the zpool was created with the LUKS names as well, that there is no way the pool could start by accident and/or incorrectly with the short-names (which some users report on SE but makes no sense to me).  At any rate, I simply copy/paste the password 6 times, and then re-create the pool as follows once the volumes are opened:
+  cryptsetup luksOpen /dev/disk/by-uuid/2702e690-…-0c4267a6fc11 sdafc11
+  cryptsetup luksOpen /dev/disk/by-uuid/e3b568ad-…-cdc5dedb9322 sdb9322
+  cryptsetup luksOpen /dev/disk/by-uuid/d353e727-…-e4d66a9b8a33 sdc8a33
+  cryptsetup luksOpen /dev/disk/by-uuid/352660ca-…-5a8beae15b44 sde5b44
+  cryptsetup luksOpen /dev/disk/by-uuid/fa1a6109-…-f46ce1cf8055 sdf8055
+  cryptsetup luksOpen /dev/disk/by-uuid/86da0b9f-…-13bc38656466 sdh6466
-  zpool import vms
+This script simply opens each LUKS crypt so long as you enter or copy/paste your HD password 6 times. After that, one has to re-mount the pool / rebuild the quasi RAID1 mirror/logical volumes with the import command as follows once the volumes are opened:
-Altogether, rebooting this server takes about 4 minutes of wait time, and then about 2 minutes to mount my home directory, 1TB vault, and my recreate my zpool.  I know this sounds a bit old school, but it ensures that I don't have to travel 63 miles to my data center to figure out what happened.  If a drive fails, I will know and still be able to boot.  And ... becuase I have offsite backups of the VMs hardrive.img files, I can easily destroy the zpool and use a smaller pool with only 4 drives (to keep uptime and production going), while I order a new pair of drives and set up time to visit the center and replace them.  Also, if/when the server does not start - remember, that could be a hard drive, or it could be something else - using this workflow, you always know what happened (unless the boot volume crashes).  Of course, the center has remote KVM and so on, but why increase the chances of failure and rely on such kludgy access ... this is cleaner and less chances for failure.  I do realize, however, that this won't scale to 100X this size, or even 50X, but again - this is entry level enterprise for advanced self-hosters and/or robust residential use-cases that just exceed normal residential use parameters.
-Phew ... that was a mouthful ... but if I don't write it down, I won't remember what I did lmao.  Criticism welcome.
+  zpool import pool
+Rebooting in this manner takes about 3-5 minutes for the host, and 2 minutes to screen into my user name, detach, and run the mount LUKS script to mount the pools/datasets, etc. Again, I ultimately rejected this because you cannot use zfs tools when hard drives fail with this setup.
- --- //[[jonathan@haacksnetworking.com|oemb1905]] 2021/10/29 18:33//
+ --- //[[jonathan@haacksnetworking.org|oemb1905]] 2022/11/12 12:39//

Haack's Wiki

User Tools

Site Tools

Differences

Page Tools