User Tools

Site Tools


computing:monitorvitals

  • monitoringvitals
  • Jonathan Haack
  • Haack's Networking
  • netcmnd@jonathanhaack.com

This tutorial is Debian GNU/Linux users wanting to regularly monitor the temperature and SMART health of their hard drives, as well as a slew of helpful zfs reports. Any production server I build includes these scripts and techniques. I set the vitals script to send me an email each hour, with the idea that I will catch temperature surges and/or SMART failures in time to remedy them. The first thing to do is install the tools with sudo apt install smartmontools. Here are some miniature scripts that you can adapt to query important information about your drives:

#!/bin/bash
DATE=`date +"%Y%m%d-%H:%M:%S"`
LOG="/root/vitals.log"
echo "Jonathan, at $(date), your vitals for $(hostname -f) were as follows:" > $LOG
#temp
echo "" >> $LOG
echo "Here are the hard drive temperatures ..." >> $LOG
for disk in \
  /dev/disk/by-id/wwn-0x5002538a98416870 \
  /dev/disk/by-id/wwn-0x5002538a98356f30 \
  /dev/disk/by-id/wwn-0x5002538a983571d0 \
  /dev/disk/by-id/wwn-0x5002538a0840a300 \
  /dev/disk/by-id/wwn-0x5002538a98356500 \
  /dev/disk/by-id/wwn-0x5002538a98356590 \
  /dev/disk/by-id/wwn-0x5002538a084065d0 \
  /dev/disk/by-id/wwn-0x5002538a98357220 \
  /dev/disk/by-id/wwn-0x5000c500d775df03 \
  /dev/disk/by-id/wwn-0x5000c500d7694517 \
  /dev/disk/by-id/wwn-0x5000c500d7771943 \
  /dev/disk/by-id/wwn-0x5000c500d785d267; do
  temp=$(sudo smartctl -a "$disk" | grep 'Current Drive Temperature' | awk '{print $4}' || echo "N/A")
  echo "$disk: $temp°C" >> $LOG
done
for disk in \
  /dev/disk/by-id/ata-SATA_SSD_22100512800207 \
  /dev/disk/by-id/ata-SATA_SSD_22100512800205; do
  temp=$(sudo smartctl -a "$disk" | grep '^194 Temperature_Celsius' | head -n 1 | awk '{print $10}' || echo "N/A")
  echo "$disk: $temp°C" >> $LOG
done
echo "" >> $LOG
echo "Here are the SMART Test results ..." >> $LOG
#vms (8) then warehouse (4) then cache (1)
for disk in \
  /dev/disk/by-id/wwn-0x5002538a98416870 \
  /dev/disk/by-id/wwn-0x5002538a98356f30 \
  /dev/disk/by-id/wwn-0x5002538a983571d0 \
  /dev/disk/by-id/wwn-0x5002538a0840a300 \
  /dev/disk/by-id/wwn-0x5002538a98356500 \
  /dev/disk/by-id/wwn-0x5002538a98356590 \
  /dev/disk/by-id/wwn-0x5002538a084065d0 \
  /dev/disk/by-id/wwn-0x5002538a98357220 \
  /dev/disk/by-id/wwn-0x5000c500d775df03 \
  /dev/disk/by-id/wwn-0x5000c500d7694517 \
  /dev/disk/by-id/wwn-0x5000c500d7771943 \
  /dev/disk/by-id/wwn-0x5000c500d785d267; do
  health=$(sudo smartctl -H "$disk" | grep -i 'SMART Health Status' | awk -F': ' '{print $2}' || echo "N/A")
  echo "$disk: Health: $health" >> $LOG
done
for disk in \
  /dev/disk/by-id/ata-SATA_SSD_22100512800207 \
  /dev/disk/by-id/ata-SATA_SSD_22100512800205; do
  health=$(sudo smartctl -H "$disk" | grep -i -E 'health.*(PASSED|FAILED|UNKNOWN)' | awk -F': ' '{print $2}' || echo "N/A")
  echo "$disk: Health: $health" >> $LOG
done
echo "" >> $LOG
echo "Here's the output of df ..." >> $LOG
df -h >> $LOG
#pool health
echo "" >> $LOG
echo "Here is the health of the pool ..." >> $LOG
zpool status -v >> $LOG
#pool list
zpool list -v >> $LOG
#ram available
free -h >> $LOG
#pool status
zpool iostat -v >> $LOG
#pool list
zfs list -ro space >> $LOG
#email report
mail -s "[$(hostname -f)]-vitals-$(date)]" alerts@haacksnetworking.org < $LOG
rm /tmp/zfs-send-stats.lock

In many cases, I need a CLI-based version of this that does not check SMART and prints to standard out in real-time. For that simpler use-case, I remove the SMART and simplify the script as follows:

#!/bin/bash
DATE=`date +"%Y%m%d-%H:%M:%S"`
LOG="/root/vitals.log"
zpool status -v
zpool iostat -v
zpool list -v
zfs list -ro space
free -h
for disk in \
  /dev/disk/by-id/wwn-0x5000c500e6db45ea \
  /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \
  /dev/disk/by-id/wwn-0x5000c5007443b754 \
  /dev/disk/by-id/wwn-0x5000c50074445f2c \
  /dev/disk/by-id/wwn-0x5000c500f204f775 \
  /dev/disk/by-id/wwn-0x5000cca28de719cc \
  /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \
  /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do
  temp=$(sudo smartctl -a "$disk" | grep '^194 Temperature_Celsius' | head -n 1 | awk '{print $10}' || echo "N/A")
  echo "$disk: $temp°C"
done

Now, as you can see, there are two different blocks for the temperature and smart reports. This is because different hardware can and will have slightly different syntax in their smart reports. In order to know what your hardware can and will support, run smartctl as follows:

sudo smartctl -a /dev/disk/by-id/wwn-0x5000c500d775df03 | grep -i -E 'Temperature|Temp'
sudo smartctl -x /dev/disk/by-id/wwn-0x5000c500e6c7ac59 | grep -i -E 'Temperature|Temp'
sudo smartctl -a /dev/disk/by-id/wwn-0x5000c500d775df03 | grep -i -E 'Min|Max'
sudo smartctl -x /dev/disk/by-id/wwn-0x5000c500e6c7ac59 | grep -i -E 'Min|Max'  
sudo smartctl -H /dev/disk/by-id/wwn-0x5002538a98416870

The -a flag provides the standard and legacy output, while -x provides full output. The -H flag helps you determine that syntax for the drive health output. These all can be used to fine tune the grep searches to your needs on the script above. In my case, both smart and temp reports required two different sets of syntax depending on which vendor made the drive. It is also important to know when to take action. In the case of the SMART tests, this will be easy to identify as it will report a failure on the output. For temperature, however, this requires you to know the minimum and maximum temperatures on your drives. For that, I crafted a script to query those values:

for disk in \
  /dev/disk/by-id/wwn-0x5000c500e6db45ea \
  /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \
  /dev/disk/by-id/wwn-0x5000c5007443b754 \
  /dev/disk/by-id/wwn-0x5000c50074445f2c \
  /dev/disk/by-id/wwn-0x5000c500f204f775 \
  /dev/disk/by-id/wwn-0x5000cca28de719cc \
  /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \
  /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do
  sudo smartctl -x "$disk" | grep -m1 'Min/Max Temperature Limit' | grep -o '[0-9]\+ Celsius' | awk '{print $1}' | xargs -I {} echo "$disk: Max Permitted Temp: {}°C" >> /var/log/drive-temps.log || echo "$disk: Max Permitted Temp: N/A°C" >> /var/log/drive-temps.log
done

Again, depending on what the -x report provided above, this might require adjusting in the grep section so that your string search matches the vendor's output for that drive. After running this script, you can easily see what values should cause alarm and take action when needed. For another server I run, I had some hard drives that insisted on going to sleep after every reboot. For the sake of their health and performance, I preferred that they stay spinning. So, I made a script that uses hdparm and smartctl to ensure the drives are set to not sleep:

#!/bin/bash
for disk in \
  /dev/disk/by-id/wwn-0x5000c500e6db45ea \
  /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \
  /dev/disk/by-id/wwn-0x5000c5007443b754 \
  /dev/disk/by-id/wwn-0x5000c50074445f2c \
  /dev/disk/by-id/wwn-0x5000c500f204f775 \
  /dev/disk/by-id/wwn-0x5000cca28de719cc \
  /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \
  /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do
  /usr/sbin/smartctl -s standby,off -n never "$disk"
  /sbin/hdparm -B 255 "$disk" 2>/dev/null || true
done

To verify the sleep and idle settings are working, you can check one drive as follows:

sudo smartctl -i -n standby /dev/disk/by-id/wwn-0x5000c500e6c7ac59

If you want to check the whole batch of drives you made settings for, then use:

for disk in \
  /dev/disk/by-id/wwn-0x5000c500e6db45ea \
  /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \
  /dev/disk/by-id/wwn-0x5000c5007443b754 \
  /dev/disk/by-id/wwn-0x5000c50074445f2c \
  /dev/disk/by-id/wwn-0x5000c500f204f775 \
  /dev/disk/by-id/wwn-0x5000cca28de719cc \
  /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \
  /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do
  sudo smartctl -i -n standby "$disk"
done

These scripts and commands provide easy ways to access or confirm hard drive information and make scripts that monitor temperature, health, and/or can be adapted to other tasks that SMART reports provide information for. In my case, I have the uppermost script sent to me every hour.

oemb1905 2025/04/13 00:16

computing/monitorvitals.txt · Last modified: 2025/04/13 00:39 by oemb1905