This tutorial is Debian GNU/Linux users wanting to regularly monitor the temperature and SMART health of their hard drives, as well as a slew of helpful zfs reports. Any production server I build includes these scripts and techniques. I set the vitals script to send me an email each hour, with the idea that I will catch temperature surges and/or SMART failures in time to remedy them. The first thing to do is install the tools with sudo apt install smartmontools
. Here are some miniature scripts that you can adapt to query important information about your drives:
#!/bin/bash DATE=`date +"%Y%m%d-%H:%M:%S"` LOG="/root/vitals.log" echo "Jonathan, at $(date), your vitals for $(hostname -f) were as follows:" > $LOG #temp echo "" >> $LOG echo "Here are the hard drive temperatures ..." >> $LOG for disk in \ /dev/disk/by-id/wwn-0x5002538a98416870 \ /dev/disk/by-id/wwn-0x5002538a98356f30 \ /dev/disk/by-id/wwn-0x5002538a983571d0 \ /dev/disk/by-id/wwn-0x5002538a0840a300 \ /dev/disk/by-id/wwn-0x5002538a98356500 \ /dev/disk/by-id/wwn-0x5002538a98356590 \ /dev/disk/by-id/wwn-0x5002538a084065d0 \ /dev/disk/by-id/wwn-0x5002538a98357220 \ /dev/disk/by-id/wwn-0x5000c500d775df03 \ /dev/disk/by-id/wwn-0x5000c500d7694517 \ /dev/disk/by-id/wwn-0x5000c500d7771943 \ /dev/disk/by-id/wwn-0x5000c500d785d267; do temp=$(sudo smartctl -a "$disk" | grep 'Current Drive Temperature' | awk '{print $4}' || echo "N/A") echo "$disk: $temp°C" >> $LOG done for disk in \ /dev/disk/by-id/ata-SATA_SSD_22100512800207 \ /dev/disk/by-id/ata-SATA_SSD_22100512800205; do temp=$(sudo smartctl -a "$disk" | grep '^194 Temperature_Celsius' | head -n 1 | awk '{print $10}' || echo "N/A") echo "$disk: $temp°C" >> $LOG done echo "" >> $LOG echo "Here are the SMART Test results ..." >> $LOG #vms (8) then warehouse (4) then cache (1) for disk in \ /dev/disk/by-id/wwn-0x5002538a98416870 \ /dev/disk/by-id/wwn-0x5002538a98356f30 \ /dev/disk/by-id/wwn-0x5002538a983571d0 \ /dev/disk/by-id/wwn-0x5002538a0840a300 \ /dev/disk/by-id/wwn-0x5002538a98356500 \ /dev/disk/by-id/wwn-0x5002538a98356590 \ /dev/disk/by-id/wwn-0x5002538a084065d0 \ /dev/disk/by-id/wwn-0x5002538a98357220 \ /dev/disk/by-id/wwn-0x5000c500d775df03 \ /dev/disk/by-id/wwn-0x5000c500d7694517 \ /dev/disk/by-id/wwn-0x5000c500d7771943 \ /dev/disk/by-id/wwn-0x5000c500d785d267; do health=$(sudo smartctl -H "$disk" | grep -i 'SMART Health Status' | awk -F': ' '{print $2}' || echo "N/A") echo "$disk: Health: $health" >> $LOG done for disk in \ /dev/disk/by-id/ata-SATA_SSD_22100512800207 \ /dev/disk/by-id/ata-SATA_SSD_22100512800205; do health=$(sudo smartctl -H "$disk" | grep -i -E 'health.*(PASSED|FAILED|UNKNOWN)' | awk -F': ' '{print $2}' || echo "N/A") echo "$disk: Health: $health" >> $LOG done echo "" >> $LOG echo "Here's the output of df ..." >> $LOG df -h >> $LOG #pool health echo "" >> $LOG echo "Here is the health of the pool ..." >> $LOG zpool status -v >> $LOG #pool list zpool list -v >> $LOG #ram available free -h >> $LOG #pool status zpool iostat -v >> $LOG #pool list zfs list -ro space >> $LOG #email report mail -s "[$(hostname -f)]-vitals-$(date)]" alerts@haacksnetworking.org < $LOG rm /tmp/zfs-send-stats.lock
In many cases, I need a CLI-based version of this that does not check SMART and prints to standard out in real-time. For that simpler use-case, I remove the SMART and simplify the script as follows:
#!/bin/bash DATE=`date +"%Y%m%d-%H:%M:%S"` LOG="/root/vitals.log" zpool status -v zpool iostat -v zpool list -v zfs list -ro space free -h for disk in \ /dev/disk/by-id/wwn-0x5000c500e6db45ea \ /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \ /dev/disk/by-id/wwn-0x5000c5007443b754 \ /dev/disk/by-id/wwn-0x5000c50074445f2c \ /dev/disk/by-id/wwn-0x5000c500f204f775 \ /dev/disk/by-id/wwn-0x5000cca28de719cc \ /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \ /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do temp=$(sudo smartctl -a "$disk" | grep '^194 Temperature_Celsius' | head -n 1 | awk '{print $10}' || echo "N/A") echo "$disk: $temp°C" done
Now, as you can see, there are two different blocks for the temperature and smart reports. This is because different hardware can and will have slightly different syntax in their smart reports. In order to know what your hardware can and will support, run smartctl as follows:
sudo smartctl -a /dev/disk/by-id/wwn-0x5000c500d775df03 | grep -i -E 'Temperature|Temp' sudo smartctl -x /dev/disk/by-id/wwn-0x5000c500e6c7ac59 | grep -i -E 'Temperature|Temp' sudo smartctl -a /dev/disk/by-id/wwn-0x5000c500d775df03 | grep -i -E 'Min|Max' sudo smartctl -x /dev/disk/by-id/wwn-0x5000c500e6c7ac59 | grep -i -E 'Min|Max' sudo smartctl -H /dev/disk/by-id/wwn-0x5002538a98416870
The -a
flag provides the standard and legacy output, while -x
provides full output. The -H
flag helps you determine that syntax for the drive health output. These all can be used to fine tune the grep
searches to your needs on the script above. In my case, both smart and temp reports required two different sets of syntax depending on which vendor made the drive. It is also important to know when to take action. In the case of the SMART tests, this will be easy to identify as it will report a failure on the output. For temperature, however, this requires you to know the minimum and maximum temperatures on your drives. For that, I crafted a script to query those values:
for disk in \ /dev/disk/by-id/wwn-0x5000c500e6db45ea \ /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \ /dev/disk/by-id/wwn-0x5000c5007443b754 \ /dev/disk/by-id/wwn-0x5000c50074445f2c \ /dev/disk/by-id/wwn-0x5000c500f204f775 \ /dev/disk/by-id/wwn-0x5000cca28de719cc \ /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \ /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do sudo smartctl -x "$disk" | grep -m1 'Min/Max Temperature Limit' | grep -o '[0-9]\+ Celsius' | awk '{print $1}' | xargs -I {} echo "$disk: Max Permitted Temp: {}°C" >> /var/log/drive-temps.log || echo "$disk: Max Permitted Temp: N/A°C" >> /var/log/drive-temps.log done
Again, depending on what the -x
report provided above, this might require adjusting in the grep section so that your string search matches the vendor's output for that drive. After running this script, you can easily see what values should cause alarm and take action when needed. For another server I run, I had some hard drives that insisted on going to sleep after every reboot. For the sake of their health and performance, I preferred that they stay spinning. So, I made a script that uses hdparm and smartctl to ensure the drives are set to not sleep:
#!/bin/bash for disk in \ /dev/disk/by-id/wwn-0x5000c500e6db45ea \ /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \ /dev/disk/by-id/wwn-0x5000c5007443b754 \ /dev/disk/by-id/wwn-0x5000c50074445f2c \ /dev/disk/by-id/wwn-0x5000c500f204f775 \ /dev/disk/by-id/wwn-0x5000cca28de719cc \ /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \ /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do /usr/sbin/smartctl -s standby,off -n never "$disk" /sbin/hdparm -B 255 "$disk" 2>/dev/null || true done
To verify the sleep and idle settings are working, you can check one drive as follows:
sudo smartctl -i -n standby /dev/disk/by-id/wwn-0x5000c500e6c7ac59
If you want to check the whole batch of drives you made settings for, then use:
for disk in \ /dev/disk/by-id/wwn-0x5000c500e6db45ea \ /dev/disk/by-id/wwn-0x5000c500e6c7ac59 \ /dev/disk/by-id/wwn-0x5000c5007443b754 \ /dev/disk/by-id/wwn-0x5000c50074445f2c \ /dev/disk/by-id/wwn-0x5000c500f204f775 \ /dev/disk/by-id/wwn-0x5000cca28de719cc \ /dev/disk/by-id/ata-Fanxiang_S301_1TB_MX-00000000000000486 \ /dev/disk/by-id/ata-KINGSTON_SH103S3120G_50026B724505838C; do sudo smartctl -i -n standby "$disk" done
These scripts and commands provide easy ways to access or confirm hard drive information and make scripts that monitor temperature, health, and/or can be adapted to other tasks that SMART reports provide information for. In my case, I have the uppermost script sent to me every hour.
— oemb1905 2025/04/13 00:16