Are your Time Machine backups really working?

August 13, 2012

Apple built a great system into Time Machine to tell the user if their machine is not backing up. First is the menubar icon. It turns into an exclamation mark if the last backup failed and clicking on it gives the user quick access to know what the problem is (in a very generic way). After 10 days of backups not running it actually pops up an alert window that tells you your machine has not been backed up in 10 days. This works great for me and my personal laptop. It has my data on it so I pay attention to if it is backed up or not. But when dealing with a corporate environment, users aren’t as concerned about their data. That is until they accidentally deleted something and need to get it back. Then they just expect the backups have been working and want to know why you were not doing your job.

In an environment where I need to maintain over 125 computers, about 50 of which run Time Machine, whose screens I never see, I need a better way to monitor whats going on with Time Machine. If you are using Time Machine in a corporate environment you probably (hopefully) are using a central server that all your machines backup to. This makes the job of monitoring these backups easier since they are all in one place. We wrote a shell script that will go through the backup directory and mount each sparsebundle in order to examine the contents and determine how many backups, how big, and when the most recent backup was performed for each sparsebundle and then e-mail the results. It also includes any warnings or errors at the top, such as “so and so has not been backed up for 14 days.”.

In order to allow the script to run we setup a cron-job as root user and use the MAILTO= statement in the cron file to indicate who gets the e-mail (you can specify multiple addresses with “,”). The main script, checkTMBackups.sh, does all the leg work of mounting the various disk images and parsing the information in them. It includes a script called functions.sh which provides some special logging functionality. I did this because I needed a way to log “normal” messages and “critical” messages at the same time, but then output the critical messages all grouped together before any of the normal messages. This makes it easier to spot (potential) problems.

checkTMBackups.sh

#!/bin/sh
#

#
# Source the helper functions in.
#
source "/usr/local/bin/functions.sh"
SetLogPrefix "Time Machine"

#
# Change these values to suite your needs.
#
BackupTMAgeAlert="14"
TMPath="/Volumes/HDCBackups"

#
# You shouldn't need to edit anything below this line.
#
ATTACH_PARMS="-readonly -noverify -noautofsck -noautoopen -quiet"
PATH="$PATH:/usr/sbin:/sbin"

#
# Check the volume with the given name in $1
#
function CheckVolume
{
	#
	# Check if volume is in use, try for 5 minutes.
	#
	i=0
	while [ $i -lt 10 ]; do
		LSOF=`lsof | grep "$1/bands"`
		if [ -z "$LSOF" ]; then break; fi
		sleep 30
		i=$[$i + 1]
	done
	if [ $i -eq 10 ]; then
		LogAlert "$BackupName: Volume in use, cannot mount."
		return 1
	fi

	#
	# Get short name
	#
	BackupName=`echo "$1" | cut -f1 -d. | cut -f1 -d_`

	#
	# Try to mount the volume quietly.
	#
	mkdir -p /tmp/mount
	hdiutil attach -mountpoint /tmp/mount $ATTACH_PARMS "$TMPath/$1"
	RESULT="$?"
	if [ $RESULT != 0 ]; then
		LogAlert "$BackupName: Could not mount volume ($RESULT)."
		rmdir /tmp/mount
		return 1
	fi

	#
	# Check if the backup has finished one cycle yet.
	#
	VolumeName=`ls -1 /tmp/mount/Backups.backupdb | grep -v "^\." | head -n1`
	if [ ! -e "/tmp/mount/Backups.backupdb/$VolumeName/Latest" ]; then
		ls "/tmp/mount/Backups.backupdb/$VolumeName"
		LogAlert "$BackupName: Has not finished full backup cycle yet."
		hdiutil detach -quiet /tmp/mount
		rmdir /tmp/mount
		return 1
	fi

	#
	# Get the name of the first drive backed up and then check
	# when it was last backed up.
	#
	LastMod=`stat -L -f "%m" "/tmp/mount/Backups.backupdb/$VolumeName/Latest"`
	CurDate=`date "+%s"`
	DaysSinceBackup=$[$[$CurDate / 86400] - $[$LastMod / 86400]]

	#
	# Check when it was first backed up.
	#
	FirstName=`ls -1 "/tmp/mount/Backups.backupdb/$VolumeName" | grep -v "^\." | head -n1`
	LastMod=`stat -L -f "%m" "/tmp/mount/Backups.backupdb/$VolumeName/$FirstName"`
	DaysSinceFirstBackup=$[$[$CurDate / 86400] - $[$LastMod / 86400]]

	#
	# Get the number of backups that are around.
	#
	NumberOfBackups=`ls -1 "/tmp/mount/Backups.backupdb/$VolumeName" | grep -v inProgress | grep -cv Latest`

	#
	# Determine space used and total.
	#
	SizeAllowed=`df -H | grep /tmp/mount | awk '{print $2}'`
	SizeOfBackup=`df -H | grep /tmp/mount | awk '{print $3}'`
	BackupUsed=$[$[`df | grep /tmp/mount | awk '{print $3}'` * 100] / `df | grep /tmp/mount | awk '{print $2}'`]

	#
	# Unmount and detach the image.
	#
	hdiutil detach -quiet /tmp/mount
	rmdir /tmp/mount

	#
	# Log the information about this backup.
	#
	if [ $DaysSinceBackup -gt $BackupTMAgeAlert ]; then
		LogAlert "$BackupName has not been backed up in $DaysSinceBackup days."
	fi
	LogMessage "$BackupName has $NumberOfBackups backups. $SizeOfBackup of $SizeAllowed (${BackupUsed}%). First/last backup was $DaysSinceFirstBackup/$DaysSinceBackup days ago."

	return 0;
}

#
# Stop server admin and check all volumes.
#
for f in "$TMPath"/*.sparsebundle; do
	backup=${f:$[${#TMPath} + 1]}
	CheckVolume "$backup"
done

#
# Determine total usage for time machine.
#
SizeAvail=`df -H | grep "$TMPath" | awk '{print $4}'`
BackupUsed=`du -skch "$TMPath"/*.sparsebundle | tail -n1 | awk '{print $1}'`
LogMessage "Total backup space used for time machine $BackupUsed ($SizeAvail available)."

DumpAlertLog
DumpLog

functions.sh

#!/bin/sh
#
# This file provides common functions I use. I make no guarentees that
# any of it will work.
#
# Copyright (c) 2010 Daniel Hazelbaker
#
# Version 1.0 - 2010/02/19
#

######################################################################
#
# Functions to provide logging information.
#
######################################################################

Log_Messages=""
Log_Alerts=""
Log_Prefix=""
Log_AlertPrefix="*** CRITICAL -"

#
# Set the prefix used when logging messages.
#
function SetLogPrefix
{
	Log_Prefix="$1"
}

#
# Set the prefix used when logging alerts.
#
function SetLogAlertPrefix
{
	Log_AlertPrefix="$1"
}

#
# Log a simple message.
#
function LogMessage
{
	local msg

	if [ -n "$Log_Prefix" ]; then
		msg="["`date "+%F %T"`" $Log_Prefix] $1"
	else
		msg="["`date "+%F %T"`"] $1"
	fi

	LogMessageRaw "$msg"
}
function LogMessageRaw
{
	if [ -n "$1" ]; then
	        if [ "$Log_Messages"X == "X" ]; then
		        Log_Messages="$msg"
		else
			Log_Messages="$Log_Messages
$msg"
		fi
	fi
}

#
# Display the log messages
#
function DumpLog
{
	if [ "$Log_Messages"X != "X" ]; then
		echo "$Log_Messages"
		echo ""
	fi
}

#
# Log a critical alert
#
function LogAlert
{
	local msg=""

	if [ -n "$Log_Prefix" ]; then
		msg="["`date "+%F %T"`" $Log_Prefix]"
	else
		msg="["`date "+%F %T"`"]"
	fi
	if [ -n "$Log_AlertPrefix" ]; then
		msg="$msg $Log_AlertPrefix"
	fi
	msg="$msg $1"

	LogAlertRaw "$msg"
}
function LogAlertRaw
{
	if [ -n "$1" ]; then
		if [ "$Log_Alerts"X == "X" ]; then
			Log_Alerts=$msg
		else
			Log_Alerts="$Log_Alerts
$msg"
		fi
	fi
}

#
# Display the alert messages
#
function DumpAlertLog
{
	if [ "$Log_Alerts"X != "X" ]; then
		echo "$Log_Alerts" >&2
		echo "" >&2
	fi
}

Cron job file

MAILTO=”daniel@mailinator.com,rharman@mailinator.com”
0 1 * * Sun,Wed /usr/local/bin/checkTMBackups.sh 2>&1

The above cron job will send e-mails to both daniel and rharman each time the script runs. An e-mail will be sent wether or not any problems were detected. The script runs at 1am every Sunday and Wednesday. Both scripts should be installed in /usr/local/bin folder (you may need to create this path on your system). To edit root’s cronjob list you can use the command

EDITOR=nano sudo crontab -e

The “EDITOR=nano” part tells it to use the editor called nano, otherwise you will be stuck with vi which is a pain if you are not used to it. sudo tells it to run as root (it will ask for your login password) and crontab -e instructs it to edit the users crontab. Note: crontabs work on Snow Leopard for sure. I can’t say for sure if they work on Lion or Mountain Lion yet. If they do not you will have to use launchd to configure, but hopefully they just work.

You will occasionally (or if you have a lot of users as I do, one or two of them each run will come up with the warning) get warnings about not being able to mount the sparsebundle. This message can occur for two reasons. The first is if the sparsebundle is in use by a client (i.e. the backup is happening right now). Most of our machines get left on overnight so this isn’t uncommon. The script will actually try to work through this problem by re-trying 10 times at 30 second intervals. The second cause for this message is if the sparsebundle needs a file system check. Because we mount the FS read-only it will abort if the file system is not clean because it is not allowed to fix the problems. These messages I usually ignore unless I notice them 2 or three e-mails in a row and then I follow up and look into it.