What happens when you put a tech, a Mac and a cat together?
RSS icon Home icon
  • Are your Time Machine backups really working?

    Posted on August 13th, 2012 daniel 9 comments

    Apple built a great system into Time Machine to tell the user if their machine is not backing up.  First is the menubar icon. It turns into an exclamation mark if the last backup failed and clicking on it gives the user quick access to know what the problem is (in a very generic way). After 10 days of backups not running it actually pops up an alert window that tells you your machine has not been backed up in 10 days. This works great for me and my personal laptop.  It has my data on it so I pay attention to if it is backed up or not. But when dealing with a corporate environment, users aren’t as concerned about their data. That is until they accidentally deleted something and need to get it back. Then they just expect the backups have been working and want to know why you were not doing your job.

    In an environment where I need to maintain over 125 computers, about 50 of which run Time Machine, whose screens I never see, I need a better way to monitor whats going on with Time Machine. If you are using Time Machine in a corporate environment you probably (hopefully) are using a central server that all your machines backup to. This makes the job of monitoring these backups easier since they are all in one place. We wrote a shell script that will go through the backup directory and mount each sparsebundle in order to examine the contents and determine how many backups, how big, and when the most recent backup was performed for each sparsebundle and then e-mail the results. It also includes any warnings or errors at the top, such as “so and so has not been backed up for 14 days.”.

    In order to allow the script to run we setup a cron-job as root user and use the MAILTO= statement in the cron file to indicate who gets the e-mail (you can specify multiple addresses with “,”). The main script, checkTMBackups.sh, does all the leg work of mounting the various disk images and parsing the information in them. It includes a script called functions.sh which provides some special logging functionality. I did this because I needed a way to log “normal” messages and “critical” messages at the same time, but then output the critical messages all grouped together before any of the normal messages. This makes it easier to spot (potential) problems.

    checkTMBackups.sh

    #!/bin/sh
    #
    
    #
    # Source the helper functions in.
    #
    source "/usr/local/bin/functions.sh"
    SetLogPrefix "Time Machine"
    
    #
    # Change these values to suite your needs.
    #
    BackupTMAgeAlert="14"
    TMPath="/Volumes/HDCBackups"
    
    #
    # You shouldn't need to edit anything below this line.
    #
    ATTACH_PARMS="-readonly -noverify -noautofsck -noautoopen -quiet"
    PATH="$PATH:/usr/sbin:/sbin"
    
    #
    # Check the volume with the given name in $1
    #
    function CheckVolume
    {
    	#
    	# Check if volume is in use, try for 5 minutes.
    	#
    	i=0
    	while [ $i -lt 10 ]; do
    		LSOF=`lsof | grep "$1/bands"`
    		if [ -z "$LSOF" ]; then break; fi
    		sleep 30
    		i=$[$i + 1]
    	done
    	if [ $i -eq 10 ]; then
    		LogAlert "$BackupName: Volume in use, cannot mount."
    		return 1
    	fi
    
    	#
    	# Get short name
    	#
    	BackupName=`echo "$1" | cut -f1 -d. | cut -f1 -d_`
    
    	#
    	# Try to mount the volume quietly.
    	#
    	mkdir -p /tmp/mount
    	hdiutil attach -mountpoint /tmp/mount $ATTACH_PARMS "$TMPath/$1"
    	RESULT="$?"
    	if [ $RESULT != 0 ]; then
    		LogAlert "$BackupName: Could not mount volume ($RESULT)."
    		rmdir /tmp/mount
    		return 1
    	fi
    
    	#
    	# Check if the backup has finished one cycle yet.
    	#
    	VolumeName=`ls -1 /tmp/mount/Backups.backupdb | grep -v "^\." | head -n1`
    	if [ ! -e "/tmp/mount/Backups.backupdb/$VolumeName/Latest" ]; then
    		ls "/tmp/mount/Backups.backupdb/$VolumeName"
    		LogAlert "$BackupName: Has not finished full backup cycle yet."
    		hdiutil detach -quiet /tmp/mount
    		rmdir /tmp/mount
    		return 1
    	fi
    
    	#
    	# Get the name of the first drive backed up and then check
    	# when it was last backed up.
    	#
    	LastMod=`stat -L -f "%m" "/tmp/mount/Backups.backupdb/$VolumeName/Latest"`
    	CurDate=`date "+%s"`
    	DaysSinceBackup=$[$[$CurDate / 86400] - $[$LastMod / 86400]]
    
    	#
    	# Check when it was first backed up.
    	#
    	FirstName=`ls -1 "/tmp/mount/Backups.backupdb/$VolumeName" | grep -v "^\." | head -n1`
    	LastMod=`stat -L -f "%m" "/tmp/mount/Backups.backupdb/$VolumeName/$FirstName"`
    	DaysSinceFirstBackup=$[$[$CurDate / 86400] - $[$LastMod / 86400]]
    
    	#
    	# Get the number of backups that are around.
    	#
    	NumberOfBackups=`ls -1 "/tmp/mount/Backups.backupdb/$VolumeName" | grep -v inProgress | grep -cv Latest`
    
    	#
    	# Determine space used and total.
    	#
    	SizeAllowed=`df -H | grep /tmp/mount | awk '{print $2}'`
    	SizeOfBackup=`df -H | grep /tmp/mount | awk '{print $3}'`
    	BackupUsed=$[$[`df | grep /tmp/mount | awk '{print $3}'` * 100] / `df | grep /tmp/mount | awk '{print $2}'`]
    
    	#
    	# Unmount and detach the image.
    	#
    	hdiutil detach -quiet /tmp/mount
    	rmdir /tmp/mount
    
    	#
    	# Log the information about this backup.
    	#
    	if [ $DaysSinceBackup -gt $BackupTMAgeAlert ]; then
    		LogAlert "$BackupName has not been backed up in $DaysSinceBackup days."
    	fi
    	LogMessage "$BackupName has $NumberOfBackups backups. $SizeOfBackup of $SizeAllowed (${BackupUsed}%). First/last backup was $DaysSinceFirstBackup/$DaysSinceBackup days ago."
    
    	return 0;
    }
    
    #
    # Stop server admin and check all volumes.
    #
    for f in "$TMPath"/*.sparsebundle; do
    	backup=${f:$[${#TMPath} + 1]}
    	CheckVolume "$backup"
    done
    
    #
    # Determine total usage for time machine.
    #
    SizeAvail=`df -H | grep "$TMPath" | awk '{print $4}'`
    BackupUsed=`du -skch "$TMPath"/*.sparsebundle | tail -n1 | awk '{print $1}'`
    LogMessage "Total backup space used for time machine $BackupUsed ($SizeAvail available)."
    
    DumpAlertLog
    DumpLog

    functions.sh

    #!/bin/sh
    #
    # This file provides common functions I use. I make no guarentees that
    # any of it will work.
    #
    # Copyright (c) 2010 Daniel Hazelbaker
    #
    # Version 1.0 - 2010/02/19
    #
    
    ######################################################################
    #
    # Functions to provide logging information.
    #
    ######################################################################
    
    Log_Messages=""
    Log_Alerts=""
    Log_Prefix=""
    Log_AlertPrefix="*** CRITICAL -"
    
    #
    # Set the prefix used when logging messages.
    #
    function SetLogPrefix
    {
    	Log_Prefix="$1"
    }
    
    #
    # Set the prefix used when logging alerts.
    #
    function SetLogAlertPrefix
    {
    	Log_AlertPrefix="$1"
    }
    
    #
    # Log a simple message.
    #
    function LogMessage
    {
    	local msg
    
    	if [ -n "$Log_Prefix" ]; then
    		msg="["`date "+%F %T"`" $Log_Prefix] $1"
    	else
    		msg="["`date "+%F %T"`"] $1"
    	fi
    
    	LogMessageRaw "$msg"
    }
    function LogMessageRaw
    {
    	if [ -n "$1" ]; then
    	        if [ "$Log_Messages"X == "X" ]; then
    		        Log_Messages="$msg"
    		else
    			Log_Messages="$Log_Messages
    $msg"
    		fi
    	fi
    }
    
    #
    # Display the log messages
    #
    function DumpLog
    {
    	if [ "$Log_Messages"X != "X" ]; then
    		echo "$Log_Messages"
    		echo ""
    	fi
    }
    
    #
    # Log a critical alert
    #
    function LogAlert
    {
    	local msg=""
    
    	if [ -n "$Log_Prefix" ]; then
    		msg="["`date "+%F %T"`" $Log_Prefix]"
    	else
    		msg="["`date "+%F %T"`"]"
    	fi
    	if [ -n "$Log_AlertPrefix" ]; then
    		msg="$msg $Log_AlertPrefix"
    	fi
    	msg="$msg $1"
    
    	LogAlertRaw "$msg"
    }
    function LogAlertRaw
    {
    	if [ -n "$1" ]; then
    		if [ "$Log_Alerts"X == "X" ]; then
    			Log_Alerts=$msg
    		else
    			Log_Alerts="$Log_Alerts
    $msg"
    		fi
    	fi
    }
    
    #
    # Display the alert messages
    #
    function DumpAlertLog
    {
    	if [ "$Log_Alerts"X != "X" ]; then
    		echo "$Log_Alerts" >&2
    		echo "" >&2
    	fi
    }

    Cron job file

    MAILTO=”daniel@mailinator.com,rharman@mailinator.com”
    0 1 * * Sun,Wed /usr/local/bin/checkTMBackups.sh 2>&1

    The above cron job will send e-mails to both daniel and rharman each time the script runs. An e-mail will be sent wether or not any problems were detected. The script runs at 1am every Sunday and Wednesday.  Both scripts should be installed in /usr/local/bin folder (you may need to create this path on your system). To edit root’s cronjob list you can use the command

    EDITOR=nano sudo crontab -e

    The “EDITOR=nano” part tells it to use the editor called nano, otherwise you will be stuck with vi which is a pain if you are not used to it. sudo tells it to run as root (it will ask for your login password) and crontab -e instructs it to edit the users crontab. Note: crontabs work on Snow Leopard for sure. I can’t say for sure if they work on Lion or Mountain Lion yet. If they do not you will have to use launchd to configure, but hopefully they just work.

    You will occasionally (or if you have a lot of users as I do, one or two of them each run will come up with the warning) get warnings about not being able to mount the sparsebundle. This message can occur for two reasons. The first is if the sparsebundle is in use by a client (i.e. the backup is happening right now). Most of our machines get left on overnight so this isn’t uncommon. The script will actually try to work through this problem by re-trying 10 times at 30 second intervals. The second cause for this message is if the sparsebundle needs a file system check. Because we mount the FS read-only it will abort if the file system is not clean because it is not allowed to fix the problems. These messages I usually ignore unless I notice them 2 or three e-mails in a row and then I follow up and look into it.

    Download Scripts

  • Lion introduces Time Machine volume size limit

    Posted on September 5th, 2011 daniel No comments

    So being that it was a holiday and we had the chance to shut down servers to run updates and look into problems, we did just that. Nearly all our updates ran without a hitch so that was nice. We had one problem to look into though.  We have a Snow Leopard server running as a Time Machine backup server for our network. It has worked great since Leopard when Time Machine first came out. With the release of Lion we now have 3 machines (my boss’ and mine) that are running Lion. The rest we have held off on upgrading until Apple fixes the data loss issues related to non-local storage volumes.

    These three Lion machines have not been able to run a Time Machine backup to the TM server, but they would backup everywhere else. The only difference was that the backup server’s storage was connected via SCSI so we decided to pull all 16 drives out of the array and drop a temp drive in to make sure it wasn’t a SCSI issue. Worked like a charm, so it isn’t SCSI. We also noticed the old array was partitioned with APM (Apple Partition Map) so we figured that had to be the issue and decided to wipe out all the backed up data and re-partition as GUID since that should have fixed it.  No dice. Formatted our 28TB backup array and still couldn’t backup.  Kept getting a random error about DiskImageHelper crashing while trying to create the sparsebundle.

    An hour and a half later of trying everything to think of I finally tracked it down.  Lion’s Time Machine cannot handle a target volume which is over 17TB. If I format the array under 17TB it’s fine. If it is 17TB or more it fails, every single time. For now we are working around this problem by creating a 25.5TB volume for all the working computers and a 2.5TB volume for the 3 Lion computers. I don’t have an easy way to test if this is a network volume related issue or just a 17TB issue period. I don’t have a way to plug any of our Lion machines into the array and I’m not going to upgrade any computers until the data-loss bug is fixed as that is a pretty glaring issue that Apple has so far ignored (10.7.1 and counting).

  • Mac OS X Lion: Auto-save or self-destruct?

    Posted on July 29th, 2011 daniel 25 comments

    Update: Read the 10.7.2 update here (sorry, doesn’t really fix the problem).

    Update 2: Read the latest from Apple here.

    Update 3: Apple has resolved the issue with the 10.7.3 update.

    Mac OS X Lion includes a new feature called auto-save, coupled tightly with versions. This feature is designed to automatically save versions of your documents as you make changes. Doing so allows you to rollback changes you made if you decide you didn’t like something you added, modified or deleted. I like it, very nice feature. Auto-save also means you don’t need to click save on your document. Ever. Even if you want to discard the changes you are going to make, you can’t just click don’t save anymore.

    For a home user, this isn’t that big a deal. Home users don’t often do destructive editing that they intend to throw away. Home users are writing school papers, letters to friends, editing pictures to be printed, etc. Yes they do sometimes want to make changes and discard them, but far less often than they are going to want everything they do saved every step of the way.

    Business users often do destructive editing that they don’t want saved. Form letter templates? Logos that need to be resized depending on what you are printing on? Keynote templates for how the message in the series should look? Some of these (such as the keynote example) by nature you will duplicate and start editing because you intend to keep it.  But not always. Suppose you want to open this current series’ Keynote template and see how an image you want to use will look on it. You drop the image in and you go “nah, don’t like it” and close the window. You have just saved that image into the central template. Yes people can go back a version but now there is confusion, “huh, is this really the template? what happened, is anything else changed I need to fix?”

    We have all kinds of standard form letter templates that we use for various things. People open them all the time, put in a person’s name, change a few lines of text, add a personalized goodbye, hit print and then close the window. They have just modified that letter template for every other user on the network and they don’t even realize it because it didn’t ask them to save. What if they added something pretty harsh to the letter and then another person opens it up, changes out the name and just hits print because they know that template is perfect for this letter from the last time they had used it. They just printed out a letter with a really harsh statement intended for somebody else entirely.

    The recipe for self-destruct comes from the fact that, while this by itself is just annoying for business users, the fact that Lion does not support versioned files on network volumes (even an AFP share on a Lion Server) creates disaster. Currently, when you make changes to a file that resides on a server and close it, the original file is modified but no previous version is saved that you can revert to when you open it up next and realize your mistake. Want to test this out for yourself?

    1. Take a screenshot with Command-Shift-4 and move that beautiful image over to your file server.
    2. Open the image up in Preview.
    3. Select an area of the image and hit Command-K to crop it.
    4. Print the image (this step really doesn’t matter, but it’s here to give you an idea of why you would do this in the first place).
    5. Close the image.
    6. Open the image again and try to find a way to get your original image back. Buh-bye, no more pretty screen-shot.
    7. Cry.

    Version information seems to be stored in the root folder of the volume in a folder called “/.DocumentRevisions-V100″. My guess is that a bug (or maybe an oversight?) in Lion means that clients do not have access to this folder to store and retrieve version information. Or maybe it is storing the version information on the local hard drive but then looking it back up on the remote hard drive (or vice-versa).

    Either way, it means that due to auto-save and the lack of versions, anytime you open a document you are forced into doing destructive editing that will be saved when you close the window without thinking about it. What this means for us we cannot and will not roll out Lion to our desktop users until this bug is fixed. I normally try to give Apple the benefit of the doubt. When they change something core I normally try to force myself to try it for a month before deciding wether or not I will hate it or like it. Most of the time I end up liking it. The rest of the time I generally don’t hate it but just live with it. This is one of those times that Apple really dropped the ball and blew it. I’m not sure how you release an OS with such a gaping data loss hole in it. I have already contacted our business rep at Apple to inform them of this issue so we’ll see what happens.

    What I would like to see (and told our rep) is the ability to disable the auto-save feature via MCX or some other method. Versions would still work and it would still save versions automatically in the background as I go, but somehow flag them as temporary (in case of a crash). When I close the document I would then be greeted with my old friend the “do you wish to save” dialog. If I click “Don’t Save” then those temporary versions are discarded and the document remains unchanged on the hard drive. If I click “Save” then those temporary versions are re-flagged as permanent and the document on the hard drive (or server) is updated to my latest changes.

    What happens in case of an application crash? That is what those temporary flags are for. When the application re-launches, or even another user opens the same document, it would detect those temporary versions and prompt the user about it. “Do you want to continue editing or discard changes?”

    To be honest though, what I expect is that Apple will simply fix the server-related bug and leave auto-save on without any way to turn it off.  That will mean some major headaches for me in I.T.  Our users can probably learn to adjust to the new auto-save functionality, but they will be forced into a very annoying work-flow:

    1. Open template letter
    2. Be prompted with the “document is locked” dialog and click the “duplicate” button.
    3. Enter a new filename for the document (presumably, I haven’t done this yet).
    4. Make the changes to the template.
    5. Print those temporary changes.
    6. Close the file.
    7. Remember to delete this new file that was created.

    Sorry Apple, but that’s just not going to happen. We’re going to be littered with all kinds of temporary files if thats the case. Or worse, somebody accidentally hits unlock and edits the original.  Or somebody intentionally unlocks the original to make some changes and then before it is auto-locked again somebody else opens it to print off a custom copy again and those changes are saved permanently.

  • Snow Leopard + Time Machine Errors

    Posted on January 30th, 2010 daniel No comments

    The previous blog post has solved nearly all of our network Time Machine errors. Those it hasn’t solved have been related to Snow Leopard machines. All of these machines were running 10.6.1.  I can’t say for sure that that was the issue, but here is what I did to get TM running over the network on those Snow Leopard machines. Some of these steps, namely unbinding from the directory, may not be necessary, I for sure have to do it at the moment as all our bound clients use our own software update server which is still running 10.5.x, so it does not provide the 10.6.x updates.

    • Unbind client from LDAP directory
    • Open Terminal
      • sudo rm -rf /Library/Managed\ Preferences
      • sudo rm -rf /Library/Preferences/com.apple.TimeMachine.plist
      • Close Terminal
    • Open Keychain Access (in your Applications/Utilities folder)
      • Go to the System keychain and remove the entry for your backup server.
      • Close Keychain Access
    • Reboot client
    • Login and run all Software Updates.
    • Reboot client and continue running Software Update until all updates have been installed, even ones you don’t think you need (i.e. Camera Raw, etc.)
    • Once all updates have been installed, re-join client to the LDAP directory and reboot client.
    • Login as the user you want time machine to run backups as, start backups.
  • Time Machine Access Denied

    Posted on December 16th, 2009 daniel No comments

    Okay, so we had one computer exhibiting the horrid -5023 error when performing time machine backups in a network environment. We finally resolved it.

    First, here is our setup. We have an Open Directory database in place that identifies which machines are supposed to perform Time Machine backups and where they backup to. Our OD setup does not require (nor allow) client computers to authenticate as that always caused us nothing but problems. For the most part, they work fine. Occasionally we run into problems. Up to this one machine if we just reset their TM settings it would start working again. This time, we got the -5023 error and were stuck.

    IP 10.0.2.82 – - [11/Dec/2009:17:51:13 0100] “Login daniel” -5023 0 0
    IP 10.0.2.82 – - [11/Dec/2009:17:51:13 0100] “Logout daniel” -5023 0 0

    Here is what we finally did to resolve it.

    First stage: Unbind the client from Open Directory. From Workgroup Manager also delete the client computer record. Then reboot the client computer. Setup TM manually to point somewhere, it doesn’t really matter where. Then turn TM back off so it un-configures itself.

    Second stage: Reboot the client again and change the local user’s password to anything else (say “password1″). Change the directory password as well to “password1″. Open up Keychain Access and delete any time machine passwords or other passwords referencing your TM backup server. Now go to the TM backup server and delete the old backup image.

    Third stage: Reboot the client again and re-join the domain – you must do this as the user who is to be backed up (i.e. the primary user). From Workgroup Manager put the computer back in and re-add it to the TM backup group. Reboot again to pick up the Managed Client settings.  Login again as the primary user and attempt to start a TM backup.  It should work at this point. Cancel the backup.

    Fourth stage: Perform the first 3 stages again, only this time change the password back to the original password and don’t cancel the backup. Everything should work.  It may work to change the password back to the original immediately after changing it to “password1″, but I can’t be sure. Feel free to try it.