Time machine – Too many clients for one volume

April 15, 2014
So at our site, we have 63 workstations doing Time Machine backups to a Linux box (using netatalk). The linux box provided 30TB of RAID5 storage over 16 spindles of 2TB drives via SCSI. After about a month of backups the server started bogging down. It wasn’t unusual to see over 10 clients concurrently backing up, not because they were all just running at once but because it was taking too long for them to run. The load average on the machine was pushing 8 (meaning 8 processes were waiting to run). Data throughput to the array was LOW, like on the order of 5-6MB/s, but with that many clients it should have been easily pushing 60+ MB/s.

By the end of the second month we were looking at 15-20 clients trying to backup concurrently, and taking over an hour to run each, with a load average pushing over 30. The backup server was basically useless and it was causing desktop computers to run more slowly (not sure why, but somehow time machine taking longer to run was actually causing noticeable slowdown).

I did a lot of research on various RAID types, spindle count, etc. It seemed there was no “silver bullet” right answer.  All the knowledgeable sounding people said the same thing. Try various sizes and see what works best. Well seeing as Time Machine was running so slowly that some computers were taking all day (8 hours) to complete a single backup, I figured I didn’t have much to lose. I wiped the storage array and started over.

I rebuilt the storage array with 3 6TB RAID 5 volumes, 4 spindles each.  The “extra” 4 spindles were created as a striped RAID set of 8TB for our file server.  All 63 clients were distributed over the backup volumes. It has now been a month and the load average is consistently under 2. Throughput to the array is easily pushing 40MB/s when one or two clients is backing up large files.

It seems the bottle neck was caused by how much random access to the array there was.  Time machine does a lot of reading back off the volume to determine what files need to be backed up.  Whenever a file was written, all 16 drives had to “move” to that write position and write the files, blocking all read requests (and write requests) until it finished.  Now it only blocks 4 drives at a time. In truth, this is an old array. It does have command re-ordering, but maybe not very good. A newer array might be able to handler it better.

For summary sake of what we are running, 16-drive SCSI RAID array.  Drives are Western Digital Green drives (i.e. not RAID edition, nor high speed). 63 desktop computers are backing up. Current stats show that all of the machines have backed up in the last 7 days.  All desktops are still set to the default setting of running backups every hour.  All at the cost of a little storage space. Each machine is limited to 250GB by default, a few specific ones have higher limits.  Doing this helped keep “run-away” backups from hogging everything.  We have a few machines that were otherwise using up 1.5TB of data after only 2 months of backups due to Virtual Machines on the desktop computer. Now all 63 computers have been backing up for 1 month and total used storage is nearly 9TB, leaving another 9TB still available.