Ever since we moved to the paid version of VMware I had noticed some oddities in speed. Mostly in the backups from the VDR appliance. Sometimes a VM would backup in about 30 minutes, other times it would time out after 10 hours (i.e. it ran for 10 hours and then the backup window closed so it cancelled the backup job). Well at one point I was at home and noticed one of the Linux VMs taking forever to respond and just feeling extremely slow. I connected in remotely and noticed that the load average was up around 80 (for non-unix users what this means is there are 80 processes waiting on the kernel to give them a slice of time to run in. This number should really never exceed the number of CPUs you have installed).
I did some snooping and noticed everything was waiting on disk I/O. I did a disk speed test and realized I was getting 103KB/s. Yes, KB, not MB. Obviously there was a problem. The next day when I was in the office I did some more digging with a newly learned utility called esxtop on the VMware hosts and found that 2 of the 8 iSCSI paths to our VM storage were really slow. The 6 that worked fine were running at about 80MB/s. The 2 that were having issues were around 1.3MB/s. Thus began about a 6 week long endeavor to track down the problem.
The final conclusion, reached this weekend, was that I had configured the iSCSI wrong in the ESX hosts. At the same time we moved to paid VMware we also bought extra NICs for the hosts, so we would have redundancy. This gave us 2 iSCSI NICs per host. When I configured the 2 iSCSI adapters as their own VMKernel device (correct) but had them sharing the same vSwitch (apparently not correct). The reason I did this is that I also needed to have in-VM access to the “SAN” network. So I had also created a VM network in that vSwitch that would share the 2 NICs for load balancing, but the iSCSI adapters were set to use only one NIC each.
Evidently that is not a supported configuration, even though it said everything was in compliance. Somehow my configuration caused the iSCSI adapters to load balance between the two, or rather, use one NIC for nearly all traffic. I think what was maybe happening is all outgoing data was being sent via one NIC (but two different MAC addresses) and ESX was expecting data to come back on the second NIC some of the time, causing really weird things to happen.
Moral of the story, after you setup your iSCSI subsystem and have everything talking, do some benchmarking. Switch your connection method from round-robin to fixed and test each path one at a time to make sure they are all working properly. The path may be active, but not working as expected. In hindsight I should have done this anyway even just to make sure things like link speed was correct. I could have had the same problem simply from one NIC somewhere linking up at 100mbit instead of gigabit which would have also caused strange slowdowns.
I also found that where was some issues with the two switches that linked the various devices together causing a bottle-neck. Once the above issue was sorted out I no longer had the “absolutely terrible” throughput, but would occasionally have mediocre throughput (though not enough to destroy the world). Again using esxtop I was able to trace this down to when people copy large files from the file-server. Doing so would saturate the link between switches and cause a slow down on the ESX machines, though not nearly as much as the misconfiguration caused. So in addition to doing basic tests on each path I also recommend looking at what else might be utilizing any physical network paths being used and stress test those paths. I now have 2 switches with 10-gig uplink ports waiting for the next holiday to be installed.
Here is a link to a page that talks about esxtop and has some extra links from there for more information. Get to know this tool, it will be invaluable to you when it comes to troubleshooting your ESX host.