vExpert

Follow me on Twitter

Tuesday, April 3, 2018

Resync Improvements in vSAN 6.6

Having seen some of the resync times in the past on vSAN 6.0 it was welcome news to hear about vSAN 6.6's ability to weigh which route was more expensive (in terms of rebuild time/data to sync) if a host was out and came back into a cluster, but I had never seen it for myself, until today. I am in the middle of working on a vSAN upgrade and I have to update the HBA firmware for my controllers. I switched to the N2215 HBA for my vSAN and need them on the VMware VCG supported firmware. The current firmware level is 9.00.02.00 and the VMware recommended versions in the vSAN Health Hardware Check, shows either 11.00.02.00 or 13.00.02.00 are supported.




I knew this upgrade was going to take a while with various firmware versions on the host needing to be upgraded. Firmware upgrades are notoriously slow due to the critical nature of the components. Many safety checks are in place to make sure things are working. In addition to the HBA firmware I also had UEFI firmware, DSA, IMM which is IBM/Lenovo speak for an out of band server management card. vSAN resync times were top of mind as I approached this update. To remedy this I knew that I wanted to increase the default vSAN resync time, which is 60 minutes when a host is absent from the cluster. There are a few ways this can be done, either GUI or CLI and I will show both methods here. In the web client you can go to the host level and choose Advanced System Settings and edit the value for VSAN.ClomRepairDelay,
in my case I changed it from 60 to 90 minutes to give myself more time:



The VMware KB article for this shows that even if you change this setting via the GUI you will still need to SSH into each vSAN host and restart the CLOMD service for the change to take effect. Use the following command:



If you prefer the CLI method, you can change the settings using the following commands via SSH on each vSAN host:



then restart the service



After that piece was taken care of I started the upgrades. Firmware upgrades are like watching paint dry, but I wanted to keep any eye on the systems as they upgraded. You never know what might happen if you turn away from the screen, you could miss an important message or error popping up. After booting into the BOMC which is a live Linux environment the upgrades started, they took quite a while as expected and after 6 reboots I knew it wasn't likely that I would make the 90 minute window I had given myself. I knew it would be close, but that I wouldn't make the cutoff with boot time etc. For those unfamiliar with vSAN, ESXi boots do take longer as the service for vSAN needs to start up, it also has to initialize the disks and then start doing checksums/data checks before the rest of the hypervisor can finish loading. I knew that things would be better than what I had experienced on 6.0 but I wasn't quite prepared for how much better. I had experienced a 48 hour resync once on 6.0, keep in mind that doesn't mean that data was lost or VM's were down. It was simply due to the design of our vSAN array, running FTT=1 (Failures to tolerate) I was left with only 1 good copy of "some" VM components. Again this doesn't affect all VM's either, depending on component placement and which host went down, SPBM (Storage Policy Based Management) settings etc.

Contrast that with what happened today on vSAN 6.6.1, when the host came back up after the firmware upgrades, I pulled it out of maintenance mode and looked to the vSAN health section to see what would happen. I went in to see how much data was resyncing and it started out at 4TB, but as I was clicking refresh I was watching that value rapidly shrink as vSAN was doing its evaluation of the data from the missing host that had just come back "online" to the array. Within a few seconds the number dropped to 1TB, then 900GB, then 40GB, and a few minutes later the array was healthy with full redundancy attained once more. The 40GB it eventually settled on was the amount of differential change the rest of the SSD's in my array had see in the time that Host 1 was offline. All other data had been evaluated very quickly and was determined to be usable and only the stale data had to be refreshed.

As I mentioned to a friend, it is one thing to hear about the feature improvements of vSAN, and it is another thing entirely to witness them first hand. I want to give the vSAN team huge kudos for the way they have continued to improve the product and make it enterprise ready. They continue to deliver improvements that customers expect in an enterprise class array. I hope you found this post useful. Thanks for reading. Cheers!

1 comment: