scenario

I was recently responsible for setting up a nested VCF deployment. My task was to do some day-2 configuration on an already deployed VCF which was powered off and saved as a vApp. Little did I know that after powering on the vApp, I will have to deal with a total VSAN outage. After powering on, none of the hosts could see each other and thus, no VSAN cluster could be formed. On top of that, the deployment had no other storage available so all the VMs (including vCenter and SDDC manager) were inaccessible.

After a bit of troubleshooting, I confirmed the issue wasn’t with unicast domain or general connectivity between the hosts. I could ping between vmk adapters no problem and the UUIDs for all the hosts were showing up fine in each unicast table. It turned out the issue was actually with the services enabled on the vmks on each host - no VSAN service enabled on any of the adapters. Unfortunately, as the adapters were connected to a vDS, I could not easily fix that (as vDS management requires an online vCenter)…

the fix

I will spare you the details of how I went through hundreds of esxcli and similar commands before discovering what has to be done. The answer turned out to be pretty simple really and involves running just a single command on each affected host:

esxcli network ip interface tag add -i vmk2 -t VSAN

As you can probably tell, this command adds a VSAN “tag” to the target vmk adapter. That really is all. Once I ran this on all hosts, the VSAN cluster recovered and I could power on all the VMs.

What’s worth noting is that this command can also be used to assign any other tag (in other words, enable any other service on the vmk). It’s pretty useful if you ever end up in a similar situation.