Nexus Crash

Read­ing Time: 5 min­utes

As is typ­i­cal in the world of IT, prob­lems have a way of sneak­ing up on you when you least expect it, then vicious­ly attack­ing you with a Bil­ly-club.  Often this hap­pens when you are asleep, on vaca­tion, severe­ly ine­bri­at­ed, or have already worked 40-hours straight with no sleep.  In my case, Super-Bowl Sun­day at around 8:30pm was my time to get the stick.  And get it I did.

For rea­sons too sad to war­rant com­ment, and far too irri­tat­ing to explain in a fam­i­ly forum like this, our ESX host servers all became dis­con­nect­ed from our SAN array.  The root prob­lem was some­thing else on layer‑2, and got resolved quick­ly, but the vir­tu­al world was not so quick to recov­er.  In ret­ro­spect, the prob­lem was not a bad one, but when you’ve been drink­ing and can’t see the obvi­ous answer you tend to dig the hole you’ve fall­en into deep­er rather than climb prompt­ly out.

By way of back­ground, we are cur­rent­ly run­ning VSphere 4.0, with a few servers hav­ing 32GB or mem­o­ry and 8‑cores, and a few hav­ing 512GB of mem­o­ry and 24-cores. All ESX Hosts are SAN boot­ing using iSC­SI ini­tia­tors on a ded­i­cat­ed layer‑2 net­work.  We use Nexus 1000v soft switch­es and have our ESX Hosts trunk­ed using 802.1q to our Core (6506‑E switch­es run­ning VS-S720-10G super­vi­sors).  Every­thing is redun­dant (dupli­cate trunks to each Core switch) and using ether-chan­nel with mac-pin­ning).  So there you have that, for what it’s worth.  Now back to the crashed servers.

We reboot­ed all of the ESX host servers, and with the excep­tion of some FSCK-com­plain­ing they all came up quite nice­ly.  The prob­lem was that none of the vir­tu­al machines came up.  Let me add that we have the domain con­trollers, DHCP, DNS, etc. on these hosts.  Crap.

So the first thing I did in my addled state was to add DHCP scopes to the DHCP servers at anoth­er office across the coun­try, and point the VLANs off “that-a-way” by chang­ing the ip helper-address on each VLAN on the Core.  That got DHCP and DNS back online.  As you can prob­a­bly guess by now, I was Mac­Gyver-ing the sit­u­a­tion nice­ly, but real­ly didn’t need to.  That’s one of the prob­lems when you’re in the trench­es: you tend to think in terms of right-now instead of root cause.

The next thing I did was to start bring­ing up the vir­tu­al machines one-by-one using the com­mand line on the ESX hosts.  Why?  Because I had no domain authen­ti­ca­tion and the VSphere Client uses domain authen­ti­ca­tion.  Here is where some­one in a live talk would be inter­rupt­ing me to point out that the VSphere Client can always be logged into using the root user of the hosts, even when domain authen­ti­ca­tion is set up for all users.  Yes, that is true and it would have been handy to know at the time.

In order to bring up the vir­tu­al machines, I had to first find the prop­er name by issu­ing:

vmware-cmd –l

from the com­mand line.  This com­mand can take a while to run, espe­cial­ly if you have a lot of VMs sit­ting around, so go get a cup of cof­fee.

Once I had that list I pri­or­i­tized the machines I want­ed up first, and issued the:

vmware-cmd //server-name.vmx start

com­mand on each one.  That should have been the end of the boot-up dra­ma, but it wasn’t.  As it turns out, a mes­sage popped up (and I don’t remem­ber the exact phras­ing) to the effect of “you need to inter­act with the vir­tu­al machine” before it would fin­ish boot­ing.  So, now I issued the:

vmware-cmd //servername.vmx answer

com­mand and got some­thing that looked about like this:

Virtual machine message 0:
msg.uuid.altered:This virtual machine may have been moved
or copied.
In order to configure certain management and networking
features VMware ESX needs to know which.
Did you move this virtual machine, or did you copy it?
If you don't know, answer "I copied it".
0. Cancel (Cancel)
1. I _moved it (I _moved it)
2. I _copied it (I _copied it) [default]

Well, I didn’t know so I select­ed the default option (I copied it) and went on my way.  That is fine in almost every cir­cum­stance and got all of my servers boot­ed up.  It did not, how­ev­er, entire­ly fix the prob­lem.  In fact, even though all of my servers were boot­ed, none could talk or be reached on the net­work.

This is where a lit­tle famil­iar­i­ty with the Nexus 1000v soft switch­es comes in handy.  Very briefly, the archi­tec­ture is made up of two parts: the VSM or Vir­tu­al Super­vi­sor Mod­ule and the VEM or Vir­tu­al Eth­er­net Mod­ule.  The VSM cor­re­sponds rough­ly to the super­vi­sor mod­ule in a phys­i­cal chas­sis switch, and the VEMs are the line cards.  The inter­est­ing bit to remem­ber for our dis­cus­sion is that the VSMs (at least two for redun­dan­cy) are also Vir­tu­al Machines.

Some of you may have guessed already what the prob­lem turned out to be, and are prob­a­bly chortling self-right­eous­ly to your­self right about now.  For the rest of us, here’s what hap­pened:

I fig­ured out the log-in-using-root thing and got the VSphere client back up and run­ning (oh, not before hav­ing to restart a few ser­vices on the Vir­tu­al Cen­ter Serv­er, which is not a vir­tu­al machine, by the way.  I’m not total­ly crazy!).  Once I got that far I could log in to the Nexus VSM, and look at the DVS to see what was going on.  All of my uplink ports (except for ones hav­ing to do with con­trol, pack­et, vmk­er­nel, etc.) were in an “UP Blocked” state.

The short-term fix (again, the Mac­Gyver job) was to cre­ate a stan­dard switch on all hosts and migrate all crit­i­cal VMs to that switch.  That didn’t, how­ev­er, fix the prob­lem per­ma­nent­ly and besides, we like the Nexus switch­es and want­ed to use them.  With that in mind, and a day or two to nor­mal­ize the old sleep pat­terns, I set up call with VMware sup­port.  This actu­al­ly took longer than I expect­ed since I had to wait for a call-back from a Nexus Engi­neer, and they are appar­ent­ly as rare as hon­est sales-peo­ple or Uni­corns.  That said, I did get a call back and we pro­ceed­ed to trou­bleshoot the prob­lem.

One thing that sur­prised me was that it took the Nexus Engi­neer a bit longer than I would have thought to find the prob­lem, but even once he did it took longer to get res­o­lu­tion because we had to get Cis­co involved.  The prob­lem, as it turns out was licens­ing.

When you license the Nexus, you receive a PAK and you use that to install the VSM.  Once you do that, you have to request your license using the Host UID of the now installed VSM.  Cis­co then sends you a license key that you install from the com­mand-line of the VSM.  This is all some­what stan­dard and not sur­pris­ing.  What was sur­pris­ing was that we would have to do this at all con­sid­er­ing we had been licensed at the high­est lev­el (Enter­prise, superdy-duper­ty cool or some­thing) for years.

What hap­pened was that the copy VSphere made in order to get each Vir­tu­al Machine back up after our crash changed the Host UID of the VSM vir­tu­al machine(s).  Thus, the license keys were no longer valid and all host uplink ports went into a blocked state.  (I’ll save you the obvi­ous gripe I have with the Nexus not offer­ing any kind of com­mand-line mes­sage about our licens­ing being hosed.)  This is where we had to get Cis­co Licens­ing involved, as we had to send them the old license key files and the new Host-UID infor­ma­tion so that they could gen­er­ate new keys.  Con­sid­er­ing I was only on the phone with them for only 15 min­utes, it was as pleas­ant an expe­ri­ence as I’ve ever had deal­ing with Cisco’s Licens­ing depart­ment.  At least that’s some­thing.

After fix­ing the licens­ing, the ports unblocked and I went through the tedi­um of adding back adapters to the Nexus, mov­ing servers, etc.  At the end of the day, how­ev­er, it is all back to nor­mal and work­ing.  There are a lot of lessons learned here, and you’ll no doubt pull your own, but the one over­rid­ing thing to be on the look­out for is that, under cer­tain cir­cum­stances, if your Nexus VSMs are part of a crash and come back up, look to licens­ing first before trou­bleshoot­ing any­thing else.  Oh, and try to sched­ule your major sys­tem crash­es for a more con­ve­nient time… when you’re sober.  Just say­ing.