Monday, May 9, 2016

XenServer Host Crashed - What to do?

Good info


It’s 3 O’Clock in the morning, do you know where your Xenserver Poolmaster is? Your client calls you frantic, and you start a GoToMeeting to see what’s wrong. If it’s down, this could have been the result of a few issues. Maybe there was a network glitch which resulted in the Citrix XenServer Poolmaster fencing itself from the rest of the farm. This can also result during a power outage, or other catastrophic failure. This is the normal defense mechanism built into Xenserver, and in the consulting world we see this type of scenario often. You can’t simply reboot the Poolmaster to bring it online. Restarting the toolstack will do you no good. There is a complex process that must be followed, so let’s discuss it –

If you’ve tried to connect to the pool from the Xencenter console, and it failed – your Poolmaster may be down. Verify this by dropping to a command prompt and issue a command like “Xe host-list” to see if you get a coherent response. If you get an error message like ““Cannot perform operation as the host is running in emergency mode” – then your Poolmaster is almost certainly down.

How do I get the Poolmaster back up?
This is easier said than done. First you’ve got to promote another server in the farm to become the Poolmaster, so that it can take over pool operations. From that servers CLI, type the command, “xe pool-emergency-transition-to-master” which will transition it to be the new Poolmaster. If the command runs successfully, you can recover the other pool servers by issuing the command, “xe pool-recover-slaves”. Now if pool management is working again, you should be able to successfully run the “xe host-list” command and get a valid response.

Now that the pool is back online, how do I fix the failed poolmaster?

1). First you have to figure out which server in the environment has failed. To do this, you’ll want to run the command, “xe host-list params=uuid,name-label,host-metrics-live”. Any servers that come back with “host-metrics-live = false” have failed. Take note of the UUID of any failed servers
2). Second, you must determine which VM’s were running on that failed server. You can do this by running the command, “xe vm-list is-control-domain=false resident-on=UUID_of_failed_server”. Once you’ve determined which VM’s were running, you need to reset their power state in order to get them to move onto another server. To do this, run the command, “xe vm-reset-powerstate resident-on=UUID_of_failed_server –force –multiple”. You should see the VM’s in question now show up as halted in the Xencenter console. Restart each of the VM’s, and they should now boot up onto surviving pool member servers.
http://citrixonline.evyy.net/i/27381/19714/810
For more information on various issues you can run into during this process, check out the official Citrix whitepaper here:
What are some root causes as to why my Xenserver Poolmaster may have been down?
The usually suspects include your network, because if the poolmaster loses connectivity to some of the other Xenserver hosts in your environment, it could fence itself and go offline as a built in defense mechanism.  Poolmaster fencing is a typical issue that can occur if there are network issues in your environment, so check with the network team before you pass go.