Articles > Virtualization
Printer Friendly Version
Views: 25996

[Solved] VMware HA Fails after Upgrade to ESXi 4.1

Last Updated: 8/11/10

If your HA won't enable properly on your VMware vSphere ESXi 4.1 host after a recent upgrade, then you are not alone. I am having the same problem.

More Info:  (skip to the bottom if you just want the solution!)

Error Message: Configuration Issues - HA agent on [esx hostname] in cluster [cluster name] has an error : Error while running health check script.

The logs at /var/log/messages from the console show this additional info:

:Invoke]Command to invoke is /opt/vmware/aam/bin/aamPerl /opt/vmware/aam/ha/aam_config_util.pl -z -shortname=[esx hostname] -uname=VMkernel -cmd=monitornodes -domain=vmware

error 'App'] [VpxaVMAP::Invoke] Command /opt/vmware/aam/bin/aamPerl /opt/vmware/aam/ha/aam_config_util.pl -z -shortname=[esx hostname] -uname=VMkernel -cmd=monitornodes -domain=vmware failed with error 3

Failure location:  08/11/10 22:08:26 [myexit]   function main::myexit called from line 360
Aug 11 22:08:27  08/11/10 22:08:26 [myexit] VMwareresult=failure 
Total time for script to complete:  0 minute(s) and 0 second(s)


Note: I removed some text from the logs to make it easier to read.

Hopefully, I will have a solution after contacting VMware support later this week.


UPDATE:

I figured out a little more. From the logs above you can see that the HA health script that runs every 30 seconds is located at: /opt/vmware/aam/ha/aam_config_util.pl.  I have compared that script one two servers, one with HA working and one with HA not working. On the working ESX host the script has a date stamp of 4/13/2010. On the ESX host with broken HA the script has a time stamp of 4/22/2009. All the other files in the folder: /opt/vmware/aam/ha are also 1 year behind the working ESX host. It appears that the 4.1 upgrade must not have installed all the correct HA files.




The next interesting thing, is that I found a script called: VMware-aam-ha-uninstall.sh at the location of: /opt/vmware/aam. It makes me wonder if I execute the uninstall script and reconfigure HA, will my problems go away? Am I brave enough to find out?


UPDATED AGAIN: (Solution)

Yep, it does fix the problem. Turn off HA for the cluster, run the uninstall script, and restart services. Here are the commands to fix the issue.

From ESXi 4.1 SSH console: (You can enable SSH from Configuration > Security Profiles > Properties)

./opt/vmware/aam/VMware-aam-ha-uninstall.sh
services.sh stop
services.sh start

re-enable HA for the cluster and click "reconfigure for VMware HA" if it doesn't do it automatically.

I figured this out by applying the solution from KB Article: 1007234 which solves a different HA issue with the same solution. So if you want more detail, read that article.




Keywords: esxi, esx, 40 to 4.1, upgrade, ha error, cluster error, error while running health check script,