Comment résoudre les problématiques de HA Agent sur un cluster VMWare....

Consider yourself lucky if you’ve never gotten the VMware HA message: An error occurred during configuration of the HA Agent on the host. But if you have, you may know that the ways to fix the error are extremely limited. Here is a method that worked for me.

Current methods

The current methods of troubleshooting this issue involve checking that the DNS is working properly, that the FT_HOSTS file in /etc/opt/vmware/aam is properly written for the hosts involved in your VMware Cluster, and disabling and re-enabling VMware HA within the VMware Cluster.

New method

The new method assumes that the VMware HA configuration is somehow at fault. I began to think this was the case when I noticed that the /opt/vmware/aam/ha/VMap process was not terminating on a reset of VMware HA. This process, as seen from the output of ps ax issued from the service console command line interface, should not exist when VMware HA is disabled. However, in my configuration it did exist. I also noticed I had problems reestablishing VMware HA after a recent reboot of a server caused by a faulty UPS. DNS was working, FT_HOSTS looked correct, and disabling and re-enabling VMware HA did no good.

Here are the steps that I followed to fix it:

  1. Log in to the service console of your problem hosts and verify that VMware HA is disabled using: service vmware-aam stop
  2. Ensure there are no VMware HA processes running by using: ps ax | grep aam | grep -v grep
  3. If processes exist, kill them using the Process ID returned by the previous command (first column) as the PID: kill -9 PID
  4. Issue the following command via the service console including the parenthesis: (cd /etc/opt/vmware/aam; mkdir .old; mv * .old; mv .[a-z]* .old)
  5. Using the Virtual Infrastructure Client click on the Host, then the Summary tab, and then Reconfigure for VMware HA.

Voila, VMware HA restarts and works properly! This solution may be seen as overkill as it forces VMware HA to recreate all configuration files. I may have been able to just remove the .vmware_fdport file and also Reconfigure for VMware HA, but I did not try that option. I bring this possibility up as it is NOT there on my now-running VMware HA-enabled hosts.

Now I have what looks to be a fool proof way to get VMware HA to start back up and protect my investment.