Unable to Fail Over from one TMG node to another when using NLB in a Virtual Environment
Introduction
This post is about a scenario where TMG Administrator was trying to simulate a failover before put the environment in production. TMG nodes were installed in a third party virtual environment. TMG was using integrated NLB with Unicast, the External TMG adapter was connected to a layer 2 switch. To attempt to simulate failover, TMG Admin was disabling the NICs of one of the nodes, the unexpected result was that the other node suddenly stopped accept connections from External users.
Troubleshooting
When dealing with NLB always remember to start from the basic and here are some good references on that:
- Binding order - http://support.microsoft.com/kb/894564
- VLAN Tagging Issues http://support.microsoft.com/kb/2286940
- Connectivity Issues - http://technet.microsoft.com/en-us/library/cc783135(WS.10).aspx
- General Windows NLB Troubleshooting - http://download.microsoft.com/download/3/2/3/32386822-8fc5-4cf1-b81d-4ee136cca2c5/NLB_Troubleshooting_Guide.htm
For this particular scenario data was collected on the server whose NIC was not disabled and in the traces found something very interesting:
Until frame number 1072 as shown above traffic was normal after that we start seeing huge RARP traffic and this RARP traffic was little weird as we can see below target and source MAC address are same and Target IP address is 0.0.0.0.
--> The rest is on the blog :