Wednesday, August 25, 2010

Windows Server 2003 looses default gateway with multiple NICs

I have a server that is connected to two network interfaces. Both interfaces are assigned addresses and gateways via DHCP. One is a fast gigE connection to a private test/NAS LAN, and the other is a connection to the intranet and to the world.

The connection to the world is through a crappy router/firewall that has difficulty handling full bandwidth transfers without quivering to a temporary state of stasis. Over the last few months, and in particular over the last week, I had noticed that the server machine was losing its default gateway connection to the world. So all incoming connections were being routed back through the lower priority/metric gateway on the test LAN -- and going to a black hole. Odd. Users annoyed, too.

It turns out Windows Server 2003 has a feature called "dead gateway detection" -- if the highest metric/priority gateway requires more than X retries on a send, Windows flags it as bad, drops it from the routing table, and moves on to the next default gateway. Here are some Microsoft links on the setting:

Interesting that Windows essentially supports "fail-over" to secondary gateways, but does not appear to "fail-back" to the original gateways later once their connection is restored. So Windows has no way to recover the lost connections once the condition has triggered the fail-over. Since MS themselves recommend disabling this feature as part of "hardening" the TCP/IP stack (to prevent a DoS vector), I went ahead and did this. Tweak the dollowing registry settings:
  1. HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameter\
  2. Add/set the DWORD key value "EnableDeadGWDetect" to 0
  3. Reboot the system to apply the change.
Another alternative is simple to only have one default gateway on your TCP/IP stack -- which is the approach many network gurus advise. With only one gateway to work with, "Dead gateway detection" does not activate. I did this as well, and set up numerous persistent direct routes for those connections that needed to go through the test/NAS LAN as opposed to the rest of the world.

So far with these changes, things are going good.