The connection to the world is through a crappy router/firewall that has difficulty handling full bandwidth transfers without quivering to a temporary state of stasis. Over the last few months, and in particular over the last week, I had noticed that the server machine was losing its default gateway connection to the world. So all incoming connections were being routed back through the lower priority/metric gateway on the test LAN -- and going to a black hole. Odd. Users annoyed, too.
It turns out Windows Server 2003 has a feature called "dead gateway detection" -- if the highest metric/priority gateway requires more than X retries on a send, Windows flags it as bad, drops it from the routing table, and moves on to the next default gateway. Here are some Microsoft links on the setting:
Interesting that Windows essentially supports "fail-over" to secondary gateways, but does not appear to "fail-back" to the original gateways later once their connection is restored. So Windows has no way to recover the lost connections once the condition has triggered the fail-over. Since MS themselves recommend disabling this feature as part of "hardening" the TCP/IP stack (to prevent a DoS vector), I went ahead and did this. Tweak the dollowing registry settings:
- HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameter\
- Add/set the DWORD key value "EnableDeadGWDetect" to 0
- Reboot the system to apply the change.
So far with these changes, things are going good.