Internet Explorer (specifically version 8, and possibly other versions as well) has an annoying bug that affects HTTP 1.1 persistent connections, particularly with chatty AJAX web applications.
I discovered the bug while troubleshooting a problem for a client. The problem presented itself as a web server or application error: users would sporadically see the application “freeze” for exactly five minutes, then see the results they expected. The problem was unrelated to load (it happened with only one user), occurred in all areas of the application (i.e. there was no consistency as to what the user did when it occurred), and the logs were suspiciously void of any error or indication of a problem at the time of the occurrence. Requests from other users’ browsers during this period would work correctly.
The absence of errors in the logs was particularly perplexing. The problem occurred in an environment using Apache 2.2 (worker MPM) on HP-UX with mod_jk talking to a tcServer instance. Both the web server and the app server were being load balanced by a Cisco ACE. Apache was configured with mod_jk, mod_log_forensic, and mod_dumpio all configured to log literally everything. The tcServer was configured to full debug logging. The Cisco ACE reported that the web and app server load balancing groups were fully available. And yet, every few hours of testing, the application would “freeze.” No log entries on the web server showed any requests taking more than 500ms to complete, much less the five minutes observed by the browser.
Packet captures recorded from the web server finally revealed what was happening. The web server was configured with HTTP 1.1 enabled and a five second keepalive timeout. Internet Explorer would open an HTTP 1.1 persistent connection and a series of AJAX requests could be seen between the browser and the web server. Eventually, there would be a five second break in activity. Apache, behaving appropriately, would send a FIN, ACK to the browser, attempting to close the connection. The browser would respond with an ACK. At this point, the web server has the socket in FIN_WAIT_1 waiting for Internet Explorer to complete closing the connection by sending its own FIN to the web server.
Except that never happened. In all the cases where the “freeze” occurred, Internet Explorer would send another AJAX request across the socket within a few hundred milliseconds of sending the ACK response to the web server’s FIN, ACK. Because the web server marked the socket as FIN_WAIT_1, it technically received the traffic from the browser but did nothing with it because the client had acknowledged the connection close request.
Internet Explorer waits five minutes for an AJAX response before timing out and will retry the request once. The freezing of the application was caused by the browser waiting the full five minutes, timing out, then retrying the request. This second request, five minutes after the first one, would succeed because it was sent across a new socket connection. From the user’s perspective, it appeared as though the application froze for five minutes. In reality, the application server never got the first request because Internet Explorer completely misbehaved by sending packets across a TCP connection after it itself had acknowledged close request.
Solution: with this particular application, disabling HTTP 1.1 on the web server was an acceptable solution. I suspect that increasing the timeout period (perhaps to as long as 60 seconds or more) would significantly reduce the probability of IE misbehaving, but I haven’t tested that configuration.
A more important take away from troubleshooting this problem: packet captures are the number one tool for understanding what is happening on the network. Packet captures never lie. If you have two or more networked components that are not working correctly and you’re not sure where the problem is, don’t guess – packet captures always reveal the truth. Browsers, thick clients, application logs – anything above layer 5 of the network stack – will inevitably deceive you about what is happening underneath the covers. A working understanding of TCP and a packet capturing and analysis tool completely eliminate the guesswork of the source of so many types of problems in distributed environments.