I’ve just spent the afternoon trying to get a simple TCP connection established.
The story is that I was configuring Nagios to check some HTTP services that are behind a Cisco PIX firewall. Nagios kept on getting timeout errors – when I tried by hand with telnet, I was getting the same problem, but only from the machine running Nagios. After setting up some logging on the firewall, I could see lines like:
%PIX-6-106015: Deny TCP (no connection) from a.b.c.d/xxxx to e.f.g.h/yyyy flags SYN on interface outside
According to the Cisco documentation, this means that a packet with the flags shown (in this case SYN) has arrived, that doesn’t match an existing connection. But hold on – the SYN flag should be set in the first packet of a TCP connection, so by definition there is no connection yet.
After lots of packet sniffing with tcpdump, and examining the dumps with the marvelous Ethereal, I spotted that SYN packets logged on the machine that couldn’t connect, also had bits called ECN and CRW set, as well as SYN. Whereas packets from a working machine only had SYN set.
After that, the mystery was quickly solved – it turns out that a significant number of TCP stacks out there either drop or send RST for SYN packets that have the ECN bit set. The version of the software on my PIX must be one of them.
Although the ideal fix would be to update the software on the PIX, it’s half-a-planet away, and I’m loath to mess with it too much. The quick fix was to add the following line to
/etc/sysctl.conf on my Debian machine that was having problems:
net.ipv4.tcp_ecn = 0
Now everything’s fine, but it makes me wonder what other weird network behaviour can be explained by this.
(See www.icir.org/floyd/ecnProblems.html for more details on the problem)