TCP problems


I see Gleen Fiedler always struggling with people that repeatedly leaves a reply claiming that he knows nothing and should be always using TCP over UDP.

Yet anyone developing a realtime application will laugh uncontrollably at such claims.

Skype runs on UDP, many multiplayer FPS games run on UDP. VoIP runs on UDP.

There are tons of resources dedicated to the strengths of TCP, so I’m going to be going around its weaknesses.

 

TCP is build for reliability, not performance

This is the pivotal point of TCP. TCP never had the idea of performance in its design. The main points of TCP are:

  • Reliability: Packets are guaranteed to arrive.
  • In-order: Packets arrive in the same order they were sent.
  • Relative security: TCP sessions are hard (but not impossible) to hijack. A long time ago the TCP’s internal ack packet counter was not randomized making session hijacking very easy. But that was a long time ago.
  • Traffic congestion: Packet loss is assumed to be caused by the system not having enough resources to process all incoming packets. When packet loss gets high, TCP “plays nice” with the other end by reducing the amount of packets being send or even temporarily halting the communication (in simple terms, give the other end “some time to breath”). This assumption is archaic in 2014.

The thing when we make games (or other real time applications), is that we often don’t care about any of this (except for security when authentication against a central server comes to play; but the level of security needed goes far beyond than a mere session hijacking protection).

As Glenn shows in Deterministic Lockstep, there are ways/tricks to achieve an acceptable degree of reliability and in-order “protocol” on top of UDP, that doesn’t really have more overhead than a TCP packet header.

RakNet has even its own “reliability layer” written on top of UDP, to mimic TCP’s reliability and in-order properties; but with performance in mind.

 

Traffic congestion in the 21st century

A slightly less common but highly annoying argument of the people complaining that Glen doesn’t know anything is that an aggressive UDP protocol doesn’t play nice with the Internet.

Effectively, if we keep being overly aggressive and don’t even bother to throttle if we have a lot of evidence that the other end is overflown with data; our UDP-based protocol will suck and malfunction on large scale deployment. You could be surprised how quick a modern computer can be brought to its knees by sending not so many packets per second.

However their arguments blindly assumes that packet loss happens only and only because of unavailable resources to process them on the other end. Personally when I see such assumption, I see an indication that the person has no field experience; whose knowledge is gathered from reading books written in the 80’s and 90’s (and sadly, in the 2000’s); or he is lucky enough to work in an Ethernet-only environment.

There is a reason my mom keeps complaining that the Tablet I gave to her is “shit”. The places she usually roams around has poor WiFi signal. Even though we have 1 bar, we’re lucky if we get Google to load. With 2 bars, Google manages to open. But God have mercy on your soul if you attempt to browse anything else. I’m sure you’ve been there.

But surprisingly enough, when I developed a remote-desktop streaming application on top of UDP (in which our custom protocol is aggressive; meant to be used over LAN, not really the Internet; although it works over Internet fine); I could view the contents of the video being streamed on real time…. with 1 bar. ONE BAR.

I can barely open Google with 1 bar; but somehow I could see a 720p stream in real time (with hiccups and artifacts*) that I developed on my own. At two bars, there was barely any hiccup.

So, what is happening?

 

*If I turned on my crappy reliability layer (which I’m not proud of) written on top of UDP, the hiccups and longer stalls were higher with no artifacts at one bar. At two bars I just experience a few hiccups and some stalls.

 

On a wireless-induced packet loss, TCP does exactly the opposite

Packet loss can occur because of noise, interference and not just lack of processing resources. On a cable-only network, noise is very rare unless the cable is faulty and must be replaced. Noise can also appear if the cable is entangled with too many cables; which could happen in a poorly designed data center. I doubt a Data center engineer or sysadmin would make such mistake though.

However on any radio-based network (WiFi, 3G, LTE, Bluetooth, etc); interference is very common. The best solution to this problem is to resend the same data again; in an attempt to increase the signal-to-noise ratio. The more aggressive, the better.

This is exactly the opposite of what TCP does! TCP will believe the packet loss is indicative of the device not being able to process all the packets, and will reduce its congestion window. This will cause TCP to send less packets. Because we’re sending less packets, the signal to noise ratio is even lower; and the chance of the congestion window getting smaller or fully reset again is even bigger.

By sending less and less packets, on a WiFi with poor signal, the chances of getting 99-100% packet loss start getting higher.

With this information, it’s no wonder that if you’ve been getting a blank loading screen in your tablet browser; hitting refresh with your thumb at 6-second intervals increases your chances of actually loading the website you wanted. **Shock**

You could try to tune the Congestion Window setting, which is a very advanced setting in Windows and a setting with root privileges in Linux. And even then, you would be just tweaking one of the many congestion control algorithms TCP may use; all of which assume that on packet loss, less packets should be sent.

 

Resources constraints vs Interference

This is far from being a solved problem; as we don’t have enough means to reliable determine whether packet loss is induced by interference or by resource constraints (though it should be intuitive that if increasing packet resends on packet loss achieves even greater packet loss, then we’re resource-bound and should send less; otherwise we were interference-bound and we should keep resending more frequently); and to make things worse, a packet may travel through cable and air and be both resource and interference bound.

Only the router in each hop can know whether the data until the next hop will travel over cable or air. This would be a shared responsibility.

If you think your WiFi router is clever enough to use custom TCP settings tuned for interference-induced packet loss, I have an aggressive UDP based protocol proving otherwise. Most manufacturers embed a slightly customized version of TCP code they could legally grab (usually FreeBSD/NetBSD; sometimes from the Linux kernel), tweak a few TCP settings after thorough experimentation and call it a day. Then blame the inherent unreliability of radio signals. You’re getting one or two bars after all.

That’s something entirely reasonable. Touching TCP code that is already proven to work is a minefield (you don’t want to be blamed of breaking it, do you?); and it’s not really their place to modify it much beyond of what the standard specifications allow.

A wifi router that aggressively resends packets also increases manufacturing costs: They need more RAM to buffer more packets in case they need to be resent aggressively.

Even if the router aggressively resends the data to the receiver, the sender endpoint’s TCP will still receive ACKs at lower rate and keep an undesired low congestion window. Ouch!

The TCP protocol doesn’t even contemplate a flag that says “sorry bro, I send here this late ACK but keep sending at your full rate. I’m on a poor wifi; my router is buffering all your packets to compensate”; which is easier said than done as something as straightforward like that would be prone to DoS attacks when someone lies: With IP Spoofing, I could tell a dozen of systems to respond to a targeted IP address with their full sending capacity; unless the slow-start would take priority over our hints, which beats the point.

 

TL;DR (Too long; didn’t read)

Stop glorifying TCP. It’s not. It has many flaws and is suitable for what it was designed for: Reliable, in order communication over cable. It has decades of proof-testing that make it very reliable. We could even argue that it isn’t even suitable for what it was intended (i.e. web browsing, chat, downloading files) when ran on a WiFi tablet or LTE phone. It’s not suited for real time applications. Stop trying to recommend TCP everywhere. Choose the right protocol according to the right needs. Sometimes these needs can be very broad, and compromises have to be made.

For example, my remote desktop controlling software where you can control a desktop application with your tablet (it was made specifically for a client, I can’t share it publicly), I was able to safely assume most of my users will use it over a wireless LAN. I can be aggressive with UDP. Or I could have added a config option (i.e. profiles: Home vs Internet) just in case someone wants to use it over internet.

But don’t blindly follow the TCP-only stream.

Food for thought: processing power keeps increasing and processing software can be optimized, but radio signals won’t be getting stronger.