By Art Reisman
CTO – APconnections
As a general rule, when a network router sees more packets than it can send or receive on a link, it will drop the extra packets. Intuitively, when your router is dropping packets, one would assume that the perceived slow down, per user, would be just a gradual shift slower.
What happens in reality is far worse…
1) Distant users get spiraling slower responses.
Martin Roth, a colleague of ours who founded one of the top performance analysis companies in the world, provided this explanation:
“Any device which is dropping packets “favors” streams with the shortest round trip time, because (according to the TCP protocol) the time after which a lost packet is recovered is depending on the round trip time. So when a company in Copenhagen/Denmark has a line to Australia and a line to Germany on the same internet router, and this router is discarding packets because of bandwidth limits/policing, the stream to Australia is getting much bigger “holes” per lost packet (up to 3 seconds) than the stream to Germany or another office in Copenhagen. This effect then increases when the TCP window size to Australia is reduced (because of the retransmissions), so there are fewer bytes per round trip and more holes between to round trips.”
In the screen shot above (courtesy of avenida.dk), the Bandwidth limit is 10 Mbit (= 1 Mbyte/s net traffic), so everything on top of that will get discarded. The problem is not the discards, this is standard TCP behaviour, but the connections that are forcefully closed because of the discards. After the peak in closed connections, there is a “dip” in bandwidth utilization, because we cut too many connections.
2) Once you hit a congestion point, where your router is forced to drop packets, overall congestion actually gets worse before it gets better.
When applications don’t get a response due to a dropped packet, instead of backing off and waiting, they tend to start sending re-tries, and this is why you may have noticed prolonged periods (3o seconds or more) of no service on a congested network. We call this the rolling brown out. Think of this situation as sort of a doubling down on bandwidth at the moment of congestion. Instead of easing into a full network and lightly bumping your head, all the devices demanding bandwidth ramp up their requests at precisely the moment when your network is congested, resulting in an explosion of packet dropping until everybody finally gives up.
How do you remedy outages caused by Congestion?
We have written extensively about solutions to prevent bottlenecks. Here is a quick summary with links:
1) The most obvious being to increase the size of your link.
2) Enforce rate limits per user.
3) Wse something more sophisticated like a Netequalizer, a device that is designed to specifically counter the effects of congestion.
From Martin Roth of Avenida.dk
“With NetEqualizer we may get the same number of discards, but we get fewer connections closed, because we “kick” the few connections with the high bandwidth, so we do not get the “dip” in bandwidth utilization.
The graphs (above) were recorded using 1 second intervals, so here you can see the bandwidth is reached. In a standard SolarWinds graph with 10 minute averages the bandwidth utilization would be under 20% and the customer would not know they are hitting the limit.”
The excerpt below was a message from a reseller who had been struggling with congestion issues at a hotel, he tried basic rate limits on his router first. Rate Limits will buy you some time , but on an oversold network you can still hit the congestion point, and for this you need a smarter device.
“…NetEq delivered a 500% gain in available bandwidth by eliminating rate caps, possible through a mix of connection limits and Equalization. Both are necessary. The hotel went from 750 Kbit max per accesspoint (entire hotel lobby fights over 750Kbit; divided between who knows how many users) to 7Mbit or more available bandwidth for single users with heavy needs.