The intent of this article to is to help set appropriate expectations of using a caching server on an uncontrolled Internet link. There are some great speed gains to be had with a caching server; however, caching alone will not remedy a heavily congested Internet connection.
Are you going down the path of using a caching server (such as Squid) to decrease peak usage load on a congested Internet link?
You might be surprised to learn that Internet link congestion cannot be mitigated with a caching server alone. Contention can only be eliminated by:
1) Increasing bandwidth
2) Some form of bandwidth control
3) Or a combination of 1) and 2)
A common assumption about caching is that somehow you will be able to cache a large portion of common web content – such that a significant amount of your user traffic will not traverse your backbone to your provider. Unfortunately, caching a large portion of web content to attain a significant hit ratio is not practical, and here is why:
Lets say your Internet trunk delivers 100 megabits and is heavily saturated prior to implementing caching or a bandwidth control solution. What happens when you add a caching server to the mix?
From our experience, a good hit rate to cache will likely not exceed 10 percent. Yes, we have heard claims of 50 percent, but have not seen this in practice. We assume this is an urban myth or just a special case.
Why is the hit rate at best only 10 percent?
Because the Internet is huge relative to a cache, and you can only cache a tiny fraction of total Internet content. Even Google, with billions invested in data storage, does not come close. You can attempt to keep trending popular content in the cache, but the majority of access requests to the Internet will tend to be somewhat random and impossible to anticipate. Yes, a good number of hits might hit the Yahoo home page and read the popular articles, but many users more are going to do unique things. For example, common hits like email and Facebook are all very different for each user, and cannot be maintained in the cache. User hobbies are also all different, and thus they traverse different web pages and watch different videos. The point is you can’t anticipate this data and keep it in a local cache any more reliably than guessing the weather long term. You can get a small statistical advantage, and that accounts for the 10 percent that you get right.
Note: Without a statistical advantage your hit rate would be effectively be 0.
Even with caching at a 10 percent hit rate, your link traffic will not decline.
With caching in place, any gain in efficiency will be countered by a corresponding increase in total usage. Why is this?
If you assume a 10 percent hit rate to cache, you will end up getting a 10 percent increase in Internet usage and thus, if your pipe to the Internet was near congestion when you put the caching solution in, it will still be congested. Yes, the hits to cache will be fast and amazing, but the 90 percent of the hits that do not come from the cache will equal 100 percent of your Internet link. The resulting effect will be that 90 percent of your Internet accesses will be sluggish due to the congested link.
Another way to understand is by practical example.
Let’s start with a very congested 100 megabit Internet link. Web hits are slow, YouTube takes forever, email responses are slow, and Skype calls break up. To solve these issues, you put in a caching server.
Now 10 percent of your hits come from cache, but since you did nothing to mitigate overall bandwidth usage, your users will simply eat up the extra 10 percent from cache and then some. It is like giving a drug addict a free hit of their preferred drug. If you serve up a fast YouTube, it will just encourage more YouTube usage.
Even with a good caching solution in place, if somebody tries to access Grandma’s Facebook page, it will have to come over the congested link, and it may time out and not load right away. Or, if somebody makes a Skype call it will still be slow. In other words, the 90 percent of the hits not in cache are still slow even though some video and some pages play fast, so the question is:
If 10 percent of your traffic is really fast, and 90 percent is doggedly slow, did your caching solution help?
The answer is yes, of course it helped, 10 percent of users are getting nice, uninterrupted YouTube. It just may not seem that way when the complaints keep rolling in. :)