Editor’s Note: It was a long road to get here (building the NetEqualizer Caching Option (NCO) a new feature offered on the NE3000 & NE4000), and for those following in our footsteps or just curious on the intricacies of YouTube caching, we have laid open the details.
This evening, I’m burning the midnight oil. I’m monitoring Internet link statistics at a state university with several thousand students hammering away on their residential network. Our bandwidth controller, along with our new NetEqualizer Caching Option (NCO), which integrates Squid for caching, has been running continuously for several days and all is stable. From the stats I can see, about 1,000 YouTube videos have been played out of the local cache over the past several hours. Without the caching feature installed, most of the YouTube videos would have played anyway, but there would be interruptions as the Internet link coughed and choked with congestion. Now, with NCO running smoothly, the most popular videos will run without interruptions.
Getting the NetEqualizer Caching Option to this stable product was a long and winding road. Here’s how we got there.
First, some background information on the initial problem.
To use a Squid proxy server, your network administrator must put hooks in your router so that all Web requests go the Squid proxy server before heading out to the Internet. Sometimes the Squid proxy server will have a local copy of the requested page, but most of the time it won’t. When a local copy is not present, it sends your request on to the Internet to get the page (for example the Yahoo! home page) on your behalf. The squid server will then update a local copy of the page in its cache (storage area) while simultaneously sending the results back to you, the original requesting user. If you make a subsequent request to the same page, the Squid will quickly check it to see if the content has been updated since it stored away the first time, and if it can, it will send you a local copy. If it detects that the local copy is no longer valid (the content has changed), then it will go back out to the Internet and get a new copy.
Now, if you add a bandwidth controller to the mix, things get interesting quickly. In the case of the NetEqualizer, it decides when to invoke fairness based on the congestion level of the Internet trunk. However, with the bandwidth controller unit (BCU) on the private side of the Squid server, the actual Internet traffic cannot be distinguished from local cache traffic. The setup looks like this:
The BCU in this example won’t know what is coming from cache and what is coming from the Internet. Why? Because the data coming from the Squid cache comes over the same path as the new Internet data. The BCU will erroneously think all the traffic is coming from the Internet and will shape cached traffic as well as Internet traffic, thus defeating the higher speeds provided by the cache.
In this situation, the obvious solution would be to switch the position of the BCU to a setup like this:
This configuration would be fine except that now all the port 80 HTTP traffic (cached or not) will appear like it is coming from the Squid proxy server and your BCU will not be able to do things like put rate limits on individual users.
Fortunately, with the our NetEqualizer 5.0 release, we’ve created an integration with NetEqualizer and co-resident Squid (our NetEqualizer Caching Option) such that everything works correctly. (The NetEqualizer still sees and acts on all traffic as if it were between the user and the Internet. This required some creative routing and actual bug fixes to the bridging and routing in the Linux kernel. We also had to develop a communication module between the NetEqualizer and the Squid server so the NetEqualizer gets advance notice when data is originating in cache and not the Internet.)
Which do you need, Bandwidth Control or Caching?
At this point, you may be wondering if Squid caching is so great, why not just dump the BCU and be done with the complexity of trying to run both? Well, while the Squid server alone will do a fine job of accelerating the access times of large files such as video when they can be fetched from cache, a common misconception is that there is a big relief on your Internet pipe with the caching server. This has not been the case in our real world installations.
The fallacy for caching as panacea for all things congested assumes that demand and overall usage is static, which is unrealistic. The cache is of finite size and users will generally start watching more YouTube videos when they see improvements in speed and quality (prior to Squid caching, they might have given up because of slowness), including videos that are not in cache. So, the Squid server will have to fetch new content all the time, using additional bandwidth and quickly negating any improvements. Therefore, if you had a congested Internet pipe before caching, you will likely still have one afterward, leading to slow access for many e-mail, Web chat and other non-cachable content. The solution is to include a bandwidth controller in conjunction with your caching server. This is what NetEqualizer 5.0 now offers.
In no particular order, here is a list of other useful information — some generic to YouTube caching and some just basic notes from our engineering effort. This documents the various stumbling blocks we had to overcome.
It seemed that the URL tags on these files change with each access, like a counter, and a normal Squid server is fooled into believing the files have changed. By default, when a file changes, a caching server goes out and gets the new copy. In the case of YouTube files, the content is almost always static. However, the caching server thinks they are different when it sees the changing file names. Without modifications, the default Squid caching server will re-retrieve the YouTube file from the source and not the cache because the file names change. (Read more on caching YouTube with Squid…).
2. We had to move to a newer Linux kernel to get a recent of version of Squid (2.7) which supports the hooks for YouTube caching.
A side effect was that the new kernel destabilized some of the timing mechanisms we use to implement bandwidth control. These subtle bugs were not easily reproduced with our standard load generation tools, so we had to create a new simulation lab capable of simulating thousands of users accessing the Internet and YouTube at the same time. Once we built this lab, we were able to re-create the timing issues in the kernel and have them patched.
3. It was necessary to set up a firewall re-direct (also on the NetEqualizer) for port 80 traffic back to the Squid server.
This configuration, and the implementation of an extra bridge, were required to get everything working. The details of the routing within the NetEqualizer were customized so that we would be able to see the correct IP addresses of Internet sources and users when shaping. (As mentioned above, if you do not take care of this, all IPs (traffic) will appear as if they are coming from the Proxy server.
4. The firewall has a table called ConnTrack (not be confused with NetEqualizer connection tracking but similar).
The connection tracking table on the firewall tends to fill up and crash the firewall, denying new requests for re-direction if you are not careful. If you just go out and make the connection table randomly enormous that can also cause your system to lock up. So, you must measure and size this table based on experimentation. This was another reason for us to build our simulation lab.
5. There was also the issue of the Squid server using all available Linux file descriptors.
Linux comes with a default limit for security reasons, and when the Squid server hit this limit (it does all kinds of file reading and writing keeping descriptors open), it locks up.
Tuning changes that we made to support Caching with Squid
a. To limit the file size of a cached object of 2 megabytes (2MB) to 40 megabytes (40MB)
- minimum_object_size 2000000 bytes
- maximum_object_size 40000000 bytes
If you allow smaller cached objects it will rapidly fill up your cache and there is little benefit to caching small pages.
b. We turned off the Squid keep reading flag
- quick_abort_min 0 KB
- quick_abort_max 0 KB
This flag when set continues to read a file even if the user leave the page, for example when watching a video if the user aborts on their browser the Squid cache continues to read the file. I suppose this could now be turned back on, but during testing it was quite obnoxious to see data transfers talking place to the squid cache when you thought nothing was going on.
c. We also explicitly told the Squid what DNS servers to use in its configuration file. There was some evidence that without this the Squid server may bog down, but we never confirmed it. However, no harm is done by setting these parameters.
d. You have to be very careful to set the cache size not to exceed your actual capacity. Squid is not smart enough to check your real capacity, so it will fill up your file system space if you let it, which in turn causes a crash. When testing with small RAM disks less than four gigs of cache, we found that the Squid logs will also fill up your disk space and cause a lock up. The logs are refreshed once a day on a busy system. With a large amount of pages being accessed, the log will use close to one (1) gig of data quite easily, and then to add insult to injury, the log back up program makes a back up. On a normal-sized caching system there should be ample space for logs
e. Squid has a short-term buffer not related to caching. It is just a buffer where it stores data from the Internet before sending it to the client. Remember all port 80 (HTTP) requests go through the squid, cached or not, and if you attempt to control the speed of a transfer between Squid and the user, it does not mean that the Squid server slows the rate of the transfer coming from the Internet right away. With the BCU in line, we want the sender on the Internet to back off right away if we decide to throttle the transfer, and with the Squid buffer in between the NetEqualizer and the sending host on the Internet, the sender would not respond to our deliberate throttling right away when the buffer was too large (Link to Squid caching parameter).
f. How to determine the effectiveness of your YouTube caching statistics?
I use the Squid client cache statistics page. Down at the bottom there is a entry that lists hits verses requests.
- ICP : 0 Queries, 0 Hits (0%)
- HTTP: 21990877 Requests, 3812 Hits (0%)
At first glance, it may appear that the hit rate is not all that effective, but let’s look at these stats another way. A simple HTTP page generates about 10 HTTP requests for perhaps 80K bytes of data total. A more complex page may generate 500k. For example, when you go to the CNN home page there are quite a few small links, and each link increments the HTTP counter. On the other hand, a YouTube hit generates one hit for about 20 megabits of data. So, if I do a little math based on bytes cached we get, the summary of HTTP hits and requests above does not account for total data. But, since our cache is only caching Web pages from two megabits to 40 megabits, with an estimated average of 20 megabits, this gives us about 400 gigabytes of regular HTTP and 76 Gigabytes of data that came from the cache. Abut 20 percent of all HTTP data came from cache by this rough estimate, which is a quite significant.
Stuck on Desert Island, Do You Take Your Caching Server or Your Netequalizer ?March 15, 2014 — netequalizer
Caching is a great idea and works well, but I’ll take my NetEqualizer with me if forced to choose between the two on my remote island with a satellite link.
Yes there are a few circumstances where a caching server might have a nice impact. Our most successful deployments are in educational environments where the same video is watched repeatedly as an assignment; but for most wide open installations ,expectations of performance far outweigh reality. Lets have at look at what works and also drill down on expectations that are based on marginal assumptions.
From my personal archive of experience here are some of the expectations attributed to caching that perhaps are a bit too optimistic.
“Most of my users go to their Yahoo or Face Book home page every day when they log in and that is the bulk of all they do”
– I doubt this customer’s user base is that conformist :), and they’ll find out once they install their caching solution. But even if true, only some of the content on Face Book and Yahoo is static. A good portion of these pages are by default dynamic, and ever-changing with content. They are marked as Dynamic in their URLs which means the bulk of the page must be reloaded each time. For example, in order for caching to have an impact , the users in this scenario would have to stick to their home pages , and not look at friend photo’s or other pages.
” We expect to see a 30 percent hit rate when we deploy our cache.”
You won’t see a 30 percent hit rate, unless somebody designs some specific robot army to test your cache, hitting the same pages over and over again. Perhaps, on IOS update day, you might see a bulk of your hits going to the same large file and have a significant performance boost for a day. But overall you will be doing well if you get a 3 or 4 percent hit rate.
” I expect the cache hits to take pressure off my Internet Link”
Assuming you want your average user to experience a fast loading Internet, this is where you really want your NetEqualizer ( or similar intelligent bandwidth controller) over your caching engine. The smart bandwidth controller can re-arrange traffic on the fly insuring Interactive hits get the best response. A caching engine does not have that intelligence.
Let’s suppose you have a 100 megabit link to the Internet ,and you install a cache engine that effectively gets a 6 percent hit rate. That would be exceptional hit rate.
So what is the end user experience with a 6 percent hit rate compared to pre-cache ?
-First off, it is not the hit rate that matters when looking at total bandwidth. Much of those hits will likely be smallish image files from the Yahoo home page or common sites, that account for less than 1 percent of your actual traffic. Most of your traffic is likely dominated by large file downloads and only a portion of those may be coming from cache.
– A 6 percent hit rate means that 94 percent miss rate , and if your Internet was slow from congestion before the caching server it will still be slow 94 percent of the time.
– Putting in a caching server would be like upgrading your bandwidth from 100 megabits to 104 megabits to relieve congestion. That cache hits may add to the total throughput in your reports, but the 100 megabit bottleneck is still there, and to the end user, there is little or no difference in user perception at this point. A portion of your Internet access is still marginal or unusable during peak times, and other than the occasional web page or video loading nice and snappy , users are getting duds most of the time.
Even the largest caching server is insignificant in how much data it can store.
– The Internet is Vast and your Cache is not. Think of a tiny Ant standing on top of Mount Everest. YouTube puts up 100 hours of new content every minute of every day. A small commercial caching server can store about 1/1000 of what YouTube uploads in day, not to mention yesterday and the day before and last year. It’s just not going to be in your cache.
So why is a NetEqualizer bandwidth controller so much more superior than a caching server when changing user perception of speed? Because the NetEqualizer is designed to keep Internet access from crashing , and this is accomplished by reducing the large file transfers and video download footprints during peak times. Yes these videos and downloads may be slow or sporadic, but they weren’t going to work anyway, so why let them crush the interactive traffic ? In the end caching and equalizing are not perfect, but from real world trials the equalizer changes the user experience from slow to fast for all Interactive transactions, caching is hit or miss ( pun intended).