A Novel Idea on How to Cache Data Completely Transparently


By Art Reisman

Recently I got a call from a customer claiming our Squid proxy was not retrieving videos from cache when expected.

This prompted me to set up a test in our lab where I watched  four videos over and over. With each iteration, I noticed that the proxy would  sometimes go out and fetch a new copy of a video, even though the video was already in the local cache, thus confirming the customer’s observation.

Why does this happen?

I have not delved down into the specific Squid code yet, but I think It has to do with the dynamic redirection performed by YouTube in the cloud, and the way the Squid proxy interprets the URL.  If you look closely at YouTube URLs, there is a CGI component in the name, the word “what” followed by a question mark “?”.  The URLs  are not static. Even though I may be watching the same YouTube on successive tries, the cloud is getting the actual video from a different place each time, and so the Squid proxy thinks it is new.

Since caching old copies of data is a big no-no, my Squid proxy, when in doubt, errors on the side of caution and fetches a new copy.

The other hassle with using a proxy caching server  is the complexity of  setting up port re-direction (special routing rules). By definition the Proxy must fake out the client making the request for the video. Getting this re-direction to work requires some intimate network knowledge and good troubleshooting techniques.

My solution for the above issues is to just toss the traditional Squid proxy altogether and invent something easier to use.

Note: I have run the following idea  by the naysayers  (all of my friends who think I am nuts), and yes, there are still  some holes in this idea. I’ll represent their points after I present my case.

My caching idea

To get my thought process started, I tossed all that traditional tomfoolery with re-direction and URL name caching out the window.

My caching idea is to cache streams of data without regard to URL or filename.  Basically, this would require a device to save off streams of characters as they happen.  I am already very familiar with implementing this technology; we do it with our CALEA probe.  We have already built technology that can capture raw streams of data, store, and then index them, so this does not need to be solved.

Figuring out if a subsequent stream matched a stored stream would be a bit more difficult but not impossible.

The benefits of this stream-based caching scheme as I see them:

1) No routing or redirection needed, the device could plugged into any network link by any weekend warrior.

2) No URL confusion.  Even if a stream (video) was kicked off from a different URL, the proxy device would recognize the character stream coming across the wire to be the same as a stored stream in the cache, and then switch over to the cached stream when appropriate, thus saving the time and energy of fetching the rest of the data from across the Internet.

The pure beauty of this solution is that just about any consumer could plug it in without any networking or routing knowledge.

How this could be built

Some rough details on how this would be implemented…

The proxy would cache the most recent 10,000 streams.

1) A stream would be defined as occurring when continuous data was transferred in one direction from an IP and port to another IP and port.

2) The stream would terminate and be stored when the port changed.

3) The server would compare the beginning parts of new streams to streams already in cache, perhaps the first several thousand characters.  If there was a match, it would fake out the sender and receiver and step in the middle and continue sending the data.

What could go wrong

Now for the major flaws in this technology that must be overcome.

1) Since there is no title on the stream from the sender, there would always be the chance that the match was a coincidence.  For example, an advertisement appended to multiple YouTube videos might fool the caching server. The initial sequence of bytes would match the advertisement and not the following video.

2) Since we would be interrupting a client-server transaction mid-stream, the server would have to be cut-off in the middle of the stream when the proxy took over.  That might get ugly as the server tries to keep sending. Faking an ACK back to the sending server would also not be viable, as the sending server would continue to send data, which is what we are trying to prevent with the cache.

Next step, (after I fix our traditional URL matching problem for the customer) is to build an experimental version of stream-based caching.

Stay tuned to see if I can get this idea to work!

YouTube Caching Results: detailed analysis from live systems


Since the release of YouTube caching support on our NetEqualizer bandwidth controller,  we have been able to review several live systems in the field. Below we will go over the basic hit rate of YouTube videos and explain in detail how this effects the user experience. The analysis  below is based on an actual snapshot from a mid-sized state university, using a 64 Gigabyte cache, and approximately 2000 students in residence.

The Squid Proxy server provides a wide range of statistics. You can easily spend hours examining them and become exhausted with MSOS, an acronym for “meaningless stat overload syndrome”.  To save you some time we are going to look at just one stat from one report.  From the Squid Statistics Tab on the NetEqualizer, we selected the Cache Client List option. This report shows individual Cache stats for all clients on your network. At the very bottom is a summary report totaling all squid stats and hits for all clients.

TOTALS

  • ICP : 0 Queries, 0 Hits (0%)
  • HTTP: 21990877 Requests, 3812 Hits (0%)

At first glance it appears as if the ratio of actual cache hits,  3812, to HTTP requests,  21990877,  is extremely low.  As with all statistics the obvious conclusion can be misleading. First off, the NetEqualizer cache is deliberately tuned to NOT cache HTTP requests smaller than 2 Megabytes. This is done for a couple of reasons:

1) Generally, there is no advantage to caching small Web pages, as they normally load up quickly on systems with NetEqualizer fairness in place. They already have priority.

2) With a few exceptions of popular web sites , small web hits are widely varied and fill up the cache – taking away space that we would like to use for our target content, Youtube Videos.

Breaking down the amount of data in a typical web site versus a Youtube hit.

It is true that web sites today can often exceed a Megabyte.  However ,rarely does a web site of 2 Megabytes load up as a single hit. It is comprised of many sub-links, each of which generates a web hit in the summary statistics. A simple HTTP page typically triggers about 10 HTTP requests for perhaps 100K bytes of data total. A more complex page may generate 500K. For example, when you go to the CNN home page there are quite a few small links, and each link increments the HTTP counter. On the other hand, a YouTube hit generates one hit for about 20 megabits of data. When we start to look at actual data cached instead of total Web Hits, the ratio of cached to not cached is quite different.

Our cache set up is also designed to only cache Web pages from 2 megabytes to 40 megabytes, with an estimated average of 20 megabytes. When we look at actual data cached (instead of hits) this gives us about 400 gigabytes of regular HTTP data of which about 76 Gigabytes  came from the cache. Conservatively about 10 percent of all HTTP data came from cache by this rough estimate. This number is  much more significant than the  HTTP statistics reveal.

Even more telling, is that effect these hits have on user experience.

YouTube streaming data, although not the majority of data on this customer system, is very time-sensitive while at the same time being very bandwidth intensive.  The subtle boost made possible by caching 10 percent of the data on this system has a discernible effect on the user experience. Think about it, if 10 percent of your experience on the Web is video, and you were resigned to it timing out and bogging down, you will notice the difference when those YouTube videos play through to completion, even if only half of them come from cache.

For a more detailed technical overview of NetEqualizer YouTube caching (NCO) click here.

Setting Up a Squid Proxy Caching Co-Resident with a Bandwidth Controller


Editor’s Note: It was a long road to get here (building the NetEqualizer Caching Option (NCO) a new feature offered on the NE3000 & NE4000), and for those following in our footsteps or just curious on the intricacies of YouTube caching, we have laid open the details.

This evening, I’m burning the midnight oil. I’m monitoring Internet link statistics at a state university with several thousand students hammering away on their residential network. Our bandwidth controller, along with our new NetEqualizer Caching Option (NCO), which integrates Squid for caching, has been running continuously for several days and all is stable. From the stats I can see, about 1,000 YouTube videos have been played out of the local cache over the past several hours. Without the caching feature installed, most of the YouTube videos would have played anyway, but there would be interruptions as the Internet link coughed and choked with congestion. Now, with NCO running smoothly, the most popular videos will run without interruptions.

Getting the NetEqualizer Caching Option to this stable product was a long and winding road.  Here’s how we got there.

First, some background information on the initial problem.

To use a Squid proxy server, your network administrator must put hooks in your router so that all Web requests go the Squid proxy server before heading out to the Internet. Sometimes the Squid proxy server will have a local copy of the requested page, but most of the time it won’t. When a local copy is not present, it sends your request on to the Internet to get the page (for example the Yahoo! home page) on your behalf. The squid server will then update a local copy of the page in its cache (storage area) while simultaneously sending the results back to you, the original requesting user. If you make a subsequent request to the same page, the Squid will quickly check it to see if the content has been updated since it stored away the first time, and if it can, it will send you a local copy. If it detects that the local copy is no longer valid (the content has changed), then it will go back out to the Internet and get a new copy.

Now, if you add a bandwidth controller to the mix, things get interesting quickly. In the case of the NetEqualizer, it decides when to invoke fairness based on the congestion level of the Internet trunk. However, with the bandwidth controller unit (BCU) on the private side of the Squid server, the actual Internet traffic cannot be distinguished from local cache traffic. The setup looks like this:

Internet->router->Squid->bandwidth controller->users

The BCU in this example won’t know what is coming from cache and what is coming from the Internet. Why? Because the data coming from the Squid cache comes over the same path as the new Internet data. The BCU will erroneously think all the traffic is coming from the Internet and will shape cached traffic as well as Internet traffic, thus defeating the higher speeds provided by the cache.

In this situation, the obvious solution would be to switch the position of the BCU to a setup like this:

Internet->router->bandwidth controller->Squid->users

This configuration would be fine except that now all the port 80 HTTP traffic (cached or not) will appear like it is coming from the Squid proxy server and your BCU will not be able to do things like put rate limits on individual users.

Fortunately, with the our NetEqualizer 5.0 release, we’ve created an integration with NetEqualizer and co-resident Squid (our NetEqualizer Caching Option) such that everything works correctly. (The NetEqualizer still sees and acts on all traffic as if it were between the user and the Internet. This required some creative routing and actual bug fixes to the bridging and routing in the Linux kernel. We also had to develop a communication module between the NetEqualizer and the Squid server so the NetEqualizer gets advance notice when data is originating in cache and not the Internet.)

Which do you need, Bandwidth Control or Caching?

At this point, you may be wondering if Squid caching is so great, why not just dump the BCU and be done with the complexity of trying to run both? Well, while the Squid server alone will do a fine job of accelerating the access times of large files such as video when they can be fetched from cache, a common misconception is that there is a big relief on your Internet pipe with the caching server.  This has not been the case in our real world installations.

The fallacy for caching as panacea for all things congested assumes that demand and overall usage is static, which is unrealistic.  The cache is of finite size and users will generally start watching more YouTube videos when they see improvements in speed and quality (prior to Squid caching, they might have given up because of slowness), including videos that are not in cache.  So, the Squid server will have to fetch new content all the time, using additional bandwidth and quickly negating any improvements.  Therefore, if you had a congested Internet pipe before caching, you will likely still have one afterward, leading to slow access for many e-mail, Web  chat and other non-cachable content. The solution is to include a bandwidth controller in conjunction with your caching server.  This is what NetEqualizer 5.0 now offers.

In no particular order, here is a list of other useful information — some generic to YouTube caching and some just basic notes from our engineering effort. This documents the various stumbling blocks we had to overcome.

1. There was the issue of just getting a standard Squid server to cache YouTube files.

It seemed that the URL tags on these files change with each access, like a counter, and a normal Squid server is fooled into believing the files have changed. By default, when a file changes, a caching server goes out and gets the new copy. In the case of YouTube files, the content is almost always static. However, the caching server thinks they are different when it sees the changing file names. Without modifications, the default Squid caching server will re-retrieve the YouTube file from the source and not the cache because the file names change. (Read more on caching YouTube with Squid…).

2. We had to move to a newer Linux kernel to get a recent of version of Squid (2.7) which supports the hooks for YouTube caching.

A side effect was that the new kernel destabilized some of the timing mechanisms we use to implement bandwidth control. These subtle bugs were not easily reproduced with our standard load generation tools, so we had to create a new simulation lab capable of simulating thousands of users accessing the Internet and YouTube at the same time. Once we built this lab, we were able to re-create the timing issues in the kernel and have them patched.

3. It was necessary to set up a firewall re-direct (also on the NetEqualizer) for port 80 traffic back to the Squid server.

This configuration, and the implementation of an extra bridge, were required to get everything working. The details of the routing within the NetEqualizer were customized so that we would be able to see the correct IP addresses of  Internet sources and users when shaping.  (As mentioned above, if you do not take care of this, all IPs (traffic) will appear as if they are coming from the Proxy server.

4. The firewall has a table called ConnTrack (not be confused with NetEqualizer connection tracking but similar).

The connection tracking table on the firewall tends to fill up and crash the firewall, denying new requests for re-direction if you are not careful. If you just go out and make the connection table randomly enormous that can also cause your system to lock up. So, you must measure and size this table based on experimentation. This was another reason for us to build our simulation lab.

5. There was also the issue of the Squid server using all available Linux file descriptors.

Linux comes with a default limit for security reasons, and when the Squid server hit this limit (it does all kinds of file reading and writing keeping descriptors open), it locks up.

Tuning changes that we made to support Caching with Squid

a. To limit the file size of a cached object of 2 megabytes (2MB) to 40 megabytes (40MB)

  • minimum_object_size 2000000 bytes
  • maximum_object_size 40000000 bytes

If you allow smaller cached objects it will rapidly fill up your cache and there is little benefit to caching small pages.

b. We turned off the Squid keep reading flag

  • quick_abort_min 0 KB
  • quick_abort_max 0 KB

This flag when set continues to read a file even if the user leave the page, for example when watching a video if the user aborts on their browser the Squid cache continues to read the file. I suppose this could now be turned back on, but during testing it was quite obnoxious to see data transfers talking place to the squid cache when you thought nothing was going on.

c. We also explicitly told the Squid what DNS servers to use in its configuration file. There was some evidence that without this the Squid server may bog down, but we never confirmed it. However, no harm is done by setting these parameters.

  • dns_nameservers   x.x.x.x

d. You have to be very careful to set the cache size not to exceed your actual capacity. Squid is not smart enough to check your real capacity, so it will fill up your file system space if you let it, which in turn causes a crash. When testing with small RAM disks less than four gigs of cache, we found that the Squid logs will also fill up your disk space and cause a lock up. The logs are refreshed once a day on a busy system. With a large amount of pages being accessed, the log will use close to one (1) gig of data quite easily, and then to add insult to injury, the log back up program makes a back up. On a normal-sized caching system there should be ample space for logs

e. Squid has a short-term buffer not related to caching. It is just a buffer where it stores data from the Internet before sending it to the client. Remember all port 80 (HTTP) requests go through the squid, cached or not, and if you attempt to control the speed of a transfer between Squid and the user, it does not mean that the Squid server slows the rate of the transfer coming from the Internet right away. With the BCU in line, we want the sender on the Internet to back off right away if we decide to throttle the transfer, and with the Squid buffer in between the NetEqualizer and the sending host on the Internet, the sender would not respond to our deliberate throttling right away when the buffer was too large (Link to Squid caching parameter).

f. How to determine the effectiveness of your YouTube caching statistics?

I use the Squid client cache statistics page. Down at the bottom there is a entry that lists hits verses requests.

TOTALS

  • ICP : 0 Queries, 0 Hits (0%)
  • HTTP: 21990877 Requests, 3812 Hits (0%)

At first glance, it may appear that the hit rate is not all that effective, but let’s look at these stats another way. A simple HTTP page generates about 10 HTTP requests for perhaps 80K bytes of data total. A more complex page may generate 500k. For example, when you go to the CNN home page there are quite a few small links, and each link increments the HTTP counter. On the other hand, a YouTube hit generates one hit for about 20 megabits of data. So, if I do a little math based on bytes cached we get, the summary of HTTP hits and requests above does not account for total data. But, since our cache is only caching Web pages from two megabits to 40 megabits, with an estimated average of 20 megabits, this gives us about 400 gigabytes of regular HTTP and 76 Gigabytes of data that came from the cache. Abut 20 percent of all HTTP data came from cache by this rough estimate, which is a quite significant.

NetEqualizer YouTube Caching a Win for Net Neutrality


Over the past few years, much of the controversy over net neutrality has ultimately stemmed from the longstanding rift between carriers and content providers. Commercial content providers such as NetFlix have entire business models that rely on relatively unrestricted bandwidth access for their customers, which has led to an enormous increase in the amount of bandwidth that is being used. In response to these extreme bandwidth loads and associated costs, ISPs have tried all types of schemes to limit and restrict total usage. Some of the solutions that have been tried include:

While in many cases effective, most of these efforts have been mired in controversy with respect to net neutrality. However, caching is the one exception.

Up to this point, caching has proven to be the magic bullet that can benefit both ISPs and consumers (faster access to videos, etc.) while respecting net neutrality. To illustrate this, we’ll run caching through the gauntlet of questions that have been raised about these other solutions in regard to a violation of net neutrality. In the end, it comes up clean.

1. Does caching involve deep introspection of user traffic without their knowledge (like layer-7 shaping and DPI)?

No.

2. Does Caching perform any form of preferential treatment based on content?

No.

3. Does caching perform any form of preferential treatment based on fees?

No.

Yet, despite avoiding these pitfalls, caching has still proven to be extremely effective, allowing Internet providers to manage increasing customer demands without infringing upon customers’ rights or quality of service. It was these factors that led APconnections to develop our most recent NetEqualizer feature, YouTube caching.

For more on this feature, or caching in general, check out our new NetEqualizer YouTube Caching FAQ post.

Enhance Your Internet Service With YouTube Caching


Have you ever wondered why certain videos on YouTube seem to run more smoothly than others? Over the years, I’ve consistently noticed that some videos on my home connection will run without interruption while others are as slow as molasses. Upon further consideration, I determined a simple common denominator for the videos that play without interruption — they’re popular. In other words, they’re trending. And, the opposite is usually true for the slower videos.

To ensure better performance, my Internet provider keeps a local copy of the popular YouTube content (caching), and when I watch a trending video, they send me the stream from their local cache. However, if I request a video that’s not contained in their current cache, I’m sent over the broader Internet to the actual YouTube content servers. When this occurs, my video streams are located off the provider’s local network and my pipe can be restricted. Therefore, the most likely cause for the slower video stream is traffic congestion at peak hours.

Considering this, caching video is usually a win-win for the ISP and Internet consumer. Here’s why…

Benefits of Caching Video for the ISP

Last-mile connections from the point of presence to the customer are usually not overloaded, especially on a wired or fiber network such as a cable operator. Caching video allows a provider to keep traffic on their last mile and hence doesn’t clog the provider’s exchange point with the broader Internet. Adding bandwidth to the exchange point is expensive, but caching video will allow you to provide a higher class of service without the large recurring costs.

Benefits of ISP-Level Caching for the Internet Consumer

Put simply, the benefit is an overall better video-viewing experience. Most consumers could care less about the technical details behind the quality of their Internet service. What matters is the quality itself. In this competitive market and the rising expectations for video service, the ISP needs every advantage it can get.

Why Target YouTube for Caching?

YouTube video is very bandwidth intensive and relatively stable content. By stable, we mean once posted, the video content does not get changed or edited. This makes it a prime candidate for effective caching.

Should an ISP Cache All Of The Data It Can?

While this is the default setting for most Squid caching servers, we recommend only caching the popular free video sites such as YouTube. This would involve some selective filtering, but caching everything in a generic mode can cause confusion with some secure sites not functioning correctly.

Note: With Squid Proxy you’ll need a third party module to cache YouTube.

How Will Caching Work with My NetEqualizer or Other Bandwidth Control Device?

You’ll need to put your caching server in transparent mode and run it on the private side of your NetEqualizer.

NetEqualizer Placement with caching server

Related Article fourteen tips to make your WISP more profitable

%d bloggers like this: