Stick a Fork in Third Party Caching (Squid Proxy)


I was just going through our blog archives and noticed that many of the caching articles we promoted circa 2011 are still getting hits.  Many of the hits are coming from less developed countries where bandwidth is relatively expensive when compared to the western world.  I hope that businesses and ISPs hoping for a miracle using caching will find this article, as it applies to all third-party caching engines, not just the one we used to offer as an add-on to the NetEqualizer.

So why do I make such a bold statement about third-party caching becoming obsolete?

#1) There have been some recent changes in the way Google provides YouTube content, which makes caching it almost impossible.  All of their YouTube videos are generated dynamically and broken up into segments, to allow differential custom advertising.  (I yearn for the days without the ads!)

#2) Almost all pages and files on the Internet are marked “Do not Cache” in the HTML headers. Some of them will cache effectively, but you must assume the designer plans on making dynamic, on the fly, changes to their content.  Caching an obsolete page and delivering it to an end user could actually result in serious issues, and perhaps even a lawsuit, if you cause some form of economic harm by ignoring the “do not cache” directive.

#3) Streaming content as well as most HTML content is now encrypted, and since we are not the NSA, we do not have a back door to decrypt and deliver from our caching engines.

As you may have noticed I have been careful to point out that caching is obsolete on third-party caching engines, not all caching engines, so what gives?

Some of the larger content providers, such as Netflix, will work with larger ISPs to provide large caching servers for their proprietary and encrypted content. This is a win-win for both Netflix and the Last Mile ISP.  There are some restrictions on who Netflix will support with this technology.  The point is that it is Netflix providing the caching engine, for their content only, with their proprietary software, and a third-party engine cannot offer this service.  There may be other content providers providing a similar technology.  However, for now, you can stick a fork in any generic third-party caching server.

A Novel Idea on How to Cache Data Completely Transparently


By Art Reisman

Recently I got a call from a customer claiming our Squid proxy was not retrieving videos from cache when expected.

This prompted me to set up a test in our lab where I watched  four videos over and over. With each iteration, I noticed that the proxy would  sometimes go out and fetch a new copy of a video, even though the video was already in the local cache, thus confirming the customer’s observation.

Why does this happen?

I have not delved down into the specific Squid code yet, but I think It has to do with the dynamic redirection performed by YouTube in the cloud, and the way the Squid proxy interprets the URL.  If you look closely at YouTube URLs, there is a CGI component in the name, the word “what” followed by a question mark “?”.  The URLs  are not static. Even though I may be watching the same YouTube on successive tries, the cloud is getting the actual video from a different place each time, and so the Squid proxy thinks it is new.

Since caching old copies of data is a big no-no, my Squid proxy, when in doubt, errors on the side of caution and fetches a new copy.

The other hassle with using a proxy caching server  is the complexity of  setting up port re-direction (special routing rules). By definition the Proxy must fake out the client making the request for the video. Getting this re-direction to work requires some intimate network knowledge and good troubleshooting techniques.

My solution for the above issues is to just toss the traditional Squid proxy altogether and invent something easier to use.

Note: I have run the following idea  by the naysayers  (all of my friends who think I am nuts), and yes, there are still  some holes in this idea. I’ll represent their points after I present my case.

My caching idea

To get my thought process started, I tossed all that traditional tomfoolery with re-direction and URL name caching out the window.

My caching idea is to cache streams of data without regard to URL or filename.  Basically, this would require a device to save off streams of characters as they happen.  I am already very familiar with implementing this technology; we do it with our CALEA probe.  We have already built technology that can capture raw streams of data, store, and then index them, so this does not need to be solved.

Figuring out if a subsequent stream matched a stored stream would be a bit more difficult but not impossible.

The benefits of this stream-based caching scheme as I see them:

1) No routing or redirection needed, the device could plugged into any network link by any weekend warrior.

2) No URL confusion.  Even if a stream (video) was kicked off from a different URL, the proxy device would recognize the character stream coming across the wire to be the same as a stored stream in the cache, and then switch over to the cached stream when appropriate, thus saving the time and energy of fetching the rest of the data from across the Internet.

The pure beauty of this solution is that just about any consumer could plug it in without any networking or routing knowledge.

How this could be built

Some rough details on how this would be implemented…

The proxy would cache the most recent 10,000 streams.

1) A stream would be defined as occurring when continuous data was transferred in one direction from an IP and port to another IP and port.

2) The stream would terminate and be stored when the port changed.

3) The server would compare the beginning parts of new streams to streams already in cache, perhaps the first several thousand characters.  If there was a match, it would fake out the sender and receiver and step in the middle and continue sending the data.

What could go wrong

Now for the major flaws in this technology that must be overcome.

1) Since there is no title on the stream from the sender, there would always be the chance that the match was a coincidence.  For example, an advertisement appended to multiple YouTube videos might fool the caching server. The initial sequence of bytes would match the advertisement and not the following video.

2) Since we would be interrupting a client-server transaction mid-stream, the server would have to be cut-off in the middle of the stream when the proxy took over.  That might get ugly as the server tries to keep sending. Faking an ACK back to the sending server would also not be viable, as the sending server would continue to send data, which is what we are trying to prevent with the cache.

Next step, (after I fix our traditional URL matching problem for the customer) is to build an experimental version of stream-based caching.

Stay tuned to see if I can get this idea to work!

%d bloggers like this: