A Novel Idea on How to Cache Data Completely Transparently


By Art Reisman

Recently I got a call from a customer claiming our Squid proxy was not retrieving videos from cache when expected.

This prompted me to set up a test in our lab where I watched  four videos over and over. With each iteration, I noticed that the proxy would  sometimes go out and fetch a new copy of a video, even though the video was already in the local cache, thus confirming the customer’s observation.

Why does this happen?

I have not delved down into the specific Squid code yet, but I think It has to do with the dynamic redirection performed by YouTube in the cloud, and the way the Squid proxy interprets the URL.  If you look closely at YouTube URLs, there is a CGI component in the name, the word “what” followed by a question mark “?”.  The URLs  are not static. Even though I may be watching the same YouTube on successive tries, the cloud is getting the actual video from a different place each time, and so the Squid proxy thinks it is new.

Since caching old copies of data is a big no-no, my Squid proxy, when in doubt, errors on the side of caution and fetches a new copy.

The other hassle with using a proxy caching server  is the complexity of  setting up port re-direction (special routing rules). By definition the Proxy must fake out the client making the request for the video. Getting this re-direction to work requires some intimate network knowledge and good troubleshooting techniques.

My solution for the above issues is to just toss the traditional Squid proxy altogether and invent something easier to use.

Note: I have run the following idea  by the naysayers  (all of my friends who think I am nuts), and yes, there are still  some holes in this idea. I’ll represent their points after I present my case.

My caching idea

To get my thought process started, I tossed all that traditional tomfoolery with re-direction and URL name caching out the window.

My caching idea is to cache streams of data without regard to URL or filename.  Basically, this would require a device to save off streams of characters as they happen.  I am already very familiar with implementing this technology; we do it with our CALEA probe.  We have already built technology that can capture raw streams of data, store, and then index them, so this does not need to be solved.

Figuring out if a subsequent stream matched a stored stream would be a bit more difficult but not impossible.

The benefits of this stream-based caching scheme as I see them:

1) No routing or redirection needed, the device could plugged into any network link by any weekend warrior.

2) No URL confusion.  Even if a stream (video) was kicked off from a different URL, the proxy device would recognize the character stream coming across the wire to be the same as a stored stream in the cache, and then switch over to the cached stream when appropriate, thus saving the time and energy of fetching the rest of the data from across the Internet.

The pure beauty of this solution is that just about any consumer could plug it in without any networking or routing knowledge.

How this could be built

Some rough details on how this would be implemented…

The proxy would cache the most recent 10,000 streams.

1) A stream would be defined as occurring when continuous data was transferred in one direction from an IP and port to another IP and port.

2) The stream would terminate and be stored when the port changed.

3) The server would compare the beginning parts of new streams to streams already in cache, perhaps the first several thousand characters.  If there was a match, it would fake out the sender and receiver and step in the middle and continue sending the data.

What could go wrong

Now for the major flaws in this technology that must be overcome.

1) Since there is no title on the stream from the sender, there would always be the chance that the match was a coincidence.  For example, an advertisement appended to multiple YouTube videos might fool the caching server. The initial sequence of bytes would match the advertisement and not the following video.

2) Since we would be interrupting a client-server transaction mid-stream, the server would have to be cut-off in the middle of the stream when the proxy took over.  That might get ugly as the server tries to keep sending. Faking an ACK back to the sending server would also not be viable, as the sending server would continue to send data, which is what we are trying to prevent with the cache.

Next step, (after I fix our traditional URL matching problem for the customer) is to build an experimental version of stream-based caching.

Stay tuned to see if I can get this idea to work!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: