Editors Note: The past few weeks we have been working on tuning and testing our caching engine. We have been working closely with some of the developers who contribute to the Squid open source program.
Following are some of my observations and discoveries regarding Squid Caching from our testing process.
Our primary mission was to make sure YouTube files cache correctly ( which we have done). One of the tricky aspects of caching a YouTube file, is that many of these files are considered dynamic content. Basically, this means their content contains a portion that may change with each access, sometimes the URL itself is just a pointer to a server where the content is generated fresh with each new access.
An extreme example of dynamic content would be your favorite stock quote site. During the business day much of the information on these pages is changing constantly, thus it is obsolete within seconds. A poorly designed caching engine would do much more harm than good if it served up out of data stock quotes.
Caching engines by default try not cache dynamic content, and for good reason. There are two different methods a caching server uses to decide whether or not to cache a page
1) The web designer can specifically set flags in the format the actual URL to tell caching engines whether a page is safe to cache or not.
In a recent test I set up a crawler to walk through the excite web site and all its urls. I use this crawler to create load in our test lab as well as to fill up our caching engine with repeatable content. I set my Squid Configuration file to cache all content less than 4k. Normally this would generate a great deal of Web hits , but for some reason none of the Excite content would cache. Upon further analysis our Squid consultant found the problem.
“ I have completed the initial analysis. The problem is the excite.com
server(s). All of the “200 OK” excite.com responses that I have seen
among the first 100+ requests contain Cache-Control headers that
prohibit their caching by shared caches. There appears to be only two
kinds of Cache-Control values favored by excite:
Cache-Control: no-store, no-cache, must-revalidate, post-check=0,
Both are deadly for a shared Squid cache like yours. Squid has options
to overwrite most of these restrictions, but you should not do that for
all traffic as it will likely break some sites.”
2) The second method is a bit more passive than deliberate directives. Caching engines look at the actual URL of a page to gain clues about its permanence. A “?” used in the url implies dynamic content and is generally a red flag to the caching server . And here-in lies the issue with caching Youtube files, almost all of them have a “?” embedded within their URL.
Fortunately Youtube Videos, are normally permanent and unchanging once they are uploaded. I am still getting a handle these pages, but it seems the dynamic part is used for the insertion of different advertisements on the front end of the Video. Our squid caching server uses a normalizing technique to keep the root of the URL consistent and thus serve up the correct base YouTube every time. Over the past two years we have had to replace our normalization technique twice in order to consistently cache YouTube files.