A Novel Idea on How to Cache Data Completely Transparently


By Art Reisman

Recently I got a call from a customer claiming our Squid proxy was not retrieving videos from cache when expected.

This prompted me to set up a test in our lab where I watched  four videos over and over. With each iteration, I noticed that the proxy would  sometimes go out and fetch a new copy of a video, even though the video was already in the local cache, thus confirming the customer’s observation.

Why does this happen?

I have not delved down into the specific Squid code yet, but I think It has to do with the dynamic redirection performed by YouTube in the cloud, and the way the Squid proxy interprets the URL.  If you look closely at YouTube URLs, there is a CGI component in the name, the word “what” followed by a question mark “?”.  The URLs  are not static. Even though I may be watching the same YouTube on successive tries, the cloud is getting the actual video from a different place each time, and so the Squid proxy thinks it is new.

Since caching old copies of data is a big no-no, my Squid proxy, when in doubt, errors on the side of caution and fetches a new copy.

The other hassle with using a proxy caching server  is the complexity of  setting up port re-direction (special routing rules). By definition the Proxy must fake out the client making the request for the video. Getting this re-direction to work requires some intimate network knowledge and good troubleshooting techniques.

My solution for the above issues is to just toss the traditional Squid proxy altogether and invent something easier to use.

Note: I have run the following idea  by the naysayers  (all of my friends who think I am nuts), and yes, there are still  some holes in this idea. I’ll represent their points after I present my case.

My caching idea

To get my thought process started, I tossed all that traditional tomfoolery with re-direction and URL name caching out the window.

My caching idea is to cache streams of data without regard to URL or filename.  Basically, this would require a device to save off streams of characters as they happen.  I am already very familiar with implementing this technology; we do it with our CALEA probe.  We have already built technology that can capture raw streams of data, store, and then index them, so this does not need to be solved.

Figuring out if a subsequent stream matched a stored stream would be a bit more difficult but not impossible.

The benefits of this stream-based caching scheme as I see them:

1) No routing or redirection needed, the device could plugged into any network link by any weekend warrior.

2) No URL confusion.  Even if a stream (video) was kicked off from a different URL, the proxy device would recognize the character stream coming across the wire to be the same as a stored stream in the cache, and then switch over to the cached stream when appropriate, thus saving the time and energy of fetching the rest of the data from across the Internet.

The pure beauty of this solution is that just about any consumer could plug it in without any networking or routing knowledge.

How this could be built

Some rough details on how this would be implemented…

The proxy would cache the most recent 10,000 streams.

1) A stream would be defined as occurring when continuous data was transferred in one direction from an IP and port to another IP and port.

2) The stream would terminate and be stored when the port changed.

3) The server would compare the beginning parts of new streams to streams already in cache, perhaps the first several thousand characters.  If there was a match, it would fake out the sender and receiver and step in the middle and continue sending the data.

What could go wrong

Now for the major flaws in this technology that must be overcome.

1) Since there is no title on the stream from the sender, there would always be the chance that the match was a coincidence.  For example, an advertisement appended to multiple YouTube videos might fool the caching server. The initial sequence of bytes would match the advertisement and not the following video.

2) Since we would be interrupting a client-server transaction mid-stream, the server would have to be cut-off in the middle of the stream when the proxy took over.  That might get ugly as the server tries to keep sending. Faking an ACK back to the sending server would also not be viable, as the sending server would continue to send data, which is what we are trying to prevent with the cache.

Next step, (after I fix our traditional URL matching problem for the customer) is to build an experimental version of stream-based caching.

Stay tuned to see if I can get this idea to work!

The World’s Biggest Caching Server


Caching solutions are used in all shapes and sizes to speed up Internet data retrieval. From your desktop keeping a local copy of the last web page viewed, to your cable company keeping an entire library of NetFlix movies,  there is a broad diversity in the scope and size of  caching solutions.

So, what is the biggest caching server out there?  Moreover, if I found the world’s largest caching server, would  it store  just a tiny microscopic subset of the total data  available from the public  Internet?   Is it possible that somebody has actually cached everything Internet? A caching server the size of the Internet seems absurd, but I decided to investigate anyway, and so with an open mind, I set out to find the biggest caching server in the world.  Below I have detailed my research and findings.

As always I started with Google, but not in the traditional sense. If you think about Google, they seem to have every  public page on the Internet indexed. That is a huge amount of data, and I suspect  they are the worlds biggest caching server.  Asserting Google as the worlds largest caching server seems logical , but somewhat hollow and unsubstantiated, my next step was to quantify my assertion.

To figure out how much data is actually stored by Google,  in a weird twist of logic, I figured the best way to estimate the size of the stored data would be to determine what data is not stored in Google.

I would need to find a good way to stumble into some truly random web pages without using Google to find them, and then specifically test to see if Google knew about those pages by  asking Google to search for unique, deep rooted, text strings within those sites.

Rather than ramble too much, I’ll just walk through one of my experiments below.

To find a random Web site, I started with  one of those random web site stumblers. As advertised, it took me to a  random  web site titled, “Finest Polynesian Tiki Objects”. From there, I looked for unique text strings on the Tiki site.  The  idea here is find a sentence of text from this site that is not likely to found anywhere but on this site. In essence something deep enough so as not to be a deliberatly indexed title already submitted to google.   I poked around on the Tiki site  and found some seemingly innocuous text on their merchant  site. “Presenting Genuine Witco Art – every piece will come with a scanned”. I put that exact string in my Google search box and presto there it was.

Screen Shot 2013-05-29 at 4.21.04 PM

Wow it looks like Google has this somewhat random page archived and indexed because it came up in my search.

A sample set of two data points is not large enough to extrapolate from and draw conclusions, so I repeated my experiment a few more times and here are more samples of what I found….

Try number two.

Random Web Site

http://www.genarowlandsband.com/contact.php

Search String In Google

“For booking or general whatnot, contact Bob. Heck, just write to say hello if you feel like it.”

Screen Shot 2013-05-30 at 2.06.35 PM

It worked again, it found the exact page from a search on a string buried deep on the page.

And then I did it again.

Screen Shot 2013-05-30 at 2.18.55 PM

And again Google found the page.

The conclusion is that Google has cached close to 100 percent of the publicly accessible text on the Internet. In fairness to Google’s competitors they also found the same Web pages using the same search terms.

So how much data is cached in terms of a raw number?

 

There are plenty of public statistics for number of Web sites/pages connected to the Internet, and there is also data detailing the average size of a Web Page, what I have not determined  is how much of the Video, and Images are cached by Google, I do know they are working on image search engines, but for now, to be conservative I’ll base my estimates on Text only.

So roughly there are 15 billion Web Pages, and the average amount of text is 25 thousand bytes. (note most of the Web is Video and Images text is actually a small percentage)

So to get a final number I multiply 15 billion  15,000,000,000 times 25 thousand 25,000 and I get…

375,000,000,000,000 bytes cached…

 

 

Notice the name of te site or the band does not appear in my search string, nothing to tip off the google search engine what I am looking for and presto!

Caching Success Urban Myth or Reality


Editors Note:

Caching is a bit overrated as a means of eliminating congestion and speeding up Internet access. Yes there are some nice caching tricks that create fleeting illusions of speed, but in the end, caching alone will fail to mitigate problems due to congestion. The following article adapted from our previous November  2011  posting details why.

You might be surprised to learn that Internet link congestion cannot be mitigated with a caching server alone. Contention can only be eliminated by:

1) Increasing bandwidth

2) Some form of intelligent bandwidth control

3) Or a combination of 1) and 2)

A common assumption about caching is that somehow you will be able to cache a large portion of common web content – such that a significant amount of your user traffic will not traverse your backbone with a decent caching solution. Unfortunately, our real world experience has shown us that the after the implementation of a caching solution the overall congestion on your Internet link shows no improvement.

For example: Let’s take the case of an  Internet trunk  that delivers 100 megabits, and is heavily saturated prior to implementing a caching  solution. What happens when you add a caching server to the mix?

From our experience, a good hit rate to cache will likely not exceed  5 percent. Yes, we have heard claims of 50 percent, but have not seen this in practice and suspect this is just best case vendor hype or a very specialized solution targeted at NetFLix (not general caching).  We have been selling a caching solution and discussing other caching solutions with customers for almost 3 years, and like any urban  myth, claims of high percentage caching hits are impossible to track down.

Why is the hit rate at best only 5 percent?

T
he Internet is huge relative to a cache, and you can only cache a tiny fraction of total Internet content. Even Google, with billions invested in data storage, does not come close. You can attempt to keep trending popular content in the cache, but the majority of access requests to the Internet will tend to be somewhat random and impossible to anticipate. Yes, a good number of hits locally resolve a Yahoo home page, but many more users are going to do unique things. For example, common hits like email and Facebook are all very different for each user, are not a shared resource maintained in the cache. User hobbies are also all different, and thus they traverse different web pages and watch different videos. The point is you can’t anticipate this data and keep it in a local cache any more reliably than guessing the weather long term. You can get a small statistical advantage, and that accounts for the 5 percent that you get right.

 

Even with caching at a 5 percent hit rate, your backbone link usage will not decline.

With caching in place, any gain in efficiency will be countered by a corresponding increase in total usage. Why is this?

If you assume an optimistic 10 percent hit rate to cache, you will end up getting a boost and obviously handle 10 percent more traffic than you did prior to caching , however your main pipe won’t.

This is worth repeating, if you cache 10 percent  of your data, that does not mean your Internet pipe usage will go from  100 percent to 90 percent , it is not a zero sum game. The net effect will be your main pipe will remain at 100 percent full , and you will get 10 percent on top of that from your cache.Thus your net usage to the  Internet appears to be 110 percent.  The problem is you still have a congested pipe and the associated slow web pages and files that are not stored in cache will suffer , you have not solved your congestion problem!

Perhaps I am beating a dead horse with examples, but just one more.

Let’s start with a very congested 100 megabit Internet link. Web hits are slow, YouTube takes forever, email responses are slow, and Skype calls break up. To solve these issues, you put in a caching server.

Now 10 percent of your hits come from cache, but since you did nothing to mitigate overall bandwidth usage, your users will simply eat up the extra 10 percent from cache and then some. It is like giving a drug addict a free hit of their preferred drug. If you serve up a fast YouTube, it will just encourage more YouTube usage.

Even with a good caching solution in place, if somebody tries to access Grandma’s Facebook page, it will have to come over the congested link, and it may time out and not load right away. Or, if somebody makes a Skype call it will still be slow. In other words, the 90 percent of the hits not in cache are still slow even though some video and some pages play fast, so the question is:

If 10 percent of your traffic is really fast, and 90 percent is doggedly slow, did your caching solution help?

The answer is yes, of course it helped, 10 percent of users are getting nice, uninterrupted YouTube. It just may not seem that way when the complaints keep rolling in. :)

Ever Wonder Why Your Video (YouTube) Over the Internet is Slow Sometimes?


By: Art Reisman

Art Reisman CTO www.netequalizer.com

Art Reisman is the CTO of APconnections. He is Chief Architect on the NetGladiator and NetEqualizer product lines.

I live in a nice suburban neighborhood with both DSL and Cable service options for my Internet. My speed tests always show better than 10 megabits of download speed, and yet sometimes, a basic YouTube or iTunes download just drags on forever. Calling my provider to complain about broken promises of Internet speed is futile. Their call center people in India have the patience of saints; they will wear me down with politeness despite my rudeness and screaming. Although I do want to believe in some kind of Internet Santa Claus, I know first hand that streaming unfettered video for all is just not going to happen. Below I’ll break down some of the limitations for video over the Internet, and explain some of the seemingly strange anomalies for various video performance problems.

The factors dictating the quality of video over the Internet are:

1) How many customers are sharing the link between your provider and the rest of the Internet

Believe it or not, your provider pays a fee to connect up to the Internet. Perhaps not in the same exact way a consumer does, but the more traffic they connect up to the rest of the Internet the more it costs them. There are times when their connection to the Internet is saturated, at which point all of their customers will experience slower service of some kind.

2) The server(s) where the video is located

It is possible that the content hosted site has overloaded servers and their disk drives are just not fast enough to maintain decent quality. This is usually what your operator will claim regardless if it is their fault or not. :)

3) The link from the server to the Internet location of your provider

Somewhere between the content video server and your provider there could be a bottleneck.

4) The “last mile”  link between you and your provider (is it dedicated or shared?)

For most cable and DSL customers, you have a direct wire back to your provider. For wireless broadband, it is a completely different story. You are likely sharing the airwaves to your nearest tower with many customers.

So why is my video slow sometimes for YouTube but not for NetFlix?

The reason why I can watch some NetFlix movies, and a good number of popular YouTube videos without any issues on my home system is that my provider uses a trick called caching to host some content locally. By hosting the video content locally, the provider can insure that items 2 and 3 (above) are not an issue. Many urban cable operators also have a dedicated wire from their office to your residence which eliminates issues with item 4 (above).

Basically, caching is nothing new for a cable operator. Even before the Internet, cable operators had movies on demand that you could purchase. With movies on demand, cable operators maintained a server with local copies of popular movies in their main office, and when you called them they would actually throw a switch of some kind and send the movie down the coaxial cable from their office to your house. Caching today is a bit more sophisticated than that but follows the same principles. When you watch a NetFlix movie, or YouTube video that is hosted on your provider’s local server (cache),  the cable company can send the video directly down the wire to your house. In most setups, you don’t share your local last mile wire, and hence the movie plays without contention.

Caching is great, and through predictive management (guessing what is going to be used the most), your provider often has the content you want in a local copy and so it downloads quickly.  However, should you truly surf around to get random or obscure YouTube videos, your chances of a slower video will increase dramatically, as it is not likely to be stored in your provider’s cache.

Try This: The next time you watch a (not popular) YouTube video that is giving your problems, kill it, and try a popular trending video. More often than not, the popular trending video will run without interruption. If you repeat this experiment a few times and get the same results, you can be certain that your provider is caching some video to speed up your experience.

In case you need more proof that this is “top of mind” for Internet Providers, check out the January 1st 2012, CED Magazine article on the Top Broadband 50 for 2011 (read the whole article here).  #25 (enclosed below) is tied to improving video over the Internet.

#25: Feeding the video frenzy with CDNs

So everyone wants their video anywhere, anytime and on any device. One way of making sure that video is poised for rapid deployment is through content delivery networks. The prime example of a cable CDN is the Comcast Content Distribution Network (CCDN), which allows Comcast to use its national backbone to tie centralized storage libraries to regional and local cache servers.

Of course, not every cable operator can afford the grand-scale CDN build-out that Comcast is undertaking, but smaller MSOs can enjoy some of the same benefits through partnerships. – MR

Our Take on Network Instruments 5th Annual Network Global Study


Editors Note: Network Instruments released their “Fifth Annual State of the Network Global study” on March 13th, 2o12. You can read their full study here. Their results were based on responses by 163 network engineers, IT directors, and CIOs in North America, Asia, Europe, Africa, Australia, and South America. Responses were collected from October 22, 2011 to January 3, 2012.

What follows is our take (or my .02 cents) on the key findings around Bandwidth Management and Bandwidth Monitoring from the study.

Finding #1: Over the next two years, more than one-third of respondents expect bandwidth consumption to increase by more than 50%.

Part of me says “well, duh!” but that is only because we hear that from many of our customers. So I guess if you were an Executive, far removed from the day-to-day, this would be an important thing to have pointed out to you. Basically, this is your wake up call (if you are not already awake) to listen to your Network Admins who keep asking you to allocate funds to the network. Now is the time to make your case for more bandwidth to your CEO/President/head guru. Get together budget and resources to build out your network in anticipation of this growth – so that you are not caught off guard. Because if you don’t, someone else will do it for you.

Finding #2: 41% stated network and application delay issues took more than an hour to resolve.

You can and should certainly put monitoring on your network to be able to see and react to delays. However, another way to look at this, admittedly biased from my bandwidth shaping background, is get rid of the delays!

If you are still running an unshaped network, you are missing out on maximizing your existing resource. Think about how smoothly traffic flows on roads, because there are smoothing algorithms (traffic lights) and rules (speed limits) that dictate how traffic moves, hence “traffic shaping.” Now, imagine driving on roads without any shaping in place. What would you do when you got to a 4-way intersection? Whether you just hit the accelerator to speed through, or decided to stop and check out the other traffic probably depends on your risk-tolerance and aggression profile. And the result would be that you make it through OK (live) or get into an ugly crash (and possibly die).

Similarly, your network traffic, when unshaped, can live (getting through without delays) or die (getting stuck waiting in a queue) trying to get to its destination. Whether you look at deep packet inspection, rate limiting, equalizing, or a home-grown solution, you should definitely look into bandwidth shaping. Find a solution that makes sense to you, will solve your network delay issues, and gives you a good return-on-investment (ROI). That way, your Network Admins can spend less time trying to find out the source of the delay.

Finding #3: Video must be dealt with.

24% believe video traffic will consume more than half of all bandwidth in 12 months.
47% say implementing and measuring QoS for video is difficult.
49% have trouble allocating and monitoring bandwidth for video.

Again, no surprise if you have been anywhere near a network in the last 2 years. YouTube use has exploded and become the norm on both consumer and business networks. Add that to the use of video conferencing in the workplace to replace travel, and Netflix or Hulu to watch movies and TV, and you can see that video demand (and consumption) has risen sharply.

Unfortunately, there is no quick, easy fix to make sure that video runs smoothly on your network. However, a combination of solutions can help you to make video run better.

1) Get more bandwidth.

This is just a basic fact-of-life. If you are running a network of < 10Mbps, you are going to have trouble with video, unless you only have one (1) user on your network. You need to look at your contention ratio and size your network appropriately.

2) Cache static video content.

Caching is a good start, especially for static content such as YouTube videos. One caveat to this, do not expect caching to solve network congestion problems (read more about that here) – as users will quickly consume any bandwidth that caching has freed up. Caching will help when a video has gone viral, and everyone is accessing it repeatedly on your network.

3) Use bandwidth shaping to prioritize business-critical video streams (servers).

If you have a designated video-streaming server, you can define rules in your bandwidth shaper to prioritize this server. The risk of this strategy is that you could end up giving all your bandwidth to video; you can reduce the risk by rate capping the bandwidth portioned out to video.

As I said, this is just my take on the findings. What do you see? Do you have a different take? Let us know!

YouTube Dominates Video Viewership in U.S.


Editor’s Note: Updated July 27th, 2011 with material from www.pewinternet.org:

YouTube studies are continuing to confirm what I’m sure we all are seeing – that Americans are creating, sharing and viewing video online more than ever, this according to a Pew Research Center Internet & American Life Project study released Tuesday.

According to Pew, fully 71% of online Americans use video-sharing sites such as YouTube and Vimeo, up from 66% a year earlier. The use of video-sharing sites on any given day also jumped five percentage points, from 23% of online Americans in May 2010 to 28% in May 2011.  This figure (28%) is slightly lower than the 33% Video Metrix reported in June, but is still significant.

To download or read the fully study, click on this link:  http://pewinternet.org/Reports/2011/Video-sharing-sites/Report.aspx

———————————————————————————————————————————————————

YouTube viewership in May 2011 was approximately 33 percent of video viewed on the Internet in the U.S., according to data from the comScore Video Metrix released on June 17, 2011.

Google sites, driven primarily by video viewing at YouTube.com, ranked as the top online video content property in May with 147.2 million unique viewers, which was 83 percent of the total unique viewers tracked.  Google Sites had the highest number of viewing sessions with more than 2.1 billion, and highest time spent per viewer at 311 minutes, crossing the five-hour mark for the first time.

To read more on the data released by comScore, click here.  comScore, Inc. (NASDAQ: SCOR) is a global leader in measuring the digital world and preferred source of digital business analytics. For more information, please visit www.comscore.com/companyinfo.

This trend further confirms why our NetEqualizer Caching Option (NCO) is geared to caching YouTube videos. While NCO will cache any file sized from 2MB-40MB traversing port 80, the main target content is YouTube.  To read more about the NetEqualizer Caching Option to see if it’s a fit for your organization, read our YouTube Caching FAQ or contact Sales at sales@apconnections.net.

YouTube Caching Results: detailed analysis from live systems


Since the release of YouTube caching support on our NetEqualizer bandwidth controller,  we have been able to review several live systems in the field. Below we will go over the basic hit rate of YouTube videos and explain in detail how this effects the user experience. The analysis  below is based on an actual snapshot from a mid-sized state university, using a 64 Gigabyte cache, and approximately 2000 students in residence.

The Squid Proxy server provides a wide range of statistics. You can easily spend hours examining them and become exhausted with MSOS, an acronym for “meaningless stat overload syndrome”.  To save you some time we are going to look at just one stat from one report.  From the Squid Statistics Tab on the NetEqualizer, we selected the Cache Client List option. This report shows individual Cache stats for all clients on your network. At the very bottom is a summary report totaling all squid stats and hits for all clients.

TOTALS

  • ICP : 0 Queries, 0 Hits (0%)
  • HTTP: 21990877 Requests, 3812 Hits (0%)

At first glance it appears as if the ratio of actual cache hits,  3812, to HTTP requests,  21990877,  is extremely low.  As with all statistics the obvious conclusion can be misleading. First off, the NetEqualizer cache is deliberately tuned to NOT cache HTTP requests smaller than 2 Megabytes. This is done for a couple of reasons:

1) Generally, there is no advantage to caching small Web pages, as they normally load up quickly on systems with NetEqualizer fairness in place. They already have priority.

2) With a few exceptions of popular web sites , small web hits are widely varied and fill up the cache – taking away space that we would like to use for our target content, Youtube Videos.

Breaking down the amount of data in a typical web site versus a Youtube hit.

It is true that web sites today can often exceed a Megabyte.  However ,rarely does a web site of 2 Megabytes load up as a single hit. It is comprised of many sub-links, each of which generates a web hit in the summary statistics. A simple HTTP page typically triggers about 10 HTTP requests for perhaps 100K bytes of data total. A more complex page may generate 500K. For example, when you go to the CNN home page there are quite a few small links, and each link increments the HTTP counter. On the other hand, a YouTube hit generates one hit for about 20 megabits of data. When we start to look at actual data cached instead of total Web Hits, the ratio of cached to not cached is quite different.

Our cache set up is also designed to only cache Web pages from 2 megabytes to 40 megabytes, with an estimated average of 20 megabytes. When we look at actual data cached (instead of hits) this gives us about 400 gigabytes of regular HTTP data of which about 76 Gigabytes  came from the cache. Conservatively about 10 percent of all HTTP data came from cache by this rough estimate. This number is  much more significant than the  HTTP statistics reveal.

Even more telling, is that effect these hits have on user experience.

YouTube streaming data, although not the majority of data on this customer system, is very time-sensitive while at the same time being very bandwidth intensive.  The subtle boost made possible by caching 10 percent of the data on this system has a discernible effect on the user experience. Think about it, if 10 percent of your experience on the Web is video, and you were resigned to it timing out and bogging down, you will notice the difference when those YouTube videos play through to completion, even if only half of them come from cache.

For a more detailed technical overview of NetEqualizer YouTube caching (NCO) click here.

%d bloggers like this: