Caching solutions are used in all shapes and sizes to speed up Internet data retrieval. From your desktop keeping a local copy of the last web page viewed, to your cable company keeping an entire library of NetFlix movies, there is a broad diversity in the scope and size of caching solutions.
So, what is the biggest caching server out there? Moreover, if I found the world’s largest caching server, would it store just a tiny microscopic subset of the total data available from the public Internet? Is it possible that somebody has actually cached everything Internet? A caching server the size of the Internet seems absurd, but I decided to investigate anyway, and so with an open mind, I set out to find the biggest caching server in the world. Below I have detailed my research and findings.
As always I started with Google, but not in the traditional sense. If you think about Google, they seem to have every public page on the Internet indexed. That is a huge amount of data, and I suspect they are the worlds biggest caching server. Asserting Google as the worlds largest caching server seems logical , but somewhat hollow and unsubstantiated, my next step was to quantify my assertion.
To figure out how much data is actually stored by Google, in a weird twist of logic, I figured the best way to estimate the size of the stored data would be to determine what data is not stored in Google.
I would need to find a good way to stumble into some truly random web pages without using Google to find them, and then specifically test to see if Google knew about those pages by asking Google to search for unique, deep rooted, text strings within those sites.
Rather than ramble too much, I’ll just walk through one of my experiments below.
To find a random Web site, I started with one of those random web site stumblers. As advertised, it took me to a random web site titled, “Finest Polynesian Tiki Objects”. From there, I looked for unique text strings on the Tiki site. The idea here is find a sentence of text from this site that is not likely to found anywhere but on this site. In essence something deep enough so as not to be a deliberatly indexed title already submitted to google. I poked around on the Tiki site and found some seemingly innocuous text on their merchant site. “Presenting Genuine Witco Art – every piece will come with a scanned”. I put that exact string in my Google search box and presto there it was.
Wow it looks like Google has this somewhat random page archived and indexed because it came up in my search.
A sample set of two data points is not large enough to extrapolate from and draw conclusions, so I repeated my experiment a few more times and here are more samples of what I found….
Try number two.
Random Web Site
Search String In Google
“For booking or general whatnot, contact Bob. Heck, just write to say hello if you feel like it.”
It worked again, it found the exact page from a search on a string buried deep on the page.
And then I did it again.
And again Google found the page.
The conclusion is that Google has cached close to 100 percent of the publicly accessible text on the Internet. In fairness to Google’s competitors they also found the same Web pages using the same search terms.
So how much data is cached in terms of a raw number?
There are plenty of public statistics for number of Web sites/pages connected to the Internet, and there is also data detailing the average size of a Web Page, what I have not determined is how much of the Video, and Images are cached by Google, I do know they are working on image search engines, but for now, to be conservative I’ll base my estimates on Text only.
So roughly there are 15 billion Web Pages, and the average amount of text is 25 thousand bytes. (note most of the Web is Video and Images text is actually a small percentage)
So to get a final number I multiply 15 billion 15,000,000,000 times 25 thousand 25,000 and I get…
375,000,000,000,000 bytes cached…
Notice the name of te site or the band does not appear in my search string, nothing to tip off the google search engine what I am looking for and presto!