Google’s index is smaller than we think – and might not grow at all5 min well spent
Google’s index size is not as straightforward as it might seem and that’s important to understand for SEOs.
There are many false claims on the internet about Google having trillions of pages in its index (example: https://www.tennessean.com/story/money/tech/2014/05/02/jj-rosen-popular-search-engines-skim-surface/8636081/). That’s not correct.
The problem is a small but important detail: the difference between the pages Google knows and the pages Google actually indexes.
Crawling versus Indexing
Google discovered 130,000,000,000 (130 trillion) web pages in 2016. In fact, I found data about how many pages Google discovered from 2008 and 2013 as well. Now, when we plot this on a chart, we see that the number of discovered pages seems to scale exponentially.
Google discovers a lot of links but doesn’t crawl all of them.
“Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!
We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn’t very useful to searchers. But we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.“
That’s the challenge I describe in “The problem with spam and search”: the internet is full of trash.
“With hundreds of billions of webpages in our index serving billions of queries every day, perhaps it’s not too surprising that there continue to be bad actors who try to manipulate search ranking. In fact, we observed that more than 25 Billion pages we discover each day are spammy.“
In the post, I come to the conclusion that roughly 30% of the pages Google discovers are spam.
So, we can agree that “found links” doesn’t equal “index size”. Google’s index is actually much smaller than the number of discovered links.
The Google Search index contains hundreds of billions of webpages and is well over 100,000,000 gigabytes in size. It’s like the index in the back of a book — with an entry for every word seen on every webpage we index. When we index a webpage, we add it to the entries for all of the words it contains.https://www.google.com/search/howsearchworks/crawling-indexing/
But how fast is its index growing? And is it growing at all?
Index size is not the goal anymore
I took the data from 3 sources to reconstruct Google’s index growth over the last 20 years. It doesn’t seem to grow fast but linearly (not exponentially).
However, I don’t think Google’s index can scale indefinitely. Its growth rate from 2000 to 2006 was 26x. If Google’s public data is correct, it only grew by 3.8x from 2008 to 2017, though. So, growth is inevitably slowing down. Google is seeing diminishing returns (maybe intentionally, maybe not).
As I wrote in The end of crawling and the beginning of API Indexing, “Google needs to keep the index as small as possible while making sure it includes only the best results. Think about it. Having a huge index is just a vanity goal. The quality of indexed results is what matters. Anything else is inefficient.“
This seems to be confirmed by a paper from 2016 called “Estimating search engine index size variability: a 9-year longitudinal study“. In the study, researchers spent 9 years (!) observing Google’s and other search engine’s index. What they found is that the index of search engines doesn’t grow linearly. It might not grow at all.
“The largest peak in the Google index estimates is about 49.4 billion documents, measured in mid-December 2011. Occasionally, estimates are as low as under 2 billion pages (e.g. 1.96 billion pages in the Google index on November 24, 2014), but such troughs in the graph are usually short-lived, and followed by a return to high numbers (e.g., to 45.7 billion pages in the Google index on January 5, 2015).“
Note how the Panda updates decrease Google’s index size. I think that’s why Google stopped reporting the index size: they realized it didn’t matter as much as index quality. They found all the thin and low-quality content in the index and decided to flip it around.
There was a time when the index size was a huge indicator of quality for search engines. But those times are long over.
Google’s mission is “to organize the world’s information and make it universally accessible and useful.” – not to index the world’s information.