The end of crawling and the beginning of API indexing12 min well spent
Google became the most successful startup in history by crawling the web, building an index of web pages, and ranking them based on popularity. Now I see signs for a potential paradigm shift from crawling to indexing APIs. In the future, search engines might not come out to get content. Webmasters might bring it to them.
The exploding growth of the web, recent problems Google had with indexing, and Bing’s and Google’s development in the space lead me to think that the crawling the web might eventually be replaced by indexing APIs.
To prepare, webmasters should:
The goal of search engines, at least for Google, is to “organize the world’s information and make it universally accessible and useful“. But the web is exploding in size. Google discovers hundreds of billions of web pages, most of which are spam. It’s also tricky to adapt to all the different coding frameworks that are being used and lastly, crawling the web is not cheap.
Google needs to keep the index as small as possible while making sure it includes only the best results. Think about it. Having a huge index is just a vanity goal. The quality of indexed results is what matters. Anything else is inefficient.
That’s why Google doesn’t like spending much time crawling and rendering low-quality sites. They just fill the index with trash. That’s why pruning works well. It’s also the basis for the idea of crawl budget.
For now, the classic search engine 4-step process still consists of: crawl > render > index > rank. In the future, I see the first two steps significantly implied by using indexing APIs and I have 4 good reasons for this thesis.
4 good reasons to use indexing APIs over crawling
Crawling is an essential part of information retrieval and the success of search engines. Why would they change their ways?
Spam was a problem for search engines from the start. As I wrote in the problem with spam and search, spam can be a lethal threat for Google because it wastes so many crawl resources, provides a bad experience for searchers, and spammers are getting more sophisticated. Google’s algorithms need to keep up.
Link spam was always one of the biggest issues. Now that Google is becoming better at understanding semantics and user-satisfaction, they rely less and less on links for ranking. However, they do still rely very much on links for indexing.
Indexing APIs could solve a big part of the spam issue because they create a bottleneck. Indexing is more controllable. And which spammer in their right mind would submit spam straight to Google? That’s like a thief trying to steal from the police.
Search engines could use certain signals to decide which content to accept and which sources to throttle to prevent API spam, such as:
- Verified age verification of the account
- Site impressions
- Quality of submitted content
Fewer rendering issues
Indexing APIs could be a solution because they provide webmasters the opportunity to submit the fully rendered HTML. Search engines wouldn’t have to worry about rendering as much.
This could open a vulnerability for cloaking but in the end, it’s the same challenge Google faces with dynamic rendering today. Google seems to be able to solve it.
Several factors decide how often and what Google crawls, for example, the popularity of a URL and how often it changes (source).
As the web grows to many billions of documents and search engine indexes scale similarly, the cost of re-crawling every document all the time becomes increasingly high.https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34570.pdf
But since 2019, we’ve seen reoccurring problems with indexing bugs:
Part of the challenge is Google’s transition to mobile-first indexing. Even though Google confirmed that there is one single index, the needed crawl resources must have grown significantly because Google needs to assess both versions (desktop and mobile) of a site.
Indexing APIs would be much more resource-sparing. Google wouldn’t have to ping servers, figure out the canonical state of a URL, or follow robots.txt directives. Schedulers wouldn’t have to figure out how often to come back to crawl a URL. They just render, index, and rank the content that webmasters want to be indexed.
In 2011, Jeff Dean gave a presentation you should watch. He explains the complicated architecture of Google’s indexing services and the challenges of storing the web.
Jeff mentions some factors that impact a search engine’s index size:
- # of documents
- Queries processed per day
- Metadata (additional data about documents)
- Update frequency
- Average crawl time per document
I tried to come up with a rough idea of how much money Google spends on crawling and indexing the web (not even speaking about rendering and ranking) but failed to do so. It’s simply too complex. But there’s another way.
In 2012, Michael Nielsen spent $580 on crawling 250m web pages but back then, web pages were on average 310kb heavy and the SERPS were a lot closer to 10 blue links as opposed to what they are nowadays.
Instead, I looked at Google’s data center CapEx (capital expenditures), meaning how much Google reports to spend on data centers. This gives us a rough estimate on cost for the whole process of building and filing an index, rendering pages, ranking them, and everything that involves images, maps, and other search verticals.
In 2018, Google spent $9b on data centers with a total CapEx of $25.5b (source), so roughly 35% go into data centers. They planned to spend $13b in 2019 but in the first quarter, CapEx turned out to be $21.8b, which means data center cost were around $7.6b – already half of what was planned. To be clear, I’m not sure how much of those $7.6b went into data centers for Google Cloud versus Google Search.
When COVID hit the world, Google announced to slow hiring and spend less on data centers. The announcement came with a little detail that says a lot. Pichai, Alphabet’s CEO, said:
“We are reevaluating the pace of our investment plans for the remainder of 2020. Beyond hiring, we continue to invest, but will be recalibrating the focus and pace of our investments in areas like data centers and machines, and non business essential marketing and travel.”
Recalibrating investments in data centers? That’s core infrastructure!
Here’s what I think: COVID provided a good opportunity to announce actions that had to be taken either way. I think data center cost turned out to be closer to $30b in 2019 than the projected $13b. If that holds true, data center cost made up 18% of total revenue (30b / 161b).
Google’s total revenue in 2019 was $161b. Gross profit was $89b, which is only a 16% increase YoY compared to the annual increase of 18% the previous year. Google needs to keep its profitability rate up and saving money by replacing web crawling with indexing APIs would be one way to do that.
What SEOs can do today
The big question is “how can we prepare?” I got 4 recommendations for you.
1. Try out Bing’s indexing API
“We believe that enabling this change will trigger a fundamental shift in the way that search engines, such as Bing, retreive and are notified of new and updated content across the web. Instead of Bing monitoring often RSS and similar feeds or frequently crawling websites to check for new pages, discover content changes and/or new outbound links, websites will notify the Bing directly about relevant URLs changing on their website. This means that eventually search engines can reduce crawling frequency of sites to detect changes and refresh the indexed content.”https://blogs.bing.com/webmaster/january-2019/bingbot-Series-Get-your-content-indexed-fast-by-now-submitting-up-to-10,000-URLs-per-day-to-Bing
Bing is a bit ahead of the curve on this one: they launched an indexing API already in March 2019, called the “content submission API”, with a 10,000 URL starter limit. You can expand that quota a lot but need permission on an individual basis.
Bing offers two versions of the indexing API: Adaptive URL submission and Batch Adaptive URL submission. The latter allows you to submit URLs in batches.
If you use Botify, you can leverage their partnership with Bing. The platform will submit the content for you, even beyond the 10K URL limit.
The content submission API also allows you to get pages out of the index, for example by 404ing them (not my preferred method; it should be a 410 technically).
Interesting: Bing clearly states that you can shoot more URLs through the API the longer your site is verified.
“The daily quota per site will be determined based on the site verified age in Bing Webmaster tool, site impressions and other signals that are available to Bing”https://blogs.bing.com/webmaster/january-2019/bingbot-Series-Get-your-content-indexed-fast-by-now-submitting-up-to-10,000-URLs-per-day-to-Bing
2. Play around with Google’s (limited) indexing API
Google limits its indexing API to jobs and streaming events at the moment but specifies that the limitation is momentarily:
“Currently, the Indexing API can only be used to crawl pages with either JobPosting or BroadcastEvent embedded in a VideoObject.”https://developers.google.com/search/apis/indexing-api/v3/quickstart
Google’s documentation also makes the difference between XML sitemaps and the indexing API clear: “We recommend using the Indexing API instead of sitemaps because the Indexing API prompts Googlebot to crawl your pages sooner than updating the sitemap and pinging Google.“
XML sitemaps tell a search engine that something on a URL has changed but Google still has to visit the URL and crawl it. Content submission APIs (maybe a better term for indexing APIs) allow you to send the full content to search engines.
I expect Google to roll out broader applications beyond jobs and broadcasting, soon. Some people didn’t want to wait and tried out submitting regular (non-job or event) content… and it works.
Tobias Willmann tried an experiment and got URLs indexed within minutes. It seems, though, that you can only get pages indexed but not to removed at the moment. That’s also the experience the folks from RankMath made.
3. RankMath’s WordPress plugin
For WordPress users, RankMath’s plugin could also be a viable solution to test and eventually use Google’s indexing API. It’s very similar to what Bing’s plugin does.
This touches on an interesting point, which is the partnership between WordPress and Google. The partnership provides many interesting points and is a good basis for another article but for indexing APIs, it could be a major lever: WordPress hosts ~37% of the web (source). For Google, it’s a great bottleneck to plug into and cover a huge part of the web at once. I could see partnerships with hosters and domain registrars follow.
4. David Sottimano’s Node JS template
David Sottimano wrote a cool Node JS template for Google’s indexing API.
Google’s new Indexing API support page say “it can only be used to crawl pages with either job posting or livestream structured data”, but of course I was curious and it turns out that we can get regular pages crawled as well, and damn fast.David Sottimano
Crawling the web is not sustainable
XML sitemaps were the first steps toward a less crawl-dependent indexing process. Indexing APIs are the next step. XML sitemaps only tell search engines when a URL changed but not what changed. Indexing APIs take it to the next level by submitting the whole content to search engines.
I can’t see search engines would stop crawling completely but reduce it down to a minimum. Google has always been good at incentivizing webmasters to follow their demands, simply by having the finger on the traffic floodgates.
Finds of the week
Search Engine Land How to show up on Google Discover: Google’s latest guidance
Google seems to be looking at way more than just large images and a mobile friendly experience for Discover. Doesn’t shock me, I’m sure there’s some sort of relevance algorithm behind the machine.
Search Engine Journal Do We Have the Math to Truly Decode Google’s Algorithms?
Thought-provoking but interesting article on the fallacies of correlation studies and comments from a real statistician. I was a bit hesitant to read this at first but I found a lot of value in it.
Marketing Profs An Easy Three-Step Headline Formula That Grabs Customer Attention in Just Five Minutes (With Examples)
Writing clearly is hard. This formula provides a great framework to engage people to click on your headlines without coning across too clickbaity.
Animalz Why Wirecutter Wins: Opinionated Content
The value of expert opinions is high. I notice that all the time when ti comes to software: people want a recommendation from an expert, not a list of 10 things they have to pick from.
Youtube Crawl Budget: SEO Mythbusting
Very helpful crawl budget mythbusting from Martin Splitt. Found some good ones in there that help to put crawl budget into perspective.