The problem with Spam and Search

Spam is one of the biggest problems for Google and Search in general.

The problem with Spam and Search

Search engines inevitably find a lot of spam as they crawl the web. According to Google’s latest spam report, we should assume that the majority of it is actually spam.

On one hand, it doesn’t cost a lot to generate spam but seems to still pay off (otherwise, why would people spam).

It won’t take you long to find spam even under Google’s own carpet (see screenshot below).

I’ve worked at three companies that had to deal with this problem. At Dailymotion, we got flooded with spam. At Atlassian, public Jira, Confluence, Bitbucket, and Trello instances were loaded with spam. At G2, we’re not facing the same extent because we vet all reviews but some companies still try to game the system a bit.

It’s just super hard to control UGC (user-generated content) at scale if you don’t vet every instance or entry.

There is a law on the internet for this: if someone could gain an advantage by spamming a platform on the internet, it will be spammed.

This leads us to two questions:

  1. What do you do about it?
  2. How harmful is it to your business?

At Dailymotion, spam did harm our business from an SEO and user experience perspective. At Atlassian, it wasn’t really a problem (except for Trello) because most public instances didn’t drive a lot of high-quality traffic. At G2, we can identify and contain it pretty well.

For Google, it’s a lethal threat.

First, spam wastes Google’s resources. Every crawled page technically costs money. If that page doesn’t end up as a valuable result in Google’s index, it’s a waste of money.

Second, spammy results are obviously bad results for users. Sometimes they even violate copyrights. If Google became a cesspool for illegal software, movies, and other files, it could face costly lawsuits. Not even speaking about its users leaving Google for another, “cleaner” search engine.

Third, spammers get more sophisticated. The fear of spam created by AI that’s impossible to detect by Google is real. One of the reasons Google wants to stay on top of Machine Learning is to decipher when content has been created by a machine or a human. No wonder Google has some of the best ML researchers in the world.

Google has to figure out how to keep its index clean from spam.

Google Web Spam report

From Google:

With hundreds of billions of webpages in our index serving billions of queries every day, perhaps it’s not too surprising that there continue to be bad actors who try to manipulate search ranking. In fact, we observed that more than 25 Billion pages we discover each day are spammy. That’s a lot of spam and it goes to show the scale, persistence, and the lengths that spammers are willing to go. We’re very serious about making sure that your chance of encountering spammy pages in Search is as small as possible. Our efforts have helped ensure that more than 99% of visits from our results lead to spam-free experiences.

First off, 25 billion spammy pages per day is quite the number. So, let’s do some quick math to figure out how much spam Google is likely to fight.

25 billion spam pages per day make 750 billion per month (25b * 30).

In 2013, Google discovered 30 trillion pages, in 2016, 130 trillion. That means Google discovers a bit over 30 trillion pages per year (probably more, given its technology gets more efficient over time) and 2.5 trillion per month. 750 billion divided by 2.5 trillion is 0.3, which means that 30% of the pages Google discovers are spam. The numbers aren’t perfect, but I assume we can say spam grows faster than “good content”. And it makes sense logically because good content takes much longer to create than spam.

And Google has a major challenge of sorting through the spam.

In 2019, we generated more than 90 million messages to website owners to let them know about issues, problems that may affect their site’s appearance on Search results and potential improvements that can be implemented. Of all messages, about 4.3 million were related to manual actions, resulting from violations of our Webmaster Guidelines.

4.3 million manual actions sound small in comparison to 25 billion pages but very large when we take the “manual” literally. There’s no way a team of humans sends out so many messages, which means Google must use some sort of automation.

Algorithms are not enough

From Google:

Working on improvements to our language understanding and other search systems is only part of why Google remains so helpful. Equally important is our ability to fight spam. Without our spam-fighting systems and teams, the quality of Search would be reduced--it would be a lot harder to find helpful information you can trust.

Google does admit to rely on humans for certain kinds of spam. Machines can still be tricked and machine learning is not yet good enough to detect that.

One of the best pieces of evidence is John Mulller’s recent response to a Tweet:

If time doesn’t play a role for spam, I wonder how good Google’s anti-spam algorithms really are. Shouldn’t they be able to “devalue” spam just like Google claims to devalue bad links? If so, why should webmasters take action?

For marketers, all there is to do is making it as easy as possible for Google to crawl your site. It’s likely Google will eventually catch up to link schemes and other spam tactics. It’s unlikely they will pull back on investing in anti-spam fighting methods, which means they will get smarter, which means they will detect schemes… eventually. “When” is another question.