What makes content valuable in an AI world?

What makes content valuable in an AI world?

Last week, three things happened that are all related:

1/ The editor-in-chief at Tom’s Hardware, Avram Piltch, published an op-ed about how SGE is a “50 megaton bomb” and will destroy the web ecosystem if launched in its current form. [link]

2/ The Verge published an article about how SEO has flooded the web with garbage text. [link]

3/ 8,000 Subreddits went dark to protest platform changes. [link]

I highly recommend reading all 3 articles because they circle around the same problem: the value of content when anyone can use AI to create it.

It’s no accident that these three things are happening simultaneously. The web is facing massive disruption from AI. In a world where anyone can create content with AI, the big question is what still makes content valuable.

If you’re short on time, just scroll to the “Bringing it all together” section at the end. If you’re curious about the details, I picked each article apart and commented on it.

SGE is not ready to go live

Avram Piltch, editor-in-chief at Tom’s Hardware, makes several good points (and a few less good ones) about the potential impact of SGE (bolding mine):

If Google were to roll its SGE experience out of beta and make it the default, it would be detonating a 50-megaton bomb on the free and open web. Many publishers, who rely on Google referrals for the majority of their visits, would fold within a few months. Others would cut resources and retreat behind paywalls. Small businesses that rely on organic search placement to sell their products and services would have to either pay for advertising or, if they cannot afford it, close up shop.

Remember the last sentence because we’ll return to it in the next article.

I agree the effect of SGE on the web ecosystem could be devastating. Companies could, for the first time, consider taking their sites off of Google. Because why still be on Google when you don’t get traffic because Google uses your content for direct answers?

The core problem is websites are not front and center anymore in SGE - the answer is:

By “putting websites front-and-center,” Google is referring to the block of three related-link thumbnails that sometimes (but not always) appear to the right of its SGE answer. These are a fig leaf to publishers, but they’re not always the best resources (they don’t match the top organic results) and few people are going to click them, having gotten their “answer” in the SGE text.

The quality of linked sites in the SGE carousel is not great. This is an area where Google needs to tighten the screws.

Some information is factually incorrect, which is very risky for medical, legal or financial queries:

I highlighted the text in blue because it is dangerously wrong. Google’s bot says that “the American Cancer Society recommends that men and women should be screened for colorectal cancer starting at age 50.” However, the American Cancer Society’s own website says that screenings should start at age 45, so this misleading “fact” probably came from elsewhere.

Giving medical advice without a license is illegal, and Google is clearly giving medical advice here. I expect Google to avoid heavily regulated YMYL topics in the final launch.

Another example is Piltch’s poor experience searching for “best GPU” in SGE, which spits out a mix of wrong and unrelated information.

When I repeated the search for “best GPU”, which Piltch criticized as featuring low-authority results in the AI carousel, I actually found Tom’s Hardware in 2nd place in the carousel but with a review of the Gigabyte RTX 4090 instead of the best GPU article.

Google SGE results for "best GPU"

A poor choice of results leads to a poor AI answer based on how SGE works. I’m not yet sure why that’s happening. By default, Google should pick high-quality results for form AI answers, but that’s not what’s happening.

When I searched “best GPU” on Bing chat, I got a good reference to Tom’s Hardware and a much better answer. It’s to the point, lists actual GPUs and has references to websites in the answer.

Bing AI results for "best GPU"

SGE also tends to return answers word for word from the corroborated results, aka websites:

Even worse, the answers in Google’s SGE boxes are frequently plagiarized, often word-for-word, from the related links. Depending on what you search for, you may find a paragraph taken from just one source or get a whole bunch of sentences and factoids from different articles mashed together into a plagiarism stew.

This is a tricky one. Where does plagiarism start, and where does it end? Meta descriptions and Featured Snippets, for example, are also word-by-word copied from the website. But it’s clear which site they come from because the link is right above or below.

And this plays into one of the biggest problems the SGE beta has right now: corroborated results aren’t enough for citation. Google needs to embrace what Neeva, You or Bing or Bard are doing: put a citation right in the answer. By the way, Bard still doesn’t include citations.

Where is the added value in SGE? What does the AI answer provide that users can’t get on a website?

Instead of taking content from a site and showing it to users, Google could simply send users to the right passage on a page with the answer to what they’re looking for - as they already do with Featured Snippets.

My take is Google wants to give a mix of information from several sites to form one best answer. But, again, if one site already gives all the information, the only added value is in saving users a click, which is not little (removing friction) but comes at the cost of damaging publisher ad revenue. It’s much more likely that Google is threatened by Chat GPT and the new Bing and sees no other way but to give answers directly.

Piltch also makes false assumptions about how good LLMs actually can get and how hard it is to define authority:

No matter how advanced LLMs get, they will never be the primary source of facts or advice and can only repurpose what people have done.

This is a common misconception about LLMs and generative AI. The argument is basically that LLMs just regurgitate what humans have already written and play it back in a different way. But, if that was the case, then why do LLMs hallucinate? LLMs learn and replay information in a similar way as humans. We learn, connect, and apply that knowledge.

Where Piltch is right is that generative AI can’t go out and test-drive a car (yet) or install a GPU in a computer and benchmark it (yet). But you bet that generative AI can be much better at reporting trends in the stock market, interpreting data or recognizing patterns in human behavior, which is 1st party facts.

It makes total sense that someone who has been reviewing CPUs for 15 years on a website that specializes in CPUs should have their AMD Ryzen review rank higher than someone with no authority on the topic.

This hints right at the problem: what is authority? Is it tenure or work time? Shouldn’t it simply be the thoroughness of the review and the quality of the arguments? I do understand that years of experience can matter, but should it be all that matters?

Piltch’s argument shows how difficult it is to measure the fuzzy concept of authority.

But one statement he makes early on in the article captures the Zeitgeist of Search and content on the web:

For example, when I searched “best bicycle,” Google’s SGE answer, combined with its shopping links and other cruft took up the first 1,360 vertical pixels of the display before I could see the first actual search result.

Piltch complains about ads and “cruft” on Google, but Tom’s Hardware does exactly the same. It’s so overloaded with ads that I couldn’t even read the piece on my smartphone. The page crashed before I was able to get all the way to the end because a huge ad overlay with all sorts of links to dubious articles popped up on my screen.

When I used my laptop to read the piece, I was asked if I wanted to enable browser notifications (no) and right away, a huge popup asks me if I want to sign up for the email newsletter. Can I just read the content, please?

The ad load on sites like Tom's Hardware is massive

Tom’s Hardware is not alone. Publishers share the same problem as Google: to grow, they must show more ads over time. As a result, the user experience suffers.

But on the publisher side, only a few big ones can make it work, like the New York Times or Wall Street Journal. The longtail of publishers struggles to charge for content. Journalists flood to Substack because what’s the incentive for them to write for a newspaper when they could just charge readers directly?

Neeva tried charging searchers a subscription but closed shop because users are habituated to search on Google, and Google has massive distribution advantages like the Apple deal, owning Chrome, Gmail or Youtube.

What can Google do against the flood of poor AI content?

Mia Sato published an article on The Verge about companies that benefit from Google traffic but need to optimize for it (I chose not to comment on some of the one-dimensional coverage of SEO).

Many small businesses (and large ones) depend on SEO traffic because advertising isn’t profitable for them (bolding mine):

Search is more essential than it ever has been for Get Bullish. Facebook used to account for a significant amount of Get Bullish profit, but after Apple introduced the “Ask App Not to Track” option in 2021, ads on the social media platform are no longer profitable. Dziura still runs Facebook ads, but they break even at best, she says. The Get Bullish app is available to shoppers, too, but Google Search is essential to the business.

If Search traffic went away because of SGE, it could indeed have severe consequences for the web, as Piltch wrote in the previous article, because there are no alternatives.

Site owners optimize their sites because it works:

Dziura’s DIY SEO work is working for her in some regards. A Google search for “feminist gifts” surfaces Get Bullish halfway down the first page of results, below Amazon and SEO-bait lists by Cosmopolitan and Town & Country Magazine but above competing small businesses. People searching for categories of items like “funny kitchen towels” and “inappropriate socks” land on Get Bullish, in addition to the shoppers looking for the shop by name.

A big part of site optimization revolves around content. What, however, happens when everyone can create good content with AI tools?

Online shoppers will increasingly encounter computer-generated text and images, likely without any indication of AI tools

This is exactly what will happen and what has already started happening: everyone will use it. As a result, Google will have a harder time ranking search results because so many sites have good content. Everyone will skyscrape each other. All the time.

8,000 Subreddits went dark to protest against API price increases and tight deadlines, and it’s noticeable in Google Search:

Over 8,000 subreddits have gone dark to protest Reddit’s upcoming API changes, and it’s shown me just how much I rely on Reddit to find useful, human-sounding information in my Google search results.

It’s a well-known trend that people search on Google for answers on Reddit. So many users append “reddit” to their queries that Google embeds Reddit answers for certain queries right in the search results:

With Google’s generally poor search results nowadays, appending “reddit” has long been the default way I search for almost anything (and no, I’m not ready to get my info from an AI chatbot, either). But given the sheer volume of subreddits that are currently unavailable — including some of the most-subscribed subreddits — clicking through many Reddit links in search results takes me to a message saying the subreddit is private.

And even if you don’t rely on the Reddit trick like I do, Reddit links often show up at the top of search results anyway, meaning that many people who don’t regularly use the platform have probably found some useful information on the site.

The reason is not just that Google Search has become over-commercialized but that Search results are sorted algorithmically, not by humans. On Reddit, you have well-moderated subforums (Subreddits) that surface and discuss valuable content. Reddit is the qualitative counterpart to Google’s qualitative results.

Community vetting and validation is something neither SGE nor Chat GPT can provide from scratch:

Sure, Google can provide me answers for any one of those needs. Other sites have great guides for Tears of the Kingdom. Google surfaced some potentially-useful videos for my pocket door problem (on YouTube, of course). And searching “best new music” brought up many lists I could look through.

But none of those have the conversational and community elements that makes Reddit so dang useful. I like perusing the comments below a post to see other recommendations, points of view, and other links to relevant resources, and then seeing other people discuss the merits of those additions to the thread.

Together with user verification, which all platforms currently invest in, UGC platforms have an edge over Google Search, which is more dependent on website content.

Bringing it all together

The reason for Reddit’s aggressive API price hike is AI. As OG forum on the web, Reddit was never able to monetize its treasure trove of valuable content. But now, most large LLMs use Reddit’s content to train their models, and Reddit wants to capture some of that value.

Over the last years - and now that we’re moving out of a zero interest rate economy - many companies run into the issue of having good content but also having a hard time monetizing it, for example, Twitter and many big publishers. On top comes that their data is used to train LLMs.

As I wrote in AI copyright could lead to new Marketing opportunities:

Of the 45 terabytes of text GPT-3 was trained on, 60% came from Common Crawl*, 22% from WebText 2 (which is trained on outgoing links from Reddit), 8% on books and 3% on Wikipedia. In other words, the majority of input to GPT-3 and other generative AI comes from the open web.

Now that Google might use their content word-for-word, big platforms and publishers don’t see their business models threatened.

The business model of content is increasingly hard to monetize. Tom’s Hardware and most other publishers are plastered with ads these days because it’s getting harder to grow revenue over time. Only a few big publishers, like the New York Times or Wall Street Journal, manage to charge consumers directly.

Right now, Google captures most of the value with ads or just habit-forming. In September 2022, they even launched a feature called “Discussions and forums”, which shows answers from Reddit (and others) in a highlighted format for certain queries. [link]

Many companies - SMBs, DTC companies or publishers - wouldn’t be profitable without SEO traffic, as Mia Sato points out in her article about garbage content. But SGE is another level of value capturing for Google that might leave too little for the web ecosystem.

Google, on the other hand, is facing competition from Bing/Chat GPT and AI that creates good content and dilutes it as a ranking signal. If Google doesn’t figure out how to give good AI answers while still sending traffic to the web, its business model might be in jeopardy as well. A good start to improving the current version would be adding references, less plagiarism, and better AI carousel answers.

But what the open web also needs is a) a way to opt out of having your data be used for training, b) a meta tag for being excluded in SGE, and c) more transparency from the big LLM developers about where the data is coming from so site owners can choose not to provide their data for free.