AI copyright could lead to new Marketing opportunities

New copyright lawsuits against AI companies could turn the tide and empower big content platforms to train their own models.

One boring but important factor makes a big difference in the conversation about how AI might change Marketing: copyright.

Copyright lawsuits have the power to decide whether content platforms like Quora, Yelp or IMDB will benefit from generative AI or be disrupted by it.

Recently, several generative AI companies have been sued for copyright infringement. The plaintiffs claim AI models have been trained on their work without compensation or licensing in three central lawsuits:

1/ In January, a group of visual artists sued Stability AI, Midjourney and DeviantArt.

2/ The same lawyers filed a class action lawsuit against GitHub for scraping copyrighted source code in November.

3/ Getty Images filed a separate class action suit against Stability AI, accusing the markers of Stable Diffusion for having scraped 5 billion images from sites like Getty, Shopify, Tumblr and Flickr without a license.

Some images from Stable Diffusion had the Getty Images watermark, indicating the model could have been trained on copyrighted images. It's unclear whether that happened or not since AI models have no underlying database of things. A sample came to the conclusion that 1% of Stable Diffusion’s dataset might have come from watermarked photos. (source)

So far, we assumed OpenAI, Stability AI and other makers of generative AI could dethrone big content platforms. But, if copyright lawsuits decide generative AI makers need to pay for training data, the power would shift to content platforms. Aggregators like Google would lose power since content platforms could train models on their data and provide superior AI experiences.

Who owns AI training data?

Content ownership is a complex problem. Google makes money with website content but doesn't it, and webmasters can opt out from showing up in search results. UGC platforms like Facebook, Yelp or IMDB don't technically own reviews but can use them in many ways as specified in their terms of service (which no one reads). Since user-generated content is openly available and can be found by crawling the web, it's a tricky question whether AI companies should be allowed to train their data on it or not.

It's unclear what data DALL-E 2 was trained on, but we know GPT-3 and Stable Diffusion were trained largely on Common Crawl, a non-profit organization that crawls billions of pages on the web and publishes them for free. Of the 45 terabytes of text GPT-3 was trained on, 60% came from Common Crawl*, 22% from WebText 2 (which is trained on outgoing links from Reddit), 8% on books and 3% on Wikipedia. In other words, the majority of input to GPT-3 and other generative AI comes from the open web.

* fun fact: Google's Danny Sullivan is an advisor to Common Crawl

Technically, webmasters can block Common Crawl's bot in the robots.txt and opt out of their content being used to train AI models, but a different non-profit crawler could be just around the corner. It's tedious to keep up and manage the opting out process.

As a result, if the lawsuits go nowhere, everything would go on as planned. Content creators might get the chance to opt out of AI training, but for the most part, models would get better. However, if the plaintiffs succeed, AI companies would either have to use copyright-free content (most of which is old) to train their models or pay.

What paid training data means for big platforms

Enforcing licenses for AI training data would come with big consequences for the development of AI models and big platforms.

First, paying for training data might significantly slow the evolution of generative AI down. New funds might have to be raised, partnerships made, and strategy roadmaps adjusted. A change in pace is not necessarily bad because it would give us enough time to adapt laws and regulations.

Second, paying for training data makes model training even more expensive. Right now, estimations of training costs for GPT-3 range from $4-12M USD, while others say it can be done with as little as $500,000. Training Stable Diffusion cost only $600K, and the cost seems to be going down over time as more specialized GPUs become available and energy cost trend down (source 2, source 2). But we're not there yet, and AI model training and running remain costly.

Third, companies that already have a lot of content might choose to train generative AI on their content to get a competitive advantage. Paying for training data could lead to a shift of power from the generative AI makers to data owners.

Shutterstock released a new generative image feature last week. While competitor Getty Images banned AI-generated images and sued AI companies, Shutterstock leans into the change and embraces it. The stock image platform announced to compensate creators for training an OpenAI model on their content, but the monetization model is still unclear.

Another company also announced to create content with the assistance of AI: Buzzfeed. In a memo, Founder and CEO Jonah Peretti wrote:

If the past 15 years of the internet have been defined by algorithmic feeds that curate and recommend content, the next 15 years will be defined by AI and data helping create, personalize, and animate the content itself. Our industry will expand beyond AI-powered curation (feeds), to AI-powered creation (content). AI opens up a new era of creativity, where creative humans like us play a key role providing the ideas, cultural currency, inspired prompts, IP, and formats that come to life using the newest technologies. (source)

Buzzfeed, though, is not a platform. It's an integrator that creates the content itself, as opposed to aggregators like Shutterstock, Yelp, IMDB or G2 that get the content from their users. When it comes to training data, aggregators have a big leg up over integrators since they have more and uniform content they can use to train models. Integrators, on the other hand, would be at the receiving end of AI models and likely don't have enough data to train their own.

Marketing implications from paid AI training data

If big content platforms train their own AI models on their content, each of them could release a Chat GPT-like interface that competes with OpenAI, Stability AI, Jasper and others.

For example:

  • Stock photo sites would compete with DALL-E 2, Stable Diffusion and other generative AI providers
  • Wikipedia, Quora & co would compete with Chat GPT and other AI chatbots
  • GitHub would likely remain the destination for AI-assisted coding but might get competition from Stack Overflow, BitBucket, GitLab and other developer platforms
  • Spotify, Apple Music and other music streaming platforms could train models on their content and provide alternatives to playlists with fine tuned recommendations based on inputs ("Play me a song like 'die for you' but faster")
  • IMDB would train a model on its movie reviews and give users perfect movie recommendations based on their mood
  • Yelp, Tripadvisor and Google would train a model on their local reviews and give users perfect restaurant, bar and hotel recommendations based on their location and taste
  • G2 would train a model on its software reviews and give users perfect software recommendations based on their stack

Aggregators would turn into integrators with the help of AI and training on the content they have today. The platforms we thought AI to replace might become more powerful.

The big platforms might still leverage SEO and similar growth channels to drive traffic, but the customer experience would be superior because visitors could simply engage with a user interface á la Chat GPT and get their answers directly. As a result, word of mouth, direct traffic and brand awareness might increase significantly, driven by an outstanding user experience.

At the same time, if the big content platforms trained their own models and released Chat GPT-like user interfaces, Google would face significant challenges and lose even more market share for vertical-specific searches. Search as a whole could become more fragmented.