Protecting Your Content in the Age of AI

Click play to LISTEN to the article below

Introduction

The rise of Artificial Intelligence (AI) has brought about a myriad of opportunities and challenges, especially for newspaper publishers who are increasingly moving their operations online. One of the most pressing issues is the protection of copyrighted content from AI web crawlers like OpenAI’s GPTBot. In this blog post, we’ll explore the current landscape, how publishers are responding, and what you can do to safeguard your content.


The Current Landscape: A Quick Overview

The AI Web Crawlers

OpenAI’s GPTBot is a web crawler designed to collect data to train its popular chatbot, ChatGPT. However, it’s not the only one; other tech giants like Google and Microsoft have their own versions. These bots crawl the web to collect massive amounts of information, which is then used to train large language models (LLMs).

The Response from Publishers

According to a Business Insider article, 70 of the world’s top 1,000 websites have already moved to block GPTBot. The list includes giants like Amazon, The New York Times, and CNN. A study by Originality.ai found that 9.2% of the top 1,000 websites blocked GPTBot within its first 14 days of launch.

The Legal and Ethical Quandary

The use of web crawlers has raised concerns about copyright infringement. Several lawsuits are already in the works, and there’s increasing awareness about the ownership of data these crawlers use.


How Are Publishers Responding?

The Robots.txt File

Many websites are using a simple yet effective tool called robots.txt to block AI web crawlers. This file instructs web crawlers which pages on a website they can or cannot crawl. OpenAI has stated that GPTBot will respect the rules set in robots.txt.

Paywall Mechanisms

According to a Digiday article, publishers are using different types of paywalls, such as JavaScript-based and CDN-based, to protect their content. At Our-Hometown all of our websites are using the CDN-based, server-side paywalls to protect our customers’ intellectual property.

User-agent: *
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /puzzles/leaderboards/invite/*
Disallow: /svc
Allow: /svc/crosswords
Allow: /svc/games
Allow: /svc/letter-boxed
Allow: /svc/spelling-bee
Allow: /svc/vertex
Allow: /svc/wordle
Disallow: /video/embedded/*
Disallow: /search
Disallow: /multiproduct/
Disallow: /hd/
Disallow: /inyt/
Disallow: /*?*query=
Disallow: /*.pdf$
Disallow: /*?*login=
Disallow: /*?*searchResultPosition=
Disallow: /*?*campaignId=
Disallow: /*?*mcubz=
Disallow: /*?*smprod=
Disallow: /*?*ProfileID=
Disallow: /*?*ListingID=
Disallow: /wirecutter/wp-admin/
Disallow: /wirecutter/*.zip$
Disallow: /wirecutter/*.csv$
Disallow: /wirecutter/deals/beta
Disallow: /wirecutter/data-requests
Disallow: /wirecutter/search
Disallow: /wirecutter/*?s=
Disallow: /wirecutter/*&xid=
Disallow: /wirecutter/*?q=
Disallow: /wirecutter/*?l=
Disallow: /search
Disallow: /*?*smid=
Disallow: /*?*partner=
Disallow: /*?*utm_source=
Allow: /wirecutter/*?*utm_source=
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss

User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ia_archiver
Disallow: /

User-Agent: omgili
Disallow: /

User-Agent: omgilibot
Disallow: /

User-agent: Twitterbot
Allow: /*?*smid=

Sitemap: https://www.nytimes.com/sitemaps/new/news.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/collections.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/video.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/cooking.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/recipe-collects.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/regions.xml
Sitemap: https://www.nytimes.com/sitemaps/new/best-sellers.xml
Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
Sitemap: https://www.nytimes.com/elections/2018/sitemap
Sitemap: https://www.nytimes.com/wirecutter/sitemapindex.xml

What Can You Do?

1. Implement a Robust Robots.txt File

If you haven’t already, create a robots.txt file and add AI web crawlers to the “disallow” list. This is the first line of defense.

2. Evaluate Your Paywall Technology

Ensure that your paywall is effective against AI bots. If you’re using a JavaScript-based paywall, consider switching to a CDN-based one, which has shown to be more effective.

3. Monitor Traffic

Keep an eye on your website’s traffic to detect any unusual spikes, which could be a sign of web crawlers bypassing your defenses.


Conclusion

The advent of AI web crawlers like GPTBot poses new challenges for newspaper publishers in protecting their online content. However, by taking proactive steps and staying informed, you can better safeguard your copyrighted material in this digital age.

This post was written in collaboration with Claude by Anthropic.

Share this Post: Facebook Twitter Pinterest Google Plus StumbleUpon Reddit RSS Email

Comments are closed.