News Publishers Block Bots For ChatGPT's Live Coverage

Independent Publishers Group asks its members to quickly stop GPTBot and Google Bard from visiting their websites.

Glenn Burgess

ChatGPT’s Impact on News Publishers Grows as It Upgrades to Real-time News, Prompting Concerns from UK Independent Publishers Alliance.

The UK’s Independent Publishers Alliance is advising its members to prevent OpenAI and Google from accessing their websites as ChatGPT prepares to read up-to-date news stories instead of relying on outdated information from a two-year-old database.

Until now, ChatGPT could only provide information up to September 2021, which was when its training data stopped. However, ChatGPT Plus and Enterprise users can now access current and reliable information from the chatbot, and this feature will soon be available to all users. OpenAI also plans to include direct links to the sources.

This change means that users can ask ChatGPT questions about current events, with responses based on content from news publishers worldwide. This might reduce traffic for these publishers if people can get information they need without visiting original sources.

It could be an extension of the trend of “zero-click searches,” where search engine results directly provide answers without users needing to click on articles that originally reported the information.

Publishers are facing ongoing challenges as they decide whether to block ChatGPT’s bot and similar crawlers from Google and Bing from using their content for training datasets.

OpenAI initially provided guidelines for publishers to opt out of scraping in August. Recently, both Google and Bing have also explained how publishers can opt out of data collection, without risking being blocked from appearing in their search results.

Google & Bing Let Publishers Opt Out of AI Training in Search

On September 22, Bing took the lead by informing publishers about their new options to gain more control over how their content is utilized in the era of AI. Bing, a search engine owned by Microsoft, has integrated AI bot Bing Chat into its search results, leveraging OpenAI’s technology through a substantial multi-year, multi-billion dollar investment.

Bing Chat’s responses typically include links to sources, with many leading to Microsoft’s news aggregator, MSN, based on Press Gazette’s inquiries.

If publishers take no action, their content will continue to be used as sources for Bing Chat. Content marked with “NOCACHE” may potentially be included in Bing Chat responses, but only URLs, snippets, and titles will be displayed and used for training the AI model. Content marked as “NOARCHIVE” will not be included, linked to, or used for training purposes.

Bing also reassured publishers that using the “NOCACHE” or “NOARCHIVE” tags would not impact how their content appears in Bing’s search results, assuring that it will still be visible there.

About a week later, Google acknowledged that publishers expressed a desire for more choice and control over how their content is used in emerging generative AI applications. In response, Google introduced Google-Extended, a tool that allows publishers to manage access to content on their websites and decide whether they want to contribute to improving Google Bard, its AI-driven chat tool.

Danielle Romain, Google’s VP for trust, emphasised the value to the tech platform of publishers permitting their content to be used, highlighting that by using Google-Extended to control content access, website administrators can choose whether to enhance the accuracy and capabilities of these AI models over time.

‘Why let them take it for free?’

The UK’s Independent Publishers Alliance has strongly advised its members to take immediate action to block ChatGPT from crawling their websites. Several reasons underpin this recommendation:

Cost Implications: Smaller publishers may face increased hosting costs if the number of bot visits to their sites surges significantly.
Plagiarism Prevention: Blocking ChatGPT helps prevent potential plagiarism issues that can arise when generative AI tools regurgitate content.
Negotiating Power: The alliance believes that publishers may have more bargaining power to negotiate compensation for their content if they choose to opt out. Allowing use for free might weaken their position in potential legal actions or licensing discussions.

Chris Dicker, a board member for the Independent Publishers Alliance, pointed out that currently, no one benefits from allowing ChatGPT to use their content for free. He highlighted concerns that if publishers don’t take action now, regulators might later question why they should pay for content when publishers initially permitted it for free.

Dicker emphasised the importance of taking a stand now, as publishers have faced similar challenges with big tech companies in the past. He described a familiar pattern where tech companies initially offer attractive deals to engage with a select few larger publishers, and gradually more gets taken away.

Dicker compared this strategy to “how to boil a frog,” where you start with warm water that the frog likes and gradually raise the temperature until it’s too late for the frog to escape.

He believes that if all sites collectively block OpenAI and/or Bard, these AI systems would be limited to knowledge from 2021, forcing them to negotiate with publishers for content use. With ChatGPT about to update its database for all users, Dicker sees this as a critical moment for publishers to assert their position.

However, not all publishing bodies share this viewpoint. Sajeeda Merali, the chief executive of the Professional Publishers Association, which represents various media businesses, argued that opting out might not be a practical option.

She suggested that if ChatGPT aims to become a significant entry point for digital information like Google search, creating barriers to negotiation might not be in the best interest of publishers. She emphasized the importance of maintaining open channels for dialogue and collaboration rather than imposing restrictions.

‘Inevitability’ that content is crawled & learned from

The pros and cons of blocking access to AI crawlers like ChatGPT were discussed at the Digiday Publishing Summit in Florida last month. One publishing executive shared their experience, mentioning that they initially opted out but later had doubts.

They realised that their content was already widely available on multiple syndication apps and websites where it could be crawled, making their blocking efforts seem somewhat futile. They acknowledged that it was almost inevitable for AI systems to ingest, crawl, and learn from their content.

However, they saw value in their decision as a starting point for future negotiations with OpenAI and other companies. It could serve as leverage in discussions, allowing them to say they’d be willing to lift the block if they can reach a mutually beneficial agreement.

Luke Budka, an AI strategist at Definition, highlighted the complexity of the situation for publishers in the realm of generative AI. There are many moving parts & potential pitfalls.

For instance, publishers must be cautious when managing crawlers, ensuring they allow Googlebot to crawl their site while disallowing Google-Extended, which scrapes information to train AI models like Bard. Mistakes in this regard could lead to sites disappearing from search results.

Additionally, publishers need to tag their content with “NOARCHIVE” if they don’t want to contribute to training Microsoft’s generative AI foundation models. They also need to separately block OpenAI (GPTBot) and Anthropic.

Some prominent publishers, such as the New York Times, Reuters, Bloomberg, CNBC, and The Athletic, moved quickly to block OpenAI’s ChatGPT, especially considering ongoing legal disputes. ABC took similar actions by blocking GPTBot and disallowing Google-Extended access to their content.

Big news names block GPTBot

The decision of whether to block AI training bots like ChatGPT has prompted varied responses from publishers. Here’s a breakdown of the situation:

Since August, 44% of 1,123 news publishers in an ongoing survey by homepages.news have chosen to block ChatGPT’s trawling through robots.txt, which specifies which parts of a website trawlers are allowed to access.
Among UK publishers, some that have blocked GPTBot include the Daily Mail, The Sun, The Guardian, Belfast Telegraph, Daily Herald, Newsquest’s Daily Echo, The Economist, and The National.
Others like the Daily Mirror, The Times, The Telegraph, The Spectator, Daily Record, BBC, Belfast News Letter, Bellingcat, Evening Standard, The Independent, The i, New Scientist, New Statesman, Reuters, The Scotsman, and Unherd still allow their sites to be trawled.
In the US, major players such as ABC News, Axios, New York Times, Bloomberg, CBS News, CNBC, CNN, and many more have blocked GPTBot.
A separate tracker from the Independent Publishers Alliance, monitoring 4,919 domains, shows that 12% of these sites were blocking robots.txt for ChatGPT as of October 1st. Recent additions to this list include the FT, The Sun, and the i.
Of the domains blocking ChatGPT, one is also blocking training for Google Bard: tech site Venture Beat.

There is no clear consensus on whether it’s better for publishers to block all AI training bots or not. Some publishers believe they should be compensated for providing training data, while others consider the potential value of being included in AI-generated results, similar to classic Google search results.

The challenge is that manually adding every AI system to the robots.txt file can be cumbersome, and some hope for government regulations to streamline this process, potentially discussed in upcoming AI ethics conferences.