News outlets are limiting the Internet Archive’s access to their journalism

TL;DR

Many major news publishers, especially local outlets, are blocking the Internet Archive from crawling their websites. This development raises concerns about the preservation of journalistic history and access for researchers.

More than 340 US local news websites have begun blocking the Internet Archive’s web crawlers, according to recent analysis, marking a significant shift in how news publishers are managing their online content’s accessibility and preservation.

Since January, the number of news sites disallowing Internet Archive bots has increased by over 140, with the total now exceeding 380 sites. The majority of these are local outlets owned by major publishers such as USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. These publishers cite concerns about AI scraping and intellectual property protection. The Internet Archive’s Wayback Machine has responded by emphasizing efforts to prevent abuse, including limiting bulk downloads and monitoring bot activity. Despite these measures, publishers remain wary, with some explicitly disallowing known Internet Archive bots via robots.txt files. Researchers and journalists rely heavily on the Archive for long-term access to historical news content, especially in regions with limited current reporting. The situation reflects ongoing tensions between content preservation and intellectual property rights, complicated further by the rise of AI training data concerns.

Why It Matters

This development threatens the long-term preservation of local journalism, which is vital for historical record-keeping, research, and accountability. The restriction of web crawlers by major local news outlets could lead to gaps in the digital archive, impacting historians, journalists, and citizens seeking to understand past events. It also signals a broader shift in how news organizations view their digital content and its control, especially amid economic pressures and legal concerns over AI use.

Seagate Portable 2TB External Hard Drive HDD — USB 3.0 for PC, Mac, PlayStation, & Xbox -1-Year Rescue Service (STGX2000400)

Storage Capacity: 2TB portable external hard drive
Device Compatibility: Works with Windows and Mac
Easy Backup: Drag-and-drop file backup

View Latest Price

As an affiliate, we earn on qualifying purchases.

Background

Since early 2026, reports have highlighted growing tensions between the Internet Archive and news publishers over content scraping and intellectual property rights. In January, Nieman Lab identified over 240 news sites disallowing Internet Archive bots, primarily owned by large media companies. The trend has since accelerated, with over 380 sites now blocking access, mostly in the United States. The controversy is rooted in fears that AI companies might scrape news content for training models, although no direct evidence of such scraping has been publicly confirmed by publishers. The Internet Archive has taken steps to address these concerns but continues to face resistance from some publishers, especially in the local news sector.

“Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term.”

— Edward McCain, journalism librarian at the University of Missouri

“We are in conversation with many publishers and appreciate the opportunity to address their concerns.”

— Mark Graham, founder of the Wayback Machine

“This is the same fight that everybody has been having with the Internet Archive since its inception.”

— Meredith Broussard, data journalist and NYU professor

What Remains Unclear

It remains unclear whether any news publishers have actually had their content scraped by AI companies from the Wayback Machine, as no direct evidence has been publicly confirmed. The extent to which these restrictions will impact long-term preservation and research is also still developing, and negotiations between the Internet Archive and publishers are ongoing.

What’s Next

Further discussions between the Internet Archive and publishers are expected to continue, with possible policy adjustments or technical solutions to address concerns. Monitoring of the number of sites blocking access and potential legal developments related to content rights and AI training data will be key in the coming months.

Key Questions

Why are news publishers blocking the Internet Archive?

Many publishers cite concerns over content scraping by AI companies and protecting their intellectual property as reasons for blocking web crawlers from archiving their sites.

Could this affect the availability of historical news content?

Yes, restricting access could lead to gaps in the digital archive, impacting researchers, journalists, and the public’s ability to access long-term news records.

Has the Internet Archive confirmed any content scraping by AI companies?

No, there is no publicly confirmed evidence that AI companies have scraped news content from the Wayback Machine. The restrictions are primarily based on publisher concerns.

What are the Internet Archive’s responses to these restrictions?

The Internet Archive has implemented measures to prevent abuse, including limiting bulk downloads and monitoring bot activity, and is engaging in discussions with publishers to address their concerns.

Source: Hacker News

News outlets are limiting the Internet Archive’s access to their journalism

Up next

10 Best Pilates Reformer Machines in 2026

Author

The Blogger Team

Share article

Why It Matters

Seagate Portable 2TB External Hard Drive HDD — USB 3.0 for PC, Mac, PlayStation, & Xbox -1-Year Rescue Service (STGX2000400)

Background

What Remains Unclear

What’s Next

Key Questions

Why are news publishers blocking the Internet Archive?

Could this affect the availability of historical news content?

Has the Internet Archive confirmed any content scraping by AI companies?

What are the Internet Archive’s responses to these restrictions?

The Switch: You Never Owned the AI You Depend On

Kill-Switch-Proof: How to Build So Washington Can’t Take Your AI Stack Down

The Eye Over the City: How Wide-Area Motion Imagery Works — and Where It Goes Blind

Kill-Switch-Proof: How to Build So Washington Can’t Take Your AI Stack Down

北京已有637人荣登“中国好人榜” – 北京市人民政府

9 Best Robot Vacuums for Pet Hair in 2026

13 Best Motivational Desk Accessories in 2026

Berlin: Christopher Street Day In Berlin Mit Demokratiefest Gestartet – Tagesschau.de

News outlets are limiting the Internet Archive’s access to their journalism

Up next

Author

The Blogger Team

Share article

Why It Matters

Seagate Portable 2TB External Hard Drive HDD — USB 3.0 for PC, Mac, PlayStation, & Xbox -1-Year Rescue Service (STGX2000400)

Background

What Remains Unclear

What’s Next

Key Questions

Why are news publishers blocking the Internet Archive?

Could this affect the availability of historical news content?

Has the Internet Archive confirmed any content scraping by AI companies?

What are the Internet Archive’s responses to these restrictions?

You May Also Like