Keeping the Web Up Under the Weight of AI Crawlers

If you run a site on the open web, chances are you've noticed a big increase in traffic over the past few months, whether or not your site has been getting more viewers, and you're not alone. Operators everywhere have observed a drastic increase in automated traffic—bots—and in most cases attribute much or all of this new traffic to AI companies.

Background

AI—in particular, Large Language Models (LLMs) and generative AI (genAI)—rely on compiling as much information from relevant sources (i.e., "texts written in English" or "photographs") as possible in order to build a functional and persuasive model that users will later interact with. While AI companies in part distinguish themselves by what data their models are trained on, possibly the greatest source of information—one freely available to all of us—is the open web.

To gather up all that data, companies and researchers use automated programs called scrapers (sometimes referred to by the more general term "bots") to "crawl" over the links available between various webpages and save the types of information they're tasked with as they go. Scrapers are tools with a long, and often beneficial, history: services like search engines, the Internet Archive, and all kinds of scientific research rely on them.

When scrapers are not deployed thoughtfully, however, they can contribute to higher hosting costs, lower performance, and even site outages, particularly when site operators see so many of them in operation at the same time. In the long run all this may lead to some sites shutting down rather than bearing the brunt of it.

For-profit AI companies must ensure they do not poison the well of the open web they rely on in a short-sighted rush for training data.

Bots: Read the Room

There are existing best practices those who use scrapers should follow. When bots and their operators ignore these guideposts it sends a signal to site operators, sometimes explicitly, that they can or should cut off their access, impede performance, and in the worst case it may take a site down for all users. Some companies appear to follow these practices most of the time, but we see increasing reports and evidence of new bots that don't.

First, where possible, scrapers should follow instructions given in a site's robots.txt file, whether those are to back off to a certain crawling rate, exclude certain paths, or not to crawl the site at all.

Second, bots should send their requests with a clearly labeled User Agent string which indicates their operator, their purpose, and a means of contact.

Third, those running scrapers should provide a process for site operators to request back-offs, rate caps, exclusions, and to report problematic behavior via the means of contact info or response forms linked via the User Agent string.

Mitigations for Site Operators

Of course, if you're running a website dealing with a flood of crawling traffic, waiting for those bots to change their behavior for the better might not be realistic. Here are a few suggested, if imperfect, mitigations based in part on our own sometimes frustrating experiences.

First, use a caching layer. In most cases a Content Delivery Network (CDN) or an "edge platform" (essentially a newer iteration of a CDN) can provide this for you, and some services offer a free tier for non-commercial users. There are also a number of great projects if you prefer to self-host. Some of the tools we've used for caching include varnish, memcached, and redis.

Second, convert to static content to prevent resource-intensive database reads. In some cases this may reduce the need for caching.

Third, use targeted rate limiting to slow down bots without taking your whole site down. But know this can get difficult when scrapers try to disguise themselves with misleading User Agent strings or by spreading a fleet of crawlers out across many IP addresses.

Other mitigations such as client-side validation (e.g. CAPTCHAs or proof-of-work) and fingerprinting carry privacy and usability trade-offs, and we warn against deploying them without careful forethought.

Where Do We Go From Here?

To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers.

Related Issues

Creativity & Innovation

Related Updates

Deeplinks Blog by Josh Richman | July 30, 2025

Podcast Episode: Smashing the Tech Oligarchy

Many of the internet’s thorniest problems can be attributed to the concentration of power in a few corporate hands: the surveillance capitalism that makes it profitable to invade our privacy, the lack of algorithmic transparency that turns artificial intelligence and other tech into impenetrable black boxes, the rent-seeking behavior that...

Deeplinks Blog by Tori Noble | June 23, 2025

Copyright Cases Should Not Threaten Chatbot Users’ Privacy

Like users of all technologies, ChatGPT users deserve the right to delete their personal data. Nineteen U.S. States, the European Union, and a host of other countries already protect users’ right to delete. For years, OpenAI gave users the option to delete their conversations with ChatGPT, rather than let their...

Deeplinks Blog by Katharine Trendacosta, Corynne McSherry | June 23, 2025

The NO FAKES Act Has Changed – and It’s So Much Worse

A bill purporting to target the issue of misinformation and defamation caused by generative AI has mutated into something that could change the internet forever, harming speech and innovation from here on out.The Nurture Originals, Foster Art and Keep Entertainment Safe (NO FAKES) Act aims to address understandable concerns about...

Deeplinks Blog by Joe Mullin | June 10, 2025

Despite Changes, A.B. 412 Still Harms Small Developers

California lawmakers are continuing to promote a bill that will reinforce the power of giant AI companies by burying small AI companies and non-commercial developers in red tape, copyright demands and potentially, lawsuits. After several amendments, the bill hasn’t improved much, and in some ways has actually gotten worse. If...

Deeplinks Blog by Joe Mullin | June 3, 2025

The PERA and PREVAIL Acts Would Make Bad Patents Easier to Get—and Harder to Fight

Two dangerous bills have been reintroduced in Congress that would reverse over a decade of progress in fighting patent trolls and making the patent system more balanced. The Patent Eligibility Restoration Act (PERA) and the PREVAIL Act would each cause significant harm on their own. Together, they form a one-two...

Deeplinks Blog by Tori Noble, Mitch Stoltz, Corynne McSherry | May 15, 2025

The U.S. Copyright Office’s Draft Report on AI Training Errs on Fair Use

Within the next decade, generative AI could join computers and electricity as one of the most transformational technologies in history, with all of the promise and peril that implies. Governments’ responses to GenAI—including new legal precedents—need to thoughtfully address real-world harms without destroying the public benefits GenAI can offer....

Deeplinks Blog by Rory Mir | April 3, 2025

Calyx Institute: A Case Study in Grassroots Innovation

Technologists play a huge role in building alternative tools and resources when our right to privacy and security are undermined by governments and major corporations. This direct resistance ensures that even in the face of powerful adversaries, communities can find some safety and autonomy through community-built tools.One of the most...

Deeplinks Blog by Joe Mullin | April 2, 2025

Site-Blocking Legislation Is Back. It’s Still a Terrible Idea.

More than a decade ago, Congress tried to pass SOPA and PIPA—two sweeping bills that would have allowed the government and copyright holders to quickly shut down entire websites based on allegations of piracy. The backlash was immediate and massive. Internet users, free speech advocates, and tech companies flooded...

Deeplinks Blog by Joe Mullin | March 21, 2025

New USPTO Memo Makes Fighting Patent Trolls Even Harder

The U.S. Patent and Trademark Office (USPTO) just made a move that will protect bad patents at the expense of everyone else. In a memo released February 28, the USPTO further restricted access to inter partes review, or IPR—the process Congress created to let the public challenge invalid patents...

Deeplinks Blog by Joe Mullin | March 17, 2025

California’s A.B. 412: A Bill That Could Crush Startups and Cement A Big Tech AI Monopoly

California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training. At first glance, this might sound like a reasonable step toward transparency. But it’s an impossible standard that could crush small AI startups and...

Related Issues

Creativity & Innovation

Search form

Search form

Background

Bots: Read the Room

Mitigations for Site Operators

Where Do We Go From Here?

Related Issues

Related Issues

Search form

Search form

Keeping the Web Up Under the Weight of AI Crawlers

Keeping the Web Up Under the Weight of AI Crawlers

Background

Bots: Read the Room

Mitigations for Site Operators

Where Do We Go From Here?

Related Issues

Related Updates

Discover more.

Related Issues

Follow EFF:

Contact

About

Issues

Updates

Press

Donate