New

Quo Vadis, Crawlers? Progress And What’s Next On Safeguarding Our Infrastructure

One year ago, the Wikimedia Foundation reported a significant increase in bot traffic to the Wikimedia projects, largely coming from crawlers who extract content to train generative AI systems. We shared about the impact of these crawlers, and introduced our action plan to ensure a fairer use of our resources. Let’s take a look at the progress we’ve made on protecting our infrastructure, what we’ve learned along the way, and next steps.

Recap: High demand, increased strain, less visibility

As generative AI increasingly draws from high-quality, human-created content, automated traffic has risen sharply on Wikimedia sites. While Wikimedia content is free, the infrastructure that serves it is not. Crawlers tend to access every part of the Wikimedia ecosystem – articles, media files, and developer platforms – exposing risks of overloading the systems and impacting the experience of our readers and contributors. At the same time, LLM-powered features such as search summaries or chat bots are making it less likely that users know the source of information or follow links, as recent studies have found. Across the web, publishers are seeing more bot traffic and fewer human users – a trend we’re also observing. This creates an imbalance: increased extraction of content, with fewer people contributing back to sustain it.

What does an open access model look like when so many don’t play by the rules? What do we need to change in order to enable – and enforce – sustainable use of our infrastructure? These and other questions have shaped our approach. Rather than asking, “How can we prevent reuse?” we’ve been thinking about ways to enable sustainable, responsible reuse.

Prioritizing access for humans and mission-oriented traffic

We ensure fair usage by prioritizing access for our readers, content and technical contributors, blocking abusive traffic, and requesting companies who want to access our data at scale to use our Wikimedia Enterprise services that are designed for high volume use cases, instead of scraping pages or overusing community resources.

To achieve this, we have updated our robot policy to set expectations, improved our bot detection and defense tools, and are investing in our API infrastructure to enable central management, improved governance and developer experience for our preferred ways of access.

Readers, contributors, responsible bots, and abusive bots all share the same access points to our websites and infrastructure. We have therefore orchestrated our work with maximum care to minimize impact on our reading and editing community, with the ultimate goal of not impeding any person from accessing our projects.

As a result of this work, we’re currently blocking or throttling about 25% of all automated requests that are coming from crawlers that don’t adhere to our policies (up to billions of requests per day). As we continue to improve our detection mechanisms, we expect this number to increase. Earlier this month, we also began rolling out global rate limits for API traffic, with a second rollout phase planned for April 2026.

Both crawling the site and using the APIs are still possible for anyone within the limits of the robot policy. Scraping at higher rates is generally restricted. Obtaining higher rate limits for the APIs, however, is easily possible and a preferred way of access. The rule of thumb is: The stronger the provided identification, the higher the provided limit. As we aim to minimize impact on our technical community, multiple options exist for technical contributors to identify their bots and tools and receive higher limits if needed. Bot owners who are unsure how to get the access they need can contact the Wikimedia Foundation.

Good bot, bad bot, human? Differentiating legitimate users from abusive bots

A prerequisite to prioritizing access for humans and mission-oriented traffic and preventing abuse is the ability to differentiate legitimate users (bots and humans alike) from abusive bots. In the past, abusive bots were fewer in number, and easier to identify. And traditional web crawlers like search engine bots followed best practices: slowing down if the server started returning errors, and making efforts to be easily identified in server logs. They also brought visitors back to the sites, by indexing and showing pages in search results, so everyone benefited. In addition, the Wikimedia communities rely on their own bots and bespoke tools to support and speed up workflows from content creation to vandalism patrols.

This new generation of bots, however, routinely ignore historical precedent and behave badly: sending requests as fast as they can, spoofing identities of real web browsers, and circumventing rate limits. Thinking about bots as adversarial was a new experience for us, and forced us to do many iterations of improving our bot detection.

Bots that cover their tracks: A predatory business model

Many modern bots operate outside of the established rules of the Internet, ignoring limits imposed by site owners, extracting data as fast as possible with no regard for the health of the host websites. In response, website operators have started to impose stricter rate limits on requests coming from datacenters and individual sources. But as a consequence of that, crawling operators have resorted to using a shady network of so-called “residential proxies” — companies that sell access to people’s own home or mobile connections, to hide their data extraction among legitimate browsing traffic. In this new world, there is little a website operator can do to stop the flood, as these networks can span hundreds of millions of IP addresses, without identifying human users one way or another. You might have noticed that, on a lot of websites, you’re now requested to “verify you’re a human” before being allowed access; these networks are the most likely cause of this shift in behaviour, and why community-centric knowledge sites like ours (and OpenStreetMap) try their best to do the same while respecting our users’ right to not be tracked extensively.

Looking ahead: Responding to threats and exploring the opportunities of reuse

Over the coming months, we aim to further improve our detection systems to scale for rapidly changing bot behavior (such as residential proxies); continue to roll out and fine-tune API rate limits, and invest in our API infrastructure. This includes completing work on a dedicated Attribution API to make it easy for reusers to provide pathways for discovery. We also started working on improving our media infrastructure, aiming to make the platform more resilient when extensive scraping occurs.

As we’re planning the next phase of this work, we’re also looking at opportunities beyond protecting our infrastructure: we want to explore ways to ensure that content reuse is sustainable long-term, including helping drive contributors back, and to further improve the discoverability and developer experience around the APIs, our preferred channels for access.

While more work is needed to complete this initiative and protect against new forms of abuse, we have made great progress so far. This would not have been possible without the support of our amazing technical community – many thanks to everyone who has updated their code to follow updated best practices, gave feedback, asked questions, helped fellow developers, or reported bugs!

We will continue to share about this work on mailing lists and blog posts – stay tuned for the next update!

Back to Listing

credit: