Technical AIO

Misinformation Sites Are 6x More Likely to Feed AI. Here's Why They're Winning.

A shocking study reveals 60% of news sites block AI, while misinformation sites leave the door wide open. Your llms.txt file is doing nothing to stop it.

OpenFound Team

Content Team

Apr 25, 20269 min read

A groundbreaking study just dropped, and it reveals a terrifying truth about the future of AI: reputable news sites are actively blocking AI crawlers, while misinformation sites are leaving the front door wide open. A new analysis of over 4,000 websites found that 60% of reputable domains disallow at least one AI crawler, compared to just 9.1% of misinformation sites. This means the very sources trying to spread inaccuracies are six times more likely to be training the AI models that will soon answer the world’s questions. If you’ve been meticulously crafting an llms.txt file, you’ve been focusing on the wrong thing. It’s time to pay attention to what's really happening.

The Great llms.txt Misconception

For the past year, the GEO (Generative Engine Optimization) community has been buzzing about llms.txt. The idea, proposed by fast.ai’s Jeremy Howard, was simple and elegant: create a new file, similar to robots.txt, to communicate with AI crawlers about content usage. The community assumed this would be the new standard for controlling AI access. Brands have spent time and resources creating these files, believing they were setting up a gatekeeper for their valuable content.

"The data is in, and the conclusion is brutal: llms.txt is currently an illusion for access control."

An extensive 90-day study by OtterlyAI found that out of over 62,000 AI bot visits, a minuscule 0.1% of traffic even requested the /llms.txt file. To put that in perspective, the average content page on the same site received 3x more AI bot visits than the file specifically designed to guide them. Major AI companies like OpenAI, Google, and Anthropic have not confirmed they use the file, and as Google’s John Mueller stated in 2025, 'none of the AI services have said they’re using llms.txt.' Your server logs will confirm this: they don't even check for it.

llms.txt is not a gatekeeper. It’s a navigation aid with no enforcement. Treating it as a content protection tool gives you a false sense of security while your most valuable asset is being scraped for training purposes.

The Real Gatekeeper: robots.txt Still Reigns Supreme

So, where is the real battle for AI access being fought? In the same place it’s been for 30 years: robots.txt. While llms.txt was designed to provide nuanced instructions about how content can be used (e.g., allow for citation but not for training), robots.txt is a far blunter, but currently more effective, instrument. It simply blocks access. And reputable companies are using it aggressively.

The study from arXiv, 'Is Misinformation More Open?', highlights an alarming trend. Reputable news organizations, which invest heavily in factual, high-quality content, are putting up walls. They are adding directives to their robots.txt files to block common AI crawlers, including:

GPTBot (OpenAI)
ClaudeBot (Anthropic)
PerplexityBot (Perplexity)
Google-Extended (Google’s agent for training)

The data shows reputable sites block an average of 15.5 AI user agents. Misinformation sites block fewer than one. The gap is widening, with AI blocking by reputable sites jumping from 23% in 2023 to nearly 60% by May 2025.

Why Misinformation Is 'Winning' the Training Data War

This isn't an accident. It's a strategic divergence. Reputable publishers view their content as intellectual property. They want to control its use, ensure proper attribution, and explore licensing models. Blocking AI crawlers is a rational, defensive business move to protect their most valuable asset from being commoditized without compensation. They are playing the long game, waiting for legal and ethical frameworks to mature.

Misinformation sites, however, have a completely different business model. Their goal is not protection; it is amplification. For them, being ingested, summarized, and repeated by a Large Language Model is a feature, not a bug. They thrive on reach, and AI is the most powerful amplification vector the world has ever seen. By leaving their doors open, they are ensuring their narratives, conspiracies, and inaccuracies are fed directly into the models that are increasingly becoming a primary source of information for millions.

Your AIO Strategy Must Start With `robots.txt`

This creates an urgent mandate for every brand, publisher, and website owner. You cannot afford to be passive. Your Generative Engine Optimization strategy starts not with a new file, but with the old one.

Step 1: Audit robots.txt Immediately. This is your primary line of defense. You need to make a conscious decision about which AI crawlers, if any, you want to allow. Blocking Google-Extended can prevent your content from being used in Google's core training sets, while still allowing Googlebot for traditional search indexing.
Step 2: Use llms.txt as a Statement of Intent, Not a Fortress. Even though it’s not enforced, creating an llms.txt file is still a good practice. Think of it as where robots.txt was in 2001—an emerging standard with clear directional importance. Use it to explicitly state your content usage preferences. This documents your position for a future where enforcement mechanisms and legal precedents are more defined.
Step 3: Adopt a Proactive GEO Strategy. Blocking everything is a tactic, not a strategy. The smartest brands are developing a 'Generative Architecture' for their sites. As noted by experts at Derivatex, the goal is to allow AI inference access to public educational content you want cited, while protecting proprietary data. This is the core of true Generative Engine Optimization (GEO).

Take Control of Your AI Visibility

The asymmetry in AI training data is one of the most critical issues facing the web today. While reputable sources are building walls, misinformation sources are laying out a welcome mat. The result is a potential feedback loop that could pollute our shared information ecosystem for years to come.

Stop hoping llms.txt will solve this for you. It won't. The tool that matters right now is robots.txt. Audit it. Use it. Then, build a sophisticated strategy that goes beyond simply blocking. Your brand's visibility and the integrity of AI depend on it. At OpenFound, we're building the tools and analytics to help you navigate this new reality. Explore our research on the GEO Index and see more insights on our blog.

Frequently Asked Questions

What is the main difference between robots.txt and llms.txt?

Robots.txt controls access. It can block crawlers from visiting pages on your site. Llms.txt provides guidance on how your content may be used (e.g., for training or citation), but it cannot block access and is not currently enforced by major AI platforms.

Does llms.txt help with AI rankings or visibility?

No. Data from the OtterlyAI GEO study shows there is no positive correlation between having an llms.txt file and increased AI crawler activity or visibility. Crawlers are discovering content through standard web pages and sitemaps, and major AI providers have not confirmed they use it for ranking or answer generation.

Should I block all AI crawlers in my robots.txt?

It depends on your strategy. Blocking training crawlers like 'Google-Extended' or 'GPTBot' can prevent your content from being used to train models without compensation. However, you might want to allow inference crawlers like 'PerplexityBot' if you want your content cited in real-time answers. A selective approach is often best.

What are the most common AI crawler user agents to block?

Key AI user agents to consider managing in your robots.txt include GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's AI training bot), PerplexityBot (Perplexity), and CCBot (Common Crawl).

Is blocking AI crawlers with robots.txt legally binding?

The Robots Exclusion Protocol (and by extension, robots.txt) is a voluntary standard. While reputable crawlers honor it, less ethical ones may not. However, it establishes a clear legal intent to deny access, which can be important in potential legal challenges or DMCA notices regarding copyright.

Continue reading