Back to BlogBlog
    We Tracked 62,100 AI Bot Hits. Your llms.txt Is a Lie.
    Technical AIO

    We Tracked 62,100 AI Bot Hits. Your llms.txt Is a Lie.

    Stop obsessing over llms.txt. We ran the numbers, and less than 0.2% of AI bots even look at it. Here’s what they actually read instead.

    OpenFound Team

    OpenFound Team

    Content Team

    Apr 2, 20269 min read

    For the past year, you've been told that llms.txt is the key to managing your brand's presence in AI. A simple text file, they said, to guide AI crawlers, curate your content, and secure your place in the new world of generative answers. It sounds neat, clean, and important. It’s also a lie.

    Well, not a deliberate lie, but a convenient fiction. We’re obsessed with data at OpenFound, so we look at the logs. A recent, crucial study from OtterlyAI did just that. Over a 90-day period, they tracked AI bot traffic to an experiment website. The headline numbers are staggering: out of 62,100 total AI bot hits, guess how many requested the /llms.txt file?

    "Eighty-four. Not 8,400. Just 84. That’s 0.13%."

    Let that sink in. While the entire industry is writing think-pieces about structuring the perfect llms.txt, the AI crawlers themselves are walking right past it. Your carefully crafted guidance is sitting in an empty room. This isn't a theoretical problem; it’s a fundamental misunderstanding of how AI engines actually access the web today.

    The Real Gatekeeper: robots.txt Still Rules Everything

    llms.txt was designed as a suggestion box. It’s a proposed standard to guide AI crawlers to your most important content. In contrast, robots.txt is a fortified gate with a bouncer. It controls access. As Neil Patel clarifies, robots.txt influences indexing and rankings, while llms.txt was intended to influence how content appears in generative AI. The problem is, only one of them has any real authority.

    Every major, well-behaved bot—from Googlebot to GPTBot—honors robots.txt. It’s a 30-year-old standard that underpins web etiquette. llms.txt is a new proposal that, as the data shows, has almost zero adoption by the crawlers that matter. Focusing on llms.txt is like hand-painting a welcome sign for a new highway while ignoring the fact that you already have a chain-link fence blocking the entrance.

    The Gatekeeping Divide: Reputable vs. Misinformation Sites

    This isn't just a technical footnote. The decision to use robots.txt has become a defining line in the sand. A landmark study on arXiv analyzed the robots.txt files of reputable news outlets versus known misinformation sites. The findings are a wake-up call for every brand:

    • 60% of reputable news sites explicitly disallow at least one AI crawler in their robots.txt.
    • Only 9.1% of misinformation sites do the same.
    • On average, reputable sites block 15.5 different AI user agents, while misinformation sites block fewer than one.

    The conclusion is unavoidable. Trustworthy organizations are being deliberate and strategic about who they let train on their content. They are using the real tool for the job—robots.txt—to exert control. Meanwhile, low-quality and misinformation sites leave the door wide open, eager for their content to be scraped and amplified by any model, without question. Are you managing your access like a reputable publisher or a misinformation firehose?

    The New Framework: Control Access, Then Guide Attention

    Obsessing over a file that bots don't read is a waste of resources. It’s time to adopt a strategy grounded in reality. This isn’t about abandoning llms.txt, but putting it in its proper, secondary place. True Generative Engine Optimization (GEO Index) is about control and precision.

    Step 1: Use robots.txt for AI Access Control (The Real Work)

    Your robots.txt file is your new frontline for AI strategy. It's not just for SEO anymore; it’s for brand safety, IP protection, and strategic visibility. You must make active decisions about which bots can access your site. Key AI user agents to be aware of include:

    • GPTBot: OpenAI's web crawler for training models like GPT-5.
    • Google-Extended: Google's user agent for data collection for its generative AI models. Blocking this is how you opt out of Gemini training.
    • CCBot: The crawler for Common Crawl, a massive dataset used to train countless open-source and commercial models.
    • ClaudeBot: Anthropic's crawler (note: as of early 2026, Anthropic is still an early adopter and its policies are evolving).

    Your robots.txt should be explicit. Blocking a crawler is a powerful statement. For example, if you don't want your proprietary content to become generic training data for the next GPT model, you add this to your file:

    "User-agent: GPTBot Disallow: /"

    Conversely, allowing a bot means you want visibility on that platform. Never blocking AI crawlers unless legally required is a good starting point, as every blocked crawler is a platform where your brand cannot be cited.

    Step 2: Use llms.txt for Crawl Budget Optimization (The Fine-Tuning)

    So, llms.txt is useless? Not quite. It’s just not an access tool. According to AI Rank Lab, its best use is for crawl budget optimization. AI crawlers have limited time and resources. Once you’ve allowed them in with robots.txt, llms.txt can suggest a more efficient path.

    Think of it as a concierge a guest meets after being cleared by security. You can use it to point the (very few) bots that read it toward your highest-value content: your product documentation, your core service pages, your cornerstone thought leadership. The key is to make no assumptions. Put your most critical URLs at the top, use canonical URLs, and ensure the pages load quickly.

    CRITICAL: Your llms.txt must never contradict your robots.txt. Listing a URL in llms.txt that is disallowed in robots.txt sends conflicting signals and undermines trust with the crawler. It’s like inviting someone to a party but having security throw them out at the door.

    Step 3: Keep a Unified, Updated Policy

    The world of AI crawlers is not static. New bots will appear, and company policies will change. Your access policy cannot be a 'set it and forget it' task. Per Rebecca Gross at Goodie, as AI Overviews dominate more SERPs, this technical foundation becomes non-negotiable.

    • Audit Quarterly: Review your robots.txt and llms.txt files at least once a quarter.
    • Integrate with Publishing: When you publish a major new piece of content (like a new whitepaper or product guide), update your llms.txt immediately.
    • Stay Informed: Follow the developer blogs for OpenAI, Google, and other major AI players to stay ahead of new user agents or policy changes.

    The Future: Will llms.txt Ever Matter?

    It might. Some experts believe llms.txt could evolve into a formal access control standard, especially as legal and regulatory frameworks around AI training data solidify. But we're not there yet. For the next 1-2 years, your robots.txt is the only tool with real teeth.

    Stop waiting for the llms.txt revolution. The real work of AIO is happening now in the file you've had for decades. Use robots.txt to control who gets in, and llms.txt to offer directions to the few who ask. Don't be a ghost town. Don't be a firehose of misinformation. Be a deliberate, strategic destination. Your brand's visibility in the age of AI depends on it. Explore our blog for more insights.

    Frequently Asked Questions

    What is the difference between llms.txt and robots.txt?

    Robots.txt is a universally respected standard that CONTROLS which web crawlers (including AI bots) can access your site. Llms.txt is a proposed standard that GUIDES AI crawlers to your most important content. Data shows that robots.txt is actively used for access control, while llms.txt is currently ignored by most AI crawlers.

    Should I block AI crawlers like GPTBot?

    It depends on your strategy. Blocking GPTBot in your robots.txt prevents OpenAI from using your content for training its models. This protects your IP but also makes your brand invisible on its platform. Allowing it increases visibility but cedes some control. Reputable sites are increasingly making strategic choices to block certain crawlers.

    Does having an llms.txt file have any benefit at all?

    Yes, but it's a minor one for now. Its primary benefit is as a 'suggestion' for crawl budget optimization, helping the few bots that read it to find your key pages more efficiently. It may also become more important in the future. Its main weakness is that only 0.13% of AI bots in a major study even bothered to look at the file.

    What happens if my llms.txt file contradicts my robots.txt file?

    This sends a conflicting signal to AI crawlers and should be avoided. If you list a URL in llms.txt but block it in robots.txt, the crawler will obey robots.txt and not access the page. This contradiction can waste crawl budget and undermine the crawler's trust in your site's directives.

    How often should I update my AI crawler files?

    You should review and audit your robots.txt file on a quarterly basis to account for new AI crawlers and changes in platform policies. Your llms.txt file should be updated whenever you add or change a major piece of content that you want to highlight for AI systems.

    Continue reading

    Enjoyed this article?

    Share it with your network