4 Out of 5 Websites Are Invisible to AI. Microsoft Says It's Their Own Fault.
At AdExchanger's Prog AI conference in Las Vegas last week, Microsoft AI VP Nikhil Kolar delivered a blunt message to publishers: stop blocking AI crawlers, or accept that your content will not exist in the AI era.
His data point was stark. Four out of five websites actively block AI bots through robots.txt, meta tags, or server-level restrictions. That means 80% of the web is invisible to ChatGPT, Google AI Overviews, Perplexity, and every other AI engine that relies on web content for grounding.
Kolar's framing was unsympathetic. "Your business is closed," he said, referring to publishers who block bots. The implication was clear: if AI engines cannot read your content, your content effectively does not exist for a growing share of information discovery.
But publishers are not backing down. And the standoff between AI companies and content owners is reshaping how visibility works on the internet.
The Numbers Behind the Blockade
The 80% figure came from Microsoft's own crawling data. When Microsoft's AI systems attempt to access websites for grounding (the process of checking AI-generated answers against real web content), they are blocked four times out of five.
This is not limited to small websites. Major publishers, media companies, and enterprise brands are all participating in the blockade. The reasons vary. Some publishers block AI bots to protect copyrighted content. Others block because they want to negotiate licensing deals before granting access. Still others block out of principle: they do not want their content used to train AI models without compensation.
The blocking mechanisms range from simple robots.txt directives (disallowing GPTBot, Google-Extended, PerplexityBot, and other AI-specific crawlers) to more sophisticated server-side detection that identifies AI crawler user agents and serves them different content or blocks them entirely.
The Publisher Counterargument
At the same Prog AI event, Jonathan Roberts of People Inc. presented the publisher side of the debate. People Inc. blocks 30,000 to 35,000 crawlers per day and grants access to only 38.
Roberts argued that blocking is not about hostility toward AI. It is about leverage. By restricting access, publishers create scarcity. Scarcity creates negotiating power. And negotiating power leads to licensing deals.
People Inc. participates in Microsoft's Publisher Content Marketplace, which licenses publisher content specifically for AI grounding (not training). The marketplace started with a handful of premium publishers and has grown to eight, with ambitions to encompass the entire open web.
The distinction between training and grounding matters. Training is the process of building AI models using large datasets. Grounding is the process of checking AI-generated answers against current, real-world sources. Publishers are generally more willing to license content for grounding than for training, because grounding requires ongoing access (which commands ongoing payments) while training is a one-time use.
Microsoft's Marketplace Play
Microsoft's Publisher Content Marketplace is positioned as the solution to the blocking problem. Instead of publishers blocking bots and AI companies scraping without permission, the marketplace creates a commercial relationship.
The economics are revealing. Kolar noted that all of Microsoft's AI computing runs on Azure. From Microsoft's perspective, licensing publisher content for grounding is "not a cost" in the traditional sense. It is a business arrangement that keeps content flowing into AI systems while compensating publishers.
But the marketplace has limitations. With only eight publishers currently participating, it represents a tiny fraction of the web. And the terms are opaque. Publishers who join are essentially betting that Microsoft's marketplace will become the dominant channel for AI content licensing, a bet that assumes Google, OpenAI, and Perplexity will either participate in the marketplace or create their own equivalents.
The Real Question: Who Controls AI Visibility?
The bot-blocking debate is really a power struggle over who controls how content appears in AI-generated answers.
In the old search model, the answer was simple: Google controlled visibility through its ranking algorithm. If you wanted to be found, you optimized for Google's algorithm. Google decided what ranked and what did not.
In the AI search model, control is fragmented across multiple actors:
AI companies (OpenAI, Google, Microsoft, Perplexity) control which content their models synthesize into answers. They decide what sources to surface and how to weight them.
Publishers control whether AI crawlers can access their content at all. Through robots.txt and server-level blocking, they can make themselves invisible to specific AI engines.
Users are gaining control through features like Google's Preferred Sources, which allows users to designate specific websites as preferred sources for AI-generated answers.
Standards bodies are working on protocols like WebMCP, which would give websites more granular control over how AI agents interact with their content.
The result is a complex negotiation where no single party has complete control. AI companies need content to generate answers. Publishers need distribution to remain relevant. Users want accurate, trustworthy answers. And everyone is trying to capture value in a rapidly shifting landscape.
What the Blocking Data Means for Your Website
If you run a website, the 80% blocking rate is both a warning and an opportunity.
The Warning
If your website blocks AI bots, you are in the majority. But majority behavior is not always optimal behavior. The 80% blocking rate means that AI engines are working with a severely limited content pool. The 20% of websites that allow AI crawling have a disproportionate influence on what AI models surface in their answers.
If your competitors allow AI crawling and you do not, your competitors will appear in AI-generated answers and you will not. For a growing share of search queries (especially informational and research queries), this means your competitors will be discovered and you will be invisible.
The Opportunity
The 80% blocking rate also means that simply allowing AI crawlers gives you a relative advantage. If only one in five websites is accessible to AI engines, being in that one-in-five group puts you ahead of the vast majority of the web.
This is not an argument for blindly opening your site to every AI crawler. It is an argument for making a deliberate, strategic choice about AI visibility rather than defaulting to block because it feels safe.
The Strategic Framework
The decision about whether to block AI crawlers should be based on three factors:
1. Content type. If your content is purely functional (product documentation, FAQs, service descriptions), allowing AI crawling is almost certainly beneficial. AI engines will surface your content when users ask relevant questions. If your content is editorial, creative, or proprietary, the calculus is different. You may want to restrict access while pursuing licensing deals.
2. Business model. If your business depends on being discovered (e-commerce, SaaS, lead generation), AI visibility is a growth channel and you should optimize for it. If your business depends on content as a product (subscriptions, paywalled journalism), blocking may be the right short-term strategy while you negotiate licensing.
3. Competitive landscape. Check whether your competitors allow AI crawling. If they do and you do not, you are ceding AI visibility to them. If none of your competitors allow AI crawling, you have a first-mover advantage by opening up.
The Middle Ground: Selective Access
The binary choice between "block everything" and "allow everything" is a false dichotomy. Most websites can benefit from a selective approach:
- Allow grounding crawlers (crawlers that check AI answers against your content) but block training crawlers (crawlers that feed your content into model training). Google-Extended, for example, can be configured to allow grounding while blocking training.
- Allow specific AI engines while blocking others. If your audience uses ChatGPT but not Perplexity, you might allow GPTBot while blocking PerplexityBot.
- Allow access to public content while restricting premium or paywalled content. Most CMS platforms can serve different robots.txt directives based on content type.
- Use AI crawler-specific rate limits to prevent excessive crawling without blocking access entirely.
The technical implementation is not complicated. The strategy is what requires thought.
What Happens Next
The standoff between AI companies and publishers will not be resolved quickly. Here is what to expect over the next 6-12 months:
More licensing deals. Microsoft's Publisher Content Marketplace will expand. Google and OpenAI will announce their own licensing programs. Publishers who have been blocking bots will have more options for monetizing access.
More sophisticated blocking. The current robots.txt approach is blunt. New standards (including WebMCP) will give publishers more granular control over what AI agents can do with their content.
More fragmentation. Different AI engines will develop different relationships with publishers. Some will license content. Others will rely on fair use arguments. The result will be an uneven landscape where your content appears in some AI engines but not others.
More user control. Google's Preferred Sources is the first example of users influencing AI source selection. Expect more features that let users control what appears in their AI-generated answers.
The Takeaway
The 80% blocking rate tells us something important about the current state of AI search: most of the web has opted out. Whether that is a smart strategic move or a costly mistake depends on your business model, content type, and competitive landscape.
Microsoft says publishers should open up. Publishers say they should negotiate first. Both are right from their own perspective. The companies that will win are the ones that make a deliberate, informed decision rather than defaulting to either extreme.
If you have not audited your AI crawler policy recently, now is the time. Check your robots.txt. Check your server logs for crawler activity. Check whether your competitors are visible in AI-generated answers. The data is there. The choice is yours.
How Visible Is Your Brand to AI?
88% of brands are invisible to ChatGPT, Perplexity, and Gemini. Find out where you stand in 60 seconds.
Check Your AI Visibility Score Free