What is Robots.txt?

Robots.txt is a text file at the root of a website that tells search engine crawlers which pages or sections they can and cannot access. It's the first file crawlers check before crawling a site.

Key directives: - User-agent: Specifies which crawler the rule applies to (* for all) - Allow: Explicitly permits crawling of a path - Disallow: Blocks crawling of a path - Sitemap: Points crawlers to your XML sitemap

Modern robots.txt should also address AI crawlers: - GPTBot (OpenAI/ChatGPT) - ClaudeBot (Anthropic/Claude) - PerplexityBot (Perplexity AI) - GoogleOther (Google's AI training crawler)

Common mistakes: 1. Accidentally blocking important pages or directories 2. Blocking CSS/JS files (prevents rendering) 3. Not including a sitemap directive 4. Using robots.txt to hide pages (it doesn't prevent indexing — use noindex for that) 5. Blocking AI crawlers without intentional decision

Important: robots.txt only controls crawling, not indexing. A page can still appear in search results (with limited info) even if blocked by robots.txt.

Example

A website's robots.txt accidentally contains 'Disallow: /blog/' — blocking all blog content from being crawled. The blog posts disappear from Google over weeks. Removing the disallow line and requesting recrawl in GSC fixes it.

Related terms

Crawl Budget XML Sitemap Noindex

All terms Run Free Audit