What is Robots.txt?

Robots.txt is a text file at the root of a website that tells search engine crawlers which pages or sections they can and cannot access. It's the first file crawlers check before crawling a site.

Key directives: - User-agent: Specifies which crawler the rule applies to (* for all) - Allow: Explicitly permits crawling of a path - Disallow: Blocks crawling of a path - Sitemap: Points crawlers to your XML sitemap

Modern robots.txt should also address AI crawlers: - GPTBot (OpenAI/ChatGPT) - ClaudeBot (Anthropic/Claude) - PerplexityBot (Perplexity AI) - GoogleOther (Google's AI training crawler)

Common mistakes: 1. Accidentally blocking important pages or directories 2. Blocking CSS/JS files (prevents rendering) 3. Not including a sitemap directive 4. Using robots.txt to hide pages (it doesn't prevent indexing — use noindex for that) 5. Blocking AI crawlers without intentional decision

Important: robots.txt only controls crawling, not indexing. A page can still appear in search results (with limited info) even if blocked by robots.txt.

Example

A website's robots.txt accidentally contains 'Disallow: /blog/' — blocking all blog content from being crawled. The blog posts disappear from Google over weeks. Removing the disallow line and requesting recrawl in GSC fixes it.