What is Robots.txt?
Robots.txt is a text file at the root of a website that tells search engine crawlers which pages or sections they can and cannot access. It's the first file crawlers check before crawling a site.
Key directives: - User-agent: Specifies which crawler the rule applies to (* for all) - Allow: Explicitly permits crawling of a path - Disallow: Blocks crawling of a path - Sitemap: Points crawlers to your XML sitemap
Modern robots.txt should also address AI crawlers: - GPTBot (OpenAI/ChatGPT) - ClaudeBot (Anthropic/Claude) - PerplexityBot (Perplexity AI) - GoogleOther (Google's AI training crawler)
Common mistakes: 1. Accidentally blocking important pages or directories 2. Blocking CSS/JS files (prevents rendering) 3. Not including a sitemap directive 4. Using robots.txt to hide pages (it doesn't prevent indexing — use noindex for that) 5. Blocking AI crawlers without intentional decision
Important: robots.txt only controls crawling, not indexing. A page can still appear in search results (with limited info) even if blocked by robots.txt.
Example
A website's robots.txt accidentally contains 'Disallow: /blog/' — blocking all blog content from being crawled. The blog posts disappear from Google over weeks. Removing the disallow line and requesting recrawl in GSC fixes it.