Robots.txt

robots.txt

One-liner: A text file at the root of a website that tells search engine crawlers which pages they should or shouldn't index.

🎯 What Is It?

robots.txt is a publicly accessible file located at https://example.com/robots.txt that implements the Robots Exclusion Protocol. It provides instructions to web crawlers about which parts of a site should be crawled and indexed.

⚠️ CRITICAL: robots.txt is NOT a security mechanismβ€”it's a polite suggestion. Malicious actors ignore it and use it as a reconnaissance tool.

πŸ”¬ How It Works

Basic Structure

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml

Directives

Directive Purpose Example
User-agent: Specify which crawler User-agent: Googlebot
Disallow: Block path from crawling Disallow: /admin/
Allow: Allow specific path Allow: /public/
Crawl-delay: Delay between requests (sec) Crawl-delay: 5
Sitemap: Location of sitemap Sitemap: /sitemap.xml

Wildcard Matching

# Block all .pdf files
User-agent: *
Disallow: /*.pdf$

# Block all URLs with "private" anywhere
Disallow: /*private*

# Block query parameters
Disallow: /*?

🚨 Security Implications

❌ Common Misuse

# DON'T DO THIS - You're giving attackers a roadmap!
User-agent: *
Disallow: /admin/
Disallow: /backup/
Disallow: /config/
Disallow: /database/
Disallow: /.env
Disallow: /api/internal/

Why this is bad:

βœ… Proper Approach

πŸ›‘οΈ Blue Team Perspective

Detection

Defense

1. Don't document sensitive paths in robots.txt
2. Use proper access controls (not obscurity)
3. Monitor robots.txt requests for reconnaissance attempts
4. Remove outdated entries that leak information

SIEM Query Example

# Detect access to disallowed paths
index=web sourcetype=access_combined
| search uri_path="/admin/*" OR uri_path="/backup/*"
| stats count by src_ip, uri_path, user_agent
| where user_agent != "Googlebot" AND user_agent != "Bingbot"

βš”οΈ Red Team Perspective

Reconnaissance Value

# Fetch robots.txt
curl https://target.com/robots.txt

# Look for juicy targets:
# - /admin, /backup, /config
# - /api/internal, /dev, /staging
# - File patterns: *.sql, *.bak, *.env

Google Dorking

# Find sites with sensitive robots.txt entries
inurl:robots.txt intext:admin
inurl:robots.txt intext:backup
inurl:robots.txt intext:password

Tools

πŸ“‚ Real-World Examples

Good Example

# Prevents duplicate content indexing
User-agent: *
Disallow: /search?
Disallow: /*?sort=
Crawl-delay: 1
Sitemap: https://example.com/sitemap.xml

Bad Example (Information Leakage)

# Actual leak from a real company (anonymized)
User-agent: *
Disallow: /admin-panel/
Disallow: /old-site-backup/
Disallow: /customer-data/
Disallow: /.git/

🎀 Interview Questions

🎀 Interview STAR Example

Situation: During recon, found target's robots.txt disclosed /admin-backup/ path containing SQL dumps.
Task: Exploit the misconfiguration as part of authorized pentest.
Action: Accessed the disallowed path directly (ignored robots.txt). Found SQL backup files with customer PII. Documented as High severity finding.
Result: Client removed sensitive paths from robots.txt and implemented authentication. Moved backups offline.

βœ… Best Practices

πŸ“š References