Sitemap

Sitemap

One-liner: An XML file that lists all important pages on a website to help search engine crawlers discover and index content efficiently.

🎯 What Is It?

A sitemap (specifically sitemap.xml) is a structured file located at https://example.com/sitemap.xml that provides search engine crawlers with a roadmap of a website's content. It lists URLs, metadata, and relationships to improve crawling efficiency and SEO.

🔬 How It Works

Basic Structure

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2024-01-15</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  
  <url>
    <loc>https://example.com/products/</loc>
    <lastmod>2024-01-10</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  
  <url>
    <loc>https://example.com/blog/post-1/</loc>
    <lastmod>2024-01-05</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.6</priority>
  </url>

</urlset>

XML Elements

Element Description Example
<loc> Full URL of the page https://example.com/page
<lastmod> Last modification date 2024-01-15
<changefreq> How often content changes daily, weekly, monthly
<priority> Relative importance (0.0-1.0) 1.0 (highest), 0.5 (medium)

📊 Sitemap Types

1. XML Sitemap (Most Common)

2. HTML Sitemap

3. Image Sitemap

<url>
  <loc>https://example.com/photo.html</loc>
  <image:image>
    <image:loc>https://example.com/photo.jpg</image:loc>
    <image:title>Photo Title</image:title>
  </image:image>
</url>

4. Video Sitemap

<url>
  <loc>https://example.com/video.html</loc>
  <video:video>
    <video:thumbnail_loc>https://example.com/thumb.jpg</video:thumbnail_loc>
    <video:title>Video Title</video:title>
  </video:video>
</url>

🚨 Security Implications

⚠️ Information Disclosure

Sitemaps can reveal:

Red Team Reconnaissance

# Fetch sitemap
curl https://target.com/sitemap.xml

# Common locations
/sitemap.xml
/sitemap_index.xml
/sitemap1.xml
/sitemap-product.xml

# Look for:
- /admin, /api, /dev paths
- High-value targets (dashboards, reports)
- Parameter patterns for fuzzing

Blue Team Detection

Monitor for:
1. Unauthorized changes to sitemap.xml
2. Sitemap poisoning (malicious URLs injected)
3. Excessive sitemap requests (reconnaissance)
4. Sensitive paths accidentally included

📂 Real-World Examples

Good Example

<!-- Public content only -->
<url>
  <loc>https://example.com/</loc>
  <priority>1.0</priority>
</url>
<url>
  <loc>https://example.com/products/</loc>
  <priority>0.8</priority>
</url>

Bad Example (Leaking Sensitive Paths)

<!-- DON'T DO THIS -->
<url>
  <loc>https://example.com/admin/dashboard/</loc>
</url>
<url>
  <loc>https://example.com/api/internal/users/</loc>
</url>
<url>
  <loc>https://example.com/staging/</loc>
</url>

🔧 Sitemap Management

Tools to Generate Sitemaps

Submitting to Search Engines

Google Search Console: https://search.google.com/search-console
Bing Webmaster Tools: https://www.bing.com/webmasters

Or add to robots.txt:
Sitemap: https://example.com/sitemap.xml

Sitemap Index (for large sites)

<!-- sitemap_index.xml -->
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
  </sitemap>
</sitemapindex>

🎤 Interview Questions

🎤 Interview STAR Example

Situation: Pentest revealed client's sitemap.xml included admin dashboard and internal API endpoints.
Task: Report as finding and recommend remediation.
Action: Documented as Medium severity information disclosure. Recommended removing non-public URLs from sitemap and implementing authentication on admin paths.
Result: Client updated sitemap to exclude sensitive paths. No admin URLs discoverable via search engines post-fix.

✅ Best Practices

📚 References