patterns/docs/badbots.md
Fabrizio Salmi ea474cbcf2 Add VitePress documentation with GitHub Pages deployment
- Create docs/ directory with VitePress configuration
- Add documentation for all web servers (Nginx, Apache, Traefik, HAProxy)
- Add bad bot detection and API reference documentation
- Add GitHub Actions workflow for automatic deployment to GitHub Pages
- Configure VitePress with sidebar, navigation, and search
2025-12-09 08:07:06 +01:00

3.7 KiB

Bad Bot Detection

This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.

Overview

The badbots.py script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.

How It Works

  1. Fetches bot lists from public sources:
  2. Generates blocking configurations for each platform
  3. Updates configurations daily via GitHub Actions

Generated Files

Platform File Format
Nginx bots.conf Map directive
Apache bots.conf ModSecurity rules
Traefik bots.toml Middleware config
HAProxy bots.acl ACL patterns

Nginx Bot Blocker

The Nginx configuration uses a map directive:

# In http block
map $http_user_agent $bad_bot {
    default 0;
    "~*AhrefsBot" 1;
    "~*SemrushBot" 1;
    "~*MJ12bot" 1;
    "~*DotBot" 1;
    # ... more bots
}

# In server block
if ($bad_bot) {
    return 403;
}

Integration

http {
    include /path/to/waf_patterns/nginx/bots.conf;
    
    server {
        if ($bad_bot) {
            return 403;
        }
    }
}

Apache Bot Blocker

Uses ModSecurity rules:

SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"

HAProxy Bot Blocker

Uses ACL rules:

acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
http-request deny if bad_bot

Blocked Bot Categories

The following categories of bots are blocked by default:

SEO/Marketing Crawlers

  • AhrefsBot
  • SemrushBot
  • MJ12bot
  • DotBot
  • BLEXBot

AI/ML Crawlers

  • GPTBot
  • ChatGPT-User
  • CCBot
  • Google-Extended
  • Anthropic-AI

Scrapers

  • DataForSeoBot
  • PetalBot
  • Bytespider
  • ClaudeBot

Malicious Bots

  • Known vulnerability scanners
  • Spam bots
  • Content scrapers

Customization

Add Custom Bots

Edit the generated file or add your own patterns:

# Nginx: Add to bots.conf
"~*MyCustomBot" 1;
# Apache: Add rule
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
    "id:200999,deny"

Whitelist Bots

For Nginx, allow specific bots:

map $http_user_agent $bad_bot {
    default 0;
    "~*Googlebot" 0;     # Allow Google
    "~*AhrefsBot" 1;     # Block Ahrefs
}

Allow All Bots for Specific Paths

location /public-api {
    # Override bot blocking
    if ($bad_bot) {
        # Don't block here
    }
}

Generate Manually

Run the script to regenerate bot lists:

python badbots.py

The script supports fallback lists if primary sources are unavailable.

Monitoring

Log Blocked Bots

Enable logging to track blocked requests:

if ($bad_bot) {
    access_log /var/log/nginx/blocked_bots.log;
    return 403;
}

Analyze Bot Traffic

# Count blocked bot requests
grep "403" /var/log/nginx/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn | head -20

Best Practices

  1. Regular Updates: The bot lists are updated daily. Pull the latest changes or download from releases.

  2. Monitor False Positives: Some legitimate services may use blocked User-Agents. Monitor your logs.

  3. Combine with Rate Limiting: Use bot blocking with rate limiting for comprehensive protection.

  4. Test Before Deploying: Verify that legitimate traffic (search engines, monitoring) is not blocked.

::: warning Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do not block major search engines. :::