mirror of https://github.com/fabriziosalmi/patterns.git synced 2025-12-17 09:45:34 +00:00

Fabrizio Salmi ea474cbcf2 Add VitePress documentation with GitHub Pages deployment

- Create docs/ directory with VitePress configuration
- Add documentation for all web servers (Nginx, Apache, Traefik, HAProxy)
- Add bad bot detection and API reference documentation
- Add GitHub Actions workflow for automatic deployment to GitHub Pages
- Configure VitePress with sidebar, navigation, and search

2025-12-09 08:07:06 +01:00

3.7 KiB

Raw Blame History

Bad Bot Detection

This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.

Overview

The badbots.py script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.

How It Works

Fetches bot lists from public sources:
- ai.robots.txt
- Various community-maintained bot lists
Generates blocking configurations for each platform
Updates configurations daily via GitHub Actions

Generated Files

Platform	File	Format
Nginx	`bots.conf`	Map directive
Apache	`bots.conf`	ModSecurity rules
Traefik	`bots.toml`	Middleware config
HAProxy	`bots.acl`	ACL patterns

Nginx Bot Blocker

The Nginx configuration uses a map directive:

# In http block
map $http_user_agent $bad_bot {
    default 0;
    "~*AhrefsBot" 1;
    "~*SemrushBot" 1;
    "~*MJ12bot" 1;
    "~*DotBot" 1;
    # ... more bots
}

# In server block
if ($bad_bot) {
    return 403;
}

Integration

http {
    include /path/to/waf_patterns/nginx/bots.conf;
    
    server {
        if ($bad_bot) {
            return 403;
        }
    }
}

Apache Bot Blocker

Uses ModSecurity rules:

SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"

HAProxy Bot Blocker

Uses ACL rules:

acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
http-request deny if bad_bot

Blocked Bot Categories

The following categories of bots are blocked by default:

SEO/Marketing Crawlers

AhrefsBot
SemrushBot
MJ12bot
DotBot
BLEXBot

AI/ML Crawlers

GPTBot
ChatGPT-User
CCBot
Google-Extended
Anthropic-AI

Scrapers

DataForSeoBot
PetalBot
Bytespider
ClaudeBot

Malicious Bots

Known vulnerability scanners
Spam bots
Content scrapers

Customization

Add Custom Bots

Edit the generated file or add your own patterns:

# Nginx: Add to bots.conf
"~*MyCustomBot" 1;

# Apache: Add rule
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
    "id:200999,deny"

Whitelist Bots

For Nginx, allow specific bots:

map $http_user_agent $bad_bot {
    default 0;
    "~*Googlebot" 0;     # Allow Google
    "~*AhrefsBot" 1;     # Block Ahrefs
}

Allow All Bots for Specific Paths

location /public-api {
    # Override bot blocking
    if ($bad_bot) {
        # Don't block here
    }
}

Generate Manually

Run the script to regenerate bot lists:

python badbots.py

The script supports fallback lists if primary sources are unavailable.

Monitoring

Log Blocked Bots

Enable logging to track blocked requests:

if ($bad_bot) {
    access_log /var/log/nginx/blocked_bots.log;
    return 403;
}

Analyze Bot Traffic

# Count blocked bot requests
grep "403" /var/log/nginx/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn | head -20

Best Practices

Regular Updates: The bot lists are updated daily. Pull the latest changes or download from releases.
Monitor False Positives: Some legitimate services may use blocked User-Agents. Monitor your logs.
Combine with Rate Limiting: Use bot blocking with rate limiting for comprehensive protection.
Test Before Deploying: Verify that legitimate traffic (search engines, monitoring) is not blocked.

::: warning Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do not block major search engines. :::

3.7 KiB Raw Blame History