patterns/docs/badbots.md

# Bad Bot Detection

This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.

## Overview

The `badbots.py` script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.

## How It Works

1. Fetches bot lists from public sources:
   - [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)
   - Various community-maintained bot lists
2. Generates blocking configurations for each platform
3. Updates configurations daily via GitHub Actions

## Generated Files

| Platform | File | Format |
|----------|------|--------|
| Nginx | `bots.conf` | Map directive |
| Apache | `bots.conf` | ModSecurity rules |
| Traefik | `bots.toml` | Middleware config |
| HAProxy | `bots.acl` | ACL patterns |

## Nginx Bot Blocker

The Nginx configuration uses a map directive:

```nginx
# In http block
map $http_user_agent $bad_bot {
    default 0;
    "~*AhrefsBot" 1;
    "~*SemrushBot" 1;
    "~*MJ12bot" 1;
    "~*DotBot" 1;
    # ... more bots
}

# In server block
if ($bad_bot) {
    return 403;
}
```

### Integration

```nginx
http {
    include /path/to/waf_patterns/nginx/bots.conf;
    
    server {
        if ($bad_bot) {
            return 403;
        }
    }
}
```

## Apache Bot Blocker

Uses ModSecurity rules:

```apache
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"
```

## HAProxy Bot Blocker

Uses ACL rules:

```haproxy
acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
http-request deny if bad_bot
```

## Blocked Bot Categories

The following categories of bots are blocked by default:

### SEO/Marketing Crawlers
- AhrefsBot
- SemrushBot
- MJ12bot
- DotBot
- BLEXBot

### AI/ML Crawlers
- GPTBot
- ChatGPT-User
- CCBot
- Google-Extended
- Anthropic-AI

### Scrapers
- DataForSeoBot
- PetalBot
- Bytespider
- ClaudeBot

### Malicious Bots
- Known vulnerability scanners
- Spam bots
- Content scrapers

## Customization

### Add Custom Bots

Edit the generated file or add your own patterns:

```nginx
# Nginx: Add to bots.conf
"~*MyCustomBot" 1;
```

```apache
# Apache: Add rule
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
    "id:200999,deny"
```

### Whitelist Bots

For Nginx, allow specific bots:

```nginx
map $http_user_agent $bad_bot {
    default 0;
    "~*Googlebot" 0;     # Allow Google
    "~*AhrefsBot" 1;     # Block Ahrefs
}
```

### Allow All Bots for Specific Paths

```nginx
location /public-api {
    # Override bot blocking
    if ($bad_bot) {
        # Don't block here
    }
}
```

## Generate Manually

Run the script to regenerate bot lists:

```bash
python badbots.py
```

The script supports fallback lists if primary sources are unavailable.

## Monitoring

### Log Blocked Bots

Enable logging to track blocked requests:

```nginx
if ($bad_bot) {
    access_log /var/log/nginx/blocked_bots.log;
    return 403;
}
```

### Analyze Bot Traffic

```bash
# Count blocked bot requests
grep "403" /var/log/nginx/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn | head -20
```

## Best Practices

1. **Regular Updates**: The bot lists are updated daily. Pull the latest changes or download from releases.

2. **Monitor False Positives**: Some legitimate services may use blocked User-Agents. Monitor your logs.

3. **Combine with Rate Limiting**: Use bot blocking with rate limiting for comprehensive protection.

4. **Test Before Deploying**: Verify that legitimate traffic (search engines, monitoring) is not blocked.

::: warning
Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do **not** block major search engines.
:::
Add VitePress documentation with GitHub Pages deployment - Create docs/ directory with VitePress configuration - Add documentation for all web servers (Nginx, Apache, Traefik, HAProxy) - Add bad bot detection and API reference documentation - Add GitHub Actions workflow for automatic deployment to GitHub Pages - Configure VitePress with sidebar, navigation, and search 2025-12-09 08:07:06 +01:00			`# Bad Bot Detection`

			`This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.`

			`## Overview`

			The `badbots.py` script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.

			`## How It Works`

			`1. Fetches bot lists from public sources:`
			`- [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)`
			`- Various community-maintained bot lists`
			`2. Generates blocking configurations for each platform`
			`3. Updates configurations daily via GitHub Actions`

			`## Generated Files`

			`\| Platform \| File \| Format \|`
			`\|----------\|------\|--------\|`
			\| Nginx \| `bots.conf` \| Map directive \|
			\| Apache \| `bots.conf` \| ModSecurity rules \|
			\| Traefik \| `bots.toml` \| Middleware config \|
			\| HAProxy \| `bots.acl` \| ACL patterns \|

			`## Nginx Bot Blocker`

			`The Nginx configuration uses a map directive:`

			```nginx
			`# In http block`
			`map $http_user_agent $bad_bot {`
			`default 0;`
			`"~*AhrefsBot" 1;`
			`"~*SemrushBot" 1;`
			`"~*MJ12bot" 1;`
			`"~*DotBot" 1;`
			`# ... more bots`
			`}`

			`# In server block`
			`if ($bad_bot) {`
			`return 403;`
			`}`
			```

			`### Integration`

			```nginx
			`http {`
			`include /path/to/waf_patterns/nginx/bots.conf;`

			`server {`
			`if ($bad_bot) {`
			`return 403;`
			`}`
			`}`
			`}`
			```

			`## Apache Bot Blocker`

			`Uses ModSecurity rules:`

			```apache
			`SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \`
			`"id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"`
			```

			`## HAProxy Bot Blocker`

			`Uses ACL rules:`

			```haproxy
			`acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl`
			`http-request deny if bad_bot`
			```

			`## Blocked Bot Categories`

			`The following categories of bots are blocked by default:`

			`### SEO/Marketing Crawlers`
			`- AhrefsBot`
			`- SemrushBot`
			`- MJ12bot`
			`- DotBot`
			`- BLEXBot`

			`### AI/ML Crawlers`
			`- GPTBot`
			`- ChatGPT-User`
			`- CCBot`
			`- Google-Extended`
			`- Anthropic-AI`

			`### Scrapers`
			`- DataForSeoBot`
			`- PetalBot`
			`- Bytespider`
			`- ClaudeBot`

			`### Malicious Bots`
			`- Known vulnerability scanners`
			`- Spam bots`
			`- Content scrapers`

			`## Customization`

			`### Add Custom Bots`

			`Edit the generated file or add your own patterns:`

			```nginx
			`# Nginx: Add to bots.conf`
			`"~*MyCustomBot" 1;`
			```

			```apache
			`# Apache: Add rule`
			`SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \`
			`"id:200999,deny"`
			```

			`### Whitelist Bots`

			`For Nginx, allow specific bots:`

			```nginx
			`map $http_user_agent $bad_bot {`
			`default 0;`
			`"~*Googlebot" 0; # Allow Google`
			`"~*AhrefsBot" 1; # Block Ahrefs`
			`}`
			```

			`### Allow All Bots for Specific Paths`

			```nginx
			`location /public-api {`
			`# Override bot blocking`
			`if ($bad_bot) {`
			`# Don't block here`
			`}`
			`}`
			```

			`## Generate Manually`

			`Run the script to regenerate bot lists:`

			```bash
			`python badbots.py`
			```

			`The script supports fallback lists if primary sources are unavailable.`

			`## Monitoring`

			`### Log Blocked Bots`

			`Enable logging to track blocked requests:`

			```nginx
			`if ($bad_bot) {`
			`access_log /var/log/nginx/blocked_bots.log;`
			`return 403;`
			`}`
			```

			`### Analyze Bot Traffic`

			```bash
			`# Count blocked bot requests`
			`grep "403" /var/log/nginx/access.log \| \`
			`awk '{print $12}' \| sort \| uniq -c \| sort -rn \| head -20`
			```

			`## Best Practices`

			`1. Regular Updates: The bot lists are updated daily. Pull the latest changes or download from releases.`

			`2. Monitor False Positives: Some legitimate services may use blocked User-Agents. Monitor your logs.`

			`3. Combine with Rate Limiting: Use bot blocking with rate limiting for comprehensive protection.`

			`4. Test Before Deploying: Verify that legitimate traffic (search engines, monitoring) is not blocked.`

			`::: warning`
			`Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do not block major search engines.`
			`:::`