# Bad Bot Detection

This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.

## Overview

The `badbots.py` script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.

## How It Works

1. Fetches bot lists from public sources:
   - [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)
   - Various community-maintained bot lists
2. Generates blocking configurations for each platform
3. Updates configurations daily via GitHub Actions

## Generated Files

| Platform | File | Format |
|----------|------|--------|
| Nginx | `bots.conf` | Map directive |
| Apache | `bots.conf` | ModSecurity rules |
| Traefik | `bots.toml` | Middleware config |
| HAProxy | `bots.acl` | ACL patterns |

## Nginx Bot Blocker

The Nginx configuration uses a map directive:

```nginx
# In http block
map $http_user_agent $bad_bot {
    default 0;
    "~*AhrefsBot" 1;
    "~*SemrushBot" 1;
    "~*MJ12bot" 1;
    "~*DotBot" 1;
    # ... more bots
}

# In server block
if ($bad_bot) {
    return 403;
}
```

### Integration

```nginx
http {
    include /path/to/waf_patterns/nginx/bots.conf;
    
    server {
        if ($bad_bot) {
            return 403;
        }
    }
}
```

## Apache Bot Blocker

Uses ModSecurity rules:

```apache
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"
```

## HAProxy Bot Blocker

Uses ACL rules:

```haproxy
acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
http-request deny if bad_bot
```

## Blocked Bot Categories

The following categories of bots are blocked by default:

### SEO/Marketing Crawlers
- AhrefsBot
- SemrushBot
- MJ12bot
- DotBot
- BLEXBot

### AI/ML Crawlers
- GPTBot
- ChatGPT-User
- CCBot
- Google-Extended
- Anthropic-AI

### Scrapers
- DataForSeoBot
- PetalBot
- Bytespider
- ClaudeBot

### Malicious Bots
- Known vulnerability scanners
- Spam bots
- Content scrapers

## Customization

### Add Custom Bots

Edit the generated file or add your own patterns:

```nginx
# Nginx: Add to bots.conf
"~*MyCustomBot" 1;
```

```apache
# Apache: Add rule
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
    "id:200999,deny"
```

### Whitelist Bots

For Nginx, allow specific bots:

```nginx
map $http_user_agent $bad_bot {
    default 0;
    "~*Googlebot" 0;     # Allow Google
    "~*AhrefsBot" 1;     # Block Ahrefs
}
```

### Allow All Bots for Specific Paths

```nginx
location /public-api {
    # Override bot blocking
    if ($bad_bot) {
        # Don't block here
    }
}
```

## Generate Manually

Run the script to regenerate bot lists:

```bash
python badbots.py
```

The script supports fallback lists if primary sources are unavailable.

## Monitoring

### Log Blocked Bots

Enable logging to track blocked requests:

```nginx
if ($bad_bot) {
    access_log /var/log/nginx/blocked_bots.log;
    return 403;
}
```

### Analyze Bot Traffic

```bash
# Count blocked bot requests
grep "403" /var/log/nginx/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn | head -20
```

## Best Practices

1. **Regular Updates**: The bot lists are updated daily. Pull the latest changes or download from releases.

2. **Monitor False Positives**: Some legitimate services may use blocked User-Agents. Monitor your logs.

3. **Combine with Rate Limiting**: Use bot blocking with rate limiting for comprehensive protection.

4. **Test Before Deploying**: Verify that legitimate traffic (search engines, monitoring) is not blocked.

::: warning
Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do **not** block major search engines.
:::