patterns/docs/badbots.md

192 lines
3.7 KiB
Markdown
Raw Normal View History

# Bad Bot Detection
This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.
## Overview
The `badbots.py` script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.
## How It Works
1. Fetches bot lists from public sources:
- [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)
- Various community-maintained bot lists
2. Generates blocking configurations for each platform
3. Updates configurations daily via GitHub Actions
## Generated Files
| Platform | File | Format |
|----------|------|--------|
| Nginx | `bots.conf` | Map directive |
| Apache | `bots.conf` | ModSecurity rules |
| Traefik | `bots.toml` | Middleware config |
| HAProxy | `bots.acl` | ACL patterns |
## Nginx Bot Blocker
The Nginx configuration uses a map directive:
```nginx
# In http block
map $http_user_agent $bad_bot {
default 0;
"~*AhrefsBot" 1;
"~*SemrushBot" 1;
"~*MJ12bot" 1;
"~*DotBot" 1;
# ... more bots
}
# In server block
if ($bad_bot) {
return 403;
}
```
### Integration
```nginx
http {
include /path/to/waf_patterns/nginx/bots.conf;
server {
if ($bad_bot) {
return 403;
}
}
}
```
## Apache Bot Blocker
Uses ModSecurity rules:
```apache
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
"id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"
```
## HAProxy Bot Blocker
Uses ACL rules:
```haproxy
acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
http-request deny if bad_bot
```
## Blocked Bot Categories
The following categories of bots are blocked by default:
### SEO/Marketing Crawlers
- AhrefsBot
- SemrushBot
- MJ12bot
- DotBot
- BLEXBot
### AI/ML Crawlers
- GPTBot
- ChatGPT-User
- CCBot
- Google-Extended
- Anthropic-AI
### Scrapers
- DataForSeoBot
- PetalBot
- Bytespider
- ClaudeBot
### Malicious Bots
- Known vulnerability scanners
- Spam bots
- Content scrapers
## Customization
### Add Custom Bots
Edit the generated file or add your own patterns:
```nginx
# Nginx: Add to bots.conf
"~*MyCustomBot" 1;
```
```apache
# Apache: Add rule
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
"id:200999,deny"
```
### Whitelist Bots
For Nginx, allow specific bots:
```nginx
map $http_user_agent $bad_bot {
default 0;
"~*Googlebot" 0; # Allow Google
"~*AhrefsBot" 1; # Block Ahrefs
}
```
### Allow All Bots for Specific Paths
```nginx
location /public-api {
# Override bot blocking
if ($bad_bot) {
# Don't block here
}
}
```
## Generate Manually
Run the script to regenerate bot lists:
```bash
python badbots.py
```
The script supports fallback lists if primary sources are unavailable.
## Monitoring
### Log Blocked Bots
Enable logging to track blocked requests:
```nginx
if ($bad_bot) {
access_log /var/log/nginx/blocked_bots.log;
return 403;
}
```
### Analyze Bot Traffic
```bash
# Count blocked bot requests
grep "403" /var/log/nginx/access.log | \
awk '{print $12}' | sort | uniq -c | sort -rn | head -20
```
## Best Practices
1. **Regular Updates**: The bot lists are updated daily. Pull the latest changes or download from releases.
2. **Monitor False Positives**: Some legitimate services may use blocked User-Agents. Monitor your logs.
3. **Combine with Rate Limiting**: Use bot blocking with rate limiting for comprehensive protection.
4. **Test Before Deploying**: Verify that legitimate traffic (search engines, monitoring) is not blocked.
::: warning
Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do **not** block major search engines.
:::