mirror of
https://github.com/fabriziosalmi/patterns.git
synced 2025-12-17 17:55:48 +00:00
- Create docs/ directory with VitePress configuration - Add documentation for all web servers (Nginx, Apache, Traefik, HAProxy) - Add bad bot detection and API reference documentation - Add GitHub Actions workflow for automatic deployment to GitHub Pages - Configure VitePress with sidebar, navigation, and search
192 lines
3.7 KiB
Markdown
192 lines
3.7 KiB
Markdown
# Bad Bot Detection
|
|
|
|
This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.
|
|
|
|
## Overview
|
|
|
|
The `badbots.py` script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.
|
|
|
|
## How It Works
|
|
|
|
1. Fetches bot lists from public sources:
|
|
- [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)
|
|
- Various community-maintained bot lists
|
|
2. Generates blocking configurations for each platform
|
|
3. Updates configurations daily via GitHub Actions
|
|
|
|
## Generated Files
|
|
|
|
| Platform | File | Format |
|
|
|----------|------|--------|
|
|
| Nginx | `bots.conf` | Map directive |
|
|
| Apache | `bots.conf` | ModSecurity rules |
|
|
| Traefik | `bots.toml` | Middleware config |
|
|
| HAProxy | `bots.acl` | ACL patterns |
|
|
|
|
## Nginx Bot Blocker
|
|
|
|
The Nginx configuration uses a map directive:
|
|
|
|
```nginx
|
|
# In http block
|
|
map $http_user_agent $bad_bot {
|
|
default 0;
|
|
"~*AhrefsBot" 1;
|
|
"~*SemrushBot" 1;
|
|
"~*MJ12bot" 1;
|
|
"~*DotBot" 1;
|
|
# ... more bots
|
|
}
|
|
|
|
# In server block
|
|
if ($bad_bot) {
|
|
return 403;
|
|
}
|
|
```
|
|
|
|
### Integration
|
|
|
|
```nginx
|
|
http {
|
|
include /path/to/waf_patterns/nginx/bots.conf;
|
|
|
|
server {
|
|
if ($bad_bot) {
|
|
return 403;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Apache Bot Blocker
|
|
|
|
Uses ModSecurity rules:
|
|
|
|
```apache
|
|
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
|
|
"id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"
|
|
```
|
|
|
|
## HAProxy Bot Blocker
|
|
|
|
Uses ACL rules:
|
|
|
|
```haproxy
|
|
acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
|
|
http-request deny if bad_bot
|
|
```
|
|
|
|
## Blocked Bot Categories
|
|
|
|
The following categories of bots are blocked by default:
|
|
|
|
### SEO/Marketing Crawlers
|
|
- AhrefsBot
|
|
- SemrushBot
|
|
- MJ12bot
|
|
- DotBot
|
|
- BLEXBot
|
|
|
|
### AI/ML Crawlers
|
|
- GPTBot
|
|
- ChatGPT-User
|
|
- CCBot
|
|
- Google-Extended
|
|
- Anthropic-AI
|
|
|
|
### Scrapers
|
|
- DataForSeoBot
|
|
- PetalBot
|
|
- Bytespider
|
|
- ClaudeBot
|
|
|
|
### Malicious Bots
|
|
- Known vulnerability scanners
|
|
- Spam bots
|
|
- Content scrapers
|
|
|
|
## Customization
|
|
|
|
### Add Custom Bots
|
|
|
|
Edit the generated file or add your own patterns:
|
|
|
|
```nginx
|
|
# Nginx: Add to bots.conf
|
|
"~*MyCustomBot" 1;
|
|
```
|
|
|
|
```apache
|
|
# Apache: Add rule
|
|
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
|
|
"id:200999,deny"
|
|
```
|
|
|
|
### Whitelist Bots
|
|
|
|
For Nginx, allow specific bots:
|
|
|
|
```nginx
|
|
map $http_user_agent $bad_bot {
|
|
default 0;
|
|
"~*Googlebot" 0; # Allow Google
|
|
"~*AhrefsBot" 1; # Block Ahrefs
|
|
}
|
|
```
|
|
|
|
### Allow All Bots for Specific Paths
|
|
|
|
```nginx
|
|
location /public-api {
|
|
# Override bot blocking
|
|
if ($bad_bot) {
|
|
# Don't block here
|
|
}
|
|
}
|
|
```
|
|
|
|
## Generate Manually
|
|
|
|
Run the script to regenerate bot lists:
|
|
|
|
```bash
|
|
python badbots.py
|
|
```
|
|
|
|
The script supports fallback lists if primary sources are unavailable.
|
|
|
|
## Monitoring
|
|
|
|
### Log Blocked Bots
|
|
|
|
Enable logging to track blocked requests:
|
|
|
|
```nginx
|
|
if ($bad_bot) {
|
|
access_log /var/log/nginx/blocked_bots.log;
|
|
return 403;
|
|
}
|
|
```
|
|
|
|
### Analyze Bot Traffic
|
|
|
|
```bash
|
|
# Count blocked bot requests
|
|
grep "403" /var/log/nginx/access.log | \
|
|
awk '{print $12}' | sort | uniq -c | sort -rn | head -20
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Regular Updates**: The bot lists are updated daily. Pull the latest changes or download from releases.
|
|
|
|
2. **Monitor False Positives**: Some legitimate services may use blocked User-Agents. Monitor your logs.
|
|
|
|
3. **Combine with Rate Limiting**: Use bot blocking with rate limiting for comprehensive protection.
|
|
|
|
4. **Test Before Deploying**: Verify that legitimate traffic (search engines, monitoring) is not blocked.
|
|
|
|
::: warning
|
|
Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do **not** block major search engines.
|
|
:::
|