mirror of
https://github.com/fabriziosalmi/patterns.git
synced 2025-12-29 16:15:12 +00:00
Add VitePress documentation with GitHub Pages deployment
- Create docs/ directory with VitePress configuration - Add documentation for all web servers (Nginx, Apache, Traefik, HAProxy) - Add bad bot detection and API reference documentation - Add GitHub Actions workflow for automatic deployment to GitHub Pages - Configure VitePress with sidebar, navigation, and search
This commit is contained in:
191
docs/badbots.md
Normal file
191
docs/badbots.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Bad Bot Detection
|
||||
|
||||
This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.
|
||||
|
||||
## Overview
|
||||
|
||||
The `badbots.py` script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.
|
||||
|
||||
## How It Works
|
||||
|
||||
1. Fetches bot lists from public sources:
|
||||
- [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)
|
||||
- Various community-maintained bot lists
|
||||
2. Generates blocking configurations for each platform
|
||||
3. Updates configurations daily via GitHub Actions
|
||||
|
||||
## Generated Files
|
||||
|
||||
| Platform | File | Format |
|
||||
|----------|------|--------|
|
||||
| Nginx | `bots.conf` | Map directive |
|
||||
| Apache | `bots.conf` | ModSecurity rules |
|
||||
| Traefik | `bots.toml` | Middleware config |
|
||||
| HAProxy | `bots.acl` | ACL patterns |
|
||||
|
||||
## Nginx Bot Blocker
|
||||
|
||||
The Nginx configuration uses a map directive:
|
||||
|
||||
```nginx
|
||||
# In http block
|
||||
map $http_user_agent $bad_bot {
|
||||
default 0;
|
||||
"~*AhrefsBot" 1;
|
||||
"~*SemrushBot" 1;
|
||||
"~*MJ12bot" 1;
|
||||
"~*DotBot" 1;
|
||||
# ... more bots
|
||||
}
|
||||
|
||||
# In server block
|
||||
if ($bad_bot) {
|
||||
return 403;
|
||||
}
|
||||
```
|
||||
|
||||
### Integration
|
||||
|
||||
```nginx
|
||||
http {
|
||||
include /path/to/waf_patterns/nginx/bots.conf;
|
||||
|
||||
server {
|
||||
if ($bad_bot) {
|
||||
return 403;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Apache Bot Blocker
|
||||
|
||||
Uses ModSecurity rules:
|
||||
|
||||
```apache
|
||||
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
|
||||
"id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"
|
||||
```
|
||||
|
||||
## HAProxy Bot Blocker
|
||||
|
||||
Uses ACL rules:
|
||||
|
||||
```haproxy
|
||||
acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
|
||||
http-request deny if bad_bot
|
||||
```
|
||||
|
||||
## Blocked Bot Categories
|
||||
|
||||
The following categories of bots are blocked by default:
|
||||
|
||||
### SEO/Marketing Crawlers
|
||||
- AhrefsBot
|
||||
- SemrushBot
|
||||
- MJ12bot
|
||||
- DotBot
|
||||
- BLEXBot
|
||||
|
||||
### AI/ML Crawlers
|
||||
- GPTBot
|
||||
- ChatGPT-User
|
||||
- CCBot
|
||||
- Google-Extended
|
||||
- Anthropic-AI
|
||||
|
||||
### Scrapers
|
||||
- DataForSeoBot
|
||||
- PetalBot
|
||||
- Bytespider
|
||||
- ClaudeBot
|
||||
|
||||
### Malicious Bots
|
||||
- Known vulnerability scanners
|
||||
- Spam bots
|
||||
- Content scrapers
|
||||
|
||||
## Customization
|
||||
|
||||
### Add Custom Bots
|
||||
|
||||
Edit the generated file or add your own patterns:
|
||||
|
||||
```nginx
|
||||
# Nginx: Add to bots.conf
|
||||
"~*MyCustomBot" 1;
|
||||
```
|
||||
|
||||
```apache
|
||||
# Apache: Add rule
|
||||
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
|
||||
"id:200999,deny"
|
||||
```
|
||||
|
||||
### Whitelist Bots
|
||||
|
||||
For Nginx, allow specific bots:
|
||||
|
||||
```nginx
|
||||
map $http_user_agent $bad_bot {
|
||||
default 0;
|
||||
"~*Googlebot" 0; # Allow Google
|
||||
"~*AhrefsBot" 1; # Block Ahrefs
|
||||
}
|
||||
```
|
||||
|
||||
### Allow All Bots for Specific Paths
|
||||
|
||||
```nginx
|
||||
location /public-api {
|
||||
# Override bot blocking
|
||||
if ($bad_bot) {
|
||||
# Don't block here
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Generate Manually
|
||||
|
||||
Run the script to regenerate bot lists:
|
||||
|
||||
```bash
|
||||
python badbots.py
|
||||
```
|
||||
|
||||
The script supports fallback lists if primary sources are unavailable.
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Log Blocked Bots
|
||||
|
||||
Enable logging to track blocked requests:
|
||||
|
||||
```nginx
|
||||
if ($bad_bot) {
|
||||
access_log /var/log/nginx/blocked_bots.log;
|
||||
return 403;
|
||||
}
|
||||
```
|
||||
|
||||
### Analyze Bot Traffic
|
||||
|
||||
```bash
|
||||
# Count blocked bot requests
|
||||
grep "403" /var/log/nginx/access.log | \
|
||||
awk '{print $12}' | sort | uniq -c | sort -rn | head -20
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Regular Updates**: The bot lists are updated daily. Pull the latest changes or download from releases.
|
||||
|
||||
2. **Monitor False Positives**: Some legitimate services may use blocked User-Agents. Monitor your logs.
|
||||
|
||||
3. **Combine with Rate Limiting**: Use bot blocking with rate limiting for comprehensive protection.
|
||||
|
||||
4. **Test Before Deploying**: Verify that legitimate traffic (search engines, monitoring) is not blocked.
|
||||
|
||||
::: warning
|
||||
Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do **not** block major search engines.
|
||||
:::
|
||||
Reference in New Issue
Block a user