Add VitePress documentation with GitHub Pages deployment

- Create docs/ directory with VitePress configuration - Add documentation for all web servers (Nginx, Apache, Traefik, HAProxy) - Add bad bot detection and API reference documentation - Add GitHub Actions workflow for automatic deployment to GitHub Pages - Configure VitePress with sidebar, navigation, and search
2025-12-29 16:15:12 +00:00 · 2025-12-09 08:07:06 +01:00
parent 6bcca53eae
commit ea474cbcf2
13 changed files with 3829 additions and 0 deletions
--- a/docs/badbots.md
+++ b/docs/badbots.md
@@ -0,0 +1,191 @@
+# Bad Bot Detection
+
+This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.
+
+## Overview
+
+The `badbots.py` script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.
+
+## How It Works
+
+1. Fetches bot lists from public sources:
+   - [ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)
+   - Various community-maintained bot lists
+2. Generates blocking configurations for each platform
+3. Updates configurations daily via GitHub Actions
+
+## Generated Files
+
+| Platform | File | Format |
+|----------|------|--------|
+| Nginx | `bots.conf` | Map directive |
+| Apache | `bots.conf` | ModSecurity rules |
+| Traefik | `bots.toml` | Middleware config |
+| HAProxy | `bots.acl` | ACL patterns |
+
+## Nginx Bot Blocker
+
+The Nginx configuration uses a map directive:
+
+```nginx
+# In http block
+map $http_user_agent $bad_bot {
+    default 0;
+    "~*AhrefsBot" 1;
+    "~*SemrushBot" 1;
+    "~*MJ12bot" 1;
+    "~*DotBot" 1;
+    # ... more bots
+}
+
+# In server block
+if ($bad_bot) {
+    return 403;
+}
+```
+
+### Integration
+
+```nginx
+http {
+    include /path/to/waf_patterns/nginx/bots.conf;
+    
+    server {
+        if ($bad_bot) {
+            return 403;
+        }
+    }
+}
+```
+
+## Apache Bot Blocker
+
+Uses ModSecurity rules:
+
+```apache
+SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
+    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"
+```
+
+## HAProxy Bot Blocker
+
+Uses ACL rules:
+
+```haproxy
+acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
+http-request deny if bad_bot
+```
+
+## Blocked Bot Categories
+
+The following categories of bots are blocked by default:
+
+### SEO/Marketing Crawlers
+- AhrefsBot
+- SemrushBot
+- MJ12bot
+- DotBot
+- BLEXBot
+
+### AI/ML Crawlers
+- GPTBot
+- ChatGPT-User
+- CCBot
+- Google-Extended
+- Anthropic-AI
+
+### Scrapers
+- DataForSeoBot
+- PetalBot
+- Bytespider
+- ClaudeBot
+
+### Malicious Bots
+- Known vulnerability scanners
+- Spam bots
+- Content scrapers
+
+## Customization
+
+### Add Custom Bots
+
+Edit the generated file or add your own patterns:
+
+```nginx
+# Nginx: Add to bots.conf
+"~*MyCustomBot" 1;
+```
+
+```apache
+# Apache: Add rule
+SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
+    "id:200999,deny"
+```
+
+### Whitelist Bots
+
+For Nginx, allow specific bots:
+
+```nginx
+map $http_user_agent $bad_bot {
+    default 0;
+    "~*Googlebot" 0;     # Allow Google
+    "~*AhrefsBot" 1;     # Block Ahrefs
+}
+```
+
+### Allow All Bots for Specific Paths
+
+```nginx
+location /public-api {
+    # Override bot blocking
+    if ($bad_bot) {
+        # Don't block here
+    }
+}
+```
+
+## Generate Manually
+
+Run the script to regenerate bot lists:
+
+```bash
+python badbots.py
+```
+
+The script supports fallback lists if primary sources are unavailable.
+
+## Monitoring
+
+### Log Blocked Bots
+
+Enable logging to track blocked requests:
+
+```nginx
+if ($bad_bot) {
+    access_log /var/log/nginx/blocked_bots.log;
+    return 403;
+}
+```
+
+### Analyze Bot Traffic
+
+```bash
+# Count blocked bot requests
+grep "403" /var/log/nginx/access.log | \
+  awk '{print $12}' | sort | uniq -c | sort -rn | head -20
+```
+
+## Best Practices
+
+1. **Regular Updates**: The bot lists are updated daily. Pull the latest changes or download from releases.
+
+2. **Monitor False Positives**: Some legitimate services may use blocked User-Agents. Monitor your logs.
+
+3. **Combine with Rate Limiting**: Use bot blocking with rate limiting for comprehensive protection.
+
+4. **Test Before Deploying**: Verify that legitimate traffic (search engines, monitoring) is not blocked.
+
+::: warning
+Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do **not** block major search engines.
+:::