15 KiB
Krawl Architecture
Overview
Krawl is a cloud-native deception honeypot server built on FastAPI. It creates realistic fake web applications (admin panels, login pages, fake credentials) to attract, detect, and analyze malicious crawlers and attackers while wasting their resources with infinite spider-trap pages.
Tech Stack
| Layer | Technology |
|---|---|
| Backend | FastAPI, Uvicorn, Python 3.11 |
| ORM / DB | SQLAlchemy 2.0, SQLite (WAL mode) |
| Templating | Jinja2 (server-side rendering) |
| Reactivity | Alpine.js 3.14 |
| Partial Updates | HTMX 2.0 |
| Charts | Chart.js 3.9 (doughnut), custom SVG radar |
| Maps | Leaflet 1.9 + CartoDB dark tiles |
| Scheduling | APScheduler |
| Container | Docker (python:3.11-slim), Helm/K8s ready |
Directory Structure
Krawl/
├── src/
│ ├── app.py # FastAPI app factory + lifespan
│ ├── config.py # YAML + env config loader
│ ├── dependencies.py # DI providers (templates, DB, client IP)
│ ├── database.py # DatabaseManager singleton
│ ├── models.py # SQLAlchemy ORM models
│ ├── tracker.py # In-memory + DB access tracking
│ ├── logger.py # Rotating file log handlers
│ ├── deception_responses.py # Attack detection + fake responses
│ ├── sanitizer.py # Input sanitization
│ ├── generators.py # Random content generators
│ ├── wordlists.py # JSON wordlist loader
│ ├── geo_utils.py # IP geolocation API
│ ├── ip_utils.py # IP validation
│ │
│ ├── routes/
│ │ ├── honeypot.py # Trap pages, credential capture, catch-all
│ │ ├── dashboard.py # Dashboard page (Jinja2 SSR)
│ │ ├── api.py # JSON API endpoints
│ │ └── htmx.py # HTMX HTML fragment endpoints
│ │
│ ├── middleware/
│ │ ├── deception.py # Path traversal / XXE / cmd injection detection
│ │ └── ban_check.py # Banned IP enforcement
│ │
│ ├── tasks/ # APScheduler background jobs
│ │ ├── analyze_ips.py # IP categorization scoring
│ │ ├── fetch_ip_rep.py # Geolocation + blocklist enrichment
│ │ ├── db_dump.py # Database export
│ │ ├── memory_cleanup.py # In-memory list trimming
│ │ └── top_attacking_ips.py # Top attacker caching
│ │
│ ├── tasks_master.py # Task discovery + APScheduler orchestrator
│ ├── firewall/ # Banlist export (iptables, raw)
│ ├── migrations/ # Schema migrations (auto-run)
│ │
│ └── templates/
│ ├── jinja2/
│ │ ├── base.html # Layout + CDN scripts
│ │ └── dashboard/
│ │ ├── index.html # Main dashboard page
│ │ └── partials/ # 13 HTMX fragment templates
│ ├── html/ # Deceptive trap page templates
│ └── static/
│ ├── css/dashboard.css
│ └── js/
│ ├── dashboard.js # Alpine.js app controller
│ ├── map.js # Leaflet map
│ ├── charts.js # Chart.js doughnut
│ └── radar.js # SVG radar chart
│
├── config.yaml # Application configuration
├── wordlists.json # Attack patterns + fake credentials
├── Dockerfile # Container build
├── docker-compose.yaml # Local orchestration
├── entrypoint.sh # Container startup (gosu privilege drop)
├── kubernetes/ # K8s manifests
└── helm/ # Helm chart
Application Entry Point
src/app.py uses the FastAPI application factory pattern with an async lifespan manager:
Startup Shutdown
│ │
├─ Initialize logging └─ Log shutdown
├─ Initialize SQLite DB
├─ Create AccessTracker
├─ Load webpages file (optional)
├─ Store config + tracker in app.state
├─ Start APScheduler background tasks
└─ Log dashboard URL
Request Pipeline
Request
│
▼
┌──────────────────────┐
│ BanCheckMiddleware │──→ IP banned? → Return 500
└──────────┬───────────┘
▼
┌──────────────────────┐
│ DeceptionMiddleware │──→ Attack detected? → Fake error response
└──────────┬───────────┘
▼
┌───────────────────────┐
│ ServerHeaderMiddleware│──→ Add random Server header
└──────────┬────────────┘
▼
┌───────────────────────┐
│ Route Matching │
│ (ordered by priority)│
│ │
│ 1. Static files │ /{secret}/static/*
│ 2. Dashboard router │ /{secret}/ (prefix-based)
│ 3. API router │ /{secret}/api/* (prefix-based)
│ 4. HTMX router │ /{secret}/htmx/* (prefix-based)
│ 5. Honeypot router │ /* (catch-all)
└───────────────────────┘
Prefix-Based Routing
Dashboard, API, and HTMX routers are mounted with prefix=f"/{secret}" in app.py. This means:
- Route handlers define paths without the secret (e.g.,
@router.get("/api/all-ips")) - FastAPI prepends the secret automatically (e.g.,
GET /a1b2c3/api/all-ips) - The honeypot catch-all
/{path:path}only matches paths that don't start with the secret - No
_is_dashboard_path()checks needed — the prefix handles access scoping
Route Architecture
Honeypot Routes (routes/honeypot.py)
| Method | Path | Response |
|---|---|---|
GET |
/{path:path} |
Trap page with random links (catch-all) |
HEAD |
/{path:path} |
200 OK |
POST |
/{path:path} |
Credential capture |
GET |
/admin, /login |
Fake login form |
GET |
/wp-admin, /wp-login.php |
Fake WordPress login |
GET |
/phpmyadmin |
Fake phpMyAdmin |
GET |
/robots.txt |
Honeypot paths advertised |
GET/POST |
/api/search, /api/sql |
SQL injection honeypot |
POST |
/api/contact |
XSS detection endpoint |
GET |
/.env, /credentials.txt |
Fake sensitive files |
Dashboard Routes (routes/dashboard.py)
| Method | Path | Response |
|---|---|---|
GET |
/ |
Server-rendered dashboard (Jinja2) |
API Routes (routes/api.py)
| Method | Path | Response |
|---|---|---|
GET |
/api/all-ips |
Paginated IP list with stats |
GET |
/api/attackers |
Paginated attacker IPs |
GET |
/api/ip-stats/{ip} |
Single IP detail |
GET |
/api/credentials |
Captured credentials |
GET |
/api/honeypot |
Honeypot trigger counts |
GET |
/api/top-ips |
Top requesting IPs |
GET |
/api/top-paths |
Most requested paths |
GET |
/api/top-user-agents |
Top user agents |
GET |
/api/attack-types-stats |
Attack type distribution |
GET |
/api/attack-types |
Paginated attack log |
GET |
/api/raw-request/{id} |
Full HTTP request |
GET |
/api/get_banlist |
Export ban rules |
HTMX Fragment Routes (routes/htmx.py)
Each returns a server-rendered Jinja2 partial (hx-swap="innerHTML"):
| Path | Template |
|---|---|
/htmx/honeypot |
honeypot_table.html |
/htmx/top-ips |
top_ips_table.html |
/htmx/top-paths |
top_paths_table.html |
/htmx/top-ua |
top_ua_table.html |
/htmx/attackers |
attackers_table.html |
/htmx/credentials |
credentials_table.html |
/htmx/attacks |
attack_types_table.html |
/htmx/patterns |
patterns_table.html |
/htmx/ip-detail/{ip} |
ip_detail.html |
Database Schema
┌─────────────────┐ ┌──────────────────┐
│ AccessLog │ │ AttackDetection │
├─────────────────┤ ├──────────────────┤
│ id (PK) │◄────│ access_log_id(FK)│
│ ip (indexed) │ │ attack_type │
│ path │ │ matched_pattern │
│ user_agent │ └──────────────────┘
│ method │
│ is_suspicious │ ┌──────────────────┐
│ is_honeypot │ │CredentialAttempt │
│ timestamp │ ├──────────────────┤
│ raw_request │ │ id (PK) │
└─────────────────┘ │ ip (indexed) │
│ path, username │
┌─────────────────┐ │ password │
│ IpStats │ │ timestamp │
├─────────────────┤ └──────────────────┘
│ ip (PK) │
│ total_requests │ ┌──────────────────┐
│ first/last_seen │ │ CategoryHistory │
│ country_code │ ├──────────────────┤
│ city, lat, lon │ │ id (PK) │
│ asn, asn_org │ │ ip (indexed) │
│ isp, reverse │ │ old_category │
│ is_proxy │ │ new_category │
│ is_hosting │ │ timestamp │
│ list_on (JSON) │ └──────────────────┘
│ category │
│ category_scores │
│ analyzed_metrics│
│ manual_category │
└─────────────────┘
SQLite config: WAL mode, 30s busy timeout, file permissions 600.
Frontend Architecture
base.html
├── CDN: Leaflet, Chart.js, HTMX, Alpine.js (deferred)
├── Static: dashboard.css
│
└── dashboard/index.html (extends base)
│
├── Stats cards ──────────── Server-rendered on page load
├── Suspicious table ─────── Server-rendered on page load
│
├── Overview tab (Alpine.js x-show)
│ ├── Honeypot table ───── HTMX hx-get on load
│ ├── Top IPs table ────── HTMX hx-get on load
│ ├── Top Paths table ──── HTMX hx-get on load
│ ├── Top UA table ─────── HTMX hx-get on load
│ └── Credentials table ── HTMX hx-get on load
│
└── Attacks tab (Alpine.js x-show, lazy init)
├── Attackers table ──── HTMX hx-get on load
├── Map ──────────────── Leaflet (init on tab switch)
├── Chart ────────────── Chart.js (init on tab switch)
├── Attack types table ─ HTMX hx-get on load
└── Patterns table ───── HTMX hx-get on load
Responsibility split:
- Alpine.js — Tab state, modals, dropdowns, lazy initialization
- HTMX — Table pagination, sorting, IP detail expansion
- Leaflet — Interactive map with category-colored markers
- Chart.js — Doughnut chart for attack type distribution
- Custom SVG — Radar charts for IP category scores
Background Tasks
Managed by TasksMaster (APScheduler). Tasks are auto-discovered from src/tasks/.
| Task | Schedule | Purpose |
|---|---|---|
analyze_ips |
Every 1 min | Score IPs into categories (attacker, crawler, user) |
fetch_ip_rep |
Every 5 min | Enrich IPs with geolocation + blocklist data |
db_dump |
Configurable | Export database backups |
memory_cleanup |
Periodic | Trim in-memory lists |
top_attacking_ips |
Periodic | Cache top attackers |
IP Categorization Model
Each IP is scored across 4 categories based on:
- HTTP method distribution (risky methods ratio)
- Robots.txt violations
- Request timing anomalies (coefficient of variation)
- User-Agent diversity
- Attack URL detection
Categories: attacker, bad_crawler, good_crawler, regular_user, unknown
Configuration
config.yaml with environment variable overrides (KRAWL_{FIELD}):
server:
port: 5000
delay: 100 # Response delay (ms)
dashboard:
secret_path: "test" # Auto-generates if null
database:
path: "data/krawl.db"
retention_days: 30
crawl:
infinite_pages_for_malicious: true
max_pages_limit: 250
ban_duration_seconds: 600
behavior:
probability_error_codes: 0 # 0-100%
canary:
token_url: null # External canary alert URL
Logging
Three rotating log files (1MB max, 5 backups each):
| Logger | File | Content |
|---|---|---|
krawl.app |
logs/krawl.log |
Application events, errors |
krawl.access |
logs/access.log |
HTTP access, attack detections |
krawl.credentials |
logs/credentials.log |
Captured login attempts |
Docker
FROM python:3.11-slim
# Non-root user: krawl:1000
# Volumes: /app/logs, /app/data, /app/exports
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000", "--app-dir", "src"]
Key Data Flows
Honeypot Request
Client → BanCheck → DeceptionMiddleware → HoneypotRouter
│
┌─────────┴──────────┐
│ tracker.record() │
│ ├─ in-memory ++ │
│ ├─ detect attacks │
│ └─ DB persist │
└────────────────────┘
Dashboard Load
Browser → GET /{secret}/ → SSR initial stats + Jinja2 render
→ Alpine.js init → HTMX fires hx-get for each table
→ User clicks Attacks tab → setTimeout → init Leaflet + Chart.js
→ Leaflet fetches /api/all-ips → plots markers
→ Chart.js fetches /api/attack-types-stats → renders doughnut
IP Enrichment Pipeline
APScheduler (every 5 min)
└─ fetch_ip_rep.main()
├─ DB: get unenriched IPs (limit 50)
├─ ip-api.com → geolocation (country, city, ASN, coords)
├─ iprep.lcrawl.com → blocklist memberships
└─ DB: update IpStats with enriched data