From 44235b232c587f7de6db12c303945548154fb49b Mon Sep 17 00:00:00 2001 From: Lorenzo Venerandi Date: Tue, 17 Feb 2026 14:34:48 +0100 Subject: [PATCH] docs: add architecture documentation for Krawl project --- docs/architecture.md | 372 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 372 insertions(+) create mode 100644 docs/architecture.md diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..75b7296 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,372 @@ +# Krawl Architecture + +## Overview + +Krawl is a cloud-native deception honeypot server built on **FastAPI**. It creates realistic fake web applications (admin panels, login pages, fake credentials) to attract, detect, and analyze malicious crawlers and attackers while wasting their resources with infinite spider-trap pages. + +## Tech Stack + +| Layer | Technology | +|-------|-----------| +| **Backend** | FastAPI, Uvicorn, Python 3.11 | +| **ORM / DB** | SQLAlchemy 2.0, SQLite (WAL mode) | +| **Templating** | Jinja2 (server-side rendering) | +| **Reactivity** | Alpine.js 3.14 | +| **Partial Updates** | HTMX 2.0 | +| **Charts** | Chart.js 3.9 (doughnut), custom SVG radar | +| **Maps** | Leaflet 1.9 + CartoDB dark tiles | +| **Scheduling** | APScheduler | +| **Container** | Docker (python:3.11-slim), Helm/K8s ready | + +## Directory Structure + +``` +Krawl/ +├── src/ +│ ├── app.py # FastAPI app factory + lifespan +│ ├── config.py # YAML + env config loader +│ ├── dependencies.py # DI providers (templates, DB, client IP) +│ ├── database.py # DatabaseManager singleton +│ ├── models.py # SQLAlchemy ORM models +│ ├── tracker.py # In-memory + DB access tracking +│ ├── logger.py # Rotating file log handlers +│ ├── deception_responses.py # Attack detection + fake responses +│ ├── sanitizer.py # Input sanitization +│ ├── generators.py # Random content generators +│ ├── wordlists.py # JSON wordlist loader +│ ├── geo_utils.py # IP geolocation API +│ ├── ip_utils.py # IP validation +│ │ +│ ├── routes/ +│ │ ├── honeypot.py # Trap pages, credential capture, catch-all +│ │ ├── dashboard.py # Dashboard page (Jinja2 SSR) +│ │ ├── api.py # JSON API endpoints +│ │ └── htmx.py # HTMX HTML fragment endpoints +│ │ +│ ├── middleware/ +│ │ ├── deception.py # Path traversal / XXE / cmd injection detection +│ │ └── ban_check.py # Banned IP enforcement +│ │ +│ ├── tasks/ # APScheduler background jobs +│ │ ├── analyze_ips.py # IP categorization scoring +│ │ ├── fetch_ip_rep.py # Geolocation + blocklist enrichment +│ │ ├── db_dump.py # Database export +│ │ ├── memory_cleanup.py # In-memory list trimming +│ │ └── top_attacking_ips.py # Top attacker caching +│ │ +│ ├── tasks_master.py # Task discovery + APScheduler orchestrator +│ ├── firewall/ # Banlist export (iptables, raw) +│ ├── migrations/ # Schema migrations (auto-run) +│ │ +│ └── templates/ +│ ├── jinja2/ +│ │ ├── base.html # Layout + CDN scripts +│ │ └── dashboard/ +│ │ ├── index.html # Main dashboard page +│ │ └── partials/ # 13 HTMX fragment templates +│ ├── html/ # Deceptive trap page templates +│ └── static/ +│ ├── css/dashboard.css +│ └── js/ +│ ├── dashboard.js # Alpine.js app controller +│ ├── map.js # Leaflet map +│ ├── charts.js # Chart.js doughnut +│ └── radar.js # SVG radar chart +│ +├── config.yaml # Application configuration +├── wordlists.json # Attack patterns + fake credentials +├── Dockerfile # Container build +├── docker-compose.yaml # Local orchestration +├── entrypoint.sh # Container startup (gosu privilege drop) +├── kubernetes/ # K8s manifests +└── helm/ # Helm chart +``` + +## Application Entry Point + +`src/app.py` uses the **FastAPI application factory** pattern with an async lifespan manager: + +``` +Startup Shutdown + │ │ + ├─ Initialize logging └─ Log shutdown + ├─ Initialize SQLite DB + ├─ Create AccessTracker + ├─ Load webpages file (optional) + ├─ Store config + tracker in app.state + ├─ Start APScheduler background tasks + └─ Log dashboard URL +``` + +## Request Pipeline + +``` + Request + │ + ▼ +┌──────────────────────┐ +│ BanCheckMiddleware │──→ IP banned? → Return 500 +└──────────┬───────────┘ + ▼ +┌──────────────────────┐ +│ DeceptionMiddleware │──→ Attack detected? → Fake error response +└──────────┬───────────┘ + ▼ +┌───────────────────────┐ +│ ServerHeaderMiddleware│──→ Add random Server header +└──────────┬────────────┘ + ▼ +┌───────────────────────┐ +│ Route Matching │ +│ (ordered by priority)│ +│ │ +│ 1. Static files │ /{secret}/static/* +│ 2. Dashboard router │ /{secret}/ (prefix-based) +│ 3. API router │ /{secret}/api/* (prefix-based) +│ 4. HTMX router │ /{secret}/htmx/* (prefix-based) +│ 5. Honeypot router │ /* (catch-all) +└───────────────────────┘ +``` + +### Prefix-Based Routing + +Dashboard, API, and HTMX routers are mounted with `prefix=f"/{secret}"` in `app.py`. This means: +- Route handlers define paths **without** the secret (e.g., `@router.get("/api/all-ips")`) +- FastAPI prepends the secret automatically (e.g., `GET /a1b2c3/api/all-ips`) +- The honeypot catch-all `/{path:path}` only matches paths that **don't** start with the secret +- No `_is_dashboard_path()` checks needed — the prefix handles access scoping + +## Route Architecture + +### Honeypot Routes (`routes/honeypot.py`) + +| Method | Path | Response | +|--------|------|----------| +| `GET` | `/{path:path}` | Trap page with random links (catch-all) | +| `HEAD` | `/{path:path}` | 200 OK | +| `POST` | `/{path:path}` | Credential capture | +| `GET` | `/admin`, `/login` | Fake login form | +| `GET` | `/wp-admin`, `/wp-login.php` | Fake WordPress login | +| `GET` | `/phpmyadmin` | Fake phpMyAdmin | +| `GET` | `/robots.txt` | Honeypot paths advertised | +| `GET/POST` | `/api/search`, `/api/sql` | SQL injection honeypot | +| `POST` | `/api/contact` | XSS detection endpoint | +| `GET` | `/.env`, `/credentials.txt` | Fake sensitive files | + +### Dashboard Routes (`routes/dashboard.py`) + +| Method | Path | Response | +|--------|------|----------| +| `GET` | `/` | Server-rendered dashboard (Jinja2) | + +### API Routes (`routes/api.py`) + +| Method | Path | Response | +|--------|------|----------| +| `GET` | `/api/all-ips` | Paginated IP list with stats | +| `GET` | `/api/attackers` | Paginated attacker IPs | +| `GET` | `/api/ip-stats/{ip}` | Single IP detail | +| `GET` | `/api/credentials` | Captured credentials | +| `GET` | `/api/honeypot` | Honeypot trigger counts | +| `GET` | `/api/top-ips` | Top requesting IPs | +| `GET` | `/api/top-paths` | Most requested paths | +| `GET` | `/api/top-user-agents` | Top user agents | +| `GET` | `/api/attack-types-stats` | Attack type distribution | +| `GET` | `/api/attack-types` | Paginated attack log | +| `GET` | `/api/raw-request/{id}` | Full HTTP request | +| `GET` | `/api/get_banlist` | Export ban rules | + +### HTMX Fragment Routes (`routes/htmx.py`) + +Each returns a server-rendered Jinja2 partial (`hx-swap="innerHTML"`): + +| Path | Template | +|------|----------| +| `/htmx/honeypot` | `honeypot_table.html` | +| `/htmx/top-ips` | `top_ips_table.html` | +| `/htmx/top-paths` | `top_paths_table.html` | +| `/htmx/top-ua` | `top_ua_table.html` | +| `/htmx/attackers` | `attackers_table.html` | +| `/htmx/credentials` | `credentials_table.html` | +| `/htmx/attacks` | `attack_types_table.html` | +| `/htmx/patterns` | `patterns_table.html` | +| `/htmx/ip-detail/{ip}` | `ip_detail.html` | + +## Database Schema + +``` +┌─────────────────┐ ┌──────────────────┐ +│ AccessLog │ │ AttackDetection │ +├─────────────────┤ ├──────────────────┤ +│ id (PK) │◄────│ access_log_id(FK)│ +│ ip (indexed) │ │ attack_type │ +│ path │ │ matched_pattern │ +│ user_agent │ └──────────────────┘ +│ method │ +│ is_suspicious │ ┌──────────────────┐ +│ is_honeypot │ │CredentialAttempt │ +│ timestamp │ ├──────────────────┤ +│ raw_request │ │ id (PK) │ +└─────────────────┘ │ ip (indexed) │ + │ path, username │ +┌─────────────────┐ │ password │ +│ IpStats │ │ timestamp │ +├─────────────────┤ └──────────────────┘ +│ ip (PK) │ +│ total_requests │ ┌──────────────────┐ +│ first/last_seen │ │ CategoryHistory │ +│ country_code │ ├──────────────────┤ +│ city, lat, lon │ │ id (PK) │ +│ asn, asn_org │ │ ip (indexed) │ +│ isp, reverse │ │ old_category │ +│ is_proxy │ │ new_category │ +│ is_hosting │ │ timestamp │ +│ list_on (JSON) │ └──────────────────┘ +│ category │ +│ category_scores │ +│ analyzed_metrics│ +│ manual_category │ +└─────────────────┘ +``` + +**SQLite config:** WAL mode, 30s busy timeout, file permissions 600. + +## Frontend Architecture + +``` +base.html + ├── CDN: Leaflet, Chart.js, HTMX, Alpine.js (deferred) + ├── Static: dashboard.css + │ + └── dashboard/index.html (extends base) + │ + ├── Stats cards ──────────── Server-rendered on page load + ├── Suspicious table ─────── Server-rendered on page load + │ + ├── Overview tab (Alpine.js x-show) + │ ├── Honeypot table ───── HTMX hx-get on load + │ ├── Top IPs table ────── HTMX hx-get on load + │ ├── Top Paths table ──── HTMX hx-get on load + │ ├── Top UA table ─────── HTMX hx-get on load + │ └── Credentials table ── HTMX hx-get on load + │ + └── Attacks tab (Alpine.js x-show, lazy init) + ├── Attackers table ──── HTMX hx-get on load + ├── Map ──────────────── Leaflet (init on tab switch) + ├── Chart ────────────── Chart.js (init on tab switch) + ├── Attack types table ─ HTMX hx-get on load + └── Patterns table ───── HTMX hx-get on load +``` + +**Responsibility split:** +- **Alpine.js** — Tab state, modals, dropdowns, lazy initialization +- **HTMX** — Table pagination, sorting, IP detail expansion +- **Leaflet** — Interactive map with category-colored markers +- **Chart.js** — Doughnut chart for attack type distribution +- **Custom SVG** — Radar charts for IP category scores + +## Background Tasks + +Managed by `TasksMaster` (APScheduler). Tasks are auto-discovered from `src/tasks/`. + +| Task | Schedule | Purpose | +|------|----------|---------| +| `analyze_ips` | Every 1 min | Score IPs into categories (attacker, crawler, user) | +| `fetch_ip_rep` | Every 5 min | Enrich IPs with geolocation + blocklist data | +| `db_dump` | Configurable | Export database backups | +| `memory_cleanup` | Periodic | Trim in-memory lists | +| `top_attacking_ips` | Periodic | Cache top attackers | + +### IP Categorization Model + +Each IP is scored across 4 categories based on: +- HTTP method distribution (risky methods ratio) +- Robots.txt violations +- Request timing anomalies (coefficient of variation) +- User-Agent diversity +- Attack URL detection + +Categories: `attacker`, `bad_crawler`, `good_crawler`, `regular_user`, `unknown` + +## Configuration + +`config.yaml` with environment variable overrides (`KRAWL_{FIELD}`): + +```yaml +server: + port: 5000 + delay: 100 # Response delay (ms) + +dashboard: + secret_path: "test" # Auto-generates if null + +database: + path: "data/krawl.db" + retention_days: 30 + +crawl: + infinite_pages_for_malicious: true + max_pages_limit: 250 + ban_duration_seconds: 600 + +behavior: + probability_error_codes: 0 # 0-100% + +canary: + token_url: null # External canary alert URL +``` + +## Logging + +Three rotating log files (1MB max, 5 backups each): + +| Logger | File | Content | +|--------|------|---------| +| `krawl.app` | `logs/krawl.log` | Application events, errors | +| `krawl.access` | `logs/access.log` | HTTP access, attack detections | +| `krawl.credentials` | `logs/credentials.log` | Captured login attempts | + +## Docker + +```dockerfile +FROM python:3.11-slim +# Non-root user: krawl:1000 +# Volumes: /app/logs, /app/data, /app/exports +CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000", "--app-dir", "src"] +``` + +## Key Data Flows + +### Honeypot Request + +``` +Client → BanCheck → DeceptionMiddleware → HoneypotRouter + │ + ┌─────────┴──────────┐ + │ tracker.record() │ + │ ├─ in-memory ++ │ + │ ├─ detect attacks │ + │ └─ DB persist │ + └────────────────────┘ +``` + +### Dashboard Load + +``` +Browser → GET /{secret}/ → SSR initial stats + Jinja2 render + → Alpine.js init → HTMX fires hx-get for each table + → User clicks Attacks tab → setTimeout → init Leaflet + Chart.js + → Leaflet fetches /api/all-ips → plots markers + → Chart.js fetches /api/attack-types-stats → renders doughnut +``` + +### IP Enrichment Pipeline + +``` +APScheduler (every 5 min) + └─ fetch_ip_rep.main() + ├─ DB: get unenriched IPs (limit 50) + ├─ ip-api.com → geolocation (country, city, ASN, coords) + ├─ iprep.lcrawl.com → blocklist memberships + └─ DB: update IpStats with enriched data +```