# Krawl Architecture ## Overview Krawl is a cloud-native deception honeypot server built on **FastAPI**. It creates realistic fake web applications (admin panels, login pages, fake credentials) to attract, detect, and analyze malicious crawlers and attackers while wasting their resources with infinite spider-trap pages. ## Tech Stack | Layer | Technology | |-------|-----------| | **Backend** | FastAPI, Uvicorn, Python 3.11 | | **ORM / DB** | SQLAlchemy 2.0, SQLite (WAL mode) | | **Templating** | Jinja2 (server-side rendering) | | **Reactivity** | Alpine.js 3.14 | | **Partial Updates** | HTMX 2.0 | | **Charts** | Chart.js 3.9 (doughnut), custom SVG radar | | **Maps** | Leaflet 1.9 + CartoDB dark tiles | | **Scheduling** | APScheduler | | **Container** | Docker (python:3.11-slim), Helm/K8s ready | ## Directory Structure ``` Krawl/ ├── src/ │ ├── app.py # FastAPI app factory + lifespan │ ├── config.py # YAML + env config loader │ ├── dependencies.py # DI providers (templates, DB, client IP) │ ├── database.py # DatabaseManager singleton │ ├── models.py # SQLAlchemy ORM models │ ├── tracker.py # In-memory + DB access tracking │ ├── logger.py # Rotating file log handlers │ ├── deception_responses.py # Attack detection + fake responses │ ├── sanitizer.py # Input sanitization │ ├── generators.py # Random content generators │ ├── wordlists.py # JSON wordlist loader │ ├── geo_utils.py # IP geolocation API │ ├── ip_utils.py # IP validation │ │ │ ├── routes/ │ │ ├── honeypot.py # Trap pages, credential capture, catch-all │ │ ├── dashboard.py # Dashboard page (Jinja2 SSR) │ │ ├── api.py # JSON API endpoints │ │ └── htmx.py # HTMX HTML fragment endpoints │ │ │ ├── middleware/ │ │ ├── deception.py # Path traversal / XXE / cmd injection detection │ │ └── ban_check.py # Banned IP enforcement │ │ │ ├── tasks/ # APScheduler background jobs │ │ ├── analyze_ips.py # IP categorization scoring │ │ ├── fetch_ip_rep.py # Geolocation + blocklist enrichment │ │ ├── db_dump.py # Database export │ │ ├── memory_cleanup.py # In-memory list trimming │ │ └── top_attacking_ips.py # Top attacker caching │ │ │ ├── tasks_master.py # Task discovery + APScheduler orchestrator │ ├── firewall/ # Banlist export (iptables, raw) │ ├── migrations/ # Schema migrations (auto-run) │ │ │ └── templates/ │ ├── jinja2/ │ │ ├── base.html # Layout + CDN scripts │ │ └── dashboard/ │ │ ├── index.html # Main dashboard page │ │ └── partials/ # 13 HTMX fragment templates │ ├── html/ # Deceptive trap page templates │ └── static/ │ ├── css/dashboard.css │ └── js/ │ ├── dashboard.js # Alpine.js app controller │ ├── map.js # Leaflet map │ ├── charts.js # Chart.js doughnut │ └── radar.js # SVG radar chart │ ├── config.yaml # Application configuration ├── wordlists.json # Attack patterns + fake credentials ├── Dockerfile # Container build ├── docker-compose.yaml # Local orchestration ├── entrypoint.sh # Container startup (gosu privilege drop) ├── kubernetes/ # K8s manifests └── helm/ # Helm chart ``` ## Application Entry Point `src/app.py` uses the **FastAPI application factory** pattern with an async lifespan manager: ``` Startup Shutdown │ │ ├─ Initialize logging └─ Log shutdown ├─ Initialize SQLite DB ├─ Create AccessTracker ├─ Load webpages file (optional) ├─ Store config + tracker in app.state ├─ Start APScheduler background tasks └─ Log dashboard URL ``` ## Request Pipeline ``` Request │ ▼ ┌──────────────────────┐ │ BanCheckMiddleware │──→ IP banned? → Return 500 └──────────┬───────────┘ ▼ ┌──────────────────────┐ │ DeceptionMiddleware │──→ Attack detected? → Fake error response └──────────┬───────────┘ ▼ ┌───────────────────────┐ │ ServerHeaderMiddleware│──→ Add random Server header └──────────┬────────────┘ ▼ ┌───────────────────────┐ │ Route Matching │ │ (ordered by priority)│ │ │ │ 1. Static files │ /{secret}/static/* │ 2. Dashboard router │ /{secret}/ (prefix-based) │ 3. API router │ /{secret}/api/* (prefix-based) │ 4. HTMX router │ /{secret}/htmx/* (prefix-based) │ 5. Honeypot router │ /* (catch-all) └───────────────────────┘ ``` ### Prefix-Based Routing Dashboard, API, and HTMX routers are mounted with `prefix=f"/{secret}"` in `app.py`. This means: - Route handlers define paths **without** the secret (e.g., `@router.get("/api/all-ips")`) - FastAPI prepends the secret automatically (e.g., `GET /a1b2c3/api/all-ips`) - The honeypot catch-all `/{path:path}` only matches paths that **don't** start with the secret - No `_is_dashboard_path()` checks needed — the prefix handles access scoping ## Route Architecture ### Honeypot Routes (`routes/honeypot.py`) | Method | Path | Response | |--------|------|----------| | `GET` | `/{path:path}` | Trap page with random links (catch-all) | | `HEAD` | `/{path:path}` | 200 OK | | `POST` | `/{path:path}` | Credential capture | | `GET` | `/admin`, `/login` | Fake login form | | `GET` | `/wp-admin`, `/wp-login.php` | Fake WordPress login | | `GET` | `/phpmyadmin` | Fake phpMyAdmin | | `GET` | `/robots.txt` | Honeypot paths advertised | | `GET/POST` | `/api/search`, `/api/sql` | SQL injection honeypot | | `POST` | `/api/contact` | XSS detection endpoint | | `GET` | `/.env`, `/credentials.txt` | Fake sensitive files | ### Dashboard Routes (`routes/dashboard.py`) | Method | Path | Response | |--------|------|----------| | `GET` | `/` | Server-rendered dashboard (Jinja2) | ### API Routes (`routes/api.py`) | Method | Path | Response | |--------|------|----------| | `GET` | `/api/all-ips` | Paginated IP list with stats | | `GET` | `/api/attackers` | Paginated attacker IPs | | `GET` | `/api/ip-stats/{ip}` | Single IP detail | | `GET` | `/api/credentials` | Captured credentials | | `GET` | `/api/honeypot` | Honeypot trigger counts | | `GET` | `/api/top-ips` | Top requesting IPs | | `GET` | `/api/top-paths` | Most requested paths | | `GET` | `/api/top-user-agents` | Top user agents | | `GET` | `/api/attack-types-stats` | Attack type distribution | | `GET` | `/api/attack-types` | Paginated attack log | | `GET` | `/api/raw-request/{id}` | Full HTTP request | | `GET` | `/api/get_banlist` | Export ban rules | ### HTMX Fragment Routes (`routes/htmx.py`) Each returns a server-rendered Jinja2 partial (`hx-swap="innerHTML"`): | Path | Template | |------|----------| | `/htmx/honeypot` | `honeypot_table.html` | | `/htmx/top-ips` | `top_ips_table.html` | | `/htmx/top-paths` | `top_paths_table.html` | | `/htmx/top-ua` | `top_ua_table.html` | | `/htmx/attackers` | `attackers_table.html` | | `/htmx/credentials` | `credentials_table.html` | | `/htmx/attacks` | `attack_types_table.html` | | `/htmx/patterns` | `patterns_table.html` | | `/htmx/ip-detail/{ip}` | `ip_detail.html` | ## Database Schema ``` ┌─────────────────┐ ┌──────────────────┐ │ AccessLog │ │ AttackDetection │ ├─────────────────┤ ├──────────────────┤ │ id (PK) │◄────│ access_log_id(FK)│ │ ip (indexed) │ │ attack_type │ │ path │ │ matched_pattern │ │ user_agent │ └──────────────────┘ │ method │ │ is_suspicious │ ┌──────────────────┐ │ is_honeypot │ │CredentialAttempt │ │ timestamp │ ├──────────────────┤ │ raw_request │ │ id (PK) │ └─────────────────┘ │ ip (indexed) │ │ path, username │ ┌─────────────────┐ │ password │ │ IpStats │ │ timestamp │ ├─────────────────┤ └──────────────────┘ │ ip (PK) │ │ total_requests │ ┌──────────────────┐ │ first/last_seen │ │ CategoryHistory │ │ country_code │ ├──────────────────┤ │ city, lat, lon │ │ id (PK) │ │ asn, asn_org │ │ ip (indexed) │ │ isp, reverse │ │ old_category │ │ is_proxy │ │ new_category │ │ is_hosting │ │ timestamp │ │ list_on (JSON) │ └──────────────────┘ │ category │ │ category_scores │ │ analyzed_metrics│ │ manual_category │ └─────────────────┘ ``` **SQLite config:** WAL mode, 30s busy timeout, file permissions 600. ## Frontend Architecture ``` base.html ├── CDN: Leaflet, Chart.js, HTMX, Alpine.js (deferred) ├── Static: dashboard.css │ └── dashboard/index.html (extends base) │ ├── Stats cards ──────────── Server-rendered on page load ├── Suspicious table ─────── Server-rendered on page load │ ├── Overview tab (Alpine.js x-show) │ ├── Honeypot table ───── HTMX hx-get on load │ ├── Top IPs table ────── HTMX hx-get on load │ ├── Top Paths table ──── HTMX hx-get on load │ ├── Top UA table ─────── HTMX hx-get on load │ └── Credentials table ── HTMX hx-get on load │ └── Attacks tab (Alpine.js x-show, lazy init) ├── Attackers table ──── HTMX hx-get on load ├── Map ──────────────── Leaflet (init on tab switch) ├── Chart ────────────── Chart.js (init on tab switch) ├── Attack types table ─ HTMX hx-get on load └── Patterns table ───── HTMX hx-get on load ``` **Responsibility split:** - **Alpine.js** — Tab state, modals, dropdowns, lazy initialization - **HTMX** — Table pagination, sorting, IP detail expansion - **Leaflet** — Interactive map with category-colored markers - **Chart.js** — Doughnut chart for attack type distribution - **Custom SVG** — Radar charts for IP category scores ## Background Tasks Managed by `TasksMaster` (APScheduler). Tasks are auto-discovered from `src/tasks/`. | Task | Schedule | Purpose | |------|----------|---------| | `analyze_ips` | Every 1 min | Score IPs into categories (attacker, crawler, user) | | `fetch_ip_rep` | Every 5 min | Enrich IPs with geolocation + blocklist data | | `db_dump` | Configurable | Export database backups | | `memory_cleanup` | Periodic | Trim in-memory lists | | `top_attacking_ips` | Periodic | Cache top attackers | ### IP Categorization Model Each IP is scored across 4 categories based on: - HTTP method distribution (risky methods ratio) - Robots.txt violations - Request timing anomalies (coefficient of variation) - User-Agent diversity - Attack URL detection Categories: `attacker`, `bad_crawler`, `good_crawler`, `regular_user`, `unknown` ## Configuration `config.yaml` with environment variable overrides (`KRAWL_{FIELD}`): ```yaml server: port: 5000 delay: 100 # Response delay (ms) dashboard: secret_path: "test" # Auto-generates if null database: path: "data/krawl.db" retention_days: 30 crawl: infinite_pages_for_malicious: true max_pages_limit: 250 ban_duration_seconds: 600 behavior: probability_error_codes: 0 # 0-100% canary: token_url: null # External canary alert URL ``` ## Logging Three rotating log files (1MB max, 5 backups each): | Logger | File | Content | |--------|------|---------| | `krawl.app` | `logs/krawl.log` | Application events, errors | | `krawl.access` | `logs/access.log` | HTTP access, attack detections | | `krawl.credentials` | `logs/credentials.log` | Captured login attempts | ## Docker ```dockerfile FROM python:3.11-slim # Non-root user: krawl:1000 # Volumes: /app/logs, /app/data, /app/exports CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000", "--app-dir", "src"] ``` ## Key Data Flows ### Honeypot Request ``` Client → BanCheck → DeceptionMiddleware → HoneypotRouter │ ┌─────────┴──────────┐ │ tracker.record() │ │ ├─ in-memory ++ │ │ ├─ detect attacks │ │ └─ DB persist │ └────────────────────┘ ``` ### Dashboard Load ``` Browser → GET /{secret}/ → SSR initial stats + Jinja2 render → Alpine.js init → HTMX fires hx-get for each table → User clicks Attacks tab → setTimeout → init Leaflet + Chart.js → Leaflet fetches /api/all-ips → plots markers → Chart.js fetches /api/attack-types-stats → renders doughnut ``` ### IP Enrichment Pipeline ``` APScheduler (every 5 min) └─ fetch_ip_rep.main() ├─ DB: get unenriched IPs (limit 50) ├─ ip-api.com → geolocation (country, city, ASN, coords) ├─ iprep.lcrawl.com → blocklist memberships └─ DB: update IpStats with enriched data ```