Files
krawl.es/docs/architecture.md
2026-02-17 14:34:48 +01:00

15 KiB

Krawl Architecture

Overview

Krawl is a cloud-native deception honeypot server built on FastAPI. It creates realistic fake web applications (admin panels, login pages, fake credentials) to attract, detect, and analyze malicious crawlers and attackers while wasting their resources with infinite spider-trap pages.

Tech Stack

Layer Technology
Backend FastAPI, Uvicorn, Python 3.11
ORM / DB SQLAlchemy 2.0, SQLite (WAL mode)
Templating Jinja2 (server-side rendering)
Reactivity Alpine.js 3.14
Partial Updates HTMX 2.0
Charts Chart.js 3.9 (doughnut), custom SVG radar
Maps Leaflet 1.9 + CartoDB dark tiles
Scheduling APScheduler
Container Docker (python:3.11-slim), Helm/K8s ready

Directory Structure

Krawl/
├── src/
│   ├── app.py                    # FastAPI app factory + lifespan
│   ├── config.py                 # YAML + env config loader
│   ├── dependencies.py           # DI providers (templates, DB, client IP)
│   ├── database.py               # DatabaseManager singleton
│   ├── models.py                 # SQLAlchemy ORM models
│   ├── tracker.py                # In-memory + DB access tracking
│   ├── logger.py                 # Rotating file log handlers
│   ├── deception_responses.py    # Attack detection + fake responses
│   ├── sanitizer.py              # Input sanitization
│   ├── generators.py             # Random content generators
│   ├── wordlists.py              # JSON wordlist loader
│   ├── geo_utils.py              # IP geolocation API
│   ├── ip_utils.py               # IP validation
│   │
│   ├── routes/
│   │   ├── honeypot.py           # Trap pages, credential capture, catch-all
│   │   ├── dashboard.py          # Dashboard page (Jinja2 SSR)
│   │   ├── api.py                # JSON API endpoints
│   │   └── htmx.py               # HTMX HTML fragment endpoints
│   │
│   ├── middleware/
│   │   ├── deception.py          # Path traversal / XXE / cmd injection detection
│   │   └── ban_check.py          # Banned IP enforcement
│   │
│   ├── tasks/                    # APScheduler background jobs
│   │   ├── analyze_ips.py        # IP categorization scoring
│   │   ├── fetch_ip_rep.py       # Geolocation + blocklist enrichment
│   │   ├── db_dump.py            # Database export
│   │   ├── memory_cleanup.py     # In-memory list trimming
│   │   └── top_attacking_ips.py  # Top attacker caching
│   │
│   ├── tasks_master.py           # Task discovery + APScheduler orchestrator
│   ├── firewall/                 # Banlist export (iptables, raw)
│   ├── migrations/               # Schema migrations (auto-run)
│   │
│   └── templates/
│       ├── jinja2/
│       │   ├── base.html                     # Layout + CDN scripts
│       │   └── dashboard/
│       │       ├── index.html                # Main dashboard page
│       │       └── partials/                 # 13 HTMX fragment templates
│       ├── html/                             # Deceptive trap page templates
│       └── static/
│           ├── css/dashboard.css
│           └── js/
│               ├── dashboard.js              # Alpine.js app controller
│               ├── map.js                    # Leaflet map
│               ├── charts.js                 # Chart.js doughnut
│               └── radar.js                  # SVG radar chart
│
├── config.yaml               # Application configuration
├── wordlists.json             # Attack patterns + fake credentials
├── Dockerfile                 # Container build
├── docker-compose.yaml        # Local orchestration
├── entrypoint.sh              # Container startup (gosu privilege drop)
├── kubernetes/                # K8s manifests
└── helm/                      # Helm chart

Application Entry Point

src/app.py uses the FastAPI application factory pattern with an async lifespan manager:

Startup                              Shutdown
  │                                    │
  ├─ Initialize logging                └─ Log shutdown
  ├─ Initialize SQLite DB
  ├─ Create AccessTracker
  ├─ Load webpages file (optional)
  ├─ Store config + tracker in app.state
  ├─ Start APScheduler background tasks
  └─ Log dashboard URL

Request Pipeline

        Request
          │
          ▼
┌──────────────────────┐
│  BanCheckMiddleware  │──→ IP banned? → Return 500
└──────────┬───────────┘
           ▼
┌──────────────────────┐
│ DeceptionMiddleware  │──→ Attack detected? → Fake error response
└──────────┬───────────┘
           ▼
┌───────────────────────┐
│ ServerHeaderMiddleware│──→ Add random Server header
└──────────┬────────────┘
           ▼
┌───────────────────────┐
│     Route Matching    │
│  (ordered by priority)│
│                       │
│  1. Static files      │  /{secret}/static/*
│  2. Dashboard router  │  /{secret}/          (prefix-based)
│  3. API router        │  /{secret}/api/*     (prefix-based)
│  4. HTMX router       │  /{secret}/htmx/*   (prefix-based)
│  5. Honeypot router   │  /* (catch-all)
└───────────────────────┘

Prefix-Based Routing

Dashboard, API, and HTMX routers are mounted with prefix=f"/{secret}" in app.py. This means:

  • Route handlers define paths without the secret (e.g., @router.get("/api/all-ips"))
  • FastAPI prepends the secret automatically (e.g., GET /a1b2c3/api/all-ips)
  • The honeypot catch-all /{path:path} only matches paths that don't start with the secret
  • No _is_dashboard_path() checks needed — the prefix handles access scoping

Route Architecture

Honeypot Routes (routes/honeypot.py)

Method Path Response
GET /{path:path} Trap page with random links (catch-all)
HEAD /{path:path} 200 OK
POST /{path:path} Credential capture
GET /admin, /login Fake login form
GET /wp-admin, /wp-login.php Fake WordPress login
GET /phpmyadmin Fake phpMyAdmin
GET /robots.txt Honeypot paths advertised
GET/POST /api/search, /api/sql SQL injection honeypot
POST /api/contact XSS detection endpoint
GET /.env, /credentials.txt Fake sensitive files

Dashboard Routes (routes/dashboard.py)

Method Path Response
GET / Server-rendered dashboard (Jinja2)

API Routes (routes/api.py)

Method Path Response
GET /api/all-ips Paginated IP list with stats
GET /api/attackers Paginated attacker IPs
GET /api/ip-stats/{ip} Single IP detail
GET /api/credentials Captured credentials
GET /api/honeypot Honeypot trigger counts
GET /api/top-ips Top requesting IPs
GET /api/top-paths Most requested paths
GET /api/top-user-agents Top user agents
GET /api/attack-types-stats Attack type distribution
GET /api/attack-types Paginated attack log
GET /api/raw-request/{id} Full HTTP request
GET /api/get_banlist Export ban rules

HTMX Fragment Routes (routes/htmx.py)

Each returns a server-rendered Jinja2 partial (hx-swap="innerHTML"):

Path Template
/htmx/honeypot honeypot_table.html
/htmx/top-ips top_ips_table.html
/htmx/top-paths top_paths_table.html
/htmx/top-ua top_ua_table.html
/htmx/attackers attackers_table.html
/htmx/credentials credentials_table.html
/htmx/attacks attack_types_table.html
/htmx/patterns patterns_table.html
/htmx/ip-detail/{ip} ip_detail.html

Database Schema

┌─────────────────┐     ┌──────────────────┐
│   AccessLog     │     │ AttackDetection   │
├─────────────────┤     ├──────────────────┤
│ id (PK)         │◄────│ access_log_id(FK)│
│ ip (indexed)    │     │ attack_type      │
│ path            │     │ matched_pattern  │
│ user_agent      │     └──────────────────┘
│ method          │
│ is_suspicious   │     ┌──────────────────┐
│ is_honeypot     │     │CredentialAttempt │
│ timestamp       │     ├──────────────────┤
│ raw_request     │     │ id (PK)          │
└─────────────────┘     │ ip (indexed)     │
                        │ path, username   │
┌─────────────────┐     │ password         │
│    IpStats      │     │ timestamp        │
├─────────────────┤     └──────────────────┘
│ ip (PK)         │
│ total_requests  │     ┌──────────────────┐
│ first/last_seen │     │ CategoryHistory  │
│ country_code    │     ├──────────────────┤
│ city, lat, lon  │     │ id (PK)          │
│ asn, asn_org    │     │ ip (indexed)     │
│ isp, reverse    │     │ old_category     │
│ is_proxy        │     │ new_category     │
│ is_hosting      │     │ timestamp        │
│ list_on (JSON)  │     └──────────────────┘
│ category        │
│ category_scores │
│ analyzed_metrics│
│ manual_category │
└─────────────────┘

SQLite config: WAL mode, 30s busy timeout, file permissions 600.

Frontend Architecture

base.html
  ├── CDN: Leaflet, Chart.js, HTMX, Alpine.js (deferred)
  ├── Static: dashboard.css
  │
  └── dashboard/index.html (extends base)
      │
      ├── Stats cards ──────────── Server-rendered on page load
      ├── Suspicious table ─────── Server-rendered on page load
      │
      ├── Overview tab (Alpine.js x-show)
      │   ├── Honeypot table ───── HTMX hx-get on load
      │   ├── Top IPs table ────── HTMX hx-get on load
      │   ├── Top Paths table ──── HTMX hx-get on load
      │   ├── Top UA table ─────── HTMX hx-get on load
      │   └── Credentials table ── HTMX hx-get on load
      │
      └── Attacks tab (Alpine.js x-show, lazy init)
          ├── Attackers table ──── HTMX hx-get on load
          ├── Map ──────────────── Leaflet (init on tab switch)
          ├── Chart ────────────── Chart.js (init on tab switch)
          ├── Attack types table ─ HTMX hx-get on load
          └── Patterns table ───── HTMX hx-get on load

Responsibility split:

  • Alpine.js — Tab state, modals, dropdowns, lazy initialization
  • HTMX — Table pagination, sorting, IP detail expansion
  • Leaflet — Interactive map with category-colored markers
  • Chart.js — Doughnut chart for attack type distribution
  • Custom SVG — Radar charts for IP category scores

Background Tasks

Managed by TasksMaster (APScheduler). Tasks are auto-discovered from src/tasks/.

Task Schedule Purpose
analyze_ips Every 1 min Score IPs into categories (attacker, crawler, user)
fetch_ip_rep Every 5 min Enrich IPs with geolocation + blocklist data
db_dump Configurable Export database backups
memory_cleanup Periodic Trim in-memory lists
top_attacking_ips Periodic Cache top attackers

IP Categorization Model

Each IP is scored across 4 categories based on:

  • HTTP method distribution (risky methods ratio)
  • Robots.txt violations
  • Request timing anomalies (coefficient of variation)
  • User-Agent diversity
  • Attack URL detection

Categories: attacker, bad_crawler, good_crawler, regular_user, unknown

Configuration

config.yaml with environment variable overrides (KRAWL_{FIELD}):

server:
  port: 5000
  delay: 100                    # Response delay (ms)

dashboard:
  secret_path: "test"           # Auto-generates if null

database:
  path: "data/krawl.db"
  retention_days: 30

crawl:
  infinite_pages_for_malicious: true
  max_pages_limit: 250
  ban_duration_seconds: 600

behavior:
  probability_error_codes: 0    # 0-100%

canary:
  token_url: null               # External canary alert URL

Logging

Three rotating log files (1MB max, 5 backups each):

Logger File Content
krawl.app logs/krawl.log Application events, errors
krawl.access logs/access.log HTTP access, attack detections
krawl.credentials logs/credentials.log Captured login attempts

Docker

FROM python:3.11-slim
# Non-root user: krawl:1000
# Volumes: /app/logs, /app/data, /app/exports
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000", "--app-dir", "src"]

Key Data Flows

Honeypot Request

Client → BanCheck → DeceptionMiddleware → HoneypotRouter
                                              │
                                    ┌─────────┴──────────┐
                                    │ tracker.record()    │
                                    │   ├─ in-memory ++   │
                                    │   ├─ detect attacks │
                                    │   └─ DB persist     │
                                    └────────────────────┘

Dashboard Load

Browser → GET /{secret}/ → SSR initial stats + Jinja2 render
       → Alpine.js init → HTMX fires hx-get for each table
       → User clicks Attacks tab → setTimeout → init Leaflet + Chart.js
       → Leaflet fetches /api/all-ips → plots markers
       → Chart.js fetches /api/attack-types-stats → renders doughnut

IP Enrichment Pipeline

APScheduler (every 5 min)
  └─ fetch_ip_rep.main()
       ├─ DB: get unenriched IPs (limit 50)
       ├─ ip-api.com → geolocation (country, city, ASN, coords)
       ├─ iprep.lcrawl.com → blocklist memberships
       └─ DB: update IpStats with enriched data