Files

Lorenzo Venerandi 44235b232c docs: add architecture documentation for Krawl project

2026-02-17 14:34:48 +01:00

15 KiB

Raw Blame History

Krawl Architecture

Overview

Krawl is a cloud-native deception honeypot server built on FastAPI. It creates realistic fake web applications (admin panels, login pages, fake credentials) to attract, detect, and analyze malicious crawlers and attackers while wasting their resources with infinite spider-trap pages.

Tech Stack

Layer	Technology
Backend	FastAPI, Uvicorn, Python 3.11
ORM / DB	SQLAlchemy 2.0, SQLite (WAL mode)
Templating	Jinja2 (server-side rendering)
Reactivity	Alpine.js 3.14
Partial Updates	HTMX 2.0
Charts	Chart.js 3.9 (doughnut), custom SVG radar
Maps	Leaflet 1.9 + CartoDB dark tiles
Scheduling	APScheduler
Container	Docker (python:3.11-slim), Helm/K8s ready

Directory Structure

Krawl/
├── src/
│   ├── app.py                    # FastAPI app factory + lifespan
│   ├── config.py                 # YAML + env config loader
│   ├── dependencies.py           # DI providers (templates, DB, client IP)
│   ├── database.py               # DatabaseManager singleton
│   ├── models.py                 # SQLAlchemy ORM models
│   ├── tracker.py                # In-memory + DB access tracking
│   ├── logger.py                 # Rotating file log handlers
│   ├── deception_responses.py    # Attack detection + fake responses
│   ├── sanitizer.py              # Input sanitization
│   ├── generators.py             # Random content generators
│   ├── wordlists.py              # JSON wordlist loader
│   ├── geo_utils.py              # IP geolocation API
│   ├── ip_utils.py               # IP validation
│   │
│   ├── routes/
│   │   ├── honeypot.py           # Trap pages, credential capture, catch-all
│   │   ├── dashboard.py          # Dashboard page (Jinja2 SSR)
│   │   ├── api.py                # JSON API endpoints
│   │   └── htmx.py               # HTMX HTML fragment endpoints
│   │
│   ├── middleware/
│   │   ├── deception.py          # Path traversal / XXE / cmd injection detection
│   │   └── ban_check.py          # Banned IP enforcement
│   │
│   ├── tasks/                    # APScheduler background jobs
│   │   ├── analyze_ips.py        # IP categorization scoring
│   │   ├── fetch_ip_rep.py       # Geolocation + blocklist enrichment
│   │   ├── db_dump.py            # Database export
│   │   ├── memory_cleanup.py     # In-memory list trimming
│   │   └── top_attacking_ips.py  # Top attacker caching
│   │
│   ├── tasks_master.py           # Task discovery + APScheduler orchestrator
│   ├── firewall/                 # Banlist export (iptables, raw)
│   ├── migrations/               # Schema migrations (auto-run)
│   │
│   └── templates/
│       ├── jinja2/
│       │   ├── base.html                     # Layout + CDN scripts
│       │   └── dashboard/
│       │       ├── index.html                # Main dashboard page
│       │       └── partials/                 # 13 HTMX fragment templates
│       ├── html/                             # Deceptive trap page templates
│       └── static/
│           ├── css/dashboard.css
│           └── js/
│               ├── dashboard.js              # Alpine.js app controller
│               ├── map.js                    # Leaflet map
│               ├── charts.js                 # Chart.js doughnut
│               └── radar.js                  # SVG radar chart
│
├── config.yaml               # Application configuration
├── wordlists.json             # Attack patterns + fake credentials
├── Dockerfile                 # Container build
├── docker-compose.yaml        # Local orchestration
├── entrypoint.sh              # Container startup (gosu privilege drop)
├── kubernetes/                # K8s manifests
└── helm/                      # Helm chart

Application Entry Point

src/app.py uses the FastAPI application factory pattern with an async lifespan manager:

Startup                              Shutdown
  │                                    │
  ├─ Initialize logging                └─ Log shutdown
  ├─ Initialize SQLite DB
  ├─ Create AccessTracker
  ├─ Load webpages file (optional)
  ├─ Store config + tracker in app.state
  ├─ Start APScheduler background tasks
  └─ Log dashboard URL

Request Pipeline

        Request
          │
          ▼
┌──────────────────────┐
│  BanCheckMiddleware  │──→ IP banned? → Return 500
└──────────┬───────────┘
           ▼
┌──────────────────────┐
│ DeceptionMiddleware  │──→ Attack detected? → Fake error response
└──────────┬───────────┘
           ▼
┌───────────────────────┐
│ ServerHeaderMiddleware│──→ Add random Server header
└──────────┬────────────┘
           ▼
┌───────────────────────┐
│     Route Matching    │
│  (ordered by priority)│
│                       │
│  1. Static files      │  /{secret}/static/*
│  2. Dashboard router  │  /{secret}/          (prefix-based)
│  3. API router        │  /{secret}/api/*     (prefix-based)
│  4. HTMX router       │  /{secret}/htmx/*   (prefix-based)
│  5. Honeypot router   │  /* (catch-all)
└───────────────────────┘

Prefix-Based Routing

Dashboard, API, and HTMX routers are mounted with prefix=f"/{secret}" in app.py. This means:

Route handlers define paths without the secret (e.g., @router.get("/api/all-ips"))
FastAPI prepends the secret automatically (e.g., GET /a1b2c3/api/all-ips)
The honeypot catch-all /{path:path} only matches paths that don't start with the secret
No _is_dashboard_path() checks needed — the prefix handles access scoping

Route Architecture

Honeypot Routes (`routes/honeypot.py`)

Method	Path	Response
`GET`	`/{path:path}`	Trap page with random links (catch-all)
`HEAD`	`/{path:path}`	200 OK
`POST`	`/{path:path}`	Credential capture
`GET`	`/admin`, `/login`	Fake login form
`GET`	`/wp-admin`, `/wp-login.php`	Fake WordPress login
`GET`	`/phpmyadmin`	Fake phpMyAdmin
`GET`	`/robots.txt`	Honeypot paths advertised
`GET/POST`	`/api/search`, `/api/sql`	SQL injection honeypot
`POST`	`/api/contact`	XSS detection endpoint
`GET`	`/.env`, `/credentials.txt`	Fake sensitive files

Dashboard Routes (`routes/dashboard.py`)

Method	Path	Response
`GET`	`/`	Server-rendered dashboard (Jinja2)

API Routes (`routes/api.py`)

Method	Path	Response
`GET`	`/api/all-ips`	Paginated IP list with stats
`GET`	`/api/attackers`	Paginated attacker IPs
`GET`	`/api/ip-stats/{ip}`	Single IP detail
`GET`	`/api/credentials`	Captured credentials
`GET`	`/api/honeypot`	Honeypot trigger counts
`GET`	`/api/top-ips`	Top requesting IPs
`GET`	`/api/top-paths`	Most requested paths
`GET`	`/api/top-user-agents`	Top user agents
`GET`	`/api/attack-types-stats`	Attack type distribution
`GET`	`/api/attack-types`	Paginated attack log
`GET`	`/api/raw-request/{id}`	Full HTTP request
`GET`	`/api/get_banlist`	Export ban rules

HTMX Fragment Routes (`routes/htmx.py`)

Each returns a server-rendered Jinja2 partial (hx-swap="innerHTML"):

Path	Template
`/htmx/honeypot`	`honeypot_table.html`
`/htmx/top-ips`	`top_ips_table.html`
`/htmx/top-paths`	`top_paths_table.html`
`/htmx/top-ua`	`top_ua_table.html`
`/htmx/attackers`	`attackers_table.html`
`/htmx/credentials`	`credentials_table.html`
`/htmx/attacks`	`attack_types_table.html`
`/htmx/patterns`	`patterns_table.html`
`/htmx/ip-detail/{ip}`	`ip_detail.html`

Database Schema

┌─────────────────┐     ┌──────────────────┐
│   AccessLog     │     │ AttackDetection   │
├─────────────────┤     ├──────────────────┤
│ id (PK)         │◄────│ access_log_id(FK)│
│ ip (indexed)    │     │ attack_type      │
│ path            │     │ matched_pattern  │
│ user_agent      │     └──────────────────┘
│ method          │
│ is_suspicious   │     ┌──────────────────┐
│ is_honeypot     │     │CredentialAttempt │
│ timestamp       │     ├──────────────────┤
│ raw_request     │     │ id (PK)          │
└─────────────────┘     │ ip (indexed)     │
                        │ path, username   │
┌─────────────────┐     │ password         │
│    IpStats      │     │ timestamp        │
├─────────────────┤     └──────────────────┘
│ ip (PK)         │
│ total_requests  │     ┌──────────────────┐
│ first/last_seen │     │ CategoryHistory  │
│ country_code    │     ├──────────────────┤
│ city, lat, lon  │     │ id (PK)          │
│ asn, asn_org    │     │ ip (indexed)     │
│ isp, reverse    │     │ old_category     │
│ is_proxy        │     │ new_category     │
│ is_hosting      │     │ timestamp        │
│ list_on (JSON)  │     └──────────────────┘
│ category        │
│ category_scores │
│ analyzed_metrics│
│ manual_category │
└─────────────────┘

SQLite config: WAL mode, 30s busy timeout, file permissions 600.

Frontend Architecture

base.html
  ├── CDN: Leaflet, Chart.js, HTMX, Alpine.js (deferred)
  ├── Static: dashboard.css
  │
  └── dashboard/index.html (extends base)
      │
      ├── Stats cards ──────────── Server-rendered on page load
      ├── Suspicious table ─────── Server-rendered on page load
      │
      ├── Overview tab (Alpine.js x-show)
      │   ├── Honeypot table ───── HTMX hx-get on load
      │   ├── Top IPs table ────── HTMX hx-get on load
      │   ├── Top Paths table ──── HTMX hx-get on load
      │   ├── Top UA table ─────── HTMX hx-get on load
      │   └── Credentials table ── HTMX hx-get on load
      │
      └── Attacks tab (Alpine.js x-show, lazy init)
          ├── Attackers table ──── HTMX hx-get on load
          ├── Map ──────────────── Leaflet (init on tab switch)
          ├── Chart ────────────── Chart.js (init on tab switch)
          ├── Attack types table ─ HTMX hx-get on load
          └── Patterns table ───── HTMX hx-get on load

Responsibility split:

Alpine.js — Tab state, modals, dropdowns, lazy initialization
HTMX — Table pagination, sorting, IP detail expansion
Leaflet — Interactive map with category-colored markers
Chart.js — Doughnut chart for attack type distribution
Custom SVG — Radar charts for IP category scores

Background Tasks

Managed by TasksMaster (APScheduler). Tasks are auto-discovered from src/tasks/.

Task	Schedule	Purpose
`analyze_ips`	Every 1 min	Score IPs into categories (attacker, crawler, user)
`fetch_ip_rep`	Every 5 min	Enrich IPs with geolocation + blocklist data
`db_dump`	Configurable	Export database backups
`memory_cleanup`	Periodic	Trim in-memory lists
`top_attacking_ips`	Periodic	Cache top attackers

IP Categorization Model

Each IP is scored across 4 categories based on:

HTTP method distribution (risky methods ratio)
Robots.txt violations
Request timing anomalies (coefficient of variation)
User-Agent diversity
Attack URL detection

Categories: attacker, bad_crawler, good_crawler, regular_user, unknown

Configuration

config.yaml with environment variable overrides (KRAWL_{FIELD}):

server:
  port: 5000
  delay: 100                    # Response delay (ms)

dashboard:
  secret_path: "test"           # Auto-generates if null

database:
  path: "data/krawl.db"
  retention_days: 30

crawl:
  infinite_pages_for_malicious: true
  max_pages_limit: 250
  ban_duration_seconds: 600

behavior:
  probability_error_codes: 0    # 0-100%

canary:
  token_url: null               # External canary alert URL

Logging

Three rotating log files (1MB max, 5 backups each):

Logger	File	Content
`krawl.app`	`logs/krawl.log`	Application events, errors
`krawl.access`	`logs/access.log`	HTTP access, attack detections
`krawl.credentials`	`logs/credentials.log`	Captured login attempts

Docker

FROM python:3.11-slim
# Non-root user: krawl:1000
# Volumes: /app/logs, /app/data, /app/exports
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000", "--app-dir", "src"]

Key Data Flows

Honeypot Request

Client → BanCheck → DeceptionMiddleware → HoneypotRouter
                                              │
                                    ┌─────────┴──────────┐
                                    │ tracker.record()    │
                                    │   ├─ in-memory ++   │
                                    │   ├─ detect attacks │
                                    │   └─ DB persist     │
                                    └────────────────────┘

Dashboard Load

Browser → GET /{secret}/ → SSR initial stats + Jinja2 render
       → Alpine.js init → HTMX fires hx-get for each table
       → User clicks Attacks tab → setTimeout → init Leaflet + Chart.js
       → Leaflet fetches /api/all-ips → plots markers
       → Chart.js fetches /api/attack-types-stats → renders doughnut

IP Enrichment Pipeline

APScheduler (every 5 min)
  └─ fetch_ip_rep.main()
       ├─ DB: get unenriched IPs (limit 50)
       ├─ ip-api.com → geolocation (country, city, ASN, coords)
       ├─ iprep.lcrawl.com → blocklist memberships
       └─ DB: update IpStats with enriched data

15 KiB Raw Blame History