Files
krawl.es/docs/architecture.md

373 lines
15 KiB
Markdown
Raw Normal View History

# Krawl Architecture
## Overview
Krawl is a cloud-native deception honeypot server built on **FastAPI**. It creates realistic fake web applications (admin panels, login pages, fake credentials) to attract, detect, and analyze malicious crawlers and attackers while wasting their resources with infinite spider-trap pages.
## Tech Stack
| Layer | Technology |
|-------|-----------|
| **Backend** | FastAPI, Uvicorn, Python 3.11 |
| **ORM / DB** | SQLAlchemy 2.0, SQLite (WAL mode) |
| **Templating** | Jinja2 (server-side rendering) |
| **Reactivity** | Alpine.js 3.14 |
| **Partial Updates** | HTMX 2.0 |
| **Charts** | Chart.js 3.9 (doughnut), custom SVG radar |
| **Maps** | Leaflet 1.9 + CartoDB dark tiles |
| **Scheduling** | APScheduler |
| **Container** | Docker (python:3.11-slim), Helm/K8s ready |
## Directory Structure
```
Krawl/
├── src/
│ ├── app.py # FastAPI app factory + lifespan
│ ├── config.py # YAML + env config loader
│ ├── dependencies.py # DI providers (templates, DB, client IP)
│ ├── database.py # DatabaseManager singleton
│ ├── models.py # SQLAlchemy ORM models
│ ├── tracker.py # In-memory + DB access tracking
│ ├── logger.py # Rotating file log handlers
│ ├── deception_responses.py # Attack detection + fake responses
│ ├── sanitizer.py # Input sanitization
│ ├── generators.py # Random content generators
│ ├── wordlists.py # JSON wordlist loader
│ ├── geo_utils.py # IP geolocation API
│ ├── ip_utils.py # IP validation
│ │
│ ├── routes/
│ │ ├── honeypot.py # Trap pages, credential capture, catch-all
│ │ ├── dashboard.py # Dashboard page (Jinja2 SSR)
│ │ ├── api.py # JSON API endpoints
│ │ └── htmx.py # HTMX HTML fragment endpoints
│ │
│ ├── middleware/
│ │ ├── deception.py # Path traversal / XXE / cmd injection detection
│ │ └── ban_check.py # Banned IP enforcement
│ │
│ ├── tasks/ # APScheduler background jobs
│ │ ├── analyze_ips.py # IP categorization scoring
│ │ ├── fetch_ip_rep.py # Geolocation + blocklist enrichment
│ │ ├── db_dump.py # Database export
│ │ ├── memory_cleanup.py # In-memory list trimming
│ │ └── top_attacking_ips.py # Top attacker caching
│ │
│ ├── tasks_master.py # Task discovery + APScheduler orchestrator
│ ├── firewall/ # Banlist export (iptables, raw)
│ ├── migrations/ # Schema migrations (auto-run)
│ │
│ └── templates/
│ ├── jinja2/
│ │ ├── base.html # Layout + CDN scripts
│ │ └── dashboard/
│ │ ├── index.html # Main dashboard page
│ │ └── partials/ # 13 HTMX fragment templates
│ ├── html/ # Deceptive trap page templates
│ └── static/
│ ├── css/dashboard.css
│ └── js/
│ ├── dashboard.js # Alpine.js app controller
│ ├── map.js # Leaflet map
│ ├── charts.js # Chart.js doughnut
│ └── radar.js # SVG radar chart
├── config.yaml # Application configuration
├── wordlists.json # Attack patterns + fake credentials
├── Dockerfile # Container build
├── docker-compose.yaml # Local orchestration
├── entrypoint.sh # Container startup (gosu privilege drop)
├── kubernetes/ # K8s manifests
└── helm/ # Helm chart
```
## Application Entry Point
`src/app.py` uses the **FastAPI application factory** pattern with an async lifespan manager:
```
Startup Shutdown
│ │
├─ Initialize logging └─ Log shutdown
├─ Initialize SQLite DB
├─ Create AccessTracker
├─ Load webpages file (optional)
├─ Store config + tracker in app.state
├─ Start APScheduler background tasks
└─ Log dashboard URL
```
## Request Pipeline
```
Request
┌──────────────────────┐
│ BanCheckMiddleware │──→ IP banned? → Return 500
└──────────┬───────────┘
┌──────────────────────┐
│ DeceptionMiddleware │──→ Attack detected? → Fake error response
└──────────┬───────────┘
┌───────────────────────┐
│ ServerHeaderMiddleware│──→ Add random Server header
└──────────┬────────────┘
┌───────────────────────┐
│ Route Matching │
│ (ordered by priority)│
│ │
│ 1. Static files │ /{secret}/static/*
│ 2. Dashboard router │ /{secret}/ (prefix-based)
│ 3. API router │ /{secret}/api/* (prefix-based)
│ 4. HTMX router │ /{secret}/htmx/* (prefix-based)
│ 5. Honeypot router │ /* (catch-all)
└───────────────────────┘
```
### Prefix-Based Routing
Dashboard, API, and HTMX routers are mounted with `prefix=f"/{secret}"` in `app.py`. This means:
- Route handlers define paths **without** the secret (e.g., `@router.get("/api/all-ips")`)
- FastAPI prepends the secret automatically (e.g., `GET /a1b2c3/api/all-ips`)
- The honeypot catch-all `/{path:path}` only matches paths that **don't** start with the secret
- No `_is_dashboard_path()` checks needed — the prefix handles access scoping
## Route Architecture
### Honeypot Routes (`routes/honeypot.py`)
| Method | Path | Response |
|--------|------|----------|
| `GET` | `/{path:path}` | Trap page with random links (catch-all) |
| `HEAD` | `/{path:path}` | 200 OK |
| `POST` | `/{path:path}` | Credential capture |
| `GET` | `/admin`, `/login` | Fake login form |
| `GET` | `/wp-admin`, `/wp-login.php` | Fake WordPress login |
| `GET` | `/phpmyadmin` | Fake phpMyAdmin |
| `GET` | `/robots.txt` | Honeypot paths advertised |
| `GET/POST` | `/api/search`, `/api/sql` | SQL injection honeypot |
| `POST` | `/api/contact` | XSS detection endpoint |
| `GET` | `/.env`, `/credentials.txt` | Fake sensitive files |
### Dashboard Routes (`routes/dashboard.py`)
| Method | Path | Response |
|--------|------|----------|
| `GET` | `/` | Server-rendered dashboard (Jinja2) |
### API Routes (`routes/api.py`)
| Method | Path | Response |
|--------|------|----------|
| `GET` | `/api/all-ips` | Paginated IP list with stats |
| `GET` | `/api/attackers` | Paginated attacker IPs |
| `GET` | `/api/ip-stats/{ip}` | Single IP detail |
| `GET` | `/api/credentials` | Captured credentials |
| `GET` | `/api/honeypot` | Honeypot trigger counts |
| `GET` | `/api/top-ips` | Top requesting IPs |
| `GET` | `/api/top-paths` | Most requested paths |
| `GET` | `/api/top-user-agents` | Top user agents |
| `GET` | `/api/attack-types-stats` | Attack type distribution |
| `GET` | `/api/attack-types` | Paginated attack log |
| `GET` | `/api/raw-request/{id}` | Full HTTP request |
| `GET` | `/api/get_banlist` | Export ban rules |
### HTMX Fragment Routes (`routes/htmx.py`)
Each returns a server-rendered Jinja2 partial (`hx-swap="innerHTML"`):
| Path | Template |
|------|----------|
| `/htmx/honeypot` | `honeypot_table.html` |
| `/htmx/top-ips` | `top_ips_table.html` |
| `/htmx/top-paths` | `top_paths_table.html` |
| `/htmx/top-ua` | `top_ua_table.html` |
| `/htmx/attackers` | `attackers_table.html` |
| `/htmx/credentials` | `credentials_table.html` |
| `/htmx/attacks` | `attack_types_table.html` |
| `/htmx/patterns` | `patterns_table.html` |
| `/htmx/ip-detail/{ip}` | `ip_detail.html` |
## Database Schema
```
┌─────────────────┐ ┌──────────────────┐
│ AccessLog │ │ AttackDetection │
├─────────────────┤ ├──────────────────┤
│ id (PK) │◄────│ access_log_id(FK)│
│ ip (indexed) │ │ attack_type │
│ path │ │ matched_pattern │
│ user_agent │ └──────────────────┘
│ method │
│ is_suspicious │ ┌──────────────────┐
│ is_honeypot │ │CredentialAttempt │
│ timestamp │ ├──────────────────┤
│ raw_request │ │ id (PK) │
└─────────────────┘ │ ip (indexed) │
│ path, username │
┌─────────────────┐ │ password │
│ IpStats │ │ timestamp │
├─────────────────┤ └──────────────────┘
│ ip (PK) │
│ total_requests │ ┌──────────────────┐
│ first/last_seen │ │ CategoryHistory │
│ country_code │ ├──────────────────┤
│ city, lat, lon │ │ id (PK) │
│ asn, asn_org │ │ ip (indexed) │
│ isp, reverse │ │ old_category │
│ is_proxy │ │ new_category │
│ is_hosting │ │ timestamp │
│ list_on (JSON) │ └──────────────────┘
│ category │
│ category_scores │
│ analyzed_metrics│
│ manual_category │
└─────────────────┘
```
**SQLite config:** WAL mode, 30s busy timeout, file permissions 600.
## Frontend Architecture
```
base.html
├── CDN: Leaflet, Chart.js, HTMX, Alpine.js (deferred)
├── Static: dashboard.css
└── dashboard/index.html (extends base)
├── Stats cards ──────────── Server-rendered on page load
├── Suspicious table ─────── Server-rendered on page load
├── Overview tab (Alpine.js x-show)
│ ├── Honeypot table ───── HTMX hx-get on load
│ ├── Top IPs table ────── HTMX hx-get on load
│ ├── Top Paths table ──── HTMX hx-get on load
│ ├── Top UA table ─────── HTMX hx-get on load
│ └── Credentials table ── HTMX hx-get on load
└── Attacks tab (Alpine.js x-show, lazy init)
├── Attackers table ──── HTMX hx-get on load
├── Map ──────────────── Leaflet (init on tab switch)
├── Chart ────────────── Chart.js (init on tab switch)
├── Attack types table ─ HTMX hx-get on load
└── Patterns table ───── HTMX hx-get on load
```
**Responsibility split:**
- **Alpine.js** — Tab state, modals, dropdowns, lazy initialization
- **HTMX** — Table pagination, sorting, IP detail expansion
- **Leaflet** — Interactive map with category-colored markers
- **Chart.js** — Doughnut chart for attack type distribution
- **Custom SVG** — Radar charts for IP category scores
## Background Tasks
Managed by `TasksMaster` (APScheduler). Tasks are auto-discovered from `src/tasks/`.
| Task | Schedule | Purpose |
|------|----------|---------|
| `analyze_ips` | Every 1 min | Score IPs into categories (attacker, crawler, user) |
| `fetch_ip_rep` | Every 5 min | Enrich IPs with geolocation + blocklist data |
| `db_dump` | Configurable | Export database backups |
| `memory_cleanup` | Periodic | Trim in-memory lists |
| `top_attacking_ips` | Periodic | Cache top attackers |
### IP Categorization Model
Each IP is scored across 4 categories based on:
- HTTP method distribution (risky methods ratio)
- Robots.txt violations
- Request timing anomalies (coefficient of variation)
- User-Agent diversity
- Attack URL detection
Categories: `attacker`, `bad_crawler`, `good_crawler`, `regular_user`, `unknown`
## Configuration
`config.yaml` with environment variable overrides (`KRAWL_{FIELD}`):
```yaml
server:
port: 5000
delay: 100 # Response delay (ms)
dashboard:
secret_path: "test" # Auto-generates if null
database:
path: "data/krawl.db"
retention_days: 30
crawl:
infinite_pages_for_malicious: true
max_pages_limit: 250
ban_duration_seconds: 600
behavior:
probability_error_codes: 0 # 0-100%
canary:
token_url: null # External canary alert URL
```
## Logging
Three rotating log files (1MB max, 5 backups each):
| Logger | File | Content |
|--------|------|---------|
| `krawl.app` | `logs/krawl.log` | Application events, errors |
| `krawl.access` | `logs/access.log` | HTTP access, attack detections |
| `krawl.credentials` | `logs/credentials.log` | Captured login attempts |
## Docker
```dockerfile
FROM python:3.11-slim
# Non-root user: krawl:1000
# Volumes: /app/logs, /app/data, /app/exports
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000", "--app-dir", "src"]
```
## Key Data Flows
### Honeypot Request
```
Client → BanCheck → DeceptionMiddleware → HoneypotRouter
┌─────────┴──────────┐
│ tracker.record() │
│ ├─ in-memory ++ │
│ ├─ detect attacks │
│ └─ DB persist │
└────────────────────┘
```
### Dashboard Load
```
Browser → GET /{secret}/ → SSR initial stats + Jinja2 render
→ Alpine.js init → HTMX fires hx-get for each table
→ User clicks Attacks tab → setTimeout → init Leaflet + Chart.js
→ Leaflet fetches /api/all-ips → plots markers
→ Chart.js fetches /api/attack-types-stats → renders doughnut
```
### IP Enrichment Pipeline
```
APScheduler (every 5 min)
└─ fetch_ip_rep.main()
├─ DB: get unenriched IPs (limit 50)
├─ ip-api.com → geolocation (country, city, ASN, coords)
├─ iprep.lcrawl.com → blocklist memberships
└─ DB: update IpStats with enriched data
```