373 lines
15 KiB
Markdown
373 lines
15 KiB
Markdown
# Krawl Architecture
|
|
|
|
## Overview
|
|
|
|
Krawl is a cloud-native deception honeypot server built on **FastAPI**. It creates realistic fake web applications (admin panels, login pages, fake credentials) to attract, detect, and analyze malicious crawlers and attackers while wasting their resources with infinite spider-trap pages.
|
|
|
|
## Tech Stack
|
|
|
|
| Layer | Technology |
|
|
|-------|-----------|
|
|
| **Backend** | FastAPI, Uvicorn, Python 3.11 |
|
|
| **ORM / DB** | SQLAlchemy 2.0, SQLite (WAL mode) |
|
|
| **Templating** | Jinja2 (server-side rendering) |
|
|
| **Reactivity** | Alpine.js 3.14 |
|
|
| **Partial Updates** | HTMX 2.0 |
|
|
| **Charts** | Chart.js 3.9 (doughnut), custom SVG radar |
|
|
| **Maps** | Leaflet 1.9 + CartoDB dark tiles |
|
|
| **Scheduling** | APScheduler |
|
|
| **Container** | Docker (python:3.11-slim), Helm/K8s ready |
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
Krawl/
|
|
├── src/
|
|
│ ├── app.py # FastAPI app factory + lifespan
|
|
│ ├── config.py # YAML + env config loader
|
|
│ ├── dependencies.py # DI providers (templates, DB, client IP)
|
|
│ ├── database.py # DatabaseManager singleton
|
|
│ ├── models.py # SQLAlchemy ORM models
|
|
│ ├── tracker.py # In-memory + DB access tracking
|
|
│ ├── logger.py # Rotating file log handlers
|
|
│ ├── deception_responses.py # Attack detection + fake responses
|
|
│ ├── sanitizer.py # Input sanitization
|
|
│ ├── generators.py # Random content generators
|
|
│ ├── wordlists.py # JSON wordlist loader
|
|
│ ├── geo_utils.py # IP geolocation API
|
|
│ ├── ip_utils.py # IP validation
|
|
│ │
|
|
│ ├── routes/
|
|
│ │ ├── honeypot.py # Trap pages, credential capture, catch-all
|
|
│ │ ├── dashboard.py # Dashboard page (Jinja2 SSR)
|
|
│ │ ├── api.py # JSON API endpoints
|
|
│ │ └── htmx.py # HTMX HTML fragment endpoints
|
|
│ │
|
|
│ ├── middleware/
|
|
│ │ ├── deception.py # Path traversal / XXE / cmd injection detection
|
|
│ │ └── ban_check.py # Banned IP enforcement
|
|
│ │
|
|
│ ├── tasks/ # APScheduler background jobs
|
|
│ │ ├── analyze_ips.py # IP categorization scoring
|
|
│ │ ├── fetch_ip_rep.py # Geolocation + blocklist enrichment
|
|
│ │ ├── db_dump.py # Database export
|
|
│ │ ├── memory_cleanup.py # In-memory list trimming
|
|
│ │ └── top_attacking_ips.py # Top attacker caching
|
|
│ │
|
|
│ ├── tasks_master.py # Task discovery + APScheduler orchestrator
|
|
│ ├── firewall/ # Banlist export (iptables, raw)
|
|
│ ├── migrations/ # Schema migrations (auto-run)
|
|
│ │
|
|
│ └── templates/
|
|
│ ├── jinja2/
|
|
│ │ ├── base.html # Layout + CDN scripts
|
|
│ │ └── dashboard/
|
|
│ │ ├── index.html # Main dashboard page
|
|
│ │ └── partials/ # 13 HTMX fragment templates
|
|
│ ├── html/ # Deceptive trap page templates
|
|
│ └── static/
|
|
│ ├── css/dashboard.css
|
|
│ └── js/
|
|
│ ├── dashboard.js # Alpine.js app controller
|
|
│ ├── map.js # Leaflet map
|
|
│ ├── charts.js # Chart.js doughnut
|
|
│ └── radar.js # SVG radar chart
|
|
│
|
|
├── config.yaml # Application configuration
|
|
├── wordlists.json # Attack patterns + fake credentials
|
|
├── Dockerfile # Container build
|
|
├── docker-compose.yaml # Local orchestration
|
|
├── entrypoint.sh # Container startup (gosu privilege drop)
|
|
├── kubernetes/ # K8s manifests
|
|
└── helm/ # Helm chart
|
|
```
|
|
|
|
## Application Entry Point
|
|
|
|
`src/app.py` uses the **FastAPI application factory** pattern with an async lifespan manager:
|
|
|
|
```
|
|
Startup Shutdown
|
|
│ │
|
|
├─ Initialize logging └─ Log shutdown
|
|
├─ Initialize SQLite DB
|
|
├─ Create AccessTracker
|
|
├─ Load webpages file (optional)
|
|
├─ Store config + tracker in app.state
|
|
├─ Start APScheduler background tasks
|
|
└─ Log dashboard URL
|
|
```
|
|
|
|
## Request Pipeline
|
|
|
|
```
|
|
Request
|
|
│
|
|
▼
|
|
┌──────────────────────┐
|
|
│ BanCheckMiddleware │──→ IP banned? → Return 500
|
|
└──────────┬───────────┘
|
|
▼
|
|
┌──────────────────────┐
|
|
│ DeceptionMiddleware │──→ Attack detected? → Fake error response
|
|
└──────────┬───────────┘
|
|
▼
|
|
┌───────────────────────┐
|
|
│ ServerHeaderMiddleware│──→ Add random Server header
|
|
└──────────┬────────────┘
|
|
▼
|
|
┌───────────────────────┐
|
|
│ Route Matching │
|
|
│ (ordered by priority)│
|
|
│ │
|
|
│ 1. Static files │ /{secret}/static/*
|
|
│ 2. Dashboard router │ /{secret}/ (prefix-based)
|
|
│ 3. API router │ /{secret}/api/* (prefix-based)
|
|
│ 4. HTMX router │ /{secret}/htmx/* (prefix-based)
|
|
│ 5. Honeypot router │ /* (catch-all)
|
|
└───────────────────────┘
|
|
```
|
|
|
|
### Prefix-Based Routing
|
|
|
|
Dashboard, API, and HTMX routers are mounted with `prefix=f"/{secret}"` in `app.py`. This means:
|
|
- Route handlers define paths **without** the secret (e.g., `@router.get("/api/all-ips")`)
|
|
- FastAPI prepends the secret automatically (e.g., `GET /a1b2c3/api/all-ips`)
|
|
- The honeypot catch-all `/{path:path}` only matches paths that **don't** start with the secret
|
|
- No `_is_dashboard_path()` checks needed — the prefix handles access scoping
|
|
|
|
## Route Architecture
|
|
|
|
### Honeypot Routes (`routes/honeypot.py`)
|
|
|
|
| Method | Path | Response |
|
|
|--------|------|----------|
|
|
| `GET` | `/{path:path}` | Trap page with random links (catch-all) |
|
|
| `HEAD` | `/{path:path}` | 200 OK |
|
|
| `POST` | `/{path:path}` | Credential capture |
|
|
| `GET` | `/admin`, `/login` | Fake login form |
|
|
| `GET` | `/wp-admin`, `/wp-login.php` | Fake WordPress login |
|
|
| `GET` | `/phpmyadmin` | Fake phpMyAdmin |
|
|
| `GET` | `/robots.txt` | Honeypot paths advertised |
|
|
| `GET/POST` | `/api/search`, `/api/sql` | SQL injection honeypot |
|
|
| `POST` | `/api/contact` | XSS detection endpoint |
|
|
| `GET` | `/.env`, `/credentials.txt` | Fake sensitive files |
|
|
|
|
### Dashboard Routes (`routes/dashboard.py`)
|
|
|
|
| Method | Path | Response |
|
|
|--------|------|----------|
|
|
| `GET` | `/` | Server-rendered dashboard (Jinja2) |
|
|
|
|
### API Routes (`routes/api.py`)
|
|
|
|
| Method | Path | Response |
|
|
|--------|------|----------|
|
|
| `GET` | `/api/all-ips` | Paginated IP list with stats |
|
|
| `GET` | `/api/attackers` | Paginated attacker IPs |
|
|
| `GET` | `/api/ip-stats/{ip}` | Single IP detail |
|
|
| `GET` | `/api/credentials` | Captured credentials |
|
|
| `GET` | `/api/honeypot` | Honeypot trigger counts |
|
|
| `GET` | `/api/top-ips` | Top requesting IPs |
|
|
| `GET` | `/api/top-paths` | Most requested paths |
|
|
| `GET` | `/api/top-user-agents` | Top user agents |
|
|
| `GET` | `/api/attack-types-stats` | Attack type distribution |
|
|
| `GET` | `/api/attack-types` | Paginated attack log |
|
|
| `GET` | `/api/raw-request/{id}` | Full HTTP request |
|
|
| `GET` | `/api/get_banlist` | Export ban rules |
|
|
|
|
### HTMX Fragment Routes (`routes/htmx.py`)
|
|
|
|
Each returns a server-rendered Jinja2 partial (`hx-swap="innerHTML"`):
|
|
|
|
| Path | Template |
|
|
|------|----------|
|
|
| `/htmx/honeypot` | `honeypot_table.html` |
|
|
| `/htmx/top-ips` | `top_ips_table.html` |
|
|
| `/htmx/top-paths` | `top_paths_table.html` |
|
|
| `/htmx/top-ua` | `top_ua_table.html` |
|
|
| `/htmx/attackers` | `attackers_table.html` |
|
|
| `/htmx/credentials` | `credentials_table.html` |
|
|
| `/htmx/attacks` | `attack_types_table.html` |
|
|
| `/htmx/patterns` | `patterns_table.html` |
|
|
| `/htmx/ip-detail/{ip}` | `ip_detail.html` |
|
|
|
|
## Database Schema
|
|
|
|
```
|
|
┌─────────────────┐ ┌──────────────────┐
|
|
│ AccessLog │ │ AttackDetection │
|
|
├─────────────────┤ ├──────────────────┤
|
|
│ id (PK) │◄────│ access_log_id(FK)│
|
|
│ ip (indexed) │ │ attack_type │
|
|
│ path │ │ matched_pattern │
|
|
│ user_agent │ └──────────────────┘
|
|
│ method │
|
|
│ is_suspicious │ ┌──────────────────┐
|
|
│ is_honeypot │ │CredentialAttempt │
|
|
│ timestamp │ ├──────────────────┤
|
|
│ raw_request │ │ id (PK) │
|
|
└─────────────────┘ │ ip (indexed) │
|
|
│ path, username │
|
|
┌─────────────────┐ │ password │
|
|
│ IpStats │ │ timestamp │
|
|
├─────────────────┤ └──────────────────┘
|
|
│ ip (PK) │
|
|
│ total_requests │ ┌──────────────────┐
|
|
│ first/last_seen │ │ CategoryHistory │
|
|
│ country_code │ ├──────────────────┤
|
|
│ city, lat, lon │ │ id (PK) │
|
|
│ asn, asn_org │ │ ip (indexed) │
|
|
│ isp, reverse │ │ old_category │
|
|
│ is_proxy │ │ new_category │
|
|
│ is_hosting │ │ timestamp │
|
|
│ list_on (JSON) │ └──────────────────┘
|
|
│ category │
|
|
│ category_scores │
|
|
│ analyzed_metrics│
|
|
│ manual_category │
|
|
└─────────────────┘
|
|
```
|
|
|
|
**SQLite config:** WAL mode, 30s busy timeout, file permissions 600.
|
|
|
|
## Frontend Architecture
|
|
|
|
```
|
|
base.html
|
|
├── CDN: Leaflet, Chart.js, HTMX, Alpine.js (deferred)
|
|
├── Static: dashboard.css
|
|
│
|
|
└── dashboard/index.html (extends base)
|
|
│
|
|
├── Stats cards ──────────── Server-rendered on page load
|
|
├── Suspicious table ─────── Server-rendered on page load
|
|
│
|
|
├── Overview tab (Alpine.js x-show)
|
|
│ ├── Honeypot table ───── HTMX hx-get on load
|
|
│ ├── Top IPs table ────── HTMX hx-get on load
|
|
│ ├── Top Paths table ──── HTMX hx-get on load
|
|
│ ├── Top UA table ─────── HTMX hx-get on load
|
|
│ └── Credentials table ── HTMX hx-get on load
|
|
│
|
|
└── Attacks tab (Alpine.js x-show, lazy init)
|
|
├── Attackers table ──── HTMX hx-get on load
|
|
├── Map ──────────────── Leaflet (init on tab switch)
|
|
├── Chart ────────────── Chart.js (init on tab switch)
|
|
├── Attack types table ─ HTMX hx-get on load
|
|
└── Patterns table ───── HTMX hx-get on load
|
|
```
|
|
|
|
**Responsibility split:**
|
|
- **Alpine.js** — Tab state, modals, dropdowns, lazy initialization
|
|
- **HTMX** — Table pagination, sorting, IP detail expansion
|
|
- **Leaflet** — Interactive map with category-colored markers
|
|
- **Chart.js** — Doughnut chart for attack type distribution
|
|
- **Custom SVG** — Radar charts for IP category scores
|
|
|
|
## Background Tasks
|
|
|
|
Managed by `TasksMaster` (APScheduler). Tasks are auto-discovered from `src/tasks/`.
|
|
|
|
| Task | Schedule | Purpose |
|
|
|------|----------|---------|
|
|
| `analyze_ips` | Every 1 min | Score IPs into categories (attacker, crawler, user) |
|
|
| `fetch_ip_rep` | Every 5 min | Enrich IPs with geolocation + blocklist data |
|
|
| `db_dump` | Configurable | Export database backups |
|
|
| `memory_cleanup` | Periodic | Trim in-memory lists |
|
|
| `top_attacking_ips` | Periodic | Cache top attackers |
|
|
|
|
### IP Categorization Model
|
|
|
|
Each IP is scored across 4 categories based on:
|
|
- HTTP method distribution (risky methods ratio)
|
|
- Robots.txt violations
|
|
- Request timing anomalies (coefficient of variation)
|
|
- User-Agent diversity
|
|
- Attack URL detection
|
|
|
|
Categories: `attacker`, `bad_crawler`, `good_crawler`, `regular_user`, `unknown`
|
|
|
|
## Configuration
|
|
|
|
`config.yaml` with environment variable overrides (`KRAWL_{FIELD}`):
|
|
|
|
```yaml
|
|
server:
|
|
port: 5000
|
|
delay: 100 # Response delay (ms)
|
|
|
|
dashboard:
|
|
secret_path: "test" # Auto-generates if null
|
|
|
|
database:
|
|
path: "data/krawl.db"
|
|
retention_days: 30
|
|
|
|
crawl:
|
|
infinite_pages_for_malicious: true
|
|
max_pages_limit: 250
|
|
ban_duration_seconds: 600
|
|
|
|
behavior:
|
|
probability_error_codes: 0 # 0-100%
|
|
|
|
canary:
|
|
token_url: null # External canary alert URL
|
|
```
|
|
|
|
## Logging
|
|
|
|
Three rotating log files (1MB max, 5 backups each):
|
|
|
|
| Logger | File | Content |
|
|
|--------|------|---------|
|
|
| `krawl.app` | `logs/krawl.log` | Application events, errors |
|
|
| `krawl.access` | `logs/access.log` | HTTP access, attack detections |
|
|
| `krawl.credentials` | `logs/credentials.log` | Captured login attempts |
|
|
|
|
## Docker
|
|
|
|
```dockerfile
|
|
FROM python:3.11-slim
|
|
# Non-root user: krawl:1000
|
|
# Volumes: /app/logs, /app/data, /app/exports
|
|
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000", "--app-dir", "src"]
|
|
```
|
|
|
|
## Key Data Flows
|
|
|
|
### Honeypot Request
|
|
|
|
```
|
|
Client → BanCheck → DeceptionMiddleware → HoneypotRouter
|
|
│
|
|
┌─────────┴──────────┐
|
|
│ tracker.record() │
|
|
│ ├─ in-memory ++ │
|
|
│ ├─ detect attacks │
|
|
│ └─ DB persist │
|
|
└────────────────────┘
|
|
```
|
|
|
|
### Dashboard Load
|
|
|
|
```
|
|
Browser → GET /{secret}/ → SSR initial stats + Jinja2 render
|
|
→ Alpine.js init → HTMX fires hx-get for each table
|
|
→ User clicks Attacks tab → setTimeout → init Leaflet + Chart.js
|
|
→ Leaflet fetches /api/all-ips → plots markers
|
|
→ Chart.js fetches /api/attack-types-stats → renders doughnut
|
|
```
|
|
|
|
### IP Enrichment Pipeline
|
|
|
|
```
|
|
APScheduler (every 5 min)
|
|
└─ fetch_ip_rep.main()
|
|
├─ DB: get unenriched IPs (limit 50)
|
|
├─ ip-api.com → geolocation (country, city, ASN, coords)
|
|
├─ iprep.lcrawl.com → blocklist memberships
|
|
└─ DB: update IpStats with enriched data
|
|
```
|