- A modern, customizable zero-dependencies honeypot server designed to detect and track malicious activity through deceptive web pages, fake credentials, and canary tokens.
-
-
-## Demo
-Tip: crawl the `robots.txt` paths for additional fun
-### Krawl URL: [http://demo.krawlme.com](http://demo.krawlme.com)
-### View the dashboard [http://demo.krawlme.com/das_dashboard](http://demo.krawlme.com/das_dashboard)
-
-## What is Krawl?
-
-**Krawl** is a cloud‑native deception server designed to detect, delay, and analyze malicious web crawlers and automated scanners.
-
-It creates realistic fake web applications filled with low‑hanging fruit such as admin panels, configuration files, and exposed fake credentials to attract and identify suspicious activity.
-
-By wasting attacker resources, Krawl helps clearly distinguish malicious behavior from legitimate crawlers.
-
-It features:
-
-- **Spider Trap Pages**: Infinite random links to waste crawler resources based on the [spidertrap project](https://github.com/adhdproject/spidertrap)
-- **Fake Login Pages**: WordPress, phpMyAdmin, admin panels
-- **Honeypot Paths**: Advertised in robots.txt to catch scanners
-- **Fake Credentials**: Realistic-looking usernames, passwords, API keys
-- **[Canary Token](#customizing-the-canary-token) Integration**: External alert triggering
-- **Real-time Dashboard**: Monitor suspicious activity
-- **Customizable Wordlists**: Easy JSON-based configuration
-- **Random Error Injection**: Mimic real server behavior
-
-
-
-## 🚀 Quick Start
-## Helm Chart
-
-Install with default values
-
-```bash
-helm install krawl oci://ghcr.io/blessedrebus/krawl-chart \
- --namespace krawl-system \
- --create-namespace
-```
-
-Install with custom [canary token](#customizing-the-canary-token)
-
-```bash
-helm install krawl oci://ghcr.io/blessedrebus/krawl-chart \
- --namespace krawl-system \
- --create-namespace \
- --set config.canaryTokenUrl="http://your-canary-token-url"
-```
-
-To access the deception server
-
-```bash
-kubectl get svc krawl -n krawl-system
-```
-
-Once the EXTERNAL-IP is assigned, access your deception server at:
-
-```
-http://:5000
-```
-
-## Kubernetes / Kustomize
-Apply all manifests with
-
-```bash
-kubectl apply -f https://raw.githubusercontent.com/BlessedRebuS/Krawl/refs/heads/main/manifests/krawl-all-in-one-deploy.yaml
-```
-
-Retrieve dashboard path with
-```bash
-kubectl get secret krawl-server -n krawl-system -o jsonpath='{.data.dashboard-path}' | base64 -d
-```
-
-Or clone the repo and apply the `manifest` folder with
-
-```bash
-kubectl apply -k manifests
-```
-
-## Docker
-Run Krawl as a docker container with
-
-```bash
-docker run -d \
- -p 5000:5000 \
- -e CANARY_TOKEN_URL="http://your-canary-token-url" \
- --name krawl \
- ghcr.io/blessedrebus/krawl:latest
-```
-
-## Docker Compose
-Run Krawl with docker-compose in the project folder with
-
-```bash
-docker-compose up -d
-```
-
-Stop it with
-
-```bash
-docker-compose down
-```
-
-## Python 3.11+
-
-Clone the repository
-
-```bash
-git clone https://github.com/blessedrebus/krawl.git
-cd krawl/src
-```
-Run the server
-```bash
-python3 server.py
-```
-
-Visit
-
-`http://localhost:5000`
-
-To access the dashboard
-
-`http://localhost:5000/`
-
-## Configuration via Environment Variables
-
-To customize the deception server installation several **environment variables** can be specified.
-
-| Variable | Description | Default |
-|----------|-------------|---------|
-| `PORT` | Server listening port | `5000` |
-| `DELAY` | Response delay in milliseconds | `100` |
-| `LINKS_MIN_LENGTH` | Minimum random link length | `5` |
-| `LINKS_MAX_LENGTH` | Maximum random link length | `15` |
-| `LINKS_MIN_PER_PAGE` | Minimum links per page | `10` |
-| `LINKS_MAX_PER_PAGE` | Maximum links per page | `15` |
-| `MAX_COUNTER` | Initial counter value | `10` |
-| `CANARY_TOKEN_TRIES` | Requests before showing canary token | `10` |
-| `CANARY_TOKEN_URL` | External canary token URL | None |
-| `DASHBOARD_SECRET_PATH` | Custom dashboard path | Auto-generated |
-| `PROBABILITY_ERROR_CODES` | Error response probability (0-100%) | `0` |
-| `SERVER_HEADER` | HTTP Server header for deception | `Apache/2.2.22 (Ubuntu)` |
-
-## robots.txt
-The actual (juicy) robots.txt configuration is the following
-
-```txt
-Disallow: /admin/
-Disallow: /api/
-Disallow: /backup/
-Disallow: /config/
-Disallow: /database/
-Disallow: /private/
-Disallow: /uploads/
-Disallow: /wp-admin/
-Disallow: /phpMyAdmin/
-Disallow: /admin/login.php
-Disallow: /api/v1/users
-Disallow: /api/v2/secrets
-Disallow: /.env
-Disallow: /credentials.txt
-Disallow: /passwords.txt
-Disallow: /.git/
-Disallow: /backup.sql
-Disallow: /db_backup.sql
-```
-
-## Honeypot pages
-Requests to common admin endpoints (`/admin/`, `/wp-admin/`, `/phpMyAdmin/`) return a fake login page. Any login attempt triggers a 1-second delay to simulate real processing and is fully logged in the dashboard (credentials, IP, headers, timing).
-
-
-
-
-
-Requests to paths like `/backup/`, `/config/`, `/database/`, `/private/`, or `/uploads/` return a fake directory listing populated with “interesting” files, each assigned a random file size to look realistic.
-
-
-
-The `.env` endpoint exposes fake database connection strings, **AWS API keys**, and **Stripe secrets**. It intentionally returns an error due to the `Content-Type` being `application/json` instead of plain text, mimicking a “juicy” misconfiguration that crawlers and scanners often flag as information leakage.
-
-
-
-The pages `/api/v1/users` and `/api/v2/secrets` show fake users and random secrets in JSON format
-
-
-
-
-
-
-The pages `/credentials.txt` and `/passwords.txt` show fake users and random secrets
-
-
-
-
-
-
-## Customizing the Canary Token
-To create a custom canary token, visit https://canarytokens.org
-
-and generate a “Web bug” canary token.
-
-This optional token is triggered when a crawler fully traverses the webpage until it reaches 0. At that point, a URL is returned. When this URL is requested, it sends an alert to the user via email, including the visitor’s IP address and user agent.
-
-
-To enable this feature, set the canary token URL [using the environment variable](#configuration-via-environment-variables) `CANARY_TOKEN_URL`.
-
-## Customizing the wordlist
-
-Edit `wordlists.json` to customize fake data for your use case
-
-```json
-{
- "usernames": {
- "prefixes": ["admin", "root", "user"],
- "suffixes": ["_prod", "_dev", "123"]
- },
- "passwords": {
- "prefixes": ["P@ssw0rd", "Admin"],
- "simple": ["test", "password"]
- },
- "directory_listing": {
- "files": ["credentials.txt", "backup.sql"],
- "directories": ["admin/", "backup/"]
- }
-}
-```
-
-or **values.yaml** in the case of helm chart installation
-
-## Dashboard
-
-Access the dashboard at `http://:/`
-
-The dashboard shows:
-- Total and unique accesses
-- Suspicious activity detection
-- Top IPs, paths, and user-agents
-- Real-time monitoring
-
-The attackers' triggered honeypot path and the suspicious activity (such as failed login attempts) are logged
-
-
-
-The top IP Addresses is shown along with top paths and User Agents
-
-
-
-### Retrieving Dashboard Path
-
-Check server startup logs or get the secret with
-
-```bash
-kubectl get secret krawl-server -n krawl-system \
- -o jsonpath='{.data.dashboard-path}' | base64 -d && echo
-```
-
-## 🤝 Contributing
-
-Contributions welcome! Please:
-1. Fork the repository
-2. Create a feature branch
-3. Make your changes
-4. Submit a pull request (explain the changes!)
-
-
-
-
-## ⚠️ Disclaimer
-
-**This is a deception/honeypot system.**
-Deploy in isolated environments and monitor carefully for security events.
-Use responsibly and in compliance with applicable laws and regulations.
-
-## Star History
-
+
Krawl
+
+
+
+
+
+
+
+
+ A modern, customizable web honeypot server designed to detect and track malicious activity from attackers and web crawlers through deceptive web pages, fake credentials, and canary tokens.
+
+
+## Demo
+Tip: crawl the `robots.txt` paths for additional fun
+### Krawl URL: [http://demo.krawlme.com](http://demo.krawlme.com)
+### View the dashboard [http://demo.krawlme.com/das_dashboard](http://demo.krawlme.com/das_dashboard)
+
+## What is Krawl?
+
+**Krawl** is a cloud‑native deception server designed to detect, delay, and analyze malicious attackers, web crawlers and automated scanners.
+
+It creates realistic fake web applications filled with low‑hanging fruit such as admin panels, configuration files, and exposed fake credentials to attract and identify suspicious activity.
+
+By wasting attacker resources, Krawl helps clearly distinguish malicious behavior from legitimate crawlers.
+
+It features:
+
+- **Spider Trap Pages**: Infinite random links to waste crawler resources based on the [spidertrap project](https://github.com/adhdproject/spidertrap)
+- **Fake Login Pages**: WordPress, phpMyAdmin, admin panels
+- **Honeypot Paths**: Advertised in robots.txt to catch scanners
+- **Fake Credentials**: Realistic-looking usernames, passwords, API keys
+- **[Canary Token](#customizing-the-canary-token) Integration**: External alert triggering
+- **Random server headers**: Confuse attacks based on server header and version
+- **Real-time Dashboard**: Monitor suspicious activity
+- **Customizable Wordlists**: Easy JSON-based configuration
+- **Random Error Injection**: Mimic real server behavior
+
+
+
+
+
+## 🚀 Installation
+
+### Docker Run
+
+Run Krawl with the latest image:
+
+```bash
+docker run -d \
+ -p 5000:5000 \
+ -e KRAWL_PORT=5000 \
+ -e KRAWL_DELAY=100 \
+ -e KRAWL_DASHBOARD_SECRET_PATH="/my-secret-dashboard" \
+ -e KRAWL_DATABASE_RETENTION_DAYS=30 \
+ --name krawl \
+ ghcr.io/blessedrebus/krawl:latest
+```
+
+Access the server at `http://localhost:5000`
+
+### Docker Compose
+
+Create a `docker-compose.yaml` file:
+
+```yaml
+services:
+ krawl:
+ image: ghcr.io/blessedrebus/krawl:latest
+ container_name: krawl-server
+ ports:
+ - "5000:5000"
+ environment:
+ - CONFIG_LOCATION=config.yaml
+ - TZ="Europe/Rome"
+ volumes:
+ - ./config.yaml:/app/config.yaml:ro
+ - krawl-data:/app/data
+ restart: unless-stopped
+
+volumes:
+ krawl-data:
+```
+
+Run with:
+
+```bash
+docker-compose up -d
+```
+
+Stop with:
+
+```bash
+docker-compose down
+```
+
+### Kubernetes
+**Krawl is also available natively on Kubernetes**. Installation can be done either [via manifest](kubernetes/README.md) or [using the helm chart](helm/README.md).
+
+## Use Krawl to Ban Malicious IPs
+Krawl uses a reputation-based system to classify attacker IP addresses. Every five minutes, Krawl exports the identified malicious IPs to a `malicious_ips.txt` file.
+
+This file can either be mounted from the Docker container into another system or downloaded directly via `curl`:
+
+```bash
+curl https://your-krawl-instance//api/download/malicious_ips.txt
+```
+
+This file can be used to [update a set of firewall rules](https://www.allthingstech.ch/using-opnsense-and-ip-blocklists-to-block-malicious-traffic), for example on OPNsense and pfSense, enabling automatic blocking of malicious IPs or using IPtables
+
+## IP Reputation
+Krawl [uses tasks that analyze recent traffic to build and continuously update an IP reputation](src/tasks/analyze_ips.py) score. It runs periodically and evaluates each active IP address based on multiple behavioral indicators to classify it as an attacker, crawler, or regular user. Thresholds are fully customizable.
+
+
+
+The analysis includes:
+- **Risky HTTP methods usage** (e.g. POST, PUT, DELETE ratios)
+- **Robots.txt violations**
+- **Request timing anomalies** (bursty or irregular patterns)
+- **User-Agent consistency**
+- **Attack URL detection** (e.g. SQL injection, XSS patterns)
+
+Each signal contributes to a weighted scoring model that assigns a reputation category:
+- `attacker`
+- `bad_crawler`
+- `good_crawler`
+- `regular_user`
+- `unknown` (for insufficient data)
+
+The resulting scores and metrics are stored in the database and used by Krawl to drive dashboards, reputation tracking, and automated mitigation actions such as IP banning or firewall integration.
+
+## Forward server header
+If Krawl is deployed behind a proxy such as NGINX the **server header** should be forwarded using the following configuration in your proxy:
+
+```bash
+location / {
+ proxy_pass https://your-krawl-instance;
+ proxy_pass_header Server;
+}
+```
+
+## API
+Krawl uses the following APIs
+- https://iprep.lcrawl.com (IP Reputation)
+- https://nominatim.openstreetmap.org/reverse (Reverse IP Lookup)
+- https://api.ipify.org (Public IP discovery)
+- http://ident.me (Public IP discovery)
+- https://ifconfig.me (Public IP discovery)
+
+## Configuration
+Krawl uses a **configuration hierarchy** in which **environment variables take precedence over the configuration file**. This approach is recommended for Docker deployments and quick out-of-the-box customization.
+
+### Configuration via Enviromental Variables
+
+| Environment Variable | Description | Default |
+|----------------------|-------------|---------|
+| `CONFIG_LOCATION` | Path to yaml config file | `config.yaml` |
+| `KRAWL_PORT` | Server listening port | `5000` |
+| `KRAWL_DELAY` | Response delay in milliseconds | `100` |
+| `KRAWL_SERVER_HEADER` | HTTP Server header for deception | `""` |
+| `KRAWL_LINKS_LENGTH_RANGE` | Link length range as `min,max` | `5,15` |
+| `KRAWL_LINKS_PER_PAGE_RANGE` | Links per page as `min,max` | `10,15` |
+| `KRAWL_CHAR_SPACE` | Characters used for link generation | `abcdefgh...` |
+| `KRAWL_MAX_COUNTER` | Initial counter value | `10` |
+| `KRAWL_CANARY_TOKEN_URL` | External canary token URL | None |
+| `KRAWL_CANARY_TOKEN_TRIES` | Requests before showing canary token | `10` |
+| `KRAWL_DASHBOARD_SECRET_PATH` | Custom dashboard path | Auto-generated |
+| `KRAWL_PROBABILITY_ERROR_CODES` | Error response probability (0-100%) | `0` |
+| `KRAWL_DATABASE_PATH` | Database file location | `data/krawl.db` |
+| `KRAWL_DATABASE_RETENTION_DAYS` | Days to retain data in database | `30` |
+| `KRAWL_HTTP_RISKY_METHODS_THRESHOLD` | Threshold for risky HTTP methods detection | `0.1` |
+| `KRAWL_VIOLATED_ROBOTS_THRESHOLD` | Threshold for robots.txt violations | `0.1` |
+| `KRAWL_UNEVEN_REQUEST_TIMING_THRESHOLD` | Coefficient of variation threshold for timing | `0.5` |
+| `KRAWL_UNEVEN_REQUEST_TIMING_TIME_WINDOW_SECONDS` | Time window for request timing analysis in seconds | `300` |
+| `KRAWL_USER_AGENTS_USED_THRESHOLD` | Threshold for detecting multiple user agents | `2` |
+| `KRAWL_ATTACK_URLS_THRESHOLD` | Threshold for attack URL detection | `1` |
+| `KRAWL_INFINITE_PAGES_FOR_MALICIOUS` | Serve infinite pages to malicious IPs | `true` |
+| `KRAWL_MAX_PAGES_LIMIT` | Maximum page limit for crawlers | `250` |
+| `KRAWL_BAN_DURATION_SECONDS` | Ban duration in seconds for rate-limited IPs | `600` |
+
+For example
+
+```bash
+# Set canary token
+export CONFIG_LOCATION="config.yaml"
+export KRAWL_CANARY_TOKEN_URL="http://your-canary-token-url"
+
+# Set number of pages range (min,max format)
+export KRAWL_LINKS_PER_PAGE_RANGE="5,25"
+
+# Set analyzer thresholds
+export KRAWL_HTTP_RISKY_METHODS_THRESHOLD="0.2"
+export KRAWL_VIOLATED_ROBOTS_THRESHOLD="0.15"
+
+# Set custom dashboard path
+export KRAWL_DASHBOARD_SECRET_PATH="/my-secret-dashboard"
+```
+
+Example of a Docker run with env variables:
+
+```bash
+docker run -d \
+ -p 5000:5000 \
+ -e KRAWL_PORT=5000 \
+ -e KRAWL_DELAY=100 \
+ -e KRAWL_CANARY_TOKEN_URL="http://your-canary-token-url" \
+ --name krawl \
+ ghcr.io/blessedrebus/krawl:latest
+```
+
+### Configuration via config.yaml
+You can use the [config.yaml](config.yaml) file for more advanced configurations, such as Docker Compose or Helm chart deployments.
+
+# Honeypot
+Below is a complete overview of the Krawl honeypot’s capabilities
+
+## robots.txt
+The actual (juicy) robots.txt configuration [is the following](src/templates/html/robots.txt).
+
+## Honeypot pages
+Requests to common admin endpoints (`/admin/`, `/wp-admin/`, `/phpMyAdmin/`) return a fake login page. Any login attempt triggers a 1-second delay to simulate real processing and is fully logged in the dashboard (credentials, IP, headers, timing).
+
+
+
+
+Requests to paths like `/backup/`, `/config/`, `/database/`, `/private/`, or `/uploads/` return a fake directory listing populated with “interesting” files, each assigned a random file size to look realistic.
+
+
+
+The `.env` endpoint exposes fake database connection strings, **AWS API keys**, and **Stripe secrets**. It intentionally returns an error due to the `Content-Type` being `application/json` instead of plain text, mimicking a “juicy” misconfiguration that crawlers and scanners often flag as information leakage.
+
+The `/server` page displays randomly generated fake error information for each known server.
+
+
+
+The pages `/api/v1/users` and `/api/v2/secrets` show fake users and random secrets in JSON format
+
+
+
+The pages `/credentials.txt` and `/passwords.txt` show fake users and random secrets
+
+
+
+Pages such as `/users`, `/search`, `/contact`, `/info`, `/input`, and `/feedback`, along with APIs like `/api/sql` and `/api/database`, are designed to lure attackers into performing attacks such as **SQL injection** or **XSS**.
+
+
+
+Automated tools like **SQLMap** will receive a different randomized database error on each request, increasing scan noise and confusing the attacker. All detected attacks are logged and displayed in the dashboard.
+
+## Customizing the Canary Token
+To create a custom canary token, visit https://canarytokens.org
+
+and generate a “Web bug” canary token.
+
+This optional token is triggered when a crawler fully traverses the webpage until it reaches 0. At that point, a URL is returned. When this URL is requested, it sends an alert to the user via email, including the visitor’s IP address and user agent.
+
+
+To enable this feature, set the canary token URL [using the environment variable](#configuration-via-environment-variables) `CANARY_TOKEN_URL`.
+
+## Customizing the wordlist
+
+Edit `wordlists.json` to customize fake data for your use case
+
+```json
+{
+ "usernames": {
+ "prefixes": ["admin", "root", "user"],
+ "suffixes": ["_prod", "_dev", "123"]
+ },
+ "passwords": {
+ "prefixes": ["P@ssw0rd", "Admin"],
+ "simple": ["test", "password"]
+ },
+ "directory_listing": {
+ "files": ["credentials.txt", "backup.sql"],
+ "directories": ["admin/", "backup/"]
+ }
+}
+```
+
+or **values.yaml** in the case of helm chart installation
+
+## Dashboard
+
+Access the dashboard at `http://:/`
+
+The dashboard shows:
+- Total and unique accesses
+- Suspicious activity and attack detection
+- Top IPs, paths, user-agents and GeoIP localization
+- Real-time monitoring
+
+The attackers’ access to the honeypot endpoint and related suspicious activities (such as failed login attempts) are logged.
+
+Krawl also implements a scoring system designed to distinguish between malicious and legitimate behavior on the website.
+
+
+
+The top IP Addresses is shown along with top paths and User Agents
+
+
+
+
+
+## 🤝 Contributing
+
+Contributions welcome! Please:
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Submit a pull request (explain the changes!)
+
+
+
+
+## ⚠️ Disclaimer
+
+**This is a deception/honeypot system.**
+Deploy in isolated environments and monitor carefully for security events.
+Use responsibly and in compliance with applicable laws and regulations.
+
+## Star History
+
diff --git a/config.yaml b/config.yaml
new file mode 100644
index 0000000..c29ebe4
--- /dev/null
+++ b/config.yaml
@@ -0,0 +1,46 @@
+# Krawl Honeypot Configuration
+
+server:
+ port: 5000
+ delay: 100 # Response delay in milliseconds
+
+ # manually set the server header, if null a random one will be used.
+ server_header: null
+
+links:
+ min_length: 5
+ max_length: 15
+ min_per_page: 5
+ max_per_page: 10
+ char_space: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
+ max_counter: 10
+
+canary:
+ token_url: null # Optional canary token URL
+ token_tries: 10
+
+dashboard:
+ # if set to "null" this will Auto-generates random path if not set
+ # can be set to "/dashboard" or similar <-- note this MUST include a forward slash
+ # secret_path: super-secret-dashboard-path
+ secret_path: test
+
+database:
+ path: "data/krawl.db"
+ retention_days: 30
+
+behavior:
+ probability_error_codes: 0 # 0-100 percentage
+
+analyzer:
+ http_risky_methods_threshold: 0.1
+ violated_robots_threshold: 0.1
+ uneven_request_timing_threshold: 0.5
+ uneven_request_timing_time_window_seconds: 300
+ user_agents_used_threshold: 2
+ attack_urls_threshold: 1
+
+crawl:
+ infinite_pages_for_malicious: true
+ max_pages_limit: 250
+ ban_duration_seconds: 600
\ No newline at end of file
diff --git a/docker-compose.yaml b/docker-compose.yaml
index 1612864..233692b 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -1,5 +1,4 @@
-version: '3.8'
-
+---
services:
krawl:
build:
@@ -8,27 +7,26 @@ services:
container_name: krawl-server
ports:
- "5000:5000"
+ environment:
+ - CONFIG_LOCATION=config.yaml
+ # set this to change timezone, alternatively mount /etc/timezone or /etc/localtime based on the time system management of the host environment
+ # - TZ=${TZ}
volumes:
- ./wordlists.json:/app/wordlists.json:ro
- environment:
- - PORT=5000
- - DELAY=100
- - LINKS_MIN_LENGTH=5
- - LINKS_MAX_LENGTH=15
- - LINKS_MIN_PER_PAGE=10
- - LINKS_MAX_PER_PAGE=15
- - MAX_COUNTER=10
- - CANARY_TOKEN_TRIES=10
- - PROBABILITY_ERROR_CODES=0
- - SERVER_HEADER=Apache/2.2.22 (Ubuntu)
- # Optional: Set your canary token URL
- # - CANARY_TOKEN_URL=http://canarytokens.com/api/users/YOUR_TOKEN/passwords.txt
- # Optional: Set custom dashboard path (auto-generated if not set)
- # - DASHBOARD_SECRET_PATH=/my-secret-dashboard
+ - ./config.yaml:/app/config.yaml:ro
+ - ./logs:/app/logs
+ - ./exports:/app/exports
+ - data:/app/data
restart: unless-stopped
- healthcheck:
- test: ["CMD", "python3", "-c", "import requests; requests.get('http://localhost:5000')"]
- interval: 30s
- timeout: 5s
- retries: 3
- start_period: 10s
+ develop:
+ watch:
+ - path: ./Dockerfile
+ action: rebuild
+ - path: ./src/
+ action: sync+restart
+ target: /app/src
+ - path: ./docker-compose.yaml
+ action: rebuild
+
+volumes:
+ data:
diff --git a/docs/coding-guidelines.md b/docs/coding-guidelines.md
new file mode 100644
index 0000000..1e13575
--- /dev/null
+++ b/docs/coding-guidelines.md
@@ -0,0 +1,90 @@
+### Coding Standards
+
+**Style & Structure**
+- Prefer longer, explicit code over compact one-liners
+- Always include docstrings for functions/classes + inline comments
+- Strongly prefer OOP-style code (classes over functional/nested functions)
+- Strong typing throughout (dataclasses, TypedDict, Enums, type hints)
+- Value future-proofing and expanded usage insights
+
+**Data Design**
+- Use dataclasses for internal data modeling
+- Typed JSON structures
+- Functions return fully typed objects (no loose dicts)
+- Snapshot files in JSON or YAML
+- Human-readable fields (e.g., `sql_injection`, `xss_attempt`)
+
+**Templates & UI**
+- Don't mix large HTML/CSS blocks in Python code
+- Prefer Jinja templates for HTML rendering
+- Clean CSS, minimal inline clutter, readable template logic
+
+**Writing & Documentation**
+- Markdown documentation
+- Clear section headers
+- Roadmap/Phase/Feature-Session style documents
+
+**Logging**
+- Use singleton for logging found in `src\logger.py`
+- Setup logging at app start:
+ ```
+ initialize_logging()
+ app_logger = get_app_logger()
+ access_logger = get_access_logger()
+ credential_logger = get_credential_logger()
+ ```
+
+**Preferred Pip Packages**
+- API/Web Server: Simple Python
+- HTTP: Requests
+- SQLite: Sqlalchemy
+- Database Migrations: Alembic
+
+### Error Handling
+- Custom exception classes for domain-specific errors
+- Consistent error response formats (JSON structure)
+- Logging severity levels (ERROR vs WARNING)
+
+### Configuration
+- `.env` for secrets (never committed)
+- Maintain `.env.example` in each component for documentation
+- Typed config loaders using dataclasses
+- Validation on startup
+
+### Containerization & Deployment
+- Explicit Dockerfiles
+- Production-friendly hardening (distroless/slim when meaningful)
+- Use git branch as tag
+
+### Dependency Management
+- Use `requirements.txt` and virtual environments (`python3 -m venv venv`)
+- Use path `venv` for all virtual environments
+- Pin versions to version ranges (or exact versions if pinning a particular version)
+- Activate venv before running code (unless in Docker)
+
+### Testing Standards
+- Manual testing preferred for applications
+- **tests:** Use shell scripts with curl/httpie for simulation and attack scripts.
+- tests should be located in `tests` directory
+
+### Git Standards
+
+**Branch Strategy:**
+- `master` - Production-ready code only
+- `beta` - Public pre-release testing
+- `dev` - Main development branch, integration point
+
+**Workflow:**
+- Feature work branches off `dev` (e.g., `feature/add-scheduler`)
+- Merge features back to `dev` for testing
+- Promote `dev` → `beta` for public testing (when applicable)
+- Promote `beta` (or `dev`) → `master` for production
+
+**Commit Messages:**
+- Use conventional commit format: `feat:`, `fix:`, `docs:`, `refactor:`, etc.
+- Keep commits atomic and focused
+- Write clear, descriptive messages
+
+**Tagging:**
+- Tag releases on `master` with semantic versioning (e.g., `v1.2.3`)
+- Optionally tag beta releases (e.g., `v1.2.3-beta.1`)
\ No newline at end of file
diff --git a/entrypoint.sh b/entrypoint.sh
new file mode 100644
index 0000000..fe3ef45
--- /dev/null
+++ b/entrypoint.sh
@@ -0,0 +1,8 @@
+#!/bin/sh
+set -e
+
+# Fix ownership of mounted directories
+chown -R krawl:krawl /app/logs /app/data /app/exports 2>/dev/null || true
+
+# Drop to krawl user and run the application
+exec gosu krawl "$@"
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index 3fe5d8a..2e3ae94 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -2,8 +2,8 @@ apiVersion: v2
name: krawl-chart
description: A Helm chart for Krawl honeypot server
type: application
-version: 0.1.2
-appVersion: "1.0.0"
+version: 1.0.0
+appVersion: 1.0.0
keywords:
- honeypot
- security
@@ -13,3 +13,4 @@ maintainers:
home: https://github.com/blessedrebus/krawl
sources:
- https://github.com/blessedrebus/krawl
+icon: https://raw.githubusercontent.com/blessedrebus/krawl/main/img/krawl-svg.svg
\ No newline at end of file
diff --git a/helm/README.md b/helm/README.md
new file mode 100644
index 0000000..ae57261
--- /dev/null
+++ b/helm/README.md
@@ -0,0 +1,356 @@
+# Krawl Helm Chart
+
+A Helm chart for deploying the Krawl honeypot application on Kubernetes.
+
+## Prerequisites
+
+- Kubernetes 1.19+
+- Helm 3.0+
+- Persistent Volume provisioner (optional, for database persistence)
+
+## Installation
+
+
+### Helm Chart
+
+Install with default values:
+
+```bash
+helm install krawl oci://ghcr.io/blessedrebus/krawl-chart \
+ --version 1.0.0 \
+ --namespace krawl-system \
+ --create-namespace
+```
+
+Or create a minimal `values.yaml` file:
+
+```yaml
+service:
+ type: LoadBalancer
+ port: 5000
+
+timezone: "Europe/Rome"
+
+ingress:
+ enabled: true
+ className: "traefik"
+ hosts:
+ - host: krawl.example.com
+ paths:
+ - path: /
+ pathType: Prefix
+
+config:
+ server:
+ port: 5000
+ delay: 100
+ dashboard:
+ secret_path: null # Auto-generated if not set
+
+database:
+ persistence:
+ enabled: true
+ size: 1Gi
+```
+
+Install with custom values:
+
+```bash
+helm install krawl oci://ghcr.io/blessedrebus/krawl-chart \
+ --version 0.2.2 \
+ --namespace krawl-system \
+ --create-namespace \
+ -f values.yaml
+```
+
+To access the deception server:
+
+```bash
+kubectl get svc krawl -n krawl-system
+```
+
+Once the EXTERNAL-IP is assigned, access your deception server at `http://:5000`
+
+### Add the repository (if applicable)
+
+```bash
+helm repo add krawl https://github.com/BlessedRebuS/Krawl
+helm repo update
+```
+
+### Install from OCI Registry
+
+```bash
+helm install krawl oci://ghcr.io/blessedrebus/krawl-chart --version 0.2.1
+```
+
+Or with a specific namespace:
+
+```bash
+helm install krawl oci://ghcr.io/blessedrebus/krawl-chart --version 0.2.1 -n krawl --create-namespace
+```
+
+### Install the chart locally
+
+```bash
+helm install krawl ./helm
+```
+
+### Install with custom values
+
+```bash
+helm install krawl ./helm -f values.yaml
+```
+
+### Install in a specific namespace
+
+```bash
+helm install krawl ./helm -n krawl --create-namespace
+```
+
+## Configuration
+
+The following table lists the main configuration parameters of the Krawl chart and their default values.
+
+### Global Settings
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `replicaCount` | Number of pod replicas | `1` |
+| `image.repository` | Image repository | `ghcr.io/blessedrebus/krawl` |
+| `image.tag` | Image tag | `latest` |
+| `image.pullPolicy` | Image pull policy | `Always` |
+
+### Service Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `service.type` | Service type | `LoadBalancer` |
+| `service.port` | Service port | `5000` |
+| `service.externalTrafficPolicy` | External traffic policy | `Local` |
+
+### Ingress Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `ingress.enabled` | Enable ingress | `true` |
+| `ingress.className` | Ingress class name | `traefik` |
+| `ingress.hosts[0].host` | Ingress hostname | `krawl.example.com` |
+
+### Server Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.server.port` | Server port | `5000` |
+| `config.server.delay` | Response delay in milliseconds | `100` |
+| `config.server.timezone` | IANA timezone (e.g., "America/New_York") | `null` |
+
+### Links Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.links.min_length` | Minimum link length | `5` |
+| `config.links.max_length` | Maximum link length | `15` |
+| `config.links.min_per_page` | Minimum links per page | `10` |
+| `config.links.max_per_page` | Maximum links per page | `15` |
+| `config.links.char_space` | Character space for link generation | `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789` |
+| `config.links.max_counter` | Maximum counter value | `10` |
+
+### Canary Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.canary.token_url` | Canary token URL | `null` |
+| `config.canary.token_tries` | Number of canary token tries | `10` |
+
+### Dashboard Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.dashboard.secret_path` | Secret dashboard path (auto-generated if null) | `null` |
+
+### API Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.api.server_url` | API server URL | `null` |
+| `config.api.server_port` | API server port | `8080` |
+| `config.api.server_path` | API server path | `/api/v2/users` |
+
+### Database Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.database.path` | Database file path | `data/krawl.db` |
+| `config.database.retention_days` | Data retention in days | `30` |
+| `database.persistence.enabled` | Enable persistent volume | `true` |
+| `database.persistence.size` | Persistent volume size | `1Gi` |
+| `database.persistence.accessMode` | Access mode | `ReadWriteOnce` |
+
+### Behavior Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.behavior.probability_error_codes` | Error code probability (0-100) | `0` |
+
+### Analyzer Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.analyzer.http_risky_methods_threshold` | HTTP risky methods threshold | `0.1` |
+| `config.analyzer.violated_robots_threshold` | Violated robots.txt threshold | `0.1` |
+| `config.analyzer.uneven_request_timing_threshold` | Uneven request timing threshold | `0.5` |
+| `config.analyzer.uneven_request_timing_time_window_seconds` | Time window for request timing analysis | `300` |
+| `config.analyzer.user_agents_used_threshold` | User agents threshold | `2` |
+| `config.analyzer.attack_urls_threshold` | Attack URLs threshold | `1` |
+
+### Crawl Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `config.crawl.infinite_pages_for_malicious` | Infinite pages for malicious crawlers | `true` |
+| `config.crawl.max_pages_limit` | Maximum pages limit for legitimate crawlers | `250` |
+| `config.crawl.ban_duration_seconds` | IP ban duration in seconds | `600` |
+
+### Resource Limits
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `resources.limits.cpu` | CPU limit | `500m` |
+| `resources.limits.memory` | Memory limit | `256Mi` |
+| `resources.requests.cpu` | CPU request | `100m` |
+| `resources.requests.memory` | Memory request | `64Mi` |
+
+### Autoscaling
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `autoscaling.enabled` | Enable horizontal pod autoscaling | `false` |
+| `autoscaling.minReplicas` | Minimum replicas | `1` |
+| `autoscaling.maxReplicas` | Maximum replicas | `1` |
+| `autoscaling.targetCPUUtilizationPercentage` | Target CPU utilization | `70` |
+| `autoscaling.targetMemoryUtilizationPercentage` | Target memory utilization | `80` |
+
+### Network Policy
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `networkPolicy.enabled` | Enable network policy | `true` |
+
+### Retrieving Dashboard Path
+
+Check server startup logs or get the secret with
+
+```bash
+kubectl get secret krawl-server -n krawl-system \
+ -o jsonpath='{.data.dashboard-path}' | base64 -d && echo
+```
+
+## Usage Examples
+
+### Basic Installation
+
+```bash
+helm install krawl ./helm
+```
+
+### Installation with Custom Domain
+
+```bash
+helm install krawl ./helm \
+ --set ingress.hosts[0].host=honeypot.example.com
+```
+
+### Enable Canary Tokens
+
+```bash
+helm install krawl ./helm \
+ --set config.canary.token_url=https://canarytokens.com/your-token
+```
+
+### Configure Custom API Endpoint
+
+```bash
+helm install krawl ./helm \
+ --set config.api.server_url=https://api.example.com \
+ --set config.api.server_port=443
+```
+
+### Create Values Override File
+
+Create `custom-values.yaml`:
+
+```yaml
+config:
+ server:
+ port: 8080
+ delay: 500
+ canary:
+ token_url: https://your-canary-token-url
+ dashboard:
+ secret_path: /super-secret-path
+ crawl:
+ max_pages_limit: 500
+ ban_duration_seconds: 3600
+```
+
+Then install:
+
+```bash
+helm install krawl ./helm -f custom-values.yaml
+```
+
+## Upgrading
+
+```bash
+helm upgrade krawl ./helm
+```
+
+## Uninstalling
+
+```bash
+helm uninstall krawl
+```
+
+## Troubleshooting
+
+### Check chart syntax
+
+```bash
+helm lint ./helm
+```
+
+### Dry run to verify values
+
+```bash
+helm install krawl ./helm --dry-run --debug
+```
+
+### Check deployed configuration
+
+```bash
+kubectl get configmap krawl-config -o yaml
+```
+
+### View pod logs
+
+```bash
+kubectl logs -l app.kubernetes.io/name=krawl
+```
+
+## Chart Files
+
+- `Chart.yaml` - Chart metadata
+- `values.yaml` - Default configuration values
+- `templates/` - Kubernetes resource templates
+ - `deployment.yaml` - Krawl deployment
+ - `service.yaml` - Service configuration
+ - `configmap.yaml` - Application configuration
+ - `pvc.yaml` - Persistent volume claim
+ - `ingress.yaml` - Ingress configuration
+ - `hpa.yaml` - Horizontal pod autoscaler
+ - `network-policy.yaml` - Network policies
+
+## Support
+
+For issues and questions, please visit the [Krawl GitHub repository](https://github.com/BlessedRebuS/Krawl).
diff --git a/helm/templates/configmap.yaml b/helm/templates/configmap.yaml
index c50ab75..f81d319 100644
--- a/helm/templates/configmap.yaml
+++ b/helm/templates/configmap.yaml
@@ -5,14 +5,36 @@ metadata:
labels:
{{- include "krawl.labels" . | nindent 4 }}
data:
- PORT: {{ .Values.config.port | quote }}
- DELAY: {{ .Values.config.delay | quote }}
- LINKS_MIN_LENGTH: {{ .Values.config.linksMinLength | quote }}
- LINKS_MAX_LENGTH: {{ .Values.config.linksMaxLength | quote }}
- LINKS_MIN_PER_PAGE: {{ .Values.config.linksMinPerPage | quote }}
- LINKS_MAX_PER_PAGE: {{ .Values.config.linksMaxPerPage | quote }}
- MAX_COUNTER: {{ .Values.config.maxCounter | quote }}
- CANARY_TOKEN_TRIES: {{ .Values.config.canaryTokenTries | quote }}
- PROBABILITY_ERROR_CODES: {{ .Values.config.probabilityErrorCodes | quote }}
- SERVER_HEADER: {{ .Values.config.serverHeader | quote }}
- CANARY_TOKEN_URL: {{ .Values.config.canaryTokenUrl | quote }}
+ config.yaml: |
+ # Krawl Honeypot Configuration
+ server:
+ port: {{ .Values.config.server.port }}
+ delay: {{ .Values.config.server.delay }}
+ links:
+ min_length: {{ .Values.config.links.min_length }}
+ max_length: {{ .Values.config.links.max_length }}
+ min_per_page: {{ .Values.config.links.min_per_page }}
+ max_per_page: {{ .Values.config.links.max_per_page }}
+ char_space: {{ .Values.config.links.char_space | quote }}
+ max_counter: {{ .Values.config.links.max_counter }}
+ canary:
+ token_url: {{ .Values.config.canary.token_url | toYaml }}
+ token_tries: {{ .Values.config.canary.token_tries }}
+ dashboard:
+ secret_path: {{ .Values.config.dashboard.secret_path | toYaml }}
+ database:
+ path: {{ .Values.config.database.path | quote }}
+ retention_days: {{ .Values.config.database.retention_days }}
+ behavior:
+ probability_error_codes: {{ .Values.config.behavior.probability_error_codes }}
+ analyzer:
+ http_risky_methods_threshold: {{ .Values.config.analyzer.http_risky_methods_threshold }}
+ violated_robots_threshold: {{ .Values.config.analyzer.violated_robots_threshold }}
+ uneven_request_timing_threshold: {{ .Values.config.analyzer.uneven_request_timing_threshold }}
+ uneven_request_timing_time_window_seconds: {{ .Values.config.analyzer.uneven_request_timing_time_window_seconds }}
+ user_agents_used_threshold: {{ .Values.config.analyzer.user_agents_used_threshold }}
+ attack_urls_threshold: {{ .Values.config.analyzer.attack_urls_threshold }}
+ crawl:
+ infinite_pages_for_malicious: {{ .Values.config.crawl.infinite_pages_for_malicious }}
+ max_pages_limit: {{ .Values.config.crawl.max_pages_limit }}
+ ban_duration_seconds: {{ .Values.config.crawl.ban_duration_seconds }}
diff --git a/helm/templates/deployment.yaml b/helm/templates/deployment.yaml
index b0aeb6d..f24261c 100644
--- a/helm/templates/deployment.yaml
+++ b/helm/templates/deployment.yaml
@@ -38,30 +38,49 @@ spec:
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
- containerPort: {{ .Values.config.port }}
+ containerPort: {{ .Values.config.server.port }}
protocol: TCP
- envFrom:
- - configMapRef:
- name: {{ include "krawl.fullname" . }}-config
env:
- - name: DASHBOARD_SECRET_PATH
- valueFrom:
- secretKeyRef:
- name: {{ include "krawl.fullname" . }}
- key: dashboard-path
+ - name: CONFIG_LOCATION
+ value: "config.yaml"
+ {{- if .Values.timezone }}
+ - name: TZ
+ value: {{ .Values.timezone | quote }}
+ {{- end }}
volumeMounts:
+ - name: config
+ mountPath: /app/config.yaml
+ subPath: config.yaml
+ readOnly: true
- name: wordlists
mountPath: /app/wordlists.json
subPath: wordlists.json
readOnly: true
+ {{- if .Values.database.persistence.enabled }}
+ - name: database
+ mountPath: /app/data
+ {{- end }}
{{- with .Values.resources }}
resources:
{{- toYaml . | nindent 12 }}
{{- end }}
volumes:
+ - name: config
+ configMap:
+ name: {{ include "krawl.fullname" . }}-config
- name: wordlists
configMap:
name: {{ include "krawl.fullname" . }}-wordlists
+ {{- if .Values.database.persistence.enabled }}
+ - name: database
+ {{- if .Values.database.persistence.existingClaim }}
+ persistentVolumeClaim:
+ claimName: {{ .Values.database.persistence.existingClaim }}
+ {{- else }}
+ persistentVolumeClaim:
+ claimName: {{ include "krawl.fullname" . }}-db
+ {{- end }}
+ {{- end }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
diff --git a/helm/templates/pvc.yaml b/helm/templates/pvc.yaml
new file mode 100644
index 0000000..ec73af2
--- /dev/null
+++ b/helm/templates/pvc.yaml
@@ -0,0 +1,17 @@
+{{- if and .Values.database.persistence.enabled (not .Values.database.persistence.existingClaim) }}
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: {{ include "krawl.fullname" . }}-db
+ labels:
+ {{- include "krawl.labels" . | nindent 4 }}
+spec:
+ accessModes:
+ - {{ .Values.database.persistence.accessMode }}
+ {{- if .Values.database.persistence.storageClassName }}
+ storageClassName: {{ .Values.database.persistence.storageClassName }}
+ {{- end }}
+ resources:
+ requests:
+ storage: {{ .Values.database.persistence.size }}
+{{- end }}
diff --git a/helm/templates/secret.yaml b/helm/templates/secret.yaml
deleted file mode 100644
index 798289c..0000000
--- a/helm/templates/secret.yaml
+++ /dev/null
@@ -1,16 +0,0 @@
-{{- $secret := (lookup "v1" "Secret" .Release.Namespace (include "krawl.fullname" .)) -}}
-{{- $dashboardPath := "" -}}
-{{- if and $secret $secret.data -}}
- {{- $dashboardPath = index $secret.data "dashboard-path" | b64dec -}}
-{{- else -}}
- {{- $dashboardPath = printf "/%s" (randAlphaNum 32) -}}
-{{- end -}}
-apiVersion: v1
-kind: Secret
-metadata:
- name: {{ include "krawl.fullname" . }}
- labels:
- {{- include "krawl.labels" . | nindent 4 }}
-type: Opaque
-stringData:
- dashboard-path: {{ $dashboardPath | quote }}
diff --git a/helm/values.yaml b/helm/values.yaml
index a095632..fb9be82 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -3,7 +3,7 @@ replicaCount: 1
image:
repository: ghcr.io/blessedrebus/krawl
pullPolicy: Always
- tag: "latest"
+ tag: "1.0.0"
imagePullSecrets: []
nameOverride: "krawl"
@@ -49,6 +49,11 @@ resources:
cpu: 100m
memory: 64Mi
+# Container timezone configuration
+# Set this to change timezone (e.g., "America/New_York", "Europe/Rome")
+# If not set, container will use its default timezone
+timezone: ""
+
autoscaling:
enabled: false
minReplicas: 1
@@ -62,19 +67,53 @@ tolerations: []
affinity: {}
-# Application configuration
+# Application configuration (config.yaml structure)
config:
- port: 5000
- delay: 100
- linksMinLength: 5
- linksMaxLength: 15
- linksMinPerPage: 10
- linksMaxPerPage: 15
- maxCounter: 10
- canaryTokenTries: 10
- probabilityErrorCodes: 0
- serverHeader: "Apache/2.2.22 (Ubuntu)"
-# canaryTokenUrl: set-your-canary-token-url-here
+ server:
+ port: 5000
+ delay: 100
+ links:
+ min_length: 5
+ max_length: 15
+ min_per_page: 10
+ max_per_page: 15
+ char_space: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
+ max_counter: 10
+ canary:
+ token_url: null # Set your canary token URL here
+ token_tries: 10
+ dashboard:
+ secret_path: null # Auto-generated if not set, or set to "/my-secret-dashboard"
+ database:
+ path: "data/krawl.db"
+ retention_days: 30
+ behavior:
+ probability_error_codes: 0
+ analyzer:
+ http_risky_methods_threshold: 0.1
+ violated_robots_threshold: 0.1
+ uneven_request_timing_threshold: 0.5
+ uneven_request_timing_time_window_seconds: 300
+ user_agents_used_threshold: 2
+ attack_urls_threshold: 1
+ crawl:
+ infinite_pages_for_malicious: true
+ max_pages_limit: 250
+ ban_duration_seconds: 600
+
+# Database persistence configuration
+database:
+ # Persistence configuration
+ persistence:
+ enabled: true
+ # Storage class name (use default if not specified)
+ # storageClassName: ""
+ # Access mode for the persistent volume
+ accessMode: ReadWriteOnce
+ # Size of the persistent volume
+ size: 1Gi
+ # Optional: Use existing PVC
+ # existingClaim: ""
networkPolicy:
enabled: true
@@ -268,6 +307,17 @@ wordlists:
- .git/
- keys/
- credentials/
+ server_headers:
+ - Apache/2.2.22 (Ubuntu)
+ - nginx/1.18.0
+ - Microsoft-IIS/10.0
+ - LiteSpeed
+ - Caddy
+ - Gunicorn/20.0.4
+ - uvicorn/0.13.4
+ - Express
+ - Flask/1.1.2
+ - Django/3.1
error_codes:
- 400
- 401
diff --git a/img/admin-page.png b/img/admin-page.png
index ba82843..790e3c3 100644
Binary files a/img/admin-page.png and b/img/admin-page.png differ
diff --git a/img/api-secrets-page.png b/img/api-secrets-page.png
deleted file mode 100644
index 77b47c8..0000000
Binary files a/img/api-secrets-page.png and /dev/null differ
diff --git a/img/api-users-page.png b/img/api-users-page.png
deleted file mode 100644
index 6746594..0000000
Binary files a/img/api-users-page.png and /dev/null differ
diff --git a/img/credentials-and-passwords.png b/img/credentials-and-passwords.png
new file mode 100644
index 0000000..acb134a
Binary files /dev/null and b/img/credentials-and-passwords.png differ
diff --git a/img/credentials-page.png b/img/credentials-page.png
deleted file mode 100644
index bc3fffa..0000000
Binary files a/img/credentials-page.png and /dev/null differ
diff --git a/img/dashboard-1.png b/img/dashboard-1.png
index ad11dd8..4479914 100644
Binary files a/img/dashboard-1.png and b/img/dashboard-1.png differ
diff --git a/img/dashboard-2.png b/img/dashboard-2.png
index 65c0766..e6a208d 100644
Binary files a/img/dashboard-2.png and b/img/dashboard-2.png differ
diff --git a/img/dashboard-3.png b/img/dashboard-3.png
new file mode 100644
index 0000000..e7b24df
Binary files /dev/null and b/img/dashboard-3.png differ
diff --git a/img/env-page.png b/img/env-page.png
deleted file mode 100644
index a738732..0000000
Binary files a/img/env-page.png and /dev/null differ
diff --git a/img/geoip_dashboard.png b/img/geoip_dashboard.png
new file mode 100644
index 0000000..6825be7
Binary files /dev/null and b/img/geoip_dashboard.png differ
diff --git a/img/ip-reputation.png b/img/ip-reputation.png
new file mode 100644
index 0000000..9119e63
Binary files /dev/null and b/img/ip-reputation.png differ
diff --git a/img/krawl-svg.svg b/img/krawl-svg.svg
new file mode 100644
index 0000000..2d15e51
--- /dev/null
+++ b/img/krawl-svg.svg
@@ -0,0 +1,95 @@
+
+
+
+
diff --git a/img/passwords-page.png b/img/passwords-page.png
deleted file mode 100644
index c9ca2f0..0000000
Binary files a/img/passwords-page.png and /dev/null differ
diff --git a/img/server-and-env-page.png b/img/server-and-env-page.png
new file mode 100644
index 0000000..700c39d
Binary files /dev/null and b/img/server-and-env-page.png differ
diff --git a/img/sql_injection.png b/img/sql_injection.png
new file mode 100644
index 0000000..8eb8ad3
Binary files /dev/null and b/img/sql_injection.png differ
diff --git a/img/users-and-secrets.png b/img/users-and-secrets.png
new file mode 100644
index 0000000..f99297e
Binary files /dev/null and b/img/users-and-secrets.png differ
diff --git a/kubernetes/README.md b/kubernetes/README.md
new file mode 100644
index 0000000..d803496
--- /dev/null
+++ b/kubernetes/README.md
@@ -0,0 +1,47 @@
+### Kubernetes
+
+Apply all manifests with:
+
+```bash
+kubectl apply -f https://raw.githubusercontent.com/BlessedRebuS/Krawl/refs/heads/main/kubernetes/krawl-all-in-one-deploy.yaml
+```
+
+Or clone the repo and apply the manifest:
+
+```bash
+kubectl apply -f kubernetes/krawl-all-in-one-deploy.yaml
+```
+
+Access the deception server:
+
+```bash
+kubectl get svc krawl-server -n krawl-system
+```
+
+Once the EXTERNAL-IP is assigned, access your deception server at `http://:5000`
+
+### Retrieving Dashboard Path
+
+Check server startup logs or get the secret with
+
+```bash
+kubectl get secret krawl-server -n krawl-system \
+ -o jsonpath='{.data.dashboard-path}' | base64 -d && echo
+```
+
+### From Source (Python 3.11+)
+
+Clone the repository:
+
+```bash
+git clone https://github.com/blessedrebus/krawl.git
+cd krawl/src
+```
+
+Run the server:
+
+```bash
+python3 server.py
+```
+
+Visit `http://localhost:5000` and access the dashboard at `http://localhost:5000/`
diff --git a/kubernetes/krawl-all-in-one-deploy.yaml b/kubernetes/krawl-all-in-one-deploy.yaml
index 0362220..767c080 100644
--- a/kubernetes/krawl-all-in-one-deploy.yaml
+++ b/kubernetes/krawl-all-in-one-deploy.yaml
@@ -4,369 +4,226 @@ kind: Namespace
metadata:
name: krawl-system
---
+# Source: krawl-chart/templates/network-policy.yaml
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+ name: krawl
+ namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
+spec:
+ podSelector:
+ matchLabels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ policyTypes:
+ - Ingress
+ - Egress
+ ingress:
+ - from:
+ - podSelector: {}
+ - namespaceSelector: {}
+ - ipBlock:
+ cidr: 0.0.0.0/0
+ ports:
+ - port: 5000
+ protocol: TCP
+ egress:
+ - ports:
+ - protocol: TCP
+ - protocol: UDP
+ to:
+ - namespaceSelector: {}
+ - ipBlock:
+ cidr: 0.0.0.0/0
+---
+# Source: krawl-chart/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: krawl-config
namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
data:
- PORT: "5000"
- DELAY: "100"
- LINKS_MIN_LENGTH: "5"
- LINKS_MAX_LENGTH: "15"
- LINKS_MIN_PER_PAGE: "10"
- LINKS_MAX_PER_PAGE: "15"
- MAX_COUNTER: "10"
- CANARY_TOKEN_TRIES: "10"
- PROBABILITY_ERROR_CODES: "0"
-# CANARY_TOKEN_URL: set-your-canary-token-url-here
+ config.yaml: |
+ # Krawl Honeypot Configuration
+ server:
+ port: 5000
+ delay: 100
+ links:
+ min_length: 5
+ max_length: 15
+ min_per_page: 10
+ max_per_page: 15
+ char_space: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
+ max_counter: 10
+ canary:
+ token_url: null
+ token_tries: 10
+ dashboard:
+ secret_path: null
+ database:
+ path: "data/krawl.db"
+ retention_days: 30
+ behavior:
+ probability_error_codes: 0
+ analyzer:
+ http_risky_methods_threshold: 0.1
+ violated_robots_threshold: 0.1
+ uneven_request_timing_threshold: 0.5
+ uneven_request_timing_time_window_seconds: 300
+ user_agents_used_threshold: 2
+ attack_urls_threshold: 1
+ crawl:
+ infinite_pages_for_malicious: true
+ max_pages_limit: 250
+ ban_duration_seconds: 600
---
+# Source: krawl-chart/templates/wordlists-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: krawl-wordlists
namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
data:
wordlists.json: |
- {
- "usernames": {
- "prefixes": [
- "admin",
- "user",
- "developer",
- "root",
- "system",
- "db",
- "api",
- "service",
- "deploy",
- "test",
- "prod",
- "backup",
- "monitor",
- "jenkins",
- "webapp"
- ],
- "suffixes": [
- "",
- "_prod",
- "_dev",
- "_test",
- "123",
- "2024",
- "_backup",
- "_admin",
- "01",
- "02",
- "_user",
- "_service",
- "_api"
- ]
- },
- "passwords": {
- "prefixes": [
- "P@ssw0rd",
- "Passw0rd",
- "Admin",
- "Secret",
- "Welcome",
- "System",
- "Database",
- "Secure",
- "Master",
- "Root"
- ],
- "simple": [
- "test",
- "demo",
- "temp",
- "change",
- "password",
- "admin",
- "letmein",
- "welcome",
- "default",
- "sample"
- ]
- },
- "emails": {
- "domains": [
- "example.com",
- "company.com",
- "localhost.com",
- "test.com",
- "domain.com",
- "corporate.com",
- "internal.net",
- "enterprise.com",
- "business.org"
- ]
- },
- "api_keys": {
- "prefixes": [
- "sk_live_",
- "sk_test_",
- "api_",
- "key_",
- "token_",
- "access_",
- "secret_",
- "prod_",
- ""
- ]
- },
- "databases": {
- "names": [
- "production",
- "prod_db",
- "main_db",
- "app_database",
- "users_db",
- "customer_data",
- "analytics",
- "staging_db",
- "dev_database",
- "wordpress",
- "ecommerce",
- "crm_db",
- "inventory"
- ],
- "hosts": [
- "localhost",
- "db.internal",
- "mysql.local",
- "postgres.internal",
- "127.0.0.1",
- "db-server-01",
- "database.prod",
- "sql.company.com"
- ]
- },
- "applications": {
- "names": [
- "WebApp",
- "API Gateway",
- "Dashboard",
- "Admin Panel",
- "CMS",
- "Portal",
- "Manager",
- "Console",
- "Control Panel",
- "Backend"
- ]
- },
- "users": {
- "roles": [
- "Administrator",
- "Developer",
- "Manager",
- "User",
- "Guest",
- "Moderator",
- "Editor",
- "Viewer",
- "Analyst",
- "Support"
- ]
- },
- "directory_listing": {
- "files": [
- "admin.txt",
- "test.exe",
- "backup.sql",
- "database.sql",
- "db_backup.sql",
- "dump.sql",
- "config.php",
- "credentials.txt",
- "passwords.txt",
- "users.csv",
- ".env",
- "id_rsa",
- "id_rsa.pub",
- "private_key.pem",
- "api_keys.json",
- "secrets.yaml",
- "admin_notes.txt",
- "settings.ini",
- "database.yml",
- "wp-config.php",
- ".htaccess",
- "server.key",
- "cert.pem",
- "shadow.bak",
- "passwd.old"
- ],
- "directories": [
- "uploads/",
- "backups/",
- "logs/",
- "temp/",
- "cache/",
- "private/",
- "config/",
- "admin/",
- "database/",
- "backup/",
- "old/",
- "archive/",
- ".git/",
- "keys/",
- "credentials/"
- ]
- },
- "error_codes": [
- 400,
- 401,
- 403,
- 404,
- 500,
- 502,
- 503
- ]
- }
+ {"api_keys":{"prefixes":["sk_live_","sk_test_","api_","key_","token_","access_","secret_","prod_",""]},"applications":{"names":["WebApp","API Gateway","Dashboard","Admin Panel","CMS","Portal","Manager","Console","Control Panel","Backend"]},"databases":{"hosts":["localhost","db.internal","mysql.local","postgres.internal","127.0.0.1","db-server-01","database.prod","sql.company.com"],"names":["production","prod_db","main_db","app_database","users_db","customer_data","analytics","staging_db","dev_database","wordpress","ecommerce","crm_db","inventory"]},"directory_listing":{"directories":["uploads/","backups/","logs/","temp/","cache/","private/","config/","admin/","database/","backup/","old/","archive/",".git/","keys/","credentials/"],"files":["admin.txt","test.exe","backup.sql","database.sql","db_backup.sql","dump.sql","config.php","credentials.txt","passwords.txt","users.csv",".env","id_rsa","id_rsa.pub","private_key.pem","api_keys.json","secrets.yaml","admin_notes.txt","settings.ini","database.yml","wp-config.php",".htaccess","server.key","cert.pem","shadow.bak","passwd.old"]},"emails":{"domains":["example.com","company.com","localhost.com","test.com","domain.com","corporate.com","internal.net","enterprise.com","business.org"]},"error_codes":[400,401,403,404,500,502,503],"passwords":{"prefixes":["P@ssw0rd","Passw0rd","Admin","Secret","Welcome","System","Database","Secure","Master","Root"],"simple":["test","demo","temp","change","password","admin","letmein","welcome","default","sample"]},"server_headers":["Apache/2.2.22 (Ubuntu)","nginx/1.18.0","Microsoft-IIS/10.0","LiteSpeed","Caddy","Gunicorn/20.0.4","uvicorn/0.13.4","Express","Flask/1.1.2","Django/3.1"],"usernames":{"prefixes":["admin","user","developer","root","system","db","api","service","deploy","test","prod","backup","monitor","jenkins","webapp"],"suffixes":["","_prod","_dev","_test","123","2024","_backup","_admin","01","02","_user","_service","_api"]},"users":{"roles":["Administrator","Developer","Manager","User","Guest","Moderator","Editor","Viewer","Analyst","Support"]}}
---
+# Source: krawl-chart/templates/pvc.yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: krawl-db
+ namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
+spec:
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 1Gi
+---
+# Source: krawl-chart/templates/service.yaml
+apiVersion: v1
+kind: Service
+metadata:
+ name: krawl
+ namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
+spec:
+ type: LoadBalancer
+ externalTrafficPolicy: Local
+ sessionAffinity: ClientIP
+ sessionAffinityConfig:
+ clientIP:
+ timeoutSeconds: 10800
+ ports:
+ - port: 5000
+ targetPort: http
+ protocol: TCP
+ name: http
+ selector:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+---
+# Source: krawl-chart/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
- name: krawl-server
+ name: krawl
namespace: krawl-system
labels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
spec:
replicas: 1
selector:
matchLabels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
template:
metadata:
labels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
spec:
containers:
- - name: krawl
- image: ghcr.io/blessedrebus/krawl:latest
+ - name: krawl-chart
+ image: "ghcr.io/blessedrebus/krawl:1.0.0"
imagePullPolicy: Always
ports:
- - containerPort: 5000
- name: http
+ - name: http
+ containerPort: 5000
protocol: TCP
- envFrom:
- - configMapRef:
- name: krawl-config
+ env:
+ - name: CONFIG_LOCATION
+ value: "config.yaml"
volumeMounts:
+ - name: config
+ mountPath: /app/config.yaml
+ subPath: config.yaml
+ readOnly: true
- name: wordlists
mountPath: /app/wordlists.json
subPath: wordlists.json
readOnly: true
+ - name: database
+ mountPath: /app/data
resources:
- requests:
- memory: "64Mi"
- cpu: "100m"
- limits:
- memory: "256Mi"
- cpu: "500m"
+ limits:
+ cpu: 500m
+ memory: 256Mi
+ requests:
+ cpu: 100m
+ memory: 64Mi
volumes:
+ - name: config
+ configMap:
+ name: krawl-config
- name: wordlists
configMap:
name: krawl-wordlists
+ - name: database
+ persistentVolumeClaim:
+ claimName: krawl-db
---
-apiVersion: v1
-kind: Service
-metadata:
- name: krawl-server
- namespace: krawl-system
- labels:
- app: krawl-server
-spec:
- type: LoadBalancer
- ports:
- - port: 5000
- targetPort: 5000
- protocol: TCP
- name: http
- selector:
- app: krawl-server
----
+# Source: krawl-chart/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
- name: krawl-ingress
+ name: krawl
namespace: krawl-system
- annotations:
- nginx.ingress.kubernetes.io/rewrite-target: /
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
spec:
- ingressClassName: nginx
+ ingressClassName: traefik
rules:
- - host: krawl.example.com # Change to your domain
- http:
- paths:
- - path: /
- pathType: Prefix
- backend:
- service:
- name: krawl-server
- port:
- number: 5000
- # tls:
- # - hosts:
- # - krawl.example.com
- # secretName: krawl-tls
----
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
- name: krawl-network-policy
- namespace: krawl-system
-spec:
- podSelector:
- matchLabels:
- app: krawl-server
- policyTypes:
- - Ingress
- - Egress
- ingress:
- - from:
- - podSelector: {}
- - namespaceSelector: {}
- - ipBlock:
- cidr: 0.0.0.0/0
- ports:
- - protocol: TCP
- port: 5000
- egress:
- - to:
- - namespaceSelector: {}
- - ipBlock:
- cidr: 0.0.0.0/0
- ports:
- - protocol: TCP
- - protocol: UDP
----
-# Optional: HorizontalPodAutoscaler for auto-scaling
-apiVersion: autoscaling/v2
-kind: HorizontalPodAutoscaler
-metadata:
- name: krawl-hpa
- namespace: krawl-system
-spec:
- scaleTargetRef:
- apiVersion: apps/v1
- kind: Deployment
- name: krawl-server
- minReplicas: 1
- maxReplicas: 5
- metrics:
- - type: Resource
- resource:
- name: cpu
- target:
- type: Utilization
- averageUtilization: 70
- - type: Resource
- resource:
- name: memory
- target:
- type: Utilization
- averageUtilization: 80
+ - host: "krawl.example.com"
+ http:
+ paths:
+ - path: /
+ pathType: Prefix
+ backend:
+ service:
+ name: krawl
+ port:
+ number: 5000
diff --git a/kubernetes/manifests/configmap.yaml b/kubernetes/manifests/configmap.yaml
index 431b9a3..cdf6f1b 100644
--- a/kubernetes/manifests/configmap.yaml
+++ b/kubernetes/manifests/configmap.yaml
@@ -1,17 +1,44 @@
+# Source: krawl-chart/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: krawl-config
namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
data:
- PORT: "5000"
- DELAY: "100"
- LINKS_MIN_LENGTH: "5"
- LINKS_MAX_LENGTH: "15"
- LINKS_MIN_PER_PAGE: "10"
- LINKS_MAX_PER_PAGE: "15"
- MAX_COUNTER: "10"
- CANARY_TOKEN_TRIES: "10"
- PROBABILITY_ERROR_CODES: "0"
- SERVER_HEADER: "Apache/2.2.22 (Ubuntu)"
-# CANARY_TOKEN_URL: set-your-canary-token-url-here
\ No newline at end of file
+ config.yaml: |
+ # Krawl Honeypot Configuration
+ server:
+ port: 5000
+ delay: 100
+ links:
+ min_length: 5
+ max_length: 15
+ min_per_page: 10
+ max_per_page: 15
+ char_space: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
+ max_counter: 10
+ canary:
+ token_url: null
+ token_tries: 10
+ dashboard:
+ secret_path: null
+ database:
+ path: "data/krawl.db"
+ retention_days: 30
+ behavior:
+ probability_error_codes: 0
+ analyzer:
+ http_risky_methods_threshold: 0.1
+ violated_robots_threshold: 0.1
+ uneven_request_timing_threshold: 0.5
+ uneven_request_timing_time_window_seconds: 300
+ user_agents_used_threshold: 2
+ attack_urls_threshold: 1
+ crawl:
+ infinite_pages_for_malicious: true
+ max_pages_limit: 250
+ ban_duration_seconds: 600
diff --git a/kubernetes/manifests/deployment.yaml b/kubernetes/manifests/deployment.yaml
index 0552eba..4c87a73 100644
--- a/kubernetes/manifests/deployment.yaml
+++ b/kubernetes/manifests/deployment.yaml
@@ -1,44 +1,61 @@
+# Source: krawl-chart/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
- name: krawl-server
+ name: krawl
namespace: krawl-system
labels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
spec:
replicas: 1
selector:
matchLabels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
template:
metadata:
labels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
spec:
containers:
- - name: krawl
- image: ghcr.io/blessedrebus/krawl:latest
+ - name: krawl-chart
+ image: "ghcr.io/blessedrebus/krawl:1.0.0"
imagePullPolicy: Always
ports:
- - containerPort: 5000
- name: http
+ - name: http
+ containerPort: 5000
protocol: TCP
- envFrom:
- - configMapRef:
- name: krawl-config
+ env:
+ - name: CONFIG_LOCATION
+ value: "config.yaml"
volumeMounts:
+ - name: config
+ mountPath: /app/config.yaml
+ subPath: config.yaml
+ readOnly: true
- name: wordlists
mountPath: /app/wordlists.json
subPath: wordlists.json
readOnly: true
+ - name: database
+ mountPath: /app/data
resources:
- requests:
- memory: "64Mi"
- cpu: "100m"
- limits:
- memory: "256Mi"
- cpu: "500m"
+ limits:
+ cpu: 500m
+ memory: 256Mi
+ requests:
+ cpu: 100m
+ memory: 64Mi
volumes:
+ - name: config
+ configMap:
+ name: krawl-config
- name: wordlists
configMap:
name: krawl-wordlists
+ - name: database
+ persistentVolumeClaim:
+ claimName: krawl-db
diff --git a/kubernetes/manifests/hpa.yaml b/kubernetes/manifests/hpa.yaml
deleted file mode 100644
index 10bab0c..0000000
--- a/kubernetes/manifests/hpa.yaml
+++ /dev/null
@@ -1,26 +0,0 @@
-# Optional: HorizontalPodAutoscaler for auto-scaling
-apiVersion: autoscaling/v2
-kind: HorizontalPodAutoscaler
-metadata:
- name: krawl-hpa
- namespace: krawl-system
-spec:
- scaleTargetRef:
- apiVersion: apps/v1
- kind: Deployment
- name: krawl-server
- minReplicas: 1
- maxReplicas: 5
- metrics:
- - type: Resource
- resource:
- name: cpu
- target:
- type: Utilization
- averageUtilization: 70
- - type: Resource
- resource:
- name: memory
- target:
- type: Utilization
- averageUtilization: 80
diff --git a/kubernetes/manifests/ingress.yaml b/kubernetes/manifests/ingress.yaml
index f5a6efc..5134798 100644
--- a/kubernetes/manifests/ingress.yaml
+++ b/kubernetes/manifests/ingress.yaml
@@ -1,24 +1,23 @@
+# Source: krawl-chart/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
- name: krawl-ingress
+ name: krawl
namespace: krawl-system
- annotations:
- nginx.ingress.kubernetes.io/rewrite-target: /
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
spec:
- ingressClassName: nginx
+ ingressClassName: traefik
rules:
- - host: krawl.example.com # Change to your domain
- http:
- paths:
- - path: /
- pathType: Prefix
- backend:
- service:
- name: krawl-server
- port:
- number: 5000
- # tls:
- # - hosts:
- # - krawl.example.com
- # secretName: krawl-tls
+ - host: "krawl.example.com"
+ http:
+ paths:
+ - path: /
+ pathType: Prefix
+ backend:
+ service:
+ name: krawl
+ port:
+ number: 5000
diff --git a/kubernetes/manifests/kustomization.yaml b/kubernetes/manifests/kustomization.yaml
index 8f41776..4a5fcd9 100644
--- a/kubernetes/manifests/kustomization.yaml
+++ b/kubernetes/manifests/kustomization.yaml
@@ -5,6 +5,7 @@ resources:
- namespace.yaml
- configmap.yaml
- wordlists-configmap.yaml
+ - pvc.yaml
- deployment.yaml
- service.yaml
- network-policy.yaml
diff --git a/kubernetes/manifests/network-policy.yaml b/kubernetes/manifests/network-policy.yaml
index e765b36..7068531 100644
--- a/kubernetes/manifests/network-policy.yaml
+++ b/kubernetes/manifests/network-policy.yaml
@@ -1,29 +1,35 @@
+# Source: krawl-chart/templates/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
- name: krawl-network-policy
+ name: krawl
namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
spec:
podSelector:
matchLabels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
policyTypes:
- - Ingress
- - Egress
+ - Ingress
+ - Egress
ingress:
- - from:
- - podSelector: {}
- - namespaceSelector: {}
- - ipBlock:
- cidr: 0.0.0.0/0
- ports:
- - protocol: TCP
- port: 5000
+ - from:
+ - podSelector: {}
+ - namespaceSelector: {}
+ - ipBlock:
+ cidr: 0.0.0.0/0
+ ports:
+ - port: 5000
+ protocol: TCP
egress:
- - to:
- - namespaceSelector: {}
- - ipBlock:
- cidr: 0.0.0.0/0
- ports:
- - protocol: TCP
- - protocol: UDP
+ - ports:
+ - protocol: TCP
+ - protocol: UDP
+ to:
+ - namespaceSelector: {}
+ - ipBlock:
+ cidr: 0.0.0.0/0
diff --git a/kubernetes/manifests/pvc.yaml b/kubernetes/manifests/pvc.yaml
new file mode 100644
index 0000000..526093d
--- /dev/null
+++ b/kubernetes/manifests/pvc.yaml
@@ -0,0 +1,16 @@
+# Source: krawl-chart/templates/pvc.yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: krawl-db
+ namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
+spec:
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 1Gi
diff --git a/kubernetes/manifests/service.yaml b/kubernetes/manifests/service.yaml
index 8db65b4..1b73cc0 100644
--- a/kubernetes/manifests/service.yaml
+++ b/kubernetes/manifests/service.yaml
@@ -1,16 +1,25 @@
+# Source: krawl-chart/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
- name: krawl-server
+ name: krawl
namespace: krawl-system
labels:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
spec:
type: LoadBalancer
+ externalTrafficPolicy: Local
+ sessionAffinity: ClientIP
+ sessionAffinityConfig:
+ clientIP:
+ timeoutSeconds: 10800
ports:
- port: 5000
- targetPort: 5000
+ targetPort: http
protocol: TCP
name: http
selector:
- app: krawl-server
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
diff --git a/kubernetes/manifests/wordlists-configmap.yaml b/kubernetes/manifests/wordlists-configmap.yaml
index 4ff0b5d..279410e 100644
--- a/kubernetes/manifests/wordlists-configmap.yaml
+++ b/kubernetes/manifests/wordlists-configmap.yaml
@@ -1,205 +1,13 @@
+# Source: krawl-chart/templates/wordlists-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: krawl-wordlists
namespace: krawl-system
+ labels:
+ app.kubernetes.io/name: krawl
+ app.kubernetes.io/instance: krawl
+ app.kubernetes.io/version: "1.0.0"
data:
wordlists.json: |
- {
- "usernames": {
- "prefixes": [
- "admin",
- "user",
- "developer",
- "root",
- "system",
- "db",
- "api",
- "service",
- "deploy",
- "test",
- "prod",
- "backup",
- "monitor",
- "jenkins",
- "webapp"
- ],
- "suffixes": [
- "",
- "_prod",
- "_dev",
- "_test",
- "123",
- "2024",
- "_backup",
- "_admin",
- "01",
- "02",
- "_user",
- "_service",
- "_api"
- ]
- },
- "passwords": {
- "prefixes": [
- "P@ssw0rd",
- "Passw0rd",
- "Admin",
- "Secret",
- "Welcome",
- "System",
- "Database",
- "Secure",
- "Master",
- "Root"
- ],
- "simple": [
- "test",
- "demo",
- "temp",
- "change",
- "password",
- "admin",
- "letmein",
- "welcome",
- "default",
- "sample"
- ]
- },
- "emails": {
- "domains": [
- "example.com",
- "company.com",
- "localhost.com",
- "test.com",
- "domain.com",
- "corporate.com",
- "internal.net",
- "enterprise.com",
- "business.org"
- ]
- },
- "api_keys": {
- "prefixes": [
- "sk_live_",
- "sk_test_",
- "api_",
- "key_",
- "token_",
- "access_",
- "secret_",
- "prod_",
- ""
- ]
- },
- "databases": {
- "names": [
- "production",
- "prod_db",
- "main_db",
- "app_database",
- "users_db",
- "customer_data",
- "analytics",
- "staging_db",
- "dev_database",
- "wordpress",
- "ecommerce",
- "crm_db",
- "inventory"
- ],
- "hosts": [
- "localhost",
- "db.internal",
- "mysql.local",
- "postgres.internal",
- "127.0.0.1",
- "db-server-01",
- "database.prod",
- "sql.company.com"
- ]
- },
- "applications": {
- "names": [
- "WebApp",
- "API Gateway",
- "Dashboard",
- "Admin Panel",
- "CMS",
- "Portal",
- "Manager",
- "Console",
- "Control Panel",
- "Backend"
- ]
- },
- "users": {
- "roles": [
- "Administrator",
- "Developer",
- "Manager",
- "User",
- "Guest",
- "Moderator",
- "Editor",
- "Viewer",
- "Analyst",
- "Support"
- ]
- },
- "directory_listing": {
- "files": [
- "admin.txt",
- "test.exe",
- "backup.sql",
- "database.sql",
- "db_backup.sql",
- "dump.sql",
- "config.php",
- "credentials.txt",
- "passwords.txt",
- "users.csv",
- ".env",
- "id_rsa",
- "id_rsa.pub",
- "private_key.pem",
- "api_keys.json",
- "secrets.yaml",
- "admin_notes.txt",
- "settings.ini",
- "database.yml",
- "wp-config.php",
- ".htaccess",
- "server.key",
- "cert.pem",
- "shadow.bak",
- "passwd.old"
- ],
- "directories": [
- "uploads/",
- "backups/",
- "logs/",
- "temp/",
- "cache/",
- "private/",
- "config/",
- "admin/",
- "database/",
- "backup/",
- "old/",
- "archive/",
- ".git/",
- "keys/",
- "credentials/"
- ]
- },
- "error_codes": [
- 400,
- 401,
- 403,
- 404,
- 500,
- 502,
- 503
- ]
- }
+ {"api_keys":{"prefixes":["sk_live_","sk_test_","api_","key_","token_","access_","secret_","prod_",""]},"applications":{"names":["WebApp","API Gateway","Dashboard","Admin Panel","CMS","Portal","Manager","Console","Control Panel","Backend"]},"databases":{"hosts":["localhost","db.internal","mysql.local","postgres.internal","127.0.0.1","db-server-01","database.prod","sql.company.com"],"names":["production","prod_db","main_db","app_database","users_db","customer_data","analytics","staging_db","dev_database","wordpress","ecommerce","crm_db","inventory"]},"directory_listing":{"directories":["uploads/","backups/","logs/","temp/","cache/","private/","config/","admin/","database/","backup/","old/","archive/",".git/","keys/","credentials/"],"files":["admin.txt","test.exe","backup.sql","database.sql","db_backup.sql","dump.sql","config.php","credentials.txt","passwords.txt","users.csv",".env","id_rsa","id_rsa.pub","private_key.pem","api_keys.json","secrets.yaml","admin_notes.txt","settings.ini","database.yml","wp-config.php",".htaccess","server.key","cert.pem","shadow.bak","passwd.old"]},"emails":{"domains":["example.com","company.com","localhost.com","test.com","domain.com","corporate.com","internal.net","enterprise.com","business.org"]},"error_codes":[400,401,403,404,500,502,503],"passwords":{"prefixes":["P@ssw0rd","Passw0rd","Admin","Secret","Welcome","System","Database","Secure","Master","Root"],"simple":["test","demo","temp","change","password","admin","letmein","welcome","default","sample"]},"server_headers":["Apache/2.2.22 (Ubuntu)","nginx/1.18.0","Microsoft-IIS/10.0","LiteSpeed","Caddy","Gunicorn/20.0.4","uvicorn/0.13.4","Express","Flask/1.1.2","Django/3.1"],"usernames":{"prefixes":["admin","user","developer","root","system","db","api","service","deploy","test","prod","backup","monitor","jenkins","webapp"],"suffixes":["","_prod","_dev","_test","123","2024","_backup","_admin","01","02","_user","_service","_api"]},"users":{"roles":["Administrator","Developer","Manager","User","Guest","Moderator","Editor","Viewer","Analyst","Support"]}}
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..b3f9b03
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,13 @@
+# Krawl Honeypot Dependencies
+# Install with: pip install -r requirements.txt
+
+# Configuration
+PyYAML>=6.0
+
+# Database ORM
+SQLAlchemy>=2.0.0,<3.0.0
+
+# Scheduling
+APScheduler>=3.11.2
+
+requests>=2.32.5
diff --git a/src/analyzer.py b/src/analyzer.py
new file mode 100644
index 0000000..7f29662
--- /dev/null
+++ b/src/analyzer.py
@@ -0,0 +1,342 @@
+#!/usr/bin/env python3
+from sqlalchemy import select
+from typing import Optional
+from database import get_database, DatabaseManager
+from zoneinfo import ZoneInfo
+from pathlib import Path
+from datetime import datetime, timedelta
+import re
+import urllib.parse
+from wordlists import get_wordlists
+from config import get_config
+from logger import get_app_logger
+import requests
+
+"""
+Functions for user activity analysis
+"""
+
+app_logger = get_app_logger()
+
+
+class Analyzer:
+ """
+ Analyzes users activity and produces aggregated insights
+ """
+
+ def __init__(self, db_manager: Optional[DatabaseManager] = None):
+ """
+ Initialize the analyzer.
+
+ Args:
+ db_manager: Optional DatabaseManager for persistence.
+ If None, will use the global singleton.
+ """
+ self._db_manager = db_manager
+
+ @property
+ def db(self) -> Optional[DatabaseManager]:
+ """
+ Get the database manager, lazily initializing if needed.
+
+ Returns:
+ DatabaseManager instance or None if not available
+ """
+ if self._db_manager is None:
+ try:
+ self._db_manager = get_database()
+ except Exception:
+ pass
+ return self._db_manager
+
+ # def infer_user_category(self, ip: str) -> str:
+
+ # config = get_config()
+
+ # http_risky_methods_threshold = config.http_risky_methods_threshold
+ # violated_robots_threshold = config.violated_robots_threshold
+ # uneven_request_timing_threshold = config.uneven_request_timing_threshold
+ # user_agents_used_threshold = config.user_agents_used_threshold
+ # attack_urls_threshold = config.attack_urls_threshold
+ # uneven_request_timing_time_window_seconds = config.uneven_request_timing_time_window_seconds
+
+ # app_logger.debug(f"http_risky_methods_threshold: {http_risky_methods_threshold}")
+
+ # score = {}
+ # score["attacker"] = {"risky_http_methods": False, "robots_violations": False, "uneven_request_timing": False, "different_user_agents": False, "attack_url": False}
+ # score["good_crawler"] = {"risky_http_methods": False, "robots_violations": False, "uneven_request_timing": False, "different_user_agents": False, "attack_url": False}
+ # score["bad_crawler"] = {"risky_http_methods": False, "robots_violations": False, "uneven_request_timing": False, "different_user_agents": False, "attack_url": False}
+ # score["regular_user"] = {"risky_http_methods": False, "robots_violations": False, "uneven_request_timing": False, "different_user_agents": False, "attack_url": False}
+
+ # #1-3 low, 4-6 mid, 7-9 high, 10-20 extreme
+ # weights = {
+ # "attacker": {
+ # "risky_http_methods": 6,
+ # "robots_violations": 4,
+ # "uneven_request_timing": 3,
+ # "different_user_agents": 8,
+ # "attack_url": 15
+ # },
+ # "good_crawler": {
+ # "risky_http_methods": 1,
+ # "robots_violations": 0,
+ # "uneven_request_timing": 0,
+ # "different_user_agents": 0,
+ # "attack_url": 0
+ # },
+ # "bad_crawler": {
+ # "risky_http_methods": 2,
+ # "robots_violations": 7,
+ # "uneven_request_timing": 0,
+ # "different_user_agents": 5,
+ # "attack_url": 5
+ # },
+ # "regular_user": {
+ # "risky_http_methods": 0,
+ # "robots_violations": 0,
+ # "uneven_request_timing": 8,
+ # "different_user_agents": 3,
+ # "attack_url": 0
+ # }
+ # }
+
+ # accesses = self.db.get_access_logs(ip_filter = ip, limit=1000)
+ # total_accesses_count = len(accesses)
+ # if total_accesses_count <= 0:
+ # return
+
+ # # Set category as "unknown" for the first 5 requests
+ # if total_accesses_count < 3:
+ # category = "unknown"
+ # analyzed_metrics = {}
+ # category_scores = {"attacker": 0, "good_crawler": 0, "bad_crawler": 0, "regular_user": 0, "unknown": 0}
+ # last_analysis = datetime.now(tz=ZoneInfo('UTC'))
+ # self._db_manager.update_ip_stats_analysis(ip, analyzed_metrics, category, category_scores, last_analysis)
+ # return 0
+
+ # #--------------------- HTTP Methods ---------------------
+
+ # get_accesses_count = len([item for item in accesses if item["method"] == "GET"])
+ # post_accesses_count = len([item for item in accesses if item["method"] == "POST"])
+ # put_accesses_count = len([item for item in accesses if item["method"] == "PUT"])
+ # delete_accesses_count = len([item for item in accesses if item["method"] == "DELETE"])
+ # head_accesses_count = len([item for item in accesses if item["method"] == "HEAD"])
+ # options_accesses_count = len([item for item in accesses if item["method"] == "OPTIONS"])
+ # patch_accesses_count = len([item for item in accesses if item["method"] == "PATCH"])
+
+ # if total_accesses_count > http_risky_methods_threshold:
+ # http_method_attacker_score = (post_accesses_count + put_accesses_count + delete_accesses_count + options_accesses_count + patch_accesses_count) / total_accesses_count
+ # else:
+ # http_method_attacker_score = 0
+
+ # #print(f"HTTP Method attacker score: {http_method_attacker_score}")
+ # if http_method_attacker_score >= http_risky_methods_threshold:
+ # score["attacker"]["risky_http_methods"] = True
+ # score["good_crawler"]["risky_http_methods"] = False
+ # score["bad_crawler"]["risky_http_methods"] = True
+ # score["regular_user"]["risky_http_methods"] = False
+ # else:
+ # score["attacker"]["risky_http_methods"] = False
+ # score["good_crawler"]["risky_http_methods"] = True
+ # score["bad_crawler"]["risky_http_methods"] = False
+ # score["regular_user"]["risky_http_methods"] = False
+
+ # #--------------------- Robots Violations ---------------------
+ # #respect robots.txt and login/config pages access frequency
+ # robots_disallows = []
+ # robots_path = Path(__file__).parent / "templates" / "html" / "robots.txt"
+ # with open(robots_path, "r") as f:
+ # for line in f:
+ # line = line.strip()
+ # if not line:
+ # continue
+ # parts = line.split(":")
+
+ # if parts[0] == "Disallow":
+ # parts[1] = parts[1].rstrip("/")
+ # #print(f"DISALLOW {parts[1]}")
+ # robots_disallows.append(parts[1].strip())
+
+ # #if 0 100% sure is good crawler, if >10% of robots violated is bad crawler or attacker
+ # violated_robots_count = len([item for item in accesses if any(item["path"].rstrip("/").startswith(disallow) for disallow in robots_disallows)])
+ # #print(f"Violated robots count: {violated_robots_count}")
+ # if total_accesses_count > 0:
+ # violated_robots_ratio = violated_robots_count / total_accesses_count
+ # else:
+ # violated_robots_ratio = 0
+
+ # if violated_robots_ratio >= violated_robots_threshold:
+ # score["attacker"]["robots_violations"] = True
+ # score["good_crawler"]["robots_violations"] = False
+ # score["bad_crawler"]["robots_violations"] = True
+ # score["regular_user"]["robots_violations"] = False
+ # else:
+ # score["attacker"]["robots_violations"] = False
+ # score["good_crawler"]["robots_violations"] = False
+ # score["bad_crawler"]["robots_violations"] = False
+ # score["regular_user"]["robots_violations"] = False
+
+ # #--------------------- Requests Timing ---------------------
+ # #Request rate and timing: steady, throttled, polite vs attackers' bursty, aggressive, or oddly rhythmic behavior
+ # timestamps = [datetime.fromisoformat(item["timestamp"]) for item in accesses]
+ # now_utc = datetime.now(tz=ZoneInfo('UTC'))
+ # timestamps = [ts for ts in timestamps if now_utc - ts <= timedelta(seconds=uneven_request_timing_time_window_seconds)]
+ # timestamps = sorted(timestamps, reverse=True)
+
+ # time_diffs = []
+ # for i in range(0, len(timestamps)-1):
+ # diff = (timestamps[i] - timestamps[i+1]).total_seconds()
+ # time_diffs.append(diff)
+
+ # mean = 0
+ # variance = 0
+ # std = 0
+ # cv = 0
+ # if time_diffs:
+ # mean = sum(time_diffs) / len(time_diffs)
+ # variance = sum((x - mean) ** 2 for x in time_diffs) / len(time_diffs)
+ # std = variance ** 0.5
+ # cv = std/mean
+ # app_logger.debug(f"Mean: {mean} - Variance {variance} - Standard Deviation {std} - Coefficient of Variation: {cv}")
+
+ # if cv >= uneven_request_timing_threshold:
+ # score["attacker"]["uneven_request_timing"] = True
+ # score["good_crawler"]["uneven_request_timing"] = False
+ # score["bad_crawler"]["uneven_request_timing"] = False
+ # score["regular_user"]["uneven_request_timing"] = True
+ # else:
+ # score["attacker"]["uneven_request_timing"] = False
+ # score["good_crawler"]["uneven_request_timing"] = False
+ # score["bad_crawler"]["uneven_request_timing"] = False
+ # score["regular_user"]["uneven_request_timing"] = False
+
+ # #--------------------- Different User Agents ---------------------
+ # #Header Quality and Consistency: Crawlers tend to use complete and consistent headers, attackers might miss, fake, or change headers
+ # user_agents_used = [item["user_agent"] for item in accesses]
+ # user_agents_used = list(dict.fromkeys(user_agents_used))
+ # #print(f"User agents used: {user_agents_used}")
+
+ # if len(user_agents_used) >= user_agents_used_threshold:
+ # score["attacker"]["different_user_agents"] = True
+ # score["good_crawler"]["different_user_agents"] = False
+ # score["bad_crawler"]["different_user_agentss"] = True
+ # score["regular_user"]["different_user_agents"] = False
+ # else:
+ # score["attacker"]["different_user_agents"] = False
+ # score["good_crawler"]["different_user_agents"] = False
+ # score["bad_crawler"]["different_user_agents"] = False
+ # score["regular_user"]["different_user_agents"] = False
+
+ # #--------------------- Attack URLs ---------------------
+
+ # attack_urls_found_list = []
+
+ # wl = get_wordlists()
+ # if wl.attack_patterns:
+ # queried_paths = [item["path"] for item in accesses]
+
+ # for queried_path in queried_paths:
+ # # URL decode the path to catch encoded attacks
+ # try:
+ # decoded_path = urllib.parse.unquote(queried_path)
+ # # Double decode to catch double-encoded attacks
+ # decoded_path_twice = urllib.parse.unquote(decoded_path)
+ # except Exception:
+ # decoded_path = queried_path
+ # decoded_path_twice = queried_path
+
+ # for name, pattern in wl.attack_patterns.items():
+ # # Check original, decoded, and double-decoded paths
+ # if (re.search(pattern, queried_path, re.IGNORECASE) or
+ # re.search(pattern, decoded_path, re.IGNORECASE) or
+ # re.search(pattern, decoded_path_twice, re.IGNORECASE)):
+ # attack_urls_found_list.append(f"{name}: {pattern}")
+
+ # #remove duplicates
+ # attack_urls_found_list = set(attack_urls_found_list)
+ # attack_urls_found_list = list(attack_urls_found_list)
+
+ # if len(attack_urls_found_list) > attack_urls_threshold:
+ # score["attacker"]["attack_url"] = True
+ # score["good_crawler"]["attack_url"] = False
+ # score["bad_crawler"]["attack_url"] = False
+ # score["regular_user"]["attack_url"] = False
+ # else:
+ # score["attacker"]["attack_url"] = False
+ # score["good_crawler"]["attack_url"] = False
+ # score["bad_crawler"]["attack_url"] = False
+ # score["regular_user"]["attack_url"] = False
+
+ # #--------------------- Calculate score ---------------------
+
+ # attacker_score = good_crawler_score = bad_crawler_score = regular_user_score = 0
+
+ # attacker_score = score["attacker"]["risky_http_methods"] * weights["attacker"]["risky_http_methods"]
+ # attacker_score = attacker_score + score["attacker"]["robots_violations"] * weights["attacker"]["robots_violations"]
+ # attacker_score = attacker_score + score["attacker"]["uneven_request_timing"] * weights["attacker"]["uneven_request_timing"]
+ # attacker_score = attacker_score + score["attacker"]["different_user_agents"] * weights["attacker"]["different_user_agents"]
+ # attacker_score = attacker_score + score["attacker"]["attack_url"] * weights["attacker"]["attack_url"]
+
+ # good_crawler_score = score["good_crawler"]["risky_http_methods"] * weights["good_crawler"]["risky_http_methods"]
+ # good_crawler_score = good_crawler_score + score["good_crawler"]["robots_violations"] * weights["good_crawler"]["robots_violations"]
+ # good_crawler_score = good_crawler_score + score["good_crawler"]["uneven_request_timing"] * weights["good_crawler"]["uneven_request_timing"]
+ # good_crawler_score = good_crawler_score + score["good_crawler"]["different_user_agents"] * weights["good_crawler"]["different_user_agents"]
+ # good_crawler_score = good_crawler_score + score["good_crawler"]["attack_url"] * weights["good_crawler"]["attack_url"]
+
+ # bad_crawler_score = score["bad_crawler"]["risky_http_methods"] * weights["bad_crawler"]["risky_http_methods"]
+ # bad_crawler_score = bad_crawler_score + score["bad_crawler"]["robots_violations"] * weights["bad_crawler"]["robots_violations"]
+ # bad_crawler_score = bad_crawler_score + score["bad_crawler"]["uneven_request_timing"] * weights["bad_crawler"]["uneven_request_timing"]
+ # bad_crawler_score = bad_crawler_score + score["bad_crawler"]["different_user_agents"] * weights["bad_crawler"]["different_user_agents"]
+ # bad_crawler_score = bad_crawler_score + score["bad_crawler"]["attack_url"] * weights["bad_crawler"]["attack_url"]
+
+ # regular_user_score = score["regular_user"]["risky_http_methods"] * weights["regular_user"]["risky_http_methods"]
+ # regular_user_score = regular_user_score + score["regular_user"]["robots_violations"] * weights["regular_user"]["robots_violations"]
+ # regular_user_score = regular_user_score + score["regular_user"]["uneven_request_timing"] * weights["regular_user"]["uneven_request_timing"]
+ # regular_user_score = regular_user_score + score["regular_user"]["different_user_agents"] * weights["regular_user"]["different_user_agents"]
+ # regular_user_score = regular_user_score + score["regular_user"]["attack_url"] * weights["regular_user"]["attack_url"]
+
+ # score_details = f"""
+ # Attacker score: {attacker_score}
+ # Good Crawler score: {good_crawler_score}
+ # Bad Crawler score: {bad_crawler_score}
+ # Regular User score: {regular_user_score}
+ # """
+ # app_logger.debug(score_details)
+
+ # analyzed_metrics = {"risky_http_methods": http_method_attacker_score, "robots_violations": violated_robots_ratio, "uneven_request_timing": mean, "different_user_agents": user_agents_used, "attack_url": attack_urls_found_list}
+ # category_scores = {"attacker": attacker_score, "good_crawler": good_crawler_score, "bad_crawler": bad_crawler_score, "regular_user": regular_user_score}
+ # category = max(category_scores, key=category_scores.get)
+ # last_analysis = datetime.now(tz=ZoneInfo('UTC'))
+
+ # self._db_manager.update_ip_stats_analysis(ip, analyzed_metrics, category, category_scores, last_analysis)
+
+ # return 0
+
+ # def update_ip_rep_infos(self, ip: str) -> list[str]:
+ # api_url = "https://iprep.lcrawl.com/api/iprep/"
+ # params = {
+ # "cidr": ip
+ # }
+ # headers = {
+ # "Content-Type": "application/json"
+ # }
+
+ # response = requests.get(api_url, headers=headers, params=params)
+ # payload = response.json()
+
+ # if payload["results"]:
+ # data = payload["results"][0]
+
+ # country_iso_code = data["geoip_data"]["country_iso_code"]
+ # asn = data["geoip_data"]["asn_autonomous_system_number"]
+ # asn_org = data["geoip_data"]["asn_autonomous_system_organization"]
+ # list_on = data["list_on"]
+
+ # sanitized_country_iso_code = sanitize_for_storage(country_iso_code, 3)
+ # sanitized_asn = sanitize_for_storage(asn, 100)
+ # sanitized_asn_org = sanitize_for_storage(asn_org, 100)
+ # sanitized_list_on = sanitize_dict(list_on, 100000)
+
+ # self._db_manager.update_ip_rep_infos(ip, sanitized_country_iso_code, sanitized_asn, sanitized_asn_org, sanitized_list_on)
+
+ # return
diff --git a/src/config.py b/src/config.py
index 7c6714c..3e5983f 100644
--- a/src/config.py
+++ b/src/config.py
@@ -1,50 +1,261 @@
#!/usr/bin/env python3
import os
+import sys
from dataclasses import dataclass
+from pathlib import Path
from typing import Optional, Tuple
+from zoneinfo import ZoneInfo
+import time
+from logger import get_app_logger
+import socket
+import time
+import requests
+import yaml
@dataclass
class Config:
"""Configuration class for the deception server"""
+
port: int = 5000
delay: int = 100 # milliseconds
+ server_header: str = ""
links_length_range: Tuple[int, int] = (5, 15)
links_per_page_range: Tuple[int, int] = (10, 15)
- char_space: str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
+ char_space: str = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
max_counter: int = 10
canary_token_url: Optional[str] = None
canary_token_tries: int = 10
dashboard_secret_path: str = None
- api_server_url: Optional[str] = None
- api_server_port: int = 8080
- api_server_path: str = "/api/v2/users"
probability_error_codes: int = 0 # Percentage (0-100)
- server_header: str = "Apache/2.2.22 (Ubuntu)"
+
+ # Crawl limiting settings - for legitimate vs malicious crawlers
+ max_pages_limit: int = (
+ 100 # Max pages limit for good crawlers and regular users (and bad crawlers/attackers if infinite_pages_for_malicious is False)
+ )
+ infinite_pages_for_malicious: bool = True # Infinite pages for malicious crawlers
+ ban_duration_seconds: int = 600 # Ban duration in seconds for IPs exceeding limits
+
+ # Database settings
+ database_path: str = "data/krawl.db"
+ database_retention_days: int = 30
+
+ # Analyzer settings
+ http_risky_methods_threshold: float = None
+ violated_robots_threshold: float = None
+ uneven_request_timing_threshold: float = None
+ uneven_request_timing_time_window_seconds: float = None
+ user_agents_used_threshold: float = None
+ attack_urls_threshold: float = None
+
+ _server_ip: Optional[str] = None
+ _server_ip_cache_time: float = 0
+ _ip_cache_ttl: int = 300
+
+ def get_server_ip(self, refresh: bool = False) -> Optional[str]:
+ """
+ Get the server's own public IP address.
+ Excludes requests from the server itself from being tracked.
+ """
+
+ current_time = time.time()
+
+ # Check if cache is valid and not forced refresh
+ if (
+ self._server_ip is not None
+ and not refresh
+ and (current_time - self._server_ip_cache_time) < self._ip_cache_ttl
+ ):
+ return self._server_ip
+
+ try:
+ # Try multiple external IP detection services (fallback chain)
+ ip_detection_services = [
+ "https://api.ipify.org", # Plain text response
+ "http://ident.me", # Plain text response
+ "https://ifconfig.me", # Plain text response
+ ]
+
+ ip = None
+ for service_url in ip_detection_services:
+ try:
+ response = requests.get(service_url, timeout=5)
+ if response.status_code == 200:
+ ip = response.text.strip()
+ if ip:
+ break
+ except Exception:
+ continue
+
+ if not ip:
+ get_app_logger().warning(
+ "Could not determine server IP from external services. "
+ "All IPs will be tracked (including potential server IP)."
+ )
+ return None
+
+ self._server_ip = ip
+ self._server_ip_cache_time = current_time
+ return ip
+
+ except Exception as e:
+ get_app_logger().warning(
+ f"Could not determine server IP address: {e}. "
+ "All IPs will be tracked (including potential server IP)."
+ )
+ return None
+
+ def refresh_server_ip(self) -> Optional[str]:
+ """
+ Force refresh the cached server IP.
+ Use this if you suspect the IP has changed.
+
+ Returns:
+ New server IP address or None if unable to determine
+ """
+ return self.get_server_ip(refresh=True)
@classmethod
- def from_env(cls) -> 'Config':
- """Create configuration from environment variables"""
+ def from_yaml(cls) -> "Config":
+ """Create configuration from YAML file"""
+ config_location = os.getenv("CONFIG_LOCATION", "config.yaml")
+ config_path = Path(__file__).parent.parent / config_location
+
+ try:
+ with open(config_path, "r") as f:
+ data = yaml.safe_load(f)
+ except FileNotFoundError:
+ print(
+ f"Error: Configuration file '{config_path}' not found.", file=sys.stderr
+ )
+ print(
+ f"Please create a config.yaml file or set CONFIG_LOCATION environment variable.",
+ file=sys.stderr,
+ )
+ sys.exit(1)
+ except yaml.YAMLError as e:
+ print(
+ f"Error: Invalid YAML in configuration file '{config_path}': {e}",
+ file=sys.stderr,
+ )
+ sys.exit(1)
+
+ if data is None:
+ data = {}
+
+ # Extract nested values with defaults
+ server = data.get("server", {})
+ links = data.get("links", {})
+ canary = data.get("canary", {})
+ dashboard = data.get("dashboard", {})
+ api = data.get("api", {})
+ database = data.get("database", {})
+ behavior = data.get("behavior", {})
+ analyzer = data.get("analyzer") or {}
+ crawl = data.get("crawl", {})
+
+ # Handle dashboard_secret_path - auto-generate if null/not set
+ dashboard_path = dashboard.get("secret_path")
+ if dashboard_path is None:
+ dashboard_path = f"/{os.urandom(16).hex()}"
+ else:
+ # ensure the dashboard path starts with a /
+ if dashboard_path[:1] != "/":
+ dashboard_path = f"/{dashboard_path}"
+
return cls(
- port=int(os.getenv('PORT', 5000)),
- delay=int(os.getenv('DELAY', 100)),
+ port=server.get("port", 5000),
+ delay=server.get("delay", 100),
+ server_header=server.get("server_header", ""),
links_length_range=(
- int(os.getenv('LINKS_MIN_LENGTH', 5)),
- int(os.getenv('LINKS_MAX_LENGTH', 15))
+ links.get("min_length", 5),
+ links.get("max_length", 15),
),
links_per_page_range=(
- int(os.getenv('LINKS_MIN_PER_PAGE', 10)),
- int(os.getenv('LINKS_MAX_PER_PAGE', 15))
+ links.get("min_per_page", 10),
+ links.get("max_per_page", 15),
),
- char_space=os.getenv('CHAR_SPACE', 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'),
- max_counter=int(os.getenv('MAX_COUNTER', 10)),
- canary_token_url=os.getenv('CANARY_TOKEN_URL'),
- canary_token_tries=int(os.getenv('CANARY_TOKEN_TRIES', 10)),
- dashboard_secret_path=os.getenv('DASHBOARD_SECRET_PATH', f'/{os.urandom(16).hex()}'),
- api_server_url=os.getenv('API_SERVER_URL'),
- api_server_port=int(os.getenv('API_SERVER_PORT', 8080)),
- api_server_path=os.getenv('API_SERVER_PATH', '/api/v2/users'),
- probability_error_codes=int(os.getenv('PROBABILITY_ERROR_CODES', 5)),
- server_header=os.getenv('SERVER_HEADER', 'Apache/2.2.22 (Ubuntu)')
+ char_space=links.get(
+ "char_space",
+ "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",
+ ),
+ max_counter=links.get("max_counter", 10),
+ canary_token_url=canary.get("token_url"),
+ canary_token_tries=canary.get("token_tries", 10),
+ dashboard_secret_path=dashboard_path,
+ probability_error_codes=behavior.get("probability_error_codes", 0),
+ database_path=database.get("path", "data/krawl.db"),
+ database_retention_days=database.get("retention_days", 30),
+ http_risky_methods_threshold=analyzer.get(
+ "http_risky_methods_threshold", 0.1
+ ),
+ violated_robots_threshold=analyzer.get("violated_robots_threshold", 0.1),
+ uneven_request_timing_threshold=analyzer.get(
+ "uneven_request_timing_threshold", 0.5
+ ), # coefficient of variation
+ uneven_request_timing_time_window_seconds=analyzer.get(
+ "uneven_request_timing_time_window_seconds", 300
+ ),
+ user_agents_used_threshold=analyzer.get("user_agents_used_threshold", 2),
+ attack_urls_threshold=analyzer.get("attack_urls_threshold", 1),
+ infinite_pages_for_malicious=crawl.get(
+ "infinite_pages_for_malicious", True
+ ),
+ max_pages_limit=crawl.get("max_pages_limit", 250),
+ ban_duration_seconds=crawl.get("ban_duration_seconds", 600),
)
+
+
+def __get_env_from_config(config: str) -> str:
+
+ env = config.upper().replace(".", "_").replace("-", "__").replace(" ", "_")
+
+ return f"KRAWL_{env}"
+
+
+def override_config_from_env(config: Config = None):
+ """Initialize configuration from environment variables"""
+
+ for field in config.__dataclass_fields__:
+
+ env_var = __get_env_from_config(field)
+ if env_var in os.environ:
+
+ get_app_logger().info(
+ f"Overriding config '{field}' from environment variable '{env_var}'"
+ )
+ try:
+ field_type = config.__dataclass_fields__[field].type
+ env_value = os.environ[env_var]
+ if field_type == int:
+ setattr(config, field, int(env_value))
+ elif field_type == float:
+ setattr(config, field, float(env_value))
+ elif field_type == bool:
+ # Handle boolean values (case-insensitive: true/false, yes/no, 1/0)
+ setattr(config, field, env_value.lower() in ("true", "yes", "1"))
+ elif field_type == Tuple[int, int]:
+ parts = env_value.split(",")
+ if len(parts) == 2:
+ setattr(config, field, (int(parts[0]), int(parts[1])))
+ else:
+ setattr(config, field, env_value)
+ except Exception as e:
+ get_app_logger().error(
+ f"Error overriding config '{field}' from environment variable '{env_var}': {e}"
+ )
+
+
+_config_instance = None
+
+
+def get_config() -> Config:
+ """Get the singleton Config instance"""
+ global _config_instance
+ if _config_instance is None:
+ _config_instance = Config.from_yaml()
+
+ override_config_from_env(_config_instance)
+
+ return _config_instance
diff --git a/src/database.py b/src/database.py
new file mode 100644
index 0000000..36cc7e1
--- /dev/null
+++ b/src/database.py
@@ -0,0 +1,1602 @@
+#!/usr/bin/env python3
+
+"""
+Database singleton module for the Krawl honeypot.
+Provides SQLAlchemy session management and database initialization.
+"""
+
+import os
+import stat
+from datetime import datetime, timedelta
+from typing import Optional, List, Dict, Any
+from zoneinfo import ZoneInfo
+
+from sqlalchemy import create_engine, func, distinct, case, event, or_
+from sqlalchemy.orm import sessionmaker, scoped_session, Session
+from sqlalchemy.engine import Engine
+
+from ip_utils import is_local_or_private_ip, is_valid_public_ip
+
+
+@event.listens_for(Engine, "connect")
+def set_sqlite_pragma(dbapi_connection, connection_record):
+ """Enable WAL mode and set busy timeout for SQLite connections."""
+ cursor = dbapi_connection.cursor()
+ cursor.execute("PRAGMA journal_mode=WAL")
+ cursor.execute("PRAGMA busy_timeout=30000")
+ cursor.close()
+
+
+from models import (
+ Base,
+ AccessLog,
+ CredentialAttempt,
+ AttackDetection,
+ IpStats,
+ CategoryHistory,
+)
+from sanitizer import (
+ sanitize_ip,
+ sanitize_path,
+ sanitize_user_agent,
+ sanitize_credential,
+ sanitize_attack_pattern,
+)
+
+from logger import get_app_logger
+
+applogger = get_app_logger()
+
+
+class DatabaseManager:
+ """
+ Singleton database manager for the Krawl honeypot.
+
+ Handles database initialization, session management, and provides
+ methods for persisting access logs, credentials, and attack detections.
+ """
+
+ _instance: Optional["DatabaseManager"] = None
+
+ def __new__(cls) -> "DatabaseManager":
+ if cls._instance is None:
+ cls._instance = super().__new__(cls)
+ cls._instance._initialized = False
+ return cls._instance
+
+ def initialize(self, database_path: str = "data/krawl.db") -> None:
+ """
+ Initialize the database connection and create tables.
+
+ Args:
+ database_path: Path to the SQLite database file
+ """
+ if self._initialized:
+ return
+
+ # Create data directory if it doesn't exist
+ data_dir = os.path.dirname(database_path)
+ if data_dir and not os.path.exists(data_dir):
+ os.makedirs(data_dir, exist_ok=True)
+
+ # Create SQLite database with check_same_thread=False for multi-threaded access
+ database_url = f"sqlite:///{database_path}"
+ self._engine = create_engine(
+ database_url,
+ connect_args={"check_same_thread": False},
+ echo=False, # Set to True for SQL debugging
+ )
+
+ # Create session factory with scoped_session for thread safety
+ session_factory = sessionmaker(bind=self._engine)
+ self._Session = scoped_session(session_factory)
+
+ # Create all tables
+ Base.metadata.create_all(self._engine)
+
+ # Run automatic migrations for backward compatibility
+ self._run_migrations(database_path)
+
+ # Set restrictive file permissions (owner read/write only)
+ if os.path.exists(database_path):
+ try:
+ os.chmod(database_path, stat.S_IRUSR | stat.S_IWUSR) # 600
+ except OSError:
+ # May fail on some systems, not critical
+ pass
+
+ self._initialized = True
+
+ def _run_migrations(self, database_path: str) -> None:
+ """
+ Run automatic migrations for backward compatibility.
+ Adds missing columns that were added in newer versions.
+
+ Args:
+ database_path: Path to the SQLite database file
+ """
+ import sqlite3
+
+ try:
+ conn = sqlite3.connect(database_path)
+ cursor = conn.cursor()
+
+ # Check if latitude/longitude columns exist
+ cursor.execute("PRAGMA table_info(ip_stats)")
+ columns = [row[1] for row in cursor.fetchall()]
+
+ migrations_run = []
+
+ # Add latitude column if missing
+ if "latitude" not in columns:
+ cursor.execute("ALTER TABLE ip_stats ADD COLUMN latitude REAL")
+ migrations_run.append("latitude")
+
+ # Add longitude column if missing
+ if "longitude" not in columns:
+ cursor.execute("ALTER TABLE ip_stats ADD COLUMN longitude REAL")
+ migrations_run.append("longitude")
+
+ if migrations_run:
+ conn.commit()
+ applogger.info(
+ f"Auto-migration: Added columns {', '.join(migrations_run)} to ip_stats table"
+ )
+
+ conn.close()
+ except Exception as e:
+ applogger.error(f"Auto-migration failed: {e}")
+ # Don't raise - allow app to continue even if migration fails
+
+ @property
+ def session(self) -> Session:
+ """Get a thread-local database session."""
+ if not self._initialized:
+ raise RuntimeError(
+ "DatabaseManager not initialized. Call initialize() first."
+ )
+ return self._Session()
+
+ def close_session(self) -> None:
+ """Close the current thread-local session."""
+ if self._initialized:
+ self._Session.remove()
+
+ def persist_access(
+ self,
+ ip: str,
+ path: str,
+ user_agent: str = "",
+ method: str = "GET",
+ is_suspicious: bool = False,
+ is_honeypot_trigger: bool = False,
+ attack_types: Optional[List[str]] = None,
+ matched_patterns: Optional[Dict[str, str]] = None,
+ ) -> Optional[int]:
+ """
+ Persist an access log entry to the database.
+
+ Args:
+ ip: Client IP address
+ path: Requested path
+ user_agent: Client user agent string
+ method: HTTP method (GET, POST, HEAD)
+ is_suspicious: Whether the request was flagged as suspicious
+ is_honeypot_trigger: Whether a honeypot path was accessed
+ attack_types: List of detected attack types
+ matched_patterns: Dict mapping attack_type to matched pattern
+
+ Returns:
+ The ID of the created AccessLog record, or None on error
+ """
+ session = self.session
+ try:
+ # Create access log with sanitized fields
+ access_log = AccessLog(
+ ip=sanitize_ip(ip),
+ path=sanitize_path(path),
+ user_agent=sanitize_user_agent(user_agent),
+ method=method[:10],
+ is_suspicious=is_suspicious,
+ is_honeypot_trigger=is_honeypot_trigger,
+ timestamp=datetime.now(),
+ )
+ session.add(access_log)
+ session.flush() # Get the ID before committing
+
+ # Add attack detections if any
+ if attack_types:
+ matched_patterns = matched_patterns or {}
+ for attack_type in attack_types:
+ detection = AttackDetection(
+ access_log_id=access_log.id,
+ attack_type=attack_type[:50],
+ matched_pattern=sanitize_attack_pattern(
+ matched_patterns.get(attack_type, "")
+ ),
+ )
+ session.add(detection)
+
+ # Update IP stats
+ self._update_ip_stats(session, ip)
+
+ session.commit()
+ return access_log.id
+
+ except Exception as e:
+ session.rollback()
+ # Log error but don't crash - database persistence is secondary to honeypot function
+ applogger.critical(f"Database error persisting access: {e}")
+ return None
+ finally:
+ self.close_session()
+
+ def persist_credential(
+ self,
+ ip: str,
+ path: str,
+ username: Optional[str] = None,
+ password: Optional[str] = None,
+ ) -> Optional[int]:
+ """
+ Persist a credential attempt to the database.
+
+ Args:
+ ip: Client IP address
+ path: Login form path
+ username: Submitted username
+ password: Submitted password
+
+ Returns:
+ The ID of the created CredentialAttempt record, or None on error
+ """
+ session = self.session
+ try:
+ credential = CredentialAttempt(
+ ip=sanitize_ip(ip),
+ path=sanitize_path(path),
+ username=sanitize_credential(username),
+ password=sanitize_credential(password),
+ timestamp=datetime.now(),
+ )
+ session.add(credential)
+ session.commit()
+ return credential.id
+
+ except Exception as e:
+ session.rollback()
+ applogger.critical(f"Database error persisting credential: {e}")
+ return None
+ finally:
+ self.close_session()
+
+ def _update_ip_stats(self, session: Session, ip: str) -> None:
+ """
+ Update IP statistics (upsert pattern).
+
+ Args:
+ session: Active database session
+ ip: IP address to update
+ """
+ sanitized_ip = sanitize_ip(ip)
+ now = datetime.now()
+
+ ip_stats = session.query(IpStats).filter(IpStats.ip == sanitized_ip).first()
+
+ if ip_stats:
+ ip_stats.total_requests += 1
+ ip_stats.last_seen = now
+ else:
+ ip_stats = IpStats(
+ ip=sanitized_ip, total_requests=1, first_seen=now, last_seen=now
+ )
+ session.add(ip_stats)
+
+ def update_ip_stats_analysis(
+ self,
+ ip: str,
+ analyzed_metrics: Dict[str, object],
+ category: str,
+ category_scores: Dict[str, int],
+ last_analysis: datetime,
+ ) -> None:
+ """
+ Update IP statistics (ip is already persisted).
+ Records category change in history if category has changed.
+
+ Args:
+ ip: IP address to update
+ analyzed_metrics: metric values analyzed be the analyzer
+ category: inferred category
+ category_scores: inferred category scores
+ last_analysis: timestamp of last analysis
+
+ """
+ applogger.debug(
+ f"Analyzed metrics {analyzed_metrics}, category {category}, category scores {category_scores}, last analysis {last_analysis}"
+ )
+ applogger.info(f"IP: {ip} category has been updated to {category}")
+
+ session = self.session
+ sanitized_ip = sanitize_ip(ip)
+ ip_stats = session.query(IpStats).filter(IpStats.ip == sanitized_ip).first()
+
+ # Check if category has changed and record it
+ old_category = ip_stats.category
+ if old_category != category:
+ self._record_category_change(
+ sanitized_ip, old_category, category, last_analysis
+ )
+
+ ip_stats.analyzed_metrics = analyzed_metrics
+ ip_stats.category = category
+ ip_stats.category_scores = category_scores
+ ip_stats.last_analysis = last_analysis
+
+ try:
+ session.commit()
+ except Exception as e:
+ session.rollback()
+ applogger.error(f"Error updating IP stats analysis: {e}")
+
+ def manual_update_category(self, ip: str, category: str) -> None:
+ """
+ Update IP category as a result of a manual intervention by an admin
+
+ Args:
+ ip: IP address to update
+ category: selected category
+
+ """
+ session = self.session
+ sanitized_ip = sanitize_ip(ip)
+ ip_stats = session.query(IpStats).filter(IpStats.ip == sanitized_ip).first()
+
+ # Record the manual category change
+ old_category = ip_stats.category
+ if old_category != category:
+ self._record_category_change(
+ sanitized_ip, old_category, category, datetime.now()
+ )
+
+ ip_stats.category = category
+ ip_stats.manual_category = True
+
+ try:
+ session.commit()
+ except Exception as e:
+ session.rollback()
+ applogger.error(f"Error updating manual category: {e}")
+
+ def _record_category_change(
+ self,
+ ip: str,
+ old_category: Optional[str],
+ new_category: str,
+ timestamp: datetime,
+ ) -> None:
+ """
+ Internal method to record category changes in history.
+ Only records if there's an actual change from a previous category.
+
+ Args:
+ ip: IP address
+ old_category: Previous category (None if first categorization)
+ new_category: New category
+ timestamp: When the change occurred
+ """
+ # Don't record initial categorization (when old_category is None)
+ # Only record actual category changes
+ if old_category is None:
+ return
+
+ session = self.session
+ try:
+ history_entry = CategoryHistory(
+ ip=ip,
+ old_category=old_category,
+ new_category=new_category,
+ timestamp=timestamp,
+ )
+ session.add(history_entry)
+ session.commit()
+ except Exception as e:
+ session.rollback()
+ applogger.error(f"Error recording category change: {e}")
+
+ def get_category_history(self, ip: str) -> List[Dict[str, Any]]:
+ """
+ Retrieve category change history for a specific IP.
+
+ Args:
+ ip: IP address to get history for
+
+ Returns:
+ List of category change records ordered by timestamp
+ """
+ session = self.session
+ try:
+ sanitized_ip = sanitize_ip(ip)
+ history = (
+ session.query(CategoryHistory)
+ .filter(CategoryHistory.ip == sanitized_ip)
+ .order_by(CategoryHistory.timestamp.asc())
+ .all()
+ )
+
+ return [
+ {
+ "old_category": h.old_category,
+ "new_category": h.new_category,
+ "timestamp": h.timestamp.isoformat(),
+ }
+ for h in history
+ ]
+ finally:
+ self.close_session()
+
+ def update_ip_rep_infos(
+ self,
+ ip: str,
+ country_code: str,
+ asn: str,
+ asn_org: str,
+ list_on: Dict[str, str],
+ city: Optional[str] = None,
+ latitude: Optional[float] = None,
+ longitude: Optional[float] = None,
+ ) -> None:
+ """
+ Update IP rep stats
+
+ Args:
+ ip: IP address
+ country_code: IP address country code
+ asn: IP address ASN
+ asn_org: IP address ASN ORG
+ list_on: public lists containing the IP address
+ city: City name (optional)
+ latitude: Latitude coordinate (optional)
+ longitude: Longitude coordinate (optional)
+
+ """
+ session = self.session
+ try:
+ sanitized_ip = sanitize_ip(ip)
+ ip_stats = session.query(IpStats).filter(IpStats.ip == sanitized_ip).first()
+ if ip_stats:
+ ip_stats.country_code = country_code
+ ip_stats.asn = asn
+ ip_stats.asn_org = asn_org
+ ip_stats.list_on = list_on
+ if city:
+ ip_stats.city = city
+ if latitude is not None:
+ ip_stats.latitude = latitude
+ if longitude is not None:
+ ip_stats.longitude = longitude
+ session.commit()
+ except Exception as e:
+ session.rollback()
+ raise
+ finally:
+ self.close_session()
+
+ def get_unenriched_ips(self, limit: int = 100) -> List[str]:
+ """
+ Get IPs that don't have complete reputation data yet.
+ Returns IPs without country_code, city, latitude, or longitude data.
+ Excludes RFC1918 private addresses and other non-routable IPs.
+
+ Args:
+ limit: Maximum number of IPs to return
+
+ Returns:
+ List of IP addresses without complete reputation data
+ """
+ from sqlalchemy.exc import OperationalError
+
+ session = self.session
+ try:
+ # Try to query including latitude/longitude (for backward compatibility)
+ try:
+ ips = (
+ session.query(IpStats.ip)
+ .filter(
+ or_(
+ IpStats.country_code.is_(None),
+ IpStats.city.is_(None),
+ IpStats.latitude.is_(None),
+ IpStats.longitude.is_(None),
+ ),
+ ~IpStats.ip.like("10.%"),
+ ~IpStats.ip.like("172.16.%"),
+ ~IpStats.ip.like("172.17.%"),
+ ~IpStats.ip.like("172.18.%"),
+ ~IpStats.ip.like("172.19.%"),
+ ~IpStats.ip.like("172.2_.%"),
+ ~IpStats.ip.like("172.30.%"),
+ ~IpStats.ip.like("172.31.%"),
+ ~IpStats.ip.like("192.168.%"),
+ ~IpStats.ip.like("127.%"),
+ ~IpStats.ip.like("169.254.%"),
+ )
+ .limit(limit)
+ .all()
+ )
+ except OperationalError as e:
+ # If latitude/longitude columns don't exist yet, fall back to old query
+ if "no such column" in str(e).lower():
+ ips = (
+ session.query(IpStats.ip)
+ .filter(
+ or_(IpStats.country_code.is_(None), IpStats.city.is_(None)),
+ ~IpStats.ip.like("10.%"),
+ ~IpStats.ip.like("172.16.%"),
+ ~IpStats.ip.like("172.17.%"),
+ ~IpStats.ip.like("172.18.%"),
+ ~IpStats.ip.like("172.19.%"),
+ ~IpStats.ip.like("172.2_.%"),
+ ~IpStats.ip.like("172.30.%"),
+ ~IpStats.ip.like("172.31.%"),
+ ~IpStats.ip.like("192.168.%"),
+ ~IpStats.ip.like("127.%"),
+ ~IpStats.ip.like("169.254.%"),
+ )
+ .limit(limit)
+ .all()
+ )
+ else:
+ raise
+
+ return [ip[0] for ip in ips]
+ finally:
+ self.close_session()
+
+ def get_access_logs(
+ self,
+ limit: int = 100,
+ offset: int = 0,
+ ip_filter: Optional[str] = None,
+ suspicious_only: bool = False,
+ since_minutes: Optional[int] = None,
+ ) -> List[Dict[str, Any]]:
+ """
+ Retrieve access logs with optional filtering.
+
+ Args:
+ limit: Maximum number of records to return
+ offset: Number of records to skip
+ ip_filter: Filter by IP address
+ suspicious_only: Only return suspicious requests
+ since_minutes: Only return logs from the last N minutes
+
+ Returns:
+ List of access log dictionaries
+ """
+ session = self.session
+ try:
+ query = session.query(AccessLog).order_by(AccessLog.timestamp.desc())
+
+ if ip_filter:
+ query = query.filter(AccessLog.ip == sanitize_ip(ip_filter))
+ if suspicious_only:
+ query = query.filter(AccessLog.is_suspicious == True)
+ if since_minutes is not None:
+ cutoff_time = datetime.now() - timedelta(minutes=since_minutes)
+ query = query.filter(AccessLog.timestamp >= cutoff_time)
+
+ logs = query.offset(offset).limit(limit).all()
+
+ return [
+ {
+ "id": log.id,
+ "ip": log.ip,
+ "path": log.path,
+ "user_agent": log.user_agent,
+ "method": log.method,
+ "is_suspicious": log.is_suspicious,
+ "is_honeypot_trigger": log.is_honeypot_trigger,
+ "timestamp": log.timestamp.isoformat(),
+ "attack_types": [d.attack_type for d in log.attack_detections],
+ }
+ for log in logs
+ ]
+ finally:
+ self.close_session()
+
+ def get_credential_attempts(
+ self, limit: int = 100, offset: int = 0, ip_filter: Optional[str] = None
+ ) -> List[Dict[str, Any]]:
+ """
+ Retrieve credential attempts with optional filtering.
+
+ Args:
+ limit: Maximum number of records to return
+ offset: Number of records to skip
+ ip_filter: Filter by IP address
+
+ Returns:
+ List of credential attempt dictionaries
+ """
+ session = self.session
+ try:
+ query = session.query(CredentialAttempt).order_by(
+ CredentialAttempt.timestamp.desc()
+ )
+
+ if ip_filter:
+ query = query.filter(CredentialAttempt.ip == sanitize_ip(ip_filter))
+
+ attempts = query.offset(offset).limit(limit).all()
+
+ return [
+ {
+ "id": attempt.id,
+ "ip": attempt.ip,
+ "path": attempt.path,
+ "username": attempt.username,
+ "password": attempt.password,
+ "timestamp": attempt.timestamp.isoformat(),
+ }
+ for attempt in attempts
+ ]
+ finally:
+ self.close_session()
+
+ def get_ip_stats(self, limit: int = 100) -> List[Dict[str, Any]]:
+ """
+ Retrieve IP statistics ordered by total requests.
+
+ Args:
+ limit: Maximum number of records to return
+
+ Returns:
+ List of IP stats dictionaries
+ """
+ session = self.session
+ try:
+ stats = (
+ session.query(IpStats)
+ .order_by(IpStats.total_requests.desc())
+ .limit(limit)
+ .all()
+ )
+
+ return [
+ {
+ "ip": s.ip,
+ "total_requests": s.total_requests,
+ "first_seen": s.first_seen.isoformat() if s.first_seen else None,
+ "last_seen": s.last_seen.isoformat() if s.last_seen else None,
+ "country_code": s.country_code,
+ "city": s.city,
+ "asn": s.asn,
+ "asn_org": s.asn_org,
+ "reputation_score": s.reputation_score,
+ "reputation_source": s.reputation_source,
+ "analyzed_metrics": s.analyzed_metrics,
+ "category": s.category,
+ "manual_category": s.manual_category,
+ "last_analysis": (
+ s.last_analysis.isoformat() if s.last_analysis else None
+ ),
+ }
+ for s in stats
+ ]
+ finally:
+ self.close_session()
+
+ def get_ip_stats_by_ip(self, ip: str) -> Optional[Dict[str, Any]]:
+ """
+ Retrieve IP statistics for a specific IP address.
+
+ Args:
+ ip: The IP address to look up
+
+ Returns:
+ Dictionary with IP stats or None if not found
+ """
+ session = self.session
+ try:
+ stat = session.query(IpStats).filter(IpStats.ip == ip).first()
+
+ if not stat:
+ return None
+
+ # Get category history for this IP
+ category_history = self.get_category_history(ip)
+
+ return {
+ "ip": stat.ip,
+ "total_requests": stat.total_requests,
+ "first_seen": stat.first_seen.isoformat() if stat.first_seen else None,
+ "last_seen": stat.last_seen.isoformat() if stat.last_seen else None,
+ "country_code": stat.country_code,
+ "city": stat.city,
+ "asn": stat.asn,
+ "asn_org": stat.asn_org,
+ "list_on": stat.list_on or {},
+ "reputation_score": stat.reputation_score,
+ "reputation_source": stat.reputation_source,
+ "analyzed_metrics": stat.analyzed_metrics or {},
+ "category": stat.category,
+ "category_scores": stat.category_scores or {},
+ "manual_category": stat.manual_category,
+ "last_analysis": (
+ stat.last_analysis.isoformat() if stat.last_analysis else None
+ ),
+ "category_history": category_history,
+ }
+ finally:
+ self.close_session()
+
+ def get_attackers_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 25,
+ sort_by: str = "total_requests",
+ sort_order: str = "desc",
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of attacker IPs ordered by specified field.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (total_requests, first_seen, last_seen)
+ sort_order: Sort order (asc or desc)
+
+ Returns:
+ Dictionary with attackers list and pagination info
+ """
+ session = self.session
+ try:
+ offset = (page - 1) * page_size
+
+ # Validate sort parameters
+ valid_sort_fields = {"total_requests", "first_seen", "last_seen"}
+ sort_by = sort_by if sort_by in valid_sort_fields else "total_requests"
+ sort_order = (
+ sort_order.lower() if sort_order.lower() in {"asc", "desc"} else "desc"
+ )
+
+ # Get total count of attackers
+ total_attackers = (
+ session.query(IpStats).filter(IpStats.category == "attacker").count()
+ )
+
+ # Build query with sorting
+ query = session.query(IpStats).filter(IpStats.category == "attacker")
+
+ if sort_by == "total_requests":
+ query = query.order_by(
+ IpStats.total_requests.desc()
+ if sort_order == "desc"
+ else IpStats.total_requests.asc()
+ )
+ elif sort_by == "first_seen":
+ query = query.order_by(
+ IpStats.first_seen.desc()
+ if sort_order == "desc"
+ else IpStats.first_seen.asc()
+ )
+ elif sort_by == "last_seen":
+ query = query.order_by(
+ IpStats.last_seen.desc()
+ if sort_order == "desc"
+ else IpStats.last_seen.asc()
+ )
+
+ # Get paginated attackers
+ attackers = query.offset(offset).limit(page_size).all()
+
+ total_pages = (total_attackers + page_size - 1) // page_size
+
+ return {
+ "attackers": [
+ {
+ "ip": a.ip,
+ "total_requests": a.total_requests,
+ "first_seen": (
+ a.first_seen.isoformat() if a.first_seen else None
+ ),
+ "last_seen": a.last_seen.isoformat() if a.last_seen else None,
+ "country_code": a.country_code,
+ "city": a.city,
+ "latitude": a.latitude,
+ "longitude": a.longitude,
+ "asn": a.asn,
+ "asn_org": a.asn_org,
+ "reputation_score": a.reputation_score,
+ "reputation_source": a.reputation_source,
+ "category": a.category,
+ "category_scores": a.category_scores or {},
+ }
+ for a in attackers
+ ],
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total_attackers": total_attackers,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+ def get_all_ips_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 25,
+ sort_by: str = "total_requests",
+ sort_order: str = "desc",
+ categories: Optional[List[str]] = None,
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of all IPs (or filtered by categories) ordered by specified field.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (total_requests, first_seen, last_seen)
+ sort_order: Sort order (asc or desc)
+ categories: Optional list of categories to filter by
+
+ Returns:
+ Dictionary with IPs list and pagination info
+ """
+ session = self.session
+ try:
+ offset = (page - 1) * page_size
+
+ # Validate sort parameters
+ valid_sort_fields = {"total_requests", "first_seen", "last_seen"}
+ sort_by = sort_by if sort_by in valid_sort_fields else "total_requests"
+ sort_order = (
+ sort_order.lower() if sort_order.lower() in {"asc", "desc"} else "desc"
+ )
+
+ # Build query with optional category filter
+ query = session.query(IpStats)
+ if categories:
+ query = query.filter(IpStats.category.in_(categories))
+
+ # Get total count
+ total_ips = query.count()
+
+ # Apply sorting
+ if sort_by == "total_requests":
+ query = query.order_by(
+ IpStats.total_requests.desc()
+ if sort_order == "desc"
+ else IpStats.total_requests.asc()
+ )
+ elif sort_by == "first_seen":
+ query = query.order_by(
+ IpStats.first_seen.desc()
+ if sort_order == "desc"
+ else IpStats.first_seen.asc()
+ )
+ elif sort_by == "last_seen":
+ query = query.order_by(
+ IpStats.last_seen.desc()
+ if sort_order == "desc"
+ else IpStats.last_seen.asc()
+ )
+
+ # Get paginated IPs
+ ips = query.offset(offset).limit(page_size).all()
+
+ total_pages = (total_ips + page_size - 1) // page_size
+
+ return {
+ "ips": [
+ {
+ "ip": ip.ip,
+ "total_requests": ip.total_requests,
+ "first_seen": (
+ ip.first_seen.isoformat() if ip.first_seen else None
+ ),
+ "last_seen": ip.last_seen.isoformat() if ip.last_seen else None,
+ "country_code": ip.country_code,
+ "city": ip.city,
+ "latitude": ip.latitude,
+ "longitude": ip.longitude,
+ "asn": ip.asn,
+ "asn_org": ip.asn_org,
+ "reputation_score": ip.reputation_score,
+ "reputation_source": ip.reputation_source,
+ "category": ip.category,
+ "category_scores": ip.category_scores or {},
+ }
+ for ip in ips
+ ],
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total": total_ips,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+ def get_dashboard_counts(self) -> Dict[str, int]:
+ """
+ Get aggregate statistics for the dashboard (excludes local/private IPs and server IP).
+
+ Returns:
+ Dictionary with total_accesses, unique_ips, unique_paths,
+ suspicious_accesses, honeypot_triggered, honeypot_ips
+ """
+ session = self.session
+ try:
+ # Get server IP to filter it out
+ from config import get_config
+
+ config = get_config()
+ server_ip = config.get_server_ip()
+
+ # Get all accesses first, then filter out local IPs and server IP
+ all_accesses = session.query(AccessLog).all()
+
+ # Filter out local/private IPs and server IP
+ public_accesses = [
+ log for log in all_accesses if is_valid_public_ip(log.ip, server_ip)
+ ]
+
+ # Calculate counts from filtered data
+ total_accesses = len(public_accesses)
+ unique_ips = len(set(log.ip for log in public_accesses))
+ unique_paths = len(set(log.path for log in public_accesses))
+ suspicious_accesses = sum(1 for log in public_accesses if log.is_suspicious)
+ honeypot_triggered = sum(
+ 1 for log in public_accesses if log.is_honeypot_trigger
+ )
+ honeypot_ips = len(
+ set(log.ip for log in public_accesses if log.is_honeypot_trigger)
+ )
+
+ # Count unique attackers from IpStats (matching the "Attackers by Total Requests" table)
+ unique_attackers = (
+ session.query(IpStats).filter(IpStats.category == "attacker").count()
+ )
+
+ return {
+ "total_accesses": total_accesses,
+ "unique_ips": unique_ips,
+ "unique_paths": unique_paths,
+ "suspicious_accesses": suspicious_accesses,
+ "honeypot_triggered": honeypot_triggered,
+ "honeypot_ips": honeypot_ips,
+ "unique_attackers": unique_attackers,
+ }
+ finally:
+ self.close_session()
+
+ def get_top_ips(self, limit: int = 10) -> List[tuple]:
+ """
+ Get top IP addresses by access count (excludes local/private IPs and server IP).
+
+ Args:
+ limit: Maximum number of results
+
+ Returns:
+ List of (ip, count) tuples ordered by count descending
+ """
+ session = self.session
+ try:
+ # Get server IP to filter it out
+ from config import get_config
+
+ config = get_config()
+ server_ip = config.get_server_ip()
+
+ results = (
+ session.query(AccessLog.ip, func.count(AccessLog.id).label("count"))
+ .group_by(AccessLog.ip)
+ .order_by(func.count(AccessLog.id).desc())
+ .all()
+ )
+
+ # Filter out local/private IPs and server IP, then limit results
+ filtered = [
+ (row.ip, row.count)
+ for row in results
+ if is_valid_public_ip(row.ip, server_ip)
+ ]
+ return filtered[:limit]
+ finally:
+ self.close_session()
+
+ def get_top_paths(self, limit: int = 10) -> List[tuple]:
+ """
+ Get top paths by access count.
+
+ Args:
+ limit: Maximum number of results
+
+ Returns:
+ List of (path, count) tuples ordered by count descending
+ """
+ session = self.session
+ try:
+ results = (
+ session.query(AccessLog.path, func.count(AccessLog.id).label("count"))
+ .group_by(AccessLog.path)
+ .order_by(func.count(AccessLog.id).desc())
+ .limit(limit)
+ .all()
+ )
+
+ return [(row.path, row.count) for row in results]
+ finally:
+ self.close_session()
+
+ def get_top_user_agents(self, limit: int = 10) -> List[tuple]:
+ """
+ Get top user agents by access count.
+
+ Args:
+ limit: Maximum number of results
+
+ Returns:
+ List of (user_agent, count) tuples ordered by count descending
+ """
+ session = self.session
+ try:
+ results = (
+ session.query(
+ AccessLog.user_agent, func.count(AccessLog.id).label("count")
+ )
+ .filter(AccessLog.user_agent.isnot(None), AccessLog.user_agent != "")
+ .group_by(AccessLog.user_agent)
+ .order_by(func.count(AccessLog.id).desc())
+ .limit(limit)
+ .all()
+ )
+
+ return [(row.user_agent, row.count) for row in results]
+ finally:
+ self.close_session()
+
+ def get_recent_suspicious(self, limit: int = 20) -> List[Dict[str, Any]]:
+ """
+ Get recent suspicious access attempts (excludes local/private IPs and server IP).
+
+ Args:
+ limit: Maximum number of results
+
+ Returns:
+ List of access log dictionaries with is_suspicious=True
+ """
+ session = self.session
+ try:
+ # Get server IP to filter it out
+ from config import get_config
+
+ config = get_config()
+ server_ip = config.get_server_ip()
+
+ logs = (
+ session.query(AccessLog)
+ .filter(AccessLog.is_suspicious == True)
+ .order_by(AccessLog.timestamp.desc())
+ .all()
+ )
+
+ # Filter out local/private IPs and server IP
+ filtered_logs = [
+ log for log in logs if is_valid_public_ip(log.ip, server_ip)
+ ]
+
+ return [
+ {
+ "ip": log.ip,
+ "path": log.path,
+ "user_agent": log.user_agent,
+ "timestamp": log.timestamp.isoformat(),
+ }
+ for log in filtered_logs[:limit]
+ ]
+ finally:
+ self.close_session()
+
+ def get_honeypot_triggered_ips(self) -> List[tuple]:
+ """
+ Get IPs that triggered honeypot paths with the paths they accessed
+ (excludes local/private IPs and server IP).
+
+ Returns:
+ List of (ip, [paths]) tuples
+ """
+ session = self.session
+ try:
+ # Get server IP to filter it out
+ from config import get_config
+
+ config = get_config()
+ server_ip = config.get_server_ip()
+
+ # Get all honeypot triggers grouped by IP
+ results = (
+ session.query(AccessLog.ip, AccessLog.path)
+ .filter(AccessLog.is_honeypot_trigger == True)
+ .all()
+ )
+
+ # Group paths by IP, filtering out local/private IPs and server IP
+ ip_paths: Dict[str, List[str]] = {}
+ for row in results:
+ # Skip invalid IPs
+ if not is_valid_public_ip(row.ip, server_ip):
+ continue
+ if row.ip not in ip_paths:
+ ip_paths[row.ip] = []
+ if row.path not in ip_paths[row.ip]:
+ ip_paths[row.ip].append(row.path)
+
+ return [(ip, paths) for ip, paths in ip_paths.items()]
+ finally:
+ self.close_session()
+
+ def get_recent_attacks(self, limit: int = 20) -> List[Dict[str, Any]]:
+ """
+ Get recent access logs that have attack detections.
+
+ Args:
+ limit: Maximum number of results
+
+ Returns:
+ List of access log dicts with attack_types included
+ """
+ session = self.session
+ try:
+ # Get access logs that have attack detections
+ logs = (
+ session.query(AccessLog)
+ .join(AttackDetection)
+ .order_by(AccessLog.timestamp.desc())
+ .limit(limit)
+ .all()
+ )
+
+ return [
+ {
+ "ip": log.ip,
+ "path": log.path,
+ "user_agent": log.user_agent,
+ "timestamp": log.timestamp.isoformat(),
+ "attack_types": [d.attack_type for d in log.attack_detections],
+ }
+ for log in logs
+ ]
+ finally:
+ self.close_session()
+
+ def get_honeypot_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 5,
+ sort_by: str = "count",
+ sort_order: str = "desc",
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of honeypot-triggered IPs with their paths.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (count or ip)
+ sort_order: Sort order (asc or desc)
+
+ Returns:
+ Dictionary with honeypots list and pagination info
+ """
+ session = self.session
+ try:
+ from config import get_config
+
+ config = get_config()
+ server_ip = config.get_server_ip()
+
+ offset = (page - 1) * page_size
+
+ # Get honeypot triggers grouped by IP
+ results = (
+ session.query(AccessLog.ip, AccessLog.path)
+ .filter(AccessLog.is_honeypot_trigger == True)
+ .all()
+ )
+
+ # Group paths by IP, filtering out invalid IPs
+ ip_paths: Dict[str, List[str]] = {}
+ for row in results:
+ if not is_valid_public_ip(row.ip, server_ip):
+ continue
+ if row.ip not in ip_paths:
+ ip_paths[row.ip] = []
+ if row.path not in ip_paths[row.ip]:
+ ip_paths[row.ip].append(row.path)
+
+ # Create list and sort
+ honeypot_list = [
+ {"ip": ip, "paths": paths, "count": len(paths)}
+ for ip, paths in ip_paths.items()
+ ]
+
+ if sort_by == "count":
+ honeypot_list.sort(
+ key=lambda x: x["count"], reverse=(sort_order == "desc")
+ )
+ else: # sort by ip
+ honeypot_list.sort(
+ key=lambda x: x["ip"], reverse=(sort_order == "desc")
+ )
+
+ total_honeypots = len(honeypot_list)
+ paginated = honeypot_list[offset : offset + page_size]
+ total_pages = (total_honeypots + page_size - 1) // page_size
+
+ return {
+ "honeypots": paginated,
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total": total_honeypots,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+ def get_credentials_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 5,
+ sort_by: str = "timestamp",
+ sort_order: str = "desc",
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of credential attempts.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (timestamp, ip, username)
+ sort_order: Sort order (asc or desc)
+
+ Returns:
+ Dictionary with credentials list and pagination info
+ """
+ session = self.session
+ try:
+ offset = (page - 1) * page_size
+
+ # Validate sort parameters
+ valid_sort_fields = {"timestamp", "ip", "username"}
+ sort_by = sort_by if sort_by in valid_sort_fields else "timestamp"
+ sort_order = (
+ sort_order.lower() if sort_order.lower() in {"asc", "desc"} else "desc"
+ )
+
+ total_credentials = session.query(CredentialAttempt).count()
+
+ # Build query with sorting
+ query = session.query(CredentialAttempt)
+
+ if sort_by == "timestamp":
+ query = query.order_by(
+ CredentialAttempt.timestamp.desc()
+ if sort_order == "desc"
+ else CredentialAttempt.timestamp.asc()
+ )
+ elif sort_by == "ip":
+ query = query.order_by(
+ CredentialAttempt.ip.desc()
+ if sort_order == "desc"
+ else CredentialAttempt.ip.asc()
+ )
+ elif sort_by == "username":
+ query = query.order_by(
+ CredentialAttempt.username.desc()
+ if sort_order == "desc"
+ else CredentialAttempt.username.asc()
+ )
+
+ credentials = query.offset(offset).limit(page_size).all()
+ total_pages = (total_credentials + page_size - 1) // page_size
+
+ return {
+ "credentials": [
+ {
+ "ip": c.ip,
+ "username": c.username,
+ "password": c.password,
+ "path": c.path,
+ "timestamp": c.timestamp.isoformat() if c.timestamp else None,
+ }
+ for c in credentials
+ ],
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total": total_credentials,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+ def get_top_ips_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 5,
+ sort_by: str = "count",
+ sort_order: str = "desc",
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of top IP addresses by access count.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (count or ip)
+ sort_order: Sort order (asc or desc)
+
+ Returns:
+ Dictionary with IPs list and pagination info
+ """
+ session = self.session
+ try:
+ from config import get_config
+
+ config = get_config()
+ server_ip = config.get_server_ip()
+
+ offset = (page - 1) * page_size
+
+ results = (
+ session.query(AccessLog.ip, func.count(AccessLog.id).label("count"))
+ .group_by(AccessLog.ip)
+ .all()
+ )
+
+ # Filter out local/private IPs and server IP, then sort
+ filtered = [
+ {"ip": row.ip, "count": row.count}
+ for row in results
+ if is_valid_public_ip(row.ip, server_ip)
+ ]
+
+ if sort_by == "count":
+ filtered.sort(key=lambda x: x["count"], reverse=(sort_order == "desc"))
+ else: # sort by ip
+ filtered.sort(key=lambda x: x["ip"], reverse=(sort_order == "desc"))
+
+ total_ips = len(filtered)
+ paginated = filtered[offset : offset + page_size]
+ total_pages = (total_ips + page_size - 1) // page_size
+
+ return {
+ "ips": paginated,
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total": total_ips,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+ def get_top_paths_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 5,
+ sort_by: str = "count",
+ sort_order: str = "desc",
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of top paths by access count.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (count or path)
+ sort_order: Sort order (asc or desc)
+
+ Returns:
+ Dictionary with paths list and pagination info
+ """
+ session = self.session
+ try:
+ offset = (page - 1) * page_size
+
+ results = (
+ session.query(AccessLog.path, func.count(AccessLog.id).label("count"))
+ .group_by(AccessLog.path)
+ .all()
+ )
+
+ # Create list and sort
+ paths_list = [{"path": row.path, "count": row.count} for row in results]
+
+ if sort_by == "count":
+ paths_list.sort(
+ key=lambda x: x["count"], reverse=(sort_order == "desc")
+ )
+ else: # sort by path
+ paths_list.sort(key=lambda x: x["path"], reverse=(sort_order == "desc"))
+
+ total_paths = len(paths_list)
+ paginated = paths_list[offset : offset + page_size]
+ total_pages = (total_paths + page_size - 1) // page_size
+
+ return {
+ "paths": paginated,
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total": total_paths,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+ def get_top_user_agents_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 5,
+ sort_by: str = "count",
+ sort_order: str = "desc",
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of top user agents by access count.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (count or user_agent)
+ sort_order: Sort order (asc or desc)
+
+ Returns:
+ Dictionary with user agents list and pagination info
+ """
+ session = self.session
+ try:
+ offset = (page - 1) * page_size
+
+ results = (
+ session.query(
+ AccessLog.user_agent, func.count(AccessLog.id).label("count")
+ )
+ .filter(AccessLog.user_agent.isnot(None), AccessLog.user_agent != "")
+ .group_by(AccessLog.user_agent)
+ .all()
+ )
+
+ # Create list and sort
+ ua_list = [
+ {"user_agent": row.user_agent, "count": row.count} for row in results
+ ]
+
+ if sort_by == "count":
+ ua_list.sort(key=lambda x: x["count"], reverse=(sort_order == "desc"))
+ else: # sort by user_agent
+ ua_list.sort(
+ key=lambda x: x["user_agent"], reverse=(sort_order == "desc")
+ )
+
+ total_uas = len(ua_list)
+ paginated = ua_list[offset : offset + page_size]
+ total_pages = (total_uas + page_size - 1) // page_size
+
+ return {
+ "user_agents": paginated,
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total": total_uas,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+ def get_attack_types_paginated(
+ self,
+ page: int = 1,
+ page_size: int = 5,
+ sort_by: str = "timestamp",
+ sort_order: str = "desc",
+ ) -> Dict[str, Any]:
+ """
+ Retrieve paginated list of detected attack types with access logs.
+
+ Args:
+ page: Page number (1-indexed)
+ page_size: Number of results per page
+ sort_by: Field to sort by (timestamp, ip, attack_type)
+ sort_order: Sort order (asc or desc)
+
+ Returns:
+ Dictionary with attacks list and pagination info
+ """
+ session = self.session
+ try:
+ offset = (page - 1) * page_size
+
+ # Validate sort parameters
+ valid_sort_fields = {"timestamp", "ip", "attack_type"}
+ sort_by = sort_by if sort_by in valid_sort_fields else "timestamp"
+ sort_order = (
+ sort_order.lower() if sort_order.lower() in {"asc", "desc"} else "desc"
+ )
+
+ # Get all access logs with attack detections
+ query = session.query(AccessLog).join(AttackDetection)
+
+ if sort_by == "timestamp":
+ query = query.order_by(
+ AccessLog.timestamp.desc()
+ if sort_order == "desc"
+ else AccessLog.timestamp.asc()
+ )
+ elif sort_by == "ip":
+ query = query.order_by(
+ AccessLog.ip.desc() if sort_order == "desc" else AccessLog.ip.asc()
+ )
+
+ logs = query.all()
+
+ # Convert to attack list
+ attack_list = [
+ {
+ "ip": log.ip,
+ "path": log.path,
+ "user_agent": log.user_agent,
+ "timestamp": log.timestamp.isoformat() if log.timestamp else None,
+ "attack_types": [d.attack_type for d in log.attack_detections],
+ }
+ for log in logs
+ ]
+
+ # Sort by attack_type if needed (this must be done post-fetch since it's in a related table)
+ if sort_by == "attack_type":
+ attack_list.sort(
+ key=lambda x: x["attack_types"][0] if x["attack_types"] else "",
+ reverse=(sort_order == "desc"),
+ )
+
+ total_attacks = len(attack_list)
+ paginated = attack_list[offset : offset + page_size]
+ total_pages = (total_attacks + page_size - 1) // page_size
+
+ return {
+ "attacks": paginated,
+ "pagination": {
+ "page": page,
+ "page_size": page_size,
+ "total": total_attacks,
+ "total_pages": total_pages,
+ },
+ }
+ finally:
+ self.close_session()
+
+
+# Module-level singleton instance
+_db_manager = DatabaseManager()
+
+
+def get_database() -> DatabaseManager:
+ """Get the database manager singleton instance."""
+ return _db_manager
+
+
+def initialize_database(database_path: str = "data/krawl.db") -> None:
+ """Initialize the database system."""
+ _db_manager.initialize(database_path)
diff --git a/src/generators.py b/src/generators.py
index 16c0c32..fd29f38 100644
--- a/src/generators.py
+++ b/src/generators.py
@@ -9,6 +9,7 @@ import string
import json
from templates import html_templates
from wordlists import get_wordlists
+from config import get_config
def random_username() -> str:
@@ -21,10 +22,10 @@ def random_password() -> str:
"""Generate random password"""
wl = get_wordlists()
templates = [
- lambda: ''.join(random.choices(string.ascii_letters + string.digits, k=12)),
+ lambda: "".join(random.choices(string.ascii_letters + string.digits, k=12)),
lambda: f"{random.choice(wl.password_prefixes)}{random.randint(100, 999)}!",
lambda: f"{random.choice(wl.simple_passwords)}{random.randint(1000, 9999)}",
- lambda: ''.join(random.choices(string.ascii_lowercase, k=8)),
+ lambda: "".join(random.choices(string.ascii_lowercase, k=8)),
]
return random.choice(templates)()
@@ -37,10 +38,19 @@ def random_email(username: str = None) -> str:
return f"{username}@{random.choice(wl.email_domains)}"
+def random_server_header() -> str:
+ """Generate random server header from wordlists"""
+ config = get_config()
+ if config.server_header:
+ return config.server_header
+ wl = get_wordlists()
+ return random.choice(wl.server_headers)
+
+
def random_api_key() -> str:
"""Generate random API key"""
wl = get_wordlists()
- key = ''.join(random.choices(string.ascii_letters + string.digits, k=32))
+ key = "".join(random.choices(string.ascii_letters + string.digits, k=32))
return random.choice(wl.api_key_prefixes) + key
@@ -80,14 +90,16 @@ def users_json() -> str:
users = []
for i in range(random.randint(3, 8)):
username = random_username()
- users.append({
- "id": i + 1,
- "username": username,
- "email": random_email(username),
- "password": random_password(),
- "role": random.choice(wl.user_roles),
- "api_token": random_api_key()
- })
+ users.append(
+ {
+ "id": i + 1,
+ "username": username,
+ "email": random_email(username),
+ "password": random_password(),
+ "role": random.choice(wl.user_roles),
+ "api_token": random_api_key(),
+ }
+ )
return json.dumps({"users": users}, indent=2)
@@ -95,20 +107,28 @@ def api_keys_json() -> str:
"""Generate fake api_keys.json with random data"""
keys = {
"stripe": {
- "public_key": "pk_live_" + ''.join(random.choices(string.ascii_letters + string.digits, k=24)),
- "secret_key": random_api_key()
+ "public_key": "pk_live_"
+ + "".join(random.choices(string.ascii_letters + string.digits, k=24)),
+ "secret_key": random_api_key(),
},
"aws": {
- "access_key_id": "AKIA" + ''.join(random.choices(string.ascii_uppercase + string.digits, k=16)),
- "secret_access_key": ''.join(random.choices(string.ascii_letters + string.digits + '+/', k=40))
+ "access_key_id": "AKIA"
+ + "".join(random.choices(string.ascii_uppercase + string.digits, k=16)),
+ "secret_access_key": "".join(
+ random.choices(string.ascii_letters + string.digits + "+/", k=40)
+ ),
},
"sendgrid": {
- "api_key": "SG." + ''.join(random.choices(string.ascii_letters + string.digits, k=48))
+ "api_key": "SG."
+ + "".join(random.choices(string.ascii_letters + string.digits, k=48))
},
"twilio": {
- "account_sid": "AC" + ''.join(random.choices(string.ascii_lowercase + string.digits, k=32)),
- "auth_token": ''.join(random.choices(string.ascii_lowercase + string.digits, k=32))
- }
+ "account_sid": "AC"
+ + "".join(random.choices(string.ascii_lowercase + string.digits, k=32)),
+ "auth_token": "".join(
+ random.choices(string.ascii_lowercase + string.digits, k=32)
+ ),
+ },
}
return json.dumps(keys, indent=2)
@@ -116,51 +136,70 @@ def api_keys_json() -> str:
def api_response(path: str) -> str:
"""Generate fake API JSON responses with random data"""
wl = get_wordlists()
-
+
def random_users(count: int = 3):
users = []
for i in range(count):
username = random_username()
- users.append({
- "id": i + 1,
- "username": username,
- "email": random_email(username),
- "role": random.choice(wl.user_roles)
- })
+ users.append(
+ {
+ "id": i + 1,
+ "username": username,
+ "email": random_email(username),
+ "role": random.choice(wl.user_roles),
+ }
+ )
return users
-
+
responses = {
- '/api/users': json.dumps({
- "users": random_users(random.randint(2, 5)),
- "total": random.randint(50, 500)
- }, indent=2),
- '/api/v1/users': json.dumps({
- "status": "success",
- "data": [{
- "id": random.randint(1, 100),
- "name": random_username(),
- "api_key": random_api_key()
- }]
- }, indent=2),
- '/api/v2/secrets': json.dumps({
- "database": {
- "host": random.choice(wl.database_hosts),
- "username": random_username(),
- "password": random_password(),
- "database": random_database_name()
+ "/api/users": json.dumps(
+ {
+ "users": random_users(random.randint(2, 5)),
+ "total": random.randint(50, 500),
},
- "api_keys": {
- "stripe": random_api_key(),
- "aws": 'AKIA' + ''.join(random.choices(string.ascii_uppercase + string.digits, k=16))
- }
- }, indent=2),
- '/api/config': json.dumps({
- "app_name": random.choice(wl.application_names),
- "debug": random.choice([True, False]),
- "secret_key": random_api_key(),
- "database_url": f"postgresql://{random_username()}:{random_password()}@localhost/{random_database_name()}"
- }, indent=2),
- '/.env': f"""APP_NAME={random.choice(wl.application_names)}
+ indent=2,
+ ),
+ "/api/v1/users": json.dumps(
+ {
+ "status": "success",
+ "data": [
+ {
+ "id": random.randint(1, 100),
+ "name": random_username(),
+ "api_key": random_api_key(),
+ }
+ ],
+ },
+ indent=2,
+ ),
+ "/api/v2/secrets": json.dumps(
+ {
+ "database": {
+ "host": random.choice(wl.database_hosts),
+ "username": random_username(),
+ "password": random_password(),
+ "database": random_database_name(),
+ },
+ "api_keys": {
+ "stripe": random_api_key(),
+ "aws": "AKIA"
+ + "".join(
+ random.choices(string.ascii_uppercase + string.digits, k=16)
+ ),
+ },
+ },
+ indent=2,
+ ),
+ "/api/config": json.dumps(
+ {
+ "app_name": random.choice(wl.application_names),
+ "debug": random.choice([True, False]),
+ "secret_key": random_api_key(),
+ "database_url": f"postgresql://{random_username()}:{random_password()}@localhost/{random_database_name()}",
+ },
+ indent=2,
+ ),
+ "/.env": f"""APP_NAME={random.choice(wl.application_names)}
DEBUG={random.choice(['true', 'false'])}
APP_KEY=base64:{''.join(random.choices(string.ascii_letters + string.digits, k=32))}=
DB_CONNECTION=mysql
@@ -172,7 +211,7 @@ DB_PASSWORD={random_password()}
AWS_ACCESS_KEY_ID=AKIA{''.join(random.choices(string.ascii_uppercase + string.digits, k=16))}
AWS_SECRET_ACCESS_KEY={''.join(random.choices(string.ascii_letters + string.digits + '+/', k=40))}
STRIPE_SECRET={random_api_key()}
-"""
+""",
}
return responses.get(path, json.dumps({"error": "Not found"}, indent=2))
@@ -180,11 +219,13 @@ STRIPE_SECRET={random_api_key()}
def directory_listing(path: str) -> str:
"""Generate fake directory listing using wordlists"""
wl = get_wordlists()
-
+
files = wl.directory_files
dirs = wl.directory_dirs
-
- selected_files = [(f, random.randint(1024, 1024*1024))
- for f in random.sample(files, min(6, len(files)))]
-
+
+ selected_files = [
+ (f, random.randint(1024, 1024 * 1024))
+ for f in random.sample(files, min(6, len(files)))
+ ]
+
return html_templates.directory_listing(path, dirs, selected_files)
diff --git a/src/geo_utils.py b/src/geo_utils.py
new file mode 100644
index 0000000..d11f01c
--- /dev/null
+++ b/src/geo_utils.py
@@ -0,0 +1,113 @@
+#!/usr/bin/env python3
+"""
+Geolocation utilities for reverse geocoding and city lookups.
+"""
+
+import requests
+from typing import Optional, Tuple
+from logger import get_app_logger
+
+app_logger = get_app_logger()
+
+# Simple city name cache to avoid repeated API calls
+_city_cache = {}
+
+
+def reverse_geocode_city(latitude: float, longitude: float) -> Optional[str]:
+ """
+ Reverse geocode coordinates to get city name using Nominatim (OpenStreetMap).
+
+ Args:
+ latitude: Latitude coordinate
+ longitude: Longitude coordinate
+
+ Returns:
+ City name or None if not found
+ """
+ # Check cache first
+ cache_key = f"{latitude},{longitude}"
+ if cache_key in _city_cache:
+ return _city_cache[cache_key]
+
+ try:
+ # Use Nominatim reverse geocoding API (free, no API key required)
+ url = "https://nominatim.openstreetmap.org/reverse"
+ params = {
+ "lat": latitude,
+ "lon": longitude,
+ "format": "json",
+ "zoom": 10, # City level
+ "addressdetails": 1,
+ }
+ headers = {"User-Agent": "Krawl-Honeypot/1.0"} # Required by Nominatim ToS
+
+ response = requests.get(url, params=params, headers=headers, timeout=5)
+ response.raise_for_status()
+
+ data = response.json()
+ address = data.get("address", {})
+
+ # Try to get city from various possible fields
+ city = (
+ address.get("city")
+ or address.get("town")
+ or address.get("village")
+ or address.get("municipality")
+ or address.get("county")
+ )
+
+ # Cache the result
+ _city_cache[cache_key] = city
+
+ if city:
+ app_logger.debug(f"Reverse geocoded {latitude},{longitude} to {city}")
+
+ return city
+
+ except requests.RequestException as e:
+ app_logger.warning(f"Reverse geocoding failed for {latitude},{longitude}: {e}")
+ return None
+ except Exception as e:
+ app_logger.error(f"Error in reverse geocoding: {e}")
+ return None
+
+
+def get_most_recent_geoip_data(results: list) -> Optional[dict]:
+ """
+ Extract the most recent geoip_data from API results.
+ Results are assumed to be sorted by record_added (most recent first).
+
+ Args:
+ results: List of result dictionaries from IP reputation API
+
+ Returns:
+ Most recent geoip_data dict or None
+ """
+ if not results:
+ return None
+
+ # The first result is the most recent (sorted by record_added)
+ most_recent = results[0]
+ return most_recent.get("geoip_data")
+
+
+def extract_city_from_coordinates(geoip_data: dict) -> Optional[str]:
+ """
+ Extract city name from geoip_data using reverse geocoding.
+
+ Args:
+ geoip_data: Dictionary containing location_latitude and location_longitude
+
+ Returns:
+ City name or None
+ """
+ if not geoip_data:
+ return None
+
+ latitude = geoip_data.get("location_latitude")
+ longitude = geoip_data.get("location_longitude")
+
+ if latitude is None or longitude is None:
+ return None
+
+ return reverse_geocode_city(latitude, longitude)
diff --git a/src/handler.py b/src/handler.py
index c93b78b..0a6abb2 100644
--- a/src/handler.py
+++ b/src/handler.py
@@ -10,11 +10,17 @@ from urllib.parse import urlparse, parse_qs
from config import Config
from tracker import AccessTracker
+from analyzer import Analyzer
from templates import html_templates
from templates.dashboard_template import generate_dashboard
from generators import (
- credentials_txt, passwords_txt, users_json, api_keys_json,
- api_response, directory_listing
+ credentials_txt,
+ passwords_txt,
+ users_json,
+ api_keys_json,
+ api_response,
+ directory_listing,
+ random_server_header,
)
from wordlists import get_wordlists
from sql_errors import generate_sql_error_response, get_sql_response_with_data
@@ -24,9 +30,11 @@ from server_errors import generate_server_error
class Handler(BaseHTTPRequestHandler):
"""HTTP request handler for the deception server"""
+
webpages: Optional[List[str]] = None
config: Config = None
tracker: AccessTracker = None
+ analyzer: Analyzer = None
counter: int = 0
app_logger: logging.Logger = None
access_logger: logging.Logger = None
@@ -35,28 +43,40 @@ class Handler(BaseHTTPRequestHandler):
def _get_client_ip(self) -> str:
"""Extract client IP address from request, checking proxy headers first"""
# Headers might not be available during early error logging
- if hasattr(self, 'headers') and self.headers:
+ if hasattr(self, "headers") and self.headers:
# Check X-Forwarded-For header (set by load balancers/proxies)
- forwarded_for = self.headers.get('X-Forwarded-For')
+ forwarded_for = self.headers.get("X-Forwarded-For")
if forwarded_for:
# X-Forwarded-For can contain multiple IPs, get the first (original client)
- return forwarded_for.split(',')[0].strip()
-
+ return forwarded_for.split(",")[0].strip()
+
# Check X-Real-IP header (set by nginx and other proxies)
- real_ip = self.headers.get('X-Real-IP')
+ real_ip = self.headers.get("X-Real-IP")
if real_ip:
return real_ip.strip()
-
+
# Fallback to direct connection IP
return self.client_address[0]
def _get_user_agent(self) -> str:
"""Extract user agent from request"""
- return self.headers.get('User-Agent', '')
+ return self.headers.get("User-Agent", "")
+
+ def _get_category_by_ip(self, client_ip: str) -> str:
+ """Get the category of an IP from the database"""
+ return self.tracker.get_category_by_ip(client_ip)
+
+ def _get_page_visit_count(self, client_ip: str) -> int:
+ """Get current page visit count for an IP"""
+ return self.tracker.get_page_visit_count(client_ip)
+
+ def _increment_page_visit(self, client_ip: str) -> int:
+ """Increment page visit counter for an IP and return new count"""
+ return self.tracker.increment_page_visit(client_ip)
def version_string(self) -> str:
"""Return custom server version for deception."""
- return self.config.server_header
+ return random_server_header()
def _should_return_error(self) -> bool:
"""Check if we should return an error based on probability"""
@@ -71,53 +91,61 @@ class Handler(BaseHTTPRequestHandler):
if not error_codes:
error_codes = [400, 401, 403, 404, 500, 502, 503]
return random.choice(error_codes)
-
+
def _parse_query_string(self) -> str:
"""Extract query string from the request path"""
parsed = urlparse(self.path)
return parsed.query
-
+
def _handle_sql_endpoint(self, path: str) -> bool:
"""
Handle SQL injection honeypot endpoints.
Returns True if the path was handled, False otherwise.
"""
# SQL-vulnerable endpoints
- sql_endpoints = ['/api/search', '/api/sql', '/api/database']
-
+ sql_endpoints = ["/api/search", "/api/sql", "/api/database"]
+
base_path = urlparse(path).path
if base_path not in sql_endpoints:
return False
-
+
try:
# Get query parameters
query_string = self._parse_query_string()
-
+
# Log SQL injection attempt
client_ip = self._get_client_ip()
user_agent = self._get_user_agent()
-
+
# Always check for SQL injection patterns
- error_msg, content_type, status_code = generate_sql_error_response(query_string or "")
-
+ error_msg, content_type, status_code = generate_sql_error_response(
+ query_string or ""
+ )
+
if error_msg:
# SQL injection detected - log and return error
- self.access_logger.warning(f"[SQL INJECTION DETECTED] {client_ip} - {base_path} - Query: {query_string[:100] if query_string else 'empty'}")
+ self.access_logger.warning(
+ f"[SQL INJECTION DETECTED] {client_ip} - {base_path} - Query: {query_string[:100] if query_string else 'empty'}"
+ )
self.send_response(status_code)
- self.send_header('Content-type', content_type)
+ self.send_header("Content-type", content_type)
self.end_headers()
self.wfile.write(error_msg.encode())
else:
# No injection detected - return fake data
- self.access_logger.info(f"[SQL ENDPOINT] {client_ip} - {base_path} - Query: {query_string[:100] if query_string else 'empty'}")
+ self.access_logger.info(
+ f"[SQL ENDPOINT] {client_ip} - {base_path} - Query: {query_string[:100] if query_string else 'empty'}"
+ )
self.send_response(200)
- self.send_header('Content-type', 'application/json')
+ self.send_header("Content-type", "application/json")
self.end_headers()
- response_data = get_sql_response_with_data(base_path, query_string or "")
+ response_data = get_sql_response_with_data(
+ base_path, query_string or ""
+ )
self.wfile.write(response_data.encode())
-
+
return True
-
+
except BrokenPipeError:
# Client disconnected
return True
@@ -126,120 +154,66 @@ class Handler(BaseHTTPRequestHandler):
# Still send a response even on error
try:
self.send_response(500)
- self.send_header('Content-type', 'application/json')
+ self.send_header("Content-type", "application/json")
self.end_headers()
self.wfile.write(b'{"error": "Internal server error"}')
except:
pass
return True
- def generate_page(self, seed: str) -> str:
+ def generate_page(self, seed: str, page_visit_count: int) -> str:
"""Generate a webpage containing random links or canary token"""
+
random.seed(seed)
num_pages = random.randint(*self.config.links_per_page_range)
- html = f"""
-
-
-
- Krawl
-
-
-
-
-
Krawl me! 🕸
-
{Handler.counter}
-
-
-"""
+ # Check if this is a good crawler by IP category from database
+ ip_category = self._get_category_by_ip(self._get_client_ip())
+ # Determine if we should apply crawler page limit based on config and IP category
+ should_apply_crawler_limit = False
+ if self.config.infinite_pages_for_malicious:
+ if (
+ ip_category == "good_crawler" or ip_category == "regular_user"
+ ) and page_visit_count >= self.config.max_pages_limit:
+ should_apply_crawler_limit = True
+ else:
+ if (
+ ip_category == "good_crawler"
+ or ip_category == "bad_crawler"
+ or ip_category == "attacker"
+ ) and page_visit_count >= self.config.max_pages_limit:
+ should_apply_crawler_limit = True
+
+ # If good crawler reached max pages, return a simple page with no links
+ if should_apply_crawler_limit:
+ return html_templates.main_page(
+ Handler.counter, "
Crawl limit reached.
"
+ )
+
+ num_pages = random.randint(*self.config.links_per_page_range)
+
+ # Build the content HTML
+ content = ""
+
+ # Add canary token if needed
if Handler.counter <= 0 and self.config.canary_token_url:
- html += f"""
+ content += f"""