Krawl

A modern, customizable web honeypot server designed to detect and track malicious activity from attackers and web crawlers through deceptive web pages, fake credentials, and canary tokens.

What is Krawl? • Installation • Honeypot Pages • Dashboard • Todo • Contributing

/api/download/malicious_ips.txt ``` This file enables automatic blocking of malicious traffic across various platforms. You can use it to update firewall rules on: * [OPNsense and pfSense](https://www.allthingstech.ch/using-opnsense-and-ip-blocklists-to-block-malicious-traffic) * [RouterOS](https://rentry.co/krawl-routeros) * [IPtables](plugins/iptables/README.md) and [Nftables](plugins/nftables/README.md) * [Fail2Ban](plugins/fail2ban/README.md) ## IP Reputation Krawl [uses tasks that analyze recent traffic to build and continuously update an IP reputation](src/tasks/analyze_ips.py) score. It runs periodically and evaluates each active IP address based on multiple behavioral indicators to classify it as an attacker, crawler, or regular user. Thresholds are fully customizable. ![ip reputation](img/ip-reputation.png) The analysis includes: - **Risky HTTP methods usage** (e.g. POST, PUT, DELETE ratios) - **Robots.txt violations** - **Request timing anomalies** (bursty or irregular patterns) - **User-Agent consistency** - **Attack URL detection** (e.g. SQL injection, XSS patterns) Each signal contributes to a weighted scoring model that assigns a reputation category: - `attacker` - `bad_crawler` - `good_crawler` - `regular_user` - `unknown` (for insufficient data) The resulting scores and metrics are stored in the database and used by Krawl to drive dashboards, reputation tracking, and automated mitigation actions such as IP banning or firewall integration. ## Forward server header If Krawl is deployed behind a proxy such as NGINX the **server header** should be forwarded using the following configuration in your proxy: ```bash location / { proxy_pass https://your-krawl-instance; proxy_pass_header Server; } ``` ## API Krawl uses the following APIs - http://ip-api.com (IP Data) - https://iprep.lcrawl.com (IP Reputation) - https://nominatim.openstreetmap.org/reverse (Reverse IP Lookup) - https://api.ipify.org (Public IP discovery) - http://ident.me (Public IP discovery) - https://ifconfig.me (Public IP discovery) ## Configuration Krawl uses a **configuration hierarchy** in which **environment variables take precedence over the configuration file**. This approach is recommended for Docker deployments and quick out-of-the-box customization. ### Configuration via Enviromental Variables | Environment Variable | Description | Default | |----------------------|-------------|---------| | `CONFIG_LOCATION` | Path to yaml config file | `config.yaml` | | `KRAWL_PORT` | Server listening port | `5000` | | `KRAWL_DELAY` | Response delay in milliseconds | `100` | | `KRAWL_SERVER_HEADER` | HTTP Server header for deception | `""` | | `KRAWL_LINKS_LENGTH_RANGE` | Link length range as `min,max` | `5,15` | | `KRAWL_LINKS_PER_PAGE_RANGE` | Links per page as `min,max` | `10,15` | | `KRAWL_CHAR_SPACE` | Characters used for link generation | `abcdefgh...` | | `KRAWL_MAX_COUNTER` | Initial counter value | `10` | | `KRAWL_CANARY_TOKEN_URL` | External canary token URL | None | | `KRAWL_CANARY_TOKEN_TRIES` | Requests before showing canary token | `10` | | `KRAWL_DASHBOARD_SECRET_PATH` | Custom dashboard path | Auto-generated | | `KRAWL_PROBABILITY_ERROR_CODES` | Error response probability (0-100%) | `0` | | `KRAWL_DATABASE_PATH` | Database file location | `data/krawl.db` | | `KRAWL_EXPORTS_PATH` | Path where firewalls rule sets are exported | `exports` | | `KRAWL_BACKUPS_PATH` | Path where database dump are saved | `backups` | | `KRAWL_BACKUPS_CRON` | cron expression to control backup job schedule | `*/30 * * * *` | | `KRAWL_BACKUPS_ENABLED` | Boolean to enable db dump job | `true` | | `KRAWL_DATABASE_RETENTION_DAYS` | Days to retain data in database | `30` | | `KRAWL_HTTP_RISKY_METHODS_THRESHOLD` | Threshold for risky HTTP methods detection | `0.1` | | `KRAWL_VIOLATED_ROBOTS_THRESHOLD` | Threshold for robots.txt violations | `0.1` | | `KRAWL_UNEVEN_REQUEST_TIMING_THRESHOLD` | Coefficient of variation threshold for timing | `0.5` | | `KRAWL_UNEVEN_REQUEST_TIMING_TIME_WINDOW_SECONDS` | Time window for request timing analysis in seconds | `300` | | `KRAWL_USER_AGENTS_USED_THRESHOLD` | Threshold for detecting multiple user agents | `2` | | `KRAWL_ATTACK_URLS_THRESHOLD` | Threshold for attack URL detection | `1` | | `KRAWL_INFINITE_PAGES_FOR_MALICIOUS` | Serve infinite pages to malicious IPs | `true` | | `KRAWL_MAX_PAGES_LIMIT` | Maximum page limit for crawlers | `250` | | `KRAWL_BAN_DURATION_SECONDS` | Ban duration in seconds for rate-limited IPs | `600` | For example ```bash # Set canary token export CONFIG_LOCATION="config.yaml" export KRAWL_CANARY_TOKEN_URL="http://your-canary-token-url" # Set number of pages range (min,max format) export KRAWL_LINKS_PER_PAGE_RANGE="5,25" # Set analyzer thresholds export KRAWL_HTTP_RISKY_METHODS_THRESHOLD="0.2" export KRAWL_VIOLATED_ROBOTS_THRESHOLD="0.15" # Set custom dashboard path export KRAWL_DASHBOARD_SECRET_PATH="/my-secret-dashboard" ``` Example of a Docker run with env variables: ```bash docker run -d \ -p 5000:5000 \ -e KRAWL_PORT=5000 \ -e KRAWL_DELAY=100 \ -e KRAWL_CANARY_TOKEN_URL="http://your-canary-token-url" \ --name krawl \ ghcr.io/blessedrebus/krawl:latest ``` ### Configuration via config.yaml You can use the [config.yaml](config.yaml) file for more advanced configurations, such as Docker Compose or Helm chart deployments. # Honeypot Below is a complete overview of the Krawl honeypot’s capabilities ## robots.txt The actual (juicy) robots.txt configuration [is the following](src/templates/html/robots.txt). ## Honeypot pages ### Common Login Attempts Requests to common admin endpoints (`/admin/`, `/wp-admin/`, `/phpMyAdmin/`) return a fake login page. Any login attempt triggers a 1-second delay to simulate real processing and is fully logged in the dashboard (credentials, IP, headers, timing). ![admin page](img/admin-page.png) ### Common Misconfiguration Paths Requests to paths like `/backup/`, `/config/`, `/database/`, `/private/`, or `/uploads/` return a fake directory listing populated with “interesting” files, each assigned a random file size to look realistic. ![directory-page](img/directory-page.png) ### Environment File Leakage The `.env` endpoint exposes fake database connection strings, **AWS API keys**, and **Stripe secrets**. It intentionally returns an error due to the `Content-Type` being `application/json` instead of plain text, mimicking a "juicy" misconfiguration that crawlers and scanners often flag as information leakage. ### Server Error Information The `/server` page displays randomly generated fake error information for each known server. ![server and env page](img/server-and-env-page.png) ### API Endpoints with Sensitive Data The pages `/api/v1/users` and `/api/v2/secrets` show fake users and random secrets in JSON format ![users and secrets](img/users-and-secrets.png) ### Exposed Credential Files The pages `/credentials.txt` and `/passwords.txt` show fake users and random secrets ![credentials and passwords](img/credentials-and-passwords.png) ### SQL Injection and XSS Detection Pages such as `/users`, `/search`, `/contact`, `/info`, `/input`, and `/feedback`, along with APIs like `/api/sql` and `/api/database`, are designed to lure attackers into performing attacks such as **SQL injection** or **XSS**. ![sql injection](img/sql_injection.png) Automated tools like **SQLMap** will receive a different randomized database error on each request, increasing scan noise and confusing the attacker. All detected attacks are logged and displayed in the dashboard. ### Path Traversal Detection Krawl detects and responds to **path traversal** attempts targeting common system files like `/etc/passwd`, `/etc/shadow`, or Windows system paths. When an attacker tries to access sensitive files using patterns like `../../../etc/passwd` or encoded variants (`%2e%2e/`, `%252e`), Krawl returns convincing fake file contents with realistic system users, UIDs, GIDs, and shell configurations. This wastes attacker time while logging the full attack pattern. ### XXE (XML External Entity) Injection The `/api/xml` and `/api/parser` endpoints accept XML input and are designed to detect **XXE injection** attempts. When attackers try to exploit external entity declarations (`:/` The dashboard shows: - Total and unique accesses - Suspicious activity and attack detection - Top IPs, paths, user-agents and GeoIP localization - Real-time monitoring The attackers’ access to the honeypot endpoint and related suspicious activities (such as failed login attempts) are logged. Krawl also implements a scoring system designed to distinguish between malicious and legitimate behavior on the website. ![dashboard-1](img/dashboard-1.png) The top IP Addresses is shown along with top paths and User Agents ![dashboard-2](img/dashboard-2.png) ![dashboard-3](img/dashboard-3.png) ## 🤝 Contributing Contributions welcome! Please: 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Submit a pull request (explain the changes!)

## ⚠️ Disclaimer **This is a deception/honeypot system.** Deploy in isolated environments and monitor carefully for security events. Use responsibly and in compliance with applicable laws and regulations. ## Star History