docs: add comprehensive documentation for API, backups, canary token, dashboard, honeypot, reverse proxy, and wordlist customization

2026-03-01 21:20:33 +01:00
parent b6fbcabdee
commit 2e4e494636
8 changed files with 161 additions and 175 deletions
--- a/docs/api.md
+++ b/docs/api.md
@@ -0,0 +1,9 @@
+# API
+
+Krawl uses the following APIs
+- http://ip-api.com (IP Data)
+- https://iprep.lcrawl.com (IP Reputation)
+- https://nominatim.openstreetmap.org/reverse (Reverse IP Lookup)
+- https://api.ipify.org (Public IP discovery)
+- http://ident.me (Public IP discovery)
+- https://ifconfig.me (Public IP discovery)
--- a/docs/backups.md
+++ b/docs/backups.md
@@ -0,0 +1,10 @@
+# Enable Database Dump Job for Backups
+
+To enable the database dump job, set the following variables (*config file example*)
+
+```yaml
+backups:
+    path: "backups" # where backup will be saved
+    cron: "*/30 * * * *" # frequency of the cronjob
+    enabled: true
+```
--- a/docs/canary-token.md
+++ b/docs/canary-token.md
@@ -0,0 +1,10 @@
+# Customizing the Canary Token
+
+To create a custom canary token, visit https://canarytokens.org
+
+and generate a "Web bug" canary token.
+
+This optional token is triggered when a crawler fully traverses the webpage until it reaches 0. At that point, a URL is returned. When this URL is requested, it sends an alert to the user via email, including the visitor's IP address and user agent.
+
+
+To enable this feature, set the canary token URL [using the environment variable](../README.md#configuration-via-enviromental-variables) `KRAWL_CANARY_TOKEN_URL`.
--- a/docs/dashboard.md
+++ b/docs/dashboard.md
@@ -0,0 +1,21 @@
+# Dashboard
+
+Access the dashboard at `http://<server-ip>:<port>/<dashboard-path>`
+
+The dashboard shows:
+- Total and unique accesses
+- Suspicious activity and attack detection
+- Top IPs, paths, user-agents and GeoIP localization
+- Real-time monitoring
+
+The attackers' access to the honeypot endpoint and related suspicious activities (such as failed login attempts) are logged.
+
+Krawl also implements a scoring system designed to distinguish between malicious and legitimate behavior on the website.
+
+![dashboard-1](../img/dashboard-1.png)
+
+The top IP Addresses is shown along with top paths and User Agents
+
+![dashboard-2](../img/dashboard-2.png)
+
+![dashboard-3](../img/dashboard-3.png)
--- a/docs/honeypot.md
+++ b/docs/honeypot.md
@@ -0,0 +1,52 @@
+# Honeypot
+
+Below is a complete overview of the Krawl honeypot's capabilities
+
+## robots.txt
+The actual (juicy) robots.txt configuration [is the following](../src/templates/html/robots.txt).
+
+## Honeypot pages
+
+### Common Login Attempts
+Requests to common admin endpoints (`/admin/`, `/wp-admin/`, `/phpMyAdmin/`) return a fake login page. Any login attempt triggers a 1-second delay to simulate real processing and is fully logged in the dashboard (credentials, IP, headers, timing).
+
+![admin page](../img/admin-page.png)
+
+### Common Misconfiguration Paths
+Requests to paths like `/backup/`, `/config/`, `/database/`, `/private/`, or `/uploads/` return a fake directory listing populated with "interesting" files, each assigned a random file size to look realistic.
+
+![directory-page](../img/directory-page.png)
+
+### Environment File Leakage
+The `.env` endpoint exposes fake database connection strings, **AWS API keys**, and **Stripe secrets**. It intentionally returns an error due to the `Content-Type` being `application/json` instead of plain text, mimicking a "juicy" misconfiguration that crawlers and scanners often flag as information leakage.
+
+### Server Error Information
+The `/server` page displays randomly generated fake error information for each known server.
+
+![server and env page](../img/server-and-env-page.png)
+
+### API Endpoints with Sensitive Data
+The pages `/api/v1/users` and `/api/v2/secrets` show fake users and random secrets in JSON format
+
+![users and secrets](../img/users-and-secrets.png)
+
+### Exposed Credential Files
+The pages `/credentials.txt` and `/passwords.txt` show fake users and random secrets
+
+![credentials and passwords](../img/credentials-and-passwords.png)
+
+### SQL Injection and XSS Detection
+Pages such as `/users`, `/search`, `/contact`, `/info`, `/input`, and `/feedback`, along with APIs like `/api/sql` and `/api/database`, are designed to lure attackers into performing attacks such as **SQL injection** or **XSS**.
+
+![sql injection](../img/sql_injection.png)
+
+Automated tools like **SQLMap** will receive a different randomized database error on each request, increasing scan noise and confusing the attacker. All detected attacks are logged and displayed in the dashboard.
+
+### Path Traversal Detection
+Krawl detects and responds to **path traversal** attempts targeting common system files like `/etc/passwd`, `/etc/shadow`, or Windows system paths. When an attacker tries to access sensitive files using patterns like `../../../etc/passwd` or encoded variants (`%2e%2e/`, `%252e`), Krawl returns convincing fake file contents with realistic system users, UIDs, GIDs, and shell configurations. This wastes attacker time while logging the full attack pattern.
+
+### XXE (XML External Entity) Injection
+The `/api/xml` and `/api/parser` endpoints accept XML input and are designed to detect **XXE injection** attempts. When attackers try to exploit external entity declarations (`<!ENTITY`, `<!DOCTYPE`, `SYSTEM`) or reference entities to access local files, Krawl responds with realistic XML responses that appear to process the entities successfully. The honeypot returns fake file contents, simulated entity values (like `admin_credentials` or `database_connection`), or realistic error messages, making the attack appear successful while fully logging the payload.
+
+### Command Injection Detection
+Pages like `/api/exec`, `/api/run`, and `/api/system` simulate command execution endpoints vulnerable to **command injection**. When attackers attempt to inject shell commands using patterns like `; whoami`, `| cat /etc/passwd`, or backticks, Krawl responds with realistic command outputs. For example, `whoami` returns fake usernames like `www-data` or `nginx`, while `uname` returns fake Linux kernel versions. Network commands like `wget` or `curl` simulate downloads or return "command not found" errors, creating believable responses that delay and confuse automated exploitation tools.
--- a/docs/reverse-proxy.md
+++ b/docs/reverse-proxy.md
@@ -0,0 +1,25 @@
+# Example Usage Behind Reverse Proxy
+
+You can configure a reverse proxy so all web requests land on the Krawl page by default, and hide your real content behind a secret hidden url. For example:
+
+```bash
+location / {
+    proxy_pass https://your-krawl-instance;
+    proxy_pass_header Server;
+}
+
+location /my-hidden-service {
+    proxy_pass https://my-hidden-service;
+    proxy_pass_header Server;
+}
+```
+
+Alternatively, you can create a bunch of different "interesting" looking domains. For example:
+
+- admin.example.com
+- portal.example.com
+- sso.example.com
+- login.example.com
+- ...
+
+Additionally, you may configure your reverse proxy to forward all non-existing subdomains (e.g. nonexistent.example.com) to one of these domains so that any crawlers that are guessing domains at random will automatically end up at your Krawl instance.
--- a/docs/wordlist.md
+++ b/docs/wordlist.md
@@ -0,0 +1,22 @@
+# Customizing the Wordlist
+
+Edit `wordlists.json` to customize fake data for your use case
+
+```json
+{
+  "usernames": {
+    "prefixes": ["admin", "root", "user"],
+    "suffixes": ["_prod", "_dev", "123"]
+  },
+  "passwords": {
+    "prefixes": ["P@ssw0rd", "Admin"],
+    "simple": ["test", "password"]
+  },
+  "directory_listing": {
+    "files": ["credentials.txt", "backup.sql"],
+    "directories": ["admin/", "backup/"]
+  }
+}
+```
+
+or **values.yaml** in the case of helm chart installation