Files
securelens-backend/docs/architecture.md

238 lines
10 KiB
Markdown
Raw Normal View History

2026-04-29 16:22:02 +05:30
# Architecture Overview
This document explains how the different pieces of SecureLens fit together — what each layer does, why it exists, and how data flows through the system.
---
## High-Level Architecture
```
┌──────────────────────────────────────────────────────────────┐
│ CLIENT │
│ (Next.js Frontend / Swagger UI / curl / API consumer) │
└───────────────────────────────┬──────────────────────────────┘
│ HTTP requests
┌──────────────────────────────────────────────────────────────┐
│ FASTAPI APPLICATION │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │
│ │ Auth Router │ │ Scan Router │ │ Code Scan Router│ │
│ │ /auth/* │ │ /scan │ │ /code-scan/* │ │
│ └───────┬───────┘ └───────┬───────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ Auth Service │ │ Scanner Service │ │ Orchestrator│ │
│ │ JWT + Users │ │ 5 check layers │ │ 3-phase agent│ │
│ └──────┬───────┘ └───────┬─────────┘ └──────┬───────┘ │
│ │ │ │ │
└─────────┼──────────────────┼────────────────────┼────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL │ │ Target URLs │ │ GitHub API │
│ Database │ │ (live scans) │ │ + Gemini AI │
└─────────────┘ └──────────────┘ └──────────────┘
```
---
## Application Layers
### 1. FastAPI Application (`app/main.py`)
This is the entry point. It creates the FastAPI app, registers all the routers, sets up CORS, and configures the lifespan (startup/shutdown logic like creating database tables).
FastAPI is async from top to bottom. Every request handler is an `async def` function, which means the server can handle many concurrent requests without blocking on I/O — critical for a system that makes lots of external HTTP calls.
The app listens on port `8000` and serves:
- A REST API for all functionality
- An interactive Swagger UI at `/docs`
- An OpenAPI schema at `/openapi.json`
---
### 2. Routers (`app/routers/`)
Routers are just groups of related endpoints. FastAPI uses them to keep the codebase organised.
| File | What It Handles |
|---|---|
| `auth.py` | Register, login, get current user |
| `scan.py` | Website URL scanning |
| `history.py` | Reading and deleting past scan results |
| `code_scan.py` | GitHub repo scanning + AI chat |
| `health.py` | Health check endpoints |
Routers don't contain business logic. They receive the request, call the appropriate service, and return the result. They're thin by design.
---
### 3. Services (`app/services/`)
Services contain the actual business logic.
#### `scanner/` — The Website Scanner
A collection of five independent checkers, each responsible for one "layer" of security:
- `transport.py` — Checks if the site uses HTTPS and implements HSTS correctly
- `ssl_checker.py` — Validates the SSL certificate (expiry, chain, TLS version)
- `headers.py` — Checks for the presence and correct configuration of security headers (CSP, X-Frame-Options, etc.)
- `cookies.py` — Checks session cookies for HttpOnly, Secure, and SameSite flags
- `exposure.py` — Probes for exposed sensitive paths like `/admin`, `/.env`, `/phpinfo.php`
Each checker runs independently. The scan router calls all of them, collects their results, passes them to the scoring engine, then sends everything through the AI service for enhancement.
#### `code_scanner/` — The Code Scanner Agent
Contains the three-phase AI pipeline. See [ai-agent.md](./ai-agent.md) for a full explanation.
- `orchestrator.py` — The main pipeline class (Triage → Analysis → Summary)
- `github_client.py` — Handles all GitHub API communication
#### `ai.py` — Website Scanner AI Layer
Standalone functions that use Gemini to enhance the website scanner's results: `enhance_security_issues()`, `chat_with_scan_context()`, `generate_threat_narrative()`.
#### `scoring.py` — The Scoring Engine
A pure Python function that takes the list of issues from all scanners, applies weights based on severity, and produces a 0100 score and an AF letter grade. No AI involved here — it's deterministic and consistent.
---
### 4. Schemas (`app/schemas/`)
Pydantic models that define the shape of every request and response. FastAPI uses these for automatic validation, serialisation, and documentation generation.
If a request body doesn't match the schema, FastAPI returns a `422` automatically without your handler even being called.
Key schemas:
- `auth.py``RegisterRequest`, `LoginRequest`, `TokenResponse`, `UserResponse`
- `scan.py``ScanRequest`, `ScanResponse`, `IssueDetail`
- `code_scan.py``CodeScanRequest`, `CodeScanResponse`, `VulnerabilityIssue`, `CodeChatRequest`, `CodeChatResponse`
---
### 5. Models (`app/models/`)
SQLAlchemy ORM models — the Python representation of database tables.
- `user.py` — The `User` table (id, email, username, hashed_password, created_at)
- `scan.py` — The `ScanResult` table (id, user_id, url, score, grade, full result JSON)
These are what get stored in PostgreSQL. The code scanner's results are *not* stored in the database in the current version — they're kept in an in-memory dict in `code_scan.py`.
---
### 6. Middleware (`app/middleware/`)
- `auth.py` — The `get_current_user` dependency. Any endpoint that requires authentication uses this. It validates the JWT token from the `Authorization` header and returns the user object.
- `rate_limiter.py` — SlowAPI configuration. Limits the number of requests per IP per minute.
---
### 7. Utils (`app/utils/`)
- `auth.py` — Low-level JWT functions: creating tokens, verifying tokens, hashing passwords, checking passwords
- `validators.py` — URL validation and SSRF protection. Before scanning any URL, we check it's not a private IP address or localhost, which would let attackers use our scanner to probe internal networks
---
## Data Flow — Code Scan Request
This is exactly what happens when you call `POST /code-scan/analyze`:
```
1. Request arrives at FastAPI
2. Pydantic validates the body → CodeScanRequest(repo_url, github_token, branch)
3. Router creates a CodeScanOrchestrator instance
4. GitHubClient.get_repo_tree() → fetches all file paths via GitHub Trees API
├── Makes 1-2 GitHub API calls (uses token for auth)
└── Returns: ["app/page.js", "app/users/page.js", "package.json", ...]
5. orchestrator.triage_files() → sends file list to Gemini
├── 1 Gemini API call with all filenames
└── Returns: ["app/users/page.js", "middleware.ts", ...] (5 files)
6. orchestrator.analyze_files() → fetches and scans each file
├── GitHubClient.get_file_content() × 5 (concurrent, async)
├── Gemini generate_content() × 5 (concurrent, async, behind Semaphore)
└── Returns: [VulnerabilityIssue, VulnerabilityIssue, ...]
7. orchestrator.generate_summary() → writes executive summary
├── 1 Gemini API call with all vulnerability data
└── Returns: "The repository presents a moderate risk..."
8. Router creates CodeScanResponse with a UUID scan_id
9. scan_store[scan_id] = response (saved in-memory for chat)
10. Response returned to client (JSON)
```
Total external API calls: 2-3 GitHub + 7 Gemini = ~9-10 calls per scan.
---
## Database
We use PostgreSQL in production (via Docker Compose) and SQLite in local development.
The connection is managed by SQLAlchemy's async engine. All database operations use `async with get_db() as session:` — they never block.
Migrations are managed by Alembic. To run migrations:
```bash
alembic upgrade head
```
The tables are also auto-created on startup in development mode (the `create_all()` call in `main.py`'s lifespan function).
---
## Environment Configuration
All configuration is driven by the `.env` file. The `config.py` file uses Pydantic's `BaseSettings` to read it:
```python
class Settings(BaseSettings):
gemini_api_key: str | None = None
database_url: str = "sqlite+aiosqlite:///./securelens.db"
jwt_secret: str = "change-me-in-production"
# ...
```
If a required variable is missing, Pydantic raises an error on startup — not silently at runtime.
See `.env.example` for the full list of options, or the Configuration section in [README.md](../README.md).
---
## Docker Setup
The `docker-compose.yml` runs two services:
```
backend ← FastAPI app (port 8000)
db ← PostgreSQL (port 5432, internal only)
```
The backend container reads `DATABASE_URL` from `.env` and connects to the `db` container over the internal Docker network. PostgreSQL data persists in a Docker volume across restarts.
To rebuild from scratch:
```bash
docker compose down -v # removes containers AND the data volume
docker compose up --build
```