docs/architecture.md

# Architecture Overview

This document explains how the different pieces of SecureLens fit together — what each layer does, why it exists, and how data flows through the system.

---

## High-Level Architecture

```
┌──────────────────────────────────────────────────────────────┐
│                        CLIENT                                │
│  (Next.js Frontend / Swagger UI / curl / API consumer)       │
└───────────────────────────────┬──────────────────────────────┘
                                │  HTTP requests
                                ▼
┌──────────────────────────────────────────────────────────────┐
│                    FASTAPI APPLICATION                       │
│                                                              │
│  ┌───────────────┐  ┌───────────────┐  ┌─────────────────┐   │
│  │  Auth Router  │  │  Scan Router  │  │ Code Scan Router│   │
│  │  /auth/*      │  │  /scan        │  │  /code-scan/*   │   │
│  └───────┬───────┘  └───────┬───────┘  └────────┬────────┘   │
│          │                  │                    │           │
│          ▼                  ▼                    ▼           │
│  ┌──────────────┐  ┌─────────────────┐  ┌──────────────┐     │
│  │ Auth Service │  │ Scanner Service │  │  Orchestrator│     │
│  │ JWT + Users  │  │ 5 check layers  │  │ 3-phase agent│     │
│  └──────┬───────┘  └───────┬─────────┘  └──────┬───────┘     │
│         │                  │                    │            │
└─────────┼──────────────────┼────────────────────┼────────────┘
          │                  │                    │
          ▼                  ▼                    ▼
   ┌─────────────┐   ┌──────────────┐    ┌──────────────┐
   │ PostgreSQL  │   │ Target URLs  │    │  GitHub API  │
   │  Database   │   │ (live scans) │    │  + Gemini AI │
   └─────────────┘   └──────────────┘    └──────────────┘
```

---

## Application Layers

### 1. FastAPI Application (`app/main.py`)

This is the entry point. It creates the FastAPI app, registers all the routers, sets up CORS, and configures the lifespan (startup/shutdown logic like creating database tables).

FastAPI is async from top to bottom. Every request handler is an `async def` function, which means the server can handle many concurrent requests without blocking on I/O — critical for a system that makes lots of external HTTP calls.

The app listens on port `8000` and serves:
- A REST API for all functionality
- An interactive Swagger UI at `/docs`
- An OpenAPI schema at `/openapi.json`

---

### 2. Routers (`app/routers/`)

Routers are just groups of related endpoints. FastAPI uses them to keep the codebase organised.

| File | What It Handles |
|---|---|
| `auth.py` | Register, login, get current user |
| `scan.py` | Website URL scanning |
| `history.py` | Reading and deleting past scan results |
| `code_scan.py` | GitHub repo scanning + AI chat |
| `health.py` | Health check endpoints |

Routers don't contain business logic. They receive the request, call the appropriate service, and return the result. They're thin by design.

---

### 3. Services (`app/services/`)

Services contain the actual business logic.

#### `scanner/` — The Website Scanner

A collection of five independent checkers, each responsible for one "layer" of security:

- `transport.py` — Checks if the site uses HTTPS and implements HSTS correctly
- `ssl_checker.py` — Validates the SSL certificate (expiry, chain, TLS version)
- `headers.py` — Checks for the presence and correct configuration of security headers (CSP, X-Frame-Options, etc.)
- `cookies.py` — Checks session cookies for HttpOnly, Secure, and SameSite flags
- `exposure.py` — Probes for exposed sensitive paths like `/admin`, `/.env`, `/phpinfo.php`

Each checker runs independently. The scan router calls all of them, collects their results, passes them to the scoring engine, then sends everything through the AI service for enhancement.

#### `code_scanner/` — The Code Scanner Agent

Contains the three-phase AI pipeline. See [ai-agent.md](./ai-agent.md) for a full explanation.

- `orchestrator.py` — The main pipeline class (Triage → Analysis → Summary)
- `github_client.py` — Handles all GitHub API communication

#### `ai.py` — Website Scanner AI Layer

Standalone functions that use Gemini to enhance the website scanner's results: `enhance_security_issues()`, `chat_with_scan_context()`, `generate_threat_narrative()`.

#### `scoring.py` — The Scoring Engine

A pure Python function that takes the list of issues from all scanners, applies weights based on severity, and produces a 0–100 score and an A–F letter grade. No AI involved here — it's deterministic and consistent.

---

### 4. Schemas (`app/schemas/`)

Pydantic models that define the shape of every request and response. FastAPI uses these for automatic validation, serialisation, and documentation generation.

If a request body doesn't match the schema, FastAPI returns a `422` automatically without your handler even being called.

Key schemas:

- `auth.py` — `RegisterRequest`, `LoginRequest`, `TokenResponse`, `UserResponse`
- `scan.py` — `ScanRequest`, `ScanResponse`, `IssueDetail`
- `code_scan.py` — `CodeScanRequest`, `CodeScanResponse`, `VulnerabilityIssue`, `CodeChatRequest`, `CodeChatResponse`

---

### 5. Models (`app/models/`)

SQLAlchemy ORM models — the Python representation of database tables.

- `user.py` — The `User` table (id, email, username, hashed_password, created_at)
- `scan.py` — The `ScanResult` table (id, user_id, url, score, grade, full result JSON)

These are what get stored in PostgreSQL. The code scanner's results are *not* stored in the database in the current version — they're kept in an in-memory dict in `code_scan.py`.

---

### 6. Middleware (`app/middleware/`)

- `auth.py` — The `get_current_user` dependency. Any endpoint that requires authentication uses this. It validates the JWT token from the `Authorization` header and returns the user object.
- `rate_limiter.py` — SlowAPI configuration. Limits the number of requests per IP per minute.

---

### 7. Utils (`app/utils/`)

- `auth.py` — Low-level JWT functions: creating tokens, verifying tokens, hashing passwords, checking passwords
- `validators.py` — URL validation and SSRF protection. Before scanning any URL, we check it's not a private IP address or localhost, which would let attackers use our scanner to probe internal networks

---

## Data Flow — Code Scan Request

This is exactly what happens when you call `POST /code-scan/analyze`:

```
1. Request arrives at FastAPI
      │
2. Pydantic validates the body → CodeScanRequest(repo_url, github_token, branch)
      │
3. Router creates a CodeScanOrchestrator instance
      │
4. GitHubClient.get_repo_tree() → fetches all file paths via GitHub Trees API
      │
      ├── Makes 1-2 GitHub API calls (uses token for auth)
      └── Returns: ["app/page.js", "app/users/page.js", "package.json", ...]
      │
5. orchestrator.triage_files() → sends file list to Gemini
      │
      ├── 1 Gemini API call with all filenames
      └── Returns: ["app/users/page.js", "middleware.ts", ...] (5 files)
      │
6. orchestrator.analyze_files() → fetches and scans each file
      │
      ├── GitHubClient.get_file_content() × 5 (concurrent, async)
      ├── Gemini generate_content() × 5 (concurrent, async, behind Semaphore)
      └── Returns: [VulnerabilityIssue, VulnerabilityIssue, ...]
      │
7. orchestrator.generate_summary() → writes executive summary
      │
      ├── 1 Gemini API call with all vulnerability data
      └── Returns: "The repository presents a moderate risk..."
      │
8. Router creates CodeScanResponse with a UUID scan_id
      │
9. scan_store[scan_id] = response  (saved in-memory for chat)
      │
10. Response returned to client (JSON)
```

Total external API calls: 2-3 GitHub + 7 Gemini = ~9-10 calls per scan.

---

## Database

We use PostgreSQL in production (via Docker Compose) and SQLite in local development.

The connection is managed by SQLAlchemy's async engine. All database operations use `async with get_db() as session:` — they never block.

Migrations are managed by Alembic. To run migrations:

```bash
alembic upgrade head
```

The tables are also auto-created on startup in development mode (the `create_all()` call in `main.py`'s lifespan function).

---

## Environment Configuration

All configuration is driven by the `.env` file. The `config.py` file uses Pydantic's `BaseSettings` to read it:

```python
class Settings(BaseSettings):
    gemini_api_key: str | None = None
    database_url: str = "sqlite+aiosqlite:///./securelens.db"
    jwt_secret: str = "change-me-in-production"
    # ...
```

If a required variable is missing, Pydantic raises an error on startup — not silently at runtime.

See `.env.example` for the full list of options, or the Configuration section in [README.md](../README.md).

---

## Docker Setup

The `docker-compose.yml` runs two services:

```
backend   ← FastAPI app (port 8000)
db        ← PostgreSQL (port 5432, internal only)
```

The backend container reads `DATABASE_URL` from `.env` and connects to the `db` container over the internal Docker network. PostgreSQL data persists in a Docker volume across restarts.

To rebuild from scratch:

```bash
docker compose down -v   # removes containers AND the data volume
docker compose up --build
```