Malin 60c9b495ae fix: AI worker crash-proof + GDPR/hosting/accessibility analysis
AI worker fixes (root cause of "nothing reaches Replicate"):
- Worker task died silently — no exception handler around while loop
- Added try/except around entire loop body with exc_info logging
- Added watchdog task that restarts dead workers every 10 seconds
- ensure_workers_alive() called on every /api/ai/assess/batch POST
- _assess_one() is now a top-level function (not closure) — avoids
  subtle scoping bugs with async inner functions in while loops
- /api/ai/debug endpoint: shows worker alive status, task exception,
  last 10 queue entries — browse to /api/ai/debug to diagnose
- /api/ai/worker/restart endpoint + UI button
- "Restart AI worker" button + "Debug AI queue" link in enrichment tab

site_analyzer.py — new signals:
- IP resolution + ip-api.com for ASN, org, ISP, host country
- EU hosting detection (27 EU + EEA + adequacy countries)
- GDPR: detects Cookiebot, OneTrust, CookiePro, Osano, Iubenda,
  Borlabs, CookieYes, Complianz, Usercentrics + text signals
- Privacy policy and GDPR text presence
- Accessibility: html lang missing, images without alt count,
  skip nav link, empty links, inputs without labels

Gemini prompt additions:
- Hosting section: IP, ASN, org/ISP, EU vs non-EU flag
- GDPR section: cookie tool, notice, privacy policy
- Accessibility section: all quick-scan results
- New output fields: hosting_notes, gdpr_compliance,
  accessibility_issues[]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 18:01:34 +02:00

DomGod — Domain Intelligence Dashboard

Dockerized dashboard for filtering, enriching, scoring, and exporting leads from a 72M-domain dataset.

Quick start

docker compose up --build

Open http://localhost:6677

On first boot, the container downloads domains.parquet (~GB) and caches it in ./data/. Subsequent restarts skip the download.

Environment variables (docker-compose.yml)

Variable Default Description
DATA_DIR /data Where parquet + sqlite live
PARQUET_URL GitHub raw URL Source parquet
CONCURRENCY_LIMIT 50 Parallel enrichment workers
SCORE_THRESHOLD 60 "Hot lead" threshold
TARGET_TLDS es,com,net TLDs to prioritise
TARGET_COUNTRIES ES,GB,DE,FR,RO,PT,AD,IT Countries for scoring bonus

Scoring

Signal Points
Domain is live +20
SSL expiry < 30 days +15
No valid SSL +15
Known CMS detected +15
No MX record +10
IP in target country +10
Shared hosting server +10
Local business keywords in title +5

Max score: 100. Hot ≥ 80, Warm 5079, Cold < 50.

API

GET  /api/stats
GET  /api/domains?tld=es&page=1&limit=100&live_only=false
POST /api/enrich/batch      { "domains": ["example.com"] }
GET  /api/enrich/status
POST /api/enrich/pause
POST /api/enrich/resume
POST /api/enrich/retry
GET  /api/enriched?min_score=60&cms=wordpress&country=ES
GET  /api/export?tier=hot   (streams CSV)
POST /api/score/run
Description
No description provided
Readme 794 KiB
Languages
Python 60.6%
HTML 39.3%
Dockerfile 0.1%