5ad8259c752172aa2b3b06e768738563749bd969
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DomGod — Domain Intelligence Dashboard
Dockerized dashboard for filtering, enriching, scoring, and exporting leads from a 72M-domain dataset.
Quick start
docker compose up --build
On first boot, the container downloads domains.parquet (~GB) and caches it in ./data/. Subsequent restarts skip the download.
Environment variables (docker-compose.yml)
| Variable | Default | Description |
|---|---|---|
DATA_DIR |
/data |
Where parquet + sqlite live |
PARQUET_URL |
GitHub raw URL | Source parquet |
CONCURRENCY_LIMIT |
50 |
Parallel enrichment workers |
SCORE_THRESHOLD |
60 |
"Hot lead" threshold |
TARGET_TLDS |
es,com,net |
TLDs to prioritise |
TARGET_COUNTRIES |
ES,GB,DE,FR,RO,PT,AD,IT |
Countries for scoring bonus |
Scoring
| Signal | Points |
|---|---|
| Domain is live | +20 |
| SSL expiry < 30 days | +15 |
| No valid SSL | +15 |
| Known CMS detected | +15 |
| No MX record | +10 |
| IP in target country | +10 |
| Shared hosting server | +10 |
| Local business keywords in title | +5 |
Max score: 100. Hot ≥ 80, Warm 50–79, Cold < 50.
API
GET /api/stats
GET /api/domains?tld=es&page=1&limit=100&live_only=false
POST /api/enrich/batch { "domains": ["example.com"] }
GET /api/enrich/status
POST /api/enrich/pause
POST /api/enrich/resume
POST /api/enrich/retry
GET /api/enriched?min_score=60&cms=wordpress&country=ES
GET /api/export?tier=hot (streams CSV)
POST /api/score/run
Description
Languages
Python
60.6%
HTML
39.3%
Dockerfile
0.1%