feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
import os
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
import asyncio
import logging
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
import aiosqlite
import duckdb
from pathlib import Path
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
logger = logging . getLogger ( __name__ )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
DATA_DIR = Path ( os . getenv ( " DATA_DIR " , " /data " ) )
PARQUET_PATH = DATA_DIR / " domains.parquet "
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
DUCKDB_PATH = DATA_DIR / " domains.duckdb "
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
SQLITE_PATH = DATA_DIR / " enrichment.db "
SCHEMA = """
CREATE TABLE IF NOT EXISTS enriched_domains (
domain TEXT PRIMARY KEY ,
is_live INTEGER DEFAULT 0 ,
status_code INTEGER ,
ssl_valid INTEGER DEFAULT 0 ,
ssl_expiry_days INTEGER ,
cms TEXT ,
has_mx INTEGER DEFAULT 0 ,
ip_country TEXT ,
page_title TEXT ,
server TEXT ,
enriched_at TEXT ,
error TEXT ,
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
score INTEGER DEFAULT 0 ,
kit_digital INTEGER DEFAULT 0 ,
kit_digital_signals TEXT ,
contact_info TEXT ,
ai_assessment TEXT ,
ai_lead_quality TEXT ,
ai_pitch TEXT ,
ai_contact_channel TEXT ,
ai_contact_value TEXT ,
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
ai_assessed_at TEXT ,
2026-04-17 21:22:45 +02:00
site_analysis TEXT ,
prescreen_status TEXT ,
niche TEXT ,
site_type TEXT ,
2026-04-18 08:27:24 +02:00
prescreen_at TEXT ,
ip TEXT ,
load_time_ms INTEGER
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
) ;
CREATE TABLE IF NOT EXISTS job_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT ,
domain TEXT UNIQUE NOT NULL ,
status TEXT DEFAULT ' pending ' ,
created_at TEXT DEFAULT ( datetime ( ' now ' ) ) ,
started_at TEXT ,
completed_at TEXT ,
error TEXT
) ;
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
CREATE TABLE IF NOT EXISTS ai_queue (
domain TEXT PRIMARY KEY ,
status TEXT DEFAULT ' pending ' ,
created_at TEXT DEFAULT ( datetime ( ' now ' ) ) ,
completed_at TEXT ,
2026-04-14 08:39:27 +02:00
error TEXT ,
language TEXT DEFAULT ' ES '
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
) ;
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
CREATE TABLE IF NOT EXISTS scores (
domain TEXT PRIMARY KEY ,
score INTEGER NOT NULL ,
scored_at TEXT DEFAULT ( datetime ( ' now ' ) )
) ;
"""
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
# Columns added after initial release — applied as migrations on existing DBs
_MIGRATIONS = [
" ALTER TABLE enriched_domains ADD COLUMN kit_digital INTEGER DEFAULT 0 " ,
" ALTER TABLE enriched_domains ADD COLUMN kit_digital_signals TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN contact_info TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN ai_assessment TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN ai_lead_quality TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN ai_pitch TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN ai_contact_channel TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN ai_contact_value TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN ai_assessed_at TEXT " ,
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
" ALTER TABLE enriched_domains ADD COLUMN site_analysis TEXT " ,
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
" CREATE TABLE IF NOT EXISTS ai_queue (domain TEXT PRIMARY KEY, status TEXT DEFAULT ' pending ' , created_at TEXT DEFAULT (datetime( ' now ' )), completed_at TEXT, error TEXT) " ,
2026-04-14 08:39:27 +02:00
" ALTER TABLE ai_queue ADD COLUMN language TEXT DEFAULT ' ES ' " ,
2026-04-17 21:22:45 +02:00
" ALTER TABLE enriched_domains ADD COLUMN prescreen_status TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN niche TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN site_type TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN prescreen_at TEXT " ,
2026-04-18 08:27:24 +02:00
" ALTER TABLE enriched_domains ADD COLUMN ip TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN load_time_ms INTEGER " ,
feat: BeautyLeads B2B cosmetics frontend on port 7788
New service (app/beauty_main.py) sharing the same /data volume:
- Separate FastAPI app running on port 7788
- beauty_ai.py: brand universe scan (~650 brands), portfolio match
detection against OUR_BRANDS, Gemini B2B assessment prompt in Spanish
returning quality/categories/dist_matches/outreach_email
- beauty_queue table + beauty_lead_quality/beauty_assessment columns
in enriched_domains (with migrations)
- Endpoints: /api/beauty/assess/batch, /api/beauty/leads,
/api/beauty/status, /api/beauty/export, /api/beauty/reset
- Static frontend: Browse (beauty/ecommerce pre-filtered, no CMS/SSL/KD
columns), Validator, B2B Pipeline (brand chips, expandable outreach),
Pre-screen, Export CSV
- docker-compose: second 'beauty' service with shared data volume
- Dockerfile: expose 7788 alongside 6677
Also: add 'error' prescreen_status handling + UI (orange stat box,
filter option) for 4xx/5xx HTTP responses
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:31:10 +02:00
" ALTER TABLE enriched_domains ADD COLUMN beauty_lead_quality TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN beauty_assessment TEXT " ,
" ALTER TABLE enriched_domains ADD COLUMN beauty_assessed_at TEXT " ,
""" CREATE TABLE IF NOT EXISTS beauty_queue (
domain TEXT PRIMARY KEY ,
status TEXT DEFAULT ' pending ' ,
created_at TEXT DEFAULT ( datetime ( ' now ' ) ) ,
completed_at TEXT ,
error TEXT
) """ ,
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
]
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
# Index build state
_index_ready = False
_index_building = False
_index_total = 0
# Cached stats (TLD breakdown is expensive — compute once)
_tld_cache : list = [ ]
_total_cache : int = 0
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
async def init_db ( ) :
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
# WAL mode: concurrent reads don't block on writes; write lock held briefly
await db . execute ( " PRAGMA journal_mode=WAL " )
await db . execute ( " PRAGMA busy_timeout=30000 " )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
await db . executescript ( SCHEMA )
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
# Run migrations (safe to re-run — silently skips existing columns)
for sql in _MIGRATIONS :
try :
await db . execute ( sql )
except Exception :
pass
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
await db . commit ( )
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
# ── DuckDB persistent index ──────────────────────────────────────────────────
def _build_index_sync ( ) :
global _index_ready , _index_building , _index_total
_index_building = True
try :
conn = duckdb . connect ( str ( DUCKDB_PATH ) )
conn . execute ( " SET threads=4 " )
conn . execute ( " SET memory_limit= ' 2GB ' " )
# Check if already built
try :
n = conn . execute ( " SELECT COUNT(*) FROM domains " ) . fetchone ( ) [ 0 ]
if n > 0 :
_index_total = n
_index_ready = True
_index_building = False
logger . info ( " DuckDB index already ready ( %d rows) " , n )
conn . close ( )
return
except Exception :
pass
logger . info ( " Building DuckDB index from parquet (one-time ~2-3 min)... " )
conn . execute ( """
CREATE OR REPLACE TABLE domains AS
SELECT
domain ,
lower ( regexp_extract ( domain , ' \\ .([^.]+)$ ' , 1 ) ) AS tld ,
len ( string_split ( domain , ' . ' ) ) AS parts
FROM read_parquet ( ? )
""" , [str(PARQUET_PATH)])
conn . execute ( " CREATE INDEX IF NOT EXISTS idx_tld ON domains(tld) " )
_index_total = conn . execute ( " SELECT COUNT(*) FROM domains " ) . fetchone ( ) [ 0 ]
conn . close ( )
_index_ready = True
logger . info ( " DuckDB index built: %d rows " , _index_total )
except Exception as e :
logger . error ( " DuckDB index build failed: %s " , e )
finally :
_index_building = False
async def build_duckdb_index ( ) :
loop = asyncio . get_event_loop ( )
await loop . run_in_executor ( None , _build_index_sync )
def index_status ( ) - > dict :
return {
" ready " : _index_ready ,
" building " : _index_building ,
" total " : _index_total ,
}
# ── Domain queries ───────────────────────────────────────────────────────────
def _domains_sync ( tld , page , limit , alpha_only , no_sld , keyword ) :
conditions = [ ]
params_count = [ ]
params_data = [ ]
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
if _index_ready :
source = " domains "
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
def _add ( clause , val = None ) :
conditions . append ( clause )
if val is not None :
params_count . append ( val )
params_data . append ( val )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
else :
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
source = f " read_parquet( ' { PARQUET_PATH } ' ) "
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
def _add ( clause , val = None ) :
conditions . append ( clause )
if val is not None :
params_count . append ( val )
params_data . append ( val )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
if tld :
if _index_ready :
_add ( " tld = ? " , tld . lower ( ) . lstrip ( " . " ) )
else :
_add ( " lower(regexp_extract(domain, ' \\ .([^.]+)$ ' , 1)) = ? " , tld . lower ( ) . lstrip ( " . " ) )
if no_sld :
if _index_ready :
_add ( " parts = 2 " )
else :
_add ( " len(string_split(domain, ' . ' )) = 2 " )
if alpha_only :
_add ( " NOT regexp_matches(domain, ' [^a-zA-Z.] ' ) " )
if keyword :
_add ( " domain LIKE ? " , f " % { keyword . lower ( ) } % " )
where = ( " WHERE " + " AND " . join ( conditions ) ) if conditions else " "
offset = ( page - 1 ) * limit
if _index_ready :
conn = duckdb . connect ( str ( DUCKDB_PATH ) , read_only = True )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
else :
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
conn = duckdb . connect ( " :memory: " )
conn . execute ( " SET threads=4 " )
total = conn . execute ( f " SELECT COUNT(*) FROM { source } { where } " , params_count ) . fetchone ( ) [ 0 ]
rows = conn . execute (
f " SELECT domain FROM { source } { where } LIMIT { limit } OFFSET { offset } " , params_data
) . fetchall ( )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
conn . close ( )
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
return total , [ r [ 0 ] for r in rows ]
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
async def get_domains ( tld = None , page = 1 , limit = 100 , alpha_only = False , no_sld = False , keyword = None , live_only = False ) :
loop = asyncio . get_event_loop ( )
total , domain_list = await loop . run_in_executor (
None , _domains_sync , tld , page , limit , alpha_only , no_sld , keyword
)
if not domain_list :
return total , [ ]
placeholders = " , " . join ( " ? " * len ( domain_list ) )
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
db . row_factory = aiosqlite . Row
async with db . execute (
f " SELECT * FROM enriched_domains WHERE domain IN ( { placeholders } ) " ,
domain_list ,
) as cur :
enriched_map = { r [ " domain " ] : dict ( r ) async for r in cur }
results = [ ]
for d in domain_list :
row = enriched_map . get ( d , { " domain " : d } )
if live_only and not row . get ( " is_live " ) :
continue
results . append ( row )
return total , results
# ── Stats ────────────────────────────────────────────────────────────────────
def _tld_stats_sync ( ) - > tuple [ int , list ] :
if _index_ready :
conn = duckdb . connect ( str ( DUCKDB_PATH ) , read_only = True )
total = conn . execute ( " SELECT COUNT(*) FROM domains " ) . fetchone ( ) [ 0 ]
rows = conn . execute ( """
SELECT tld , COUNT ( * ) AS cnt FROM domains
WHERE tld != ' '
GROUP BY tld ORDER BY cnt DESC LIMIT 20
""" ).fetchall()
conn . close ( )
else :
p = str ( PARQUET_PATH )
conn = duckdb . connect ( " :memory: " )
conn . execute ( " SET threads=4 " )
total = conn . execute ( f " SELECT COUNT(*) FROM read_parquet( ' { p } ' ) " ) . fetchone ( ) [ 0 ]
rows = conn . execute ( f """
SELECT lower ( regexp_extract ( domain , ' \\ .([^.]+)$ ' , 1 ) ) AS tld , COUNT ( * ) AS cnt
FROM read_parquet ( ' {p} ' )
GROUP BY tld ORDER BY cnt DESC LIMIT 20
""" ).fetchall()
conn . close ( )
return total , [ { " tld " : r [ 0 ] , " count " : r [ 1 ] } for r in rows ]
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
async def get_stats ( ) :
global _tld_cache , _total_cache
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
# Compute TLD breakdown once and cache it
if not _tld_cache :
loop = asyncio . get_event_loop ( )
_total_cache , _tld_cache = await loop . run_in_executor ( None , _tld_stats_sync )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
async with db . execute ( " SELECT COUNT(*) FROM enriched_domains " ) as cur :
enriched = ( await cur . fetchone ( ) ) [ 0 ]
threshold = int ( os . getenv ( " SCORE_THRESHOLD " , " 60 " ) )
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
async with db . execute ( " SELECT COUNT(*) FROM enriched_domains WHERE score >= ? " , ( threshold , ) ) as cur :
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
hot_leads = ( await cur . fetchone ( ) ) [ 0 ]
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
async with db . execute ( " SELECT COUNT(*) FROM enriched_domains WHERE kit_digital=1 " ) as cur :
kit_digital_count = ( await cur . fetchone ( ) ) [ 0 ]
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
async with db . execute ( " SELECT status, COUNT(*) FROM job_queue GROUP BY status " ) as cur :
q = { r [ 0 ] : r [ 1 ] async for r in cur }
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
return {
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
" total_domains " : _total_cache ,
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
" enriched " : enriched ,
" hot_leads " : hot_leads ,
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
" kit_digital_count " : kit_digital_count ,
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
" tld_breakdown " : _tld_cache ,
" index_status " : index_status ( ) ,
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
" queue " : {
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
" pending " : q . get ( " pending " , 0 ) ,
" running " : q . get ( " running " , 0 ) ,
" done " : q . get ( " done " , 0 ) ,
" failed " : q . get ( " failed " , 0 ) ,
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
} ,
}
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
# ── Enrichment helpers ───────────────────────────────────────────────────────
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
2026-04-14 18:57:15 +02:00
async def get_enriched ( min_score = 0 , cms = None , country = None , kit_digital = None ,
2026-04-18 08:27:24 +02:00
ai_only = False , lead_quality = None ,
prescreen_status = None , niche = None , site_type = None ,
2026-05-04 19:44:34 +02:00
keyword = None , tld = None ,
2026-04-18 08:27:24 +02:00
page = 1 , limit = 100 ) :
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
offset = ( page - 1 ) * limit
conditions = [ " score >= ? " ]
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
params : list = [ min_score ]
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
if cms :
conditions . append ( " cms = ? " )
params . append ( cms )
if country :
conditions . append ( " ip_country = ? " )
2026-05-04 19:44:34 +02:00
params . append ( country . upper ( ) )
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
if kit_digital is not None :
conditions . append ( " kit_digital = ? " )
params . append ( 1 if kit_digital else 0 )
2026-04-14 18:57:15 +02:00
if ai_only :
conditions . append ( " ai_lead_quality IS NOT NULL " )
if lead_quality :
conditions . append ( " ai_lead_quality = ? " )
params . append ( lead_quality . upper ( ) )
2026-04-18 08:27:24 +02:00
if prescreen_status == " none " :
conditions . append ( " prescreen_status IS NULL " )
elif prescreen_status :
conditions . append ( " prescreen_status = ? " )
params . append ( prescreen_status )
if niche :
conditions . append ( " niche = ? " )
params . append ( niche )
if site_type :
conditions . append ( " site_type = ? " )
params . append ( site_type )
2026-05-04 19:44:34 +02:00
if keyword :
kw = f " % { keyword . lower ( ) } % "
conditions . append ( " (LOWER(domain) LIKE ? OR LOWER(COALESCE(page_title, ' ' )) LIKE ?) " )
params . extend ( [ kw , kw ] )
if tld :
tld_clean = tld . lower ( ) . lstrip ( " . " )
conditions . append ( " LOWER(domain) LIKE ? " )
params . append ( f " %. { tld_clean } " )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
where = " WHERE " + " AND " . join ( conditions )
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
db . row_factory = aiosqlite . Row
async with db . execute (
f " SELECT * FROM enriched_domains { where } ORDER BY score DESC LIMIT ? OFFSET ? " ,
params + [ limit , offset ] ,
) as cur :
rows = [ dict ( r ) async for r in cur ]
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
async with db . execute (
f " SELECT COUNT(*) FROM enriched_domains { where } " , params
) as cur :
total = ( await cur . fetchone ( ) ) [ 0 ]
return total , rows
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
2026-04-14 08:39:27 +02:00
async def queue_ai ( domains : list [ str ] , language : str = " ES " ) :
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
await db . executemany (
2026-04-14 08:39:27 +02:00
""" INSERT INTO ai_queue (domain, language) VALUES (?, ?)
ON CONFLICT ( domain ) DO UPDATE SET language = excluded . language , status = ' pending ' """ ,
[ ( d , language ) for d in domains ] ,
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
)
await db . commit ( )
async def get_ai_queue_status ( ) :
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
async with db . execute ( " SELECT status, COUNT(*) FROM ai_queue GROUP BY status " ) as cur :
rows = { r [ 0 ] : r [ 1 ] async for r in cur }
return {
" pending " : rows . get ( " pending " , 0 ) ,
" running " : rows . get ( " running " , 0 ) ,
" done " : rows . get ( " done " , 0 ) ,
" failed " : rows . get ( " failed " , 0 ) ,
" total " : sum ( rows . values ( ) ) ,
}
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
async def save_ai_assessment ( domain : str , assessment : dict , site_analysis : dict = None ) :
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
import json as _json
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
# Upsert into enriched_domains (domain may not exist yet if assessed before full enrichment)
await db . execute (
""" INSERT INTO enriched_domains (domain) VALUES (?) ON CONFLICT(domain) DO NOTHING """ ,
( domain , ) ,
)
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
await db . execute (
""" UPDATE enriched_domains SET
ai_assessment = ? , ai_lead_quality = ? , ai_pitch = ? ,
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
ai_contact_channel = ? , ai_contact_value = ? , ai_assessed_at = datetime ( ' now ' ) ,
site_analysis = ?
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
WHERE domain = ? """ ,
(
_json . dumps ( assessment ) ,
assessment . get ( " lead_quality " ) ,
assessment . get ( " pitch_angle " ) ,
assessment . get ( " best_contact_channel " ) ,
assessment . get ( " best_contact_value " ) ,
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
_json . dumps ( site_analysis ) if site_analysis else None ,
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
domain ,
) ,
)
2026-04-14 07:21:02 +02:00
# Update contact_info + kit_digital from site_analysis if available.
# Gemini's kit_digital_confirmed is the authoritative verdict — it can
# override a false-positive from the heuristic scanner.
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
if site_analysis :
contacts = {
" emails " : site_analysis . get ( " emails " , [ ] ) ,
" phones " : site_analysis . get ( " phones " , [ ] ) ,
" whatsapp " : site_analysis . get ( " whatsapp " , [ ] ) ,
" social " : site_analysis . get ( " social_links " , [ ] ) ,
}
2026-04-14 07:21:02 +02:00
# Prefer Gemini's explicit verdict; fall back to heuristic if null
ai_kit = assessment . get ( " kit_digital_confirmed " )
kit_val = int ( ai_kit ) if ai_kit is not None else int ( site_analysis . get ( " kit_digital " , False ) )
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
await db . execute (
""" UPDATE enriched_domains SET
kit_digital = ? , kit_digital_signals = ? , contact_info = ?
WHERE domain = ? """ ,
(
2026-04-14 07:21:02 +02:00
kit_val ,
feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection
AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched
Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
urgency_signals[], performance_notes, seo_status,
best_contact_channel+value, all_contacts, ES pitch,
services_needed, outreach_notes
UI: rich AI modal with summary banner, quality grid, content issues,
urgency signals, full contact list, technical snapshot
Fixes: correct Replicate token, ai_queue status='running' bug
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
_json . dumps ( site_analysis . get ( " kit_digital_signals " , [ ] ) ) ,
_json . dumps ( contacts ) ,
domain ,
) ,
)
feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)
Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column
Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing
DB: migration adds kit_digital, kit_digital_signals, contact_info,
ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
ai_queue table
UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
assessment), contact chips (email/phone/WA/social), AI Assess button,
Kit Digital only filter, AI queue status in enrichment tab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
await db . execute (
" UPDATE ai_queue SET status= ' done ' , completed_at=datetime( ' now ' ) WHERE domain=? " ,
( domain , ) ,
)
await db . commit ( )
2026-04-17 21:22:45 +02:00
async def save_prescreen_results ( results : list [ dict ] ) :
""" Upsert prescreen HTTP results and/or DeepSeek niche/type classifications. """
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
2026-04-17 21:22:45 +02:00
for r in results :
domain = r . get ( " domain " )
if not domain :
continue
niche = r . get ( " niche " )
site_type = r . get ( " type " ) # DeepSeek returns "type" key
if niche or site_type :
2026-04-17 21:35:49 +02:00
# Upsert niche/type — works even if the row was never enriched
2026-04-17 21:22:45 +02:00
await db . execute (
2026-04-17 21:35:49 +02:00
""" INSERT INTO enriched_domains (domain, niche, site_type)
VALUES ( ? , ? , ? )
ON CONFLICT ( domain ) DO UPDATE SET
niche = excluded . niche ,
site_type = excluded . site_type """ ,
( domain , niche , site_type ) ,
2026-04-17 21:22:45 +02:00
)
else :
# Prescreen status upsert — create row if it doesn't exist yet
await db . execute (
""" INSERT INTO enriched_domains (domain, prescreen_status, prescreen_at, page_title)
VALUES ( ? , ? , datetime ( ' now ' ) , ? )
ON CONFLICT ( domain ) DO UPDATE SET
prescreen_status = excluded . prescreen_status ,
prescreen_at = excluded . prescreen_at ,
page_title = COALESCE ( page_title , excluded . page_title ) """ ,
( domain , r . get ( " prescreen_status " ) , r . get ( " title " ) ) ,
)
await db . commit ( )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
async def queue_domains ( domains : list [ str ] ) :
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
await db . executemany (
" INSERT OR IGNORE INTO job_queue (domain) VALUES (?) " ,
[ ( d , ) for d in domains ] ,
)
await db . commit ( )
feat: BeautyLeads B2B cosmetics frontend on port 7788
New service (app/beauty_main.py) sharing the same /data volume:
- Separate FastAPI app running on port 7788
- beauty_ai.py: brand universe scan (~650 brands), portfolio match
detection against OUR_BRANDS, Gemini B2B assessment prompt in Spanish
returning quality/categories/dist_matches/outreach_email
- beauty_queue table + beauty_lead_quality/beauty_assessment columns
in enriched_domains (with migrations)
- Endpoints: /api/beauty/assess/batch, /api/beauty/leads,
/api/beauty/status, /api/beauty/export, /api/beauty/reset
- Static frontend: Browse (beauty/ecommerce pre-filtered, no CMS/SSL/KD
columns), Validator, B2B Pipeline (brand chips, expandable outreach),
Pre-screen, Export CSV
- docker-compose: second 'beauty' service with shared data volume
- Dockerfile: expose 7788 alongside 6677
Also: add 'error' prescreen_status handling + UI (orange stat box,
filter option) for 4xx/5xx HTTP responses
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:31:10 +02:00
async def queue_beauty ( domains : list [ str ] ) :
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
await db . executemany (
" INSERT OR IGNORE INTO beauty_queue (domain) VALUES (?) " ,
[ ( d , ) for d in domains ] ,
)
await db . commit ( )
async def get_beauty_queue_status ( ) :
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
async with db . execute ( " SELECT status, COUNT(*) FROM beauty_queue GROUP BY status " ) as cur :
rows = { r [ 0 ] : r [ 1 ] async for r in cur }
return {
" pending " : rows . get ( " pending " , 0 ) ,
" running " : rows . get ( " running " , 0 ) ,
" done " : rows . get ( " done " , 0 ) ,
" failed " : rows . get ( " failed " , 0 ) ,
" total " : sum ( rows . values ( ) ) ,
}
async def save_beauty_assessment ( domain : str , assessment : dict ) :
import json as _json
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
await db . execute (
" INSERT INTO enriched_domains (domain) VALUES (?) ON CONFLICT(domain) DO NOTHING " ,
( domain , ) ,
)
await db . execute (
""" UPDATE enriched_domains SET
beauty_lead_quality = ? , beauty_assessment = ? , beauty_assessed_at = datetime ( ' now ' )
WHERE domain = ? """ ,
( assessment . get ( " lead_quality " ) , _json . dumps ( assessment ) , domain ) ,
)
await db . execute (
" UPDATE beauty_queue SET status= ' done ' , completed_at=datetime( ' now ' ) WHERE domain=? " ,
( domain , ) ,
)
await db . commit ( )
async def get_beauty_leads ( quality : str = None , country : str = None ,
page : int = 1 , limit : int = 100 ) :
import json as _json
offset = ( page - 1 ) * limit
conditions = [ " beauty_lead_quality IS NOT NULL " ]
params : list = [ ]
if quality :
conditions . append ( " beauty_lead_quality = ? " )
params . append ( quality . upper ( ) )
if country :
conditions . append ( " ip_country = ? " )
params . append ( country . upper ( ) )
where = " WHERE " + " AND " . join ( conditions )
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
db . row_factory = aiosqlite . Row
async with db . execute (
f " SELECT * FROM enriched_domains { where } "
f " ORDER BY CASE beauty_lead_quality WHEN ' HOT ' THEN 1 WHEN ' WARM ' THEN 2 ELSE 3 END "
f " LIMIT ? OFFSET ? " ,
params + [ limit , offset ] ,
) as cur :
rows = [ dict ( r ) async for r in cur ]
async with db . execute ( f " SELECT COUNT(*) FROM enriched_domains { where } " , params ) as cur :
total = ( await cur . fetchone ( ) ) [ 0 ]
# Parse beauty_assessment JSON inline
for r in rows :
try :
r [ " _beauty " ] = _json . loads ( r . get ( " beauty_assessment " ) or " {} " )
except Exception :
r [ " _beauty " ] = { }
return total , rows
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
async def get_queue_status ( ) :
2026-04-21 07:10:45 +02:00
async with aiosqlite . connect ( SQLITE_PATH , timeout = 30 ) as db :
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
async with db . execute ( " SELECT status, COUNT(*) FROM job_queue GROUP BY status " ) as cur :
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
rows = { r [ 0 ] : r [ 1 ] async for r in cur }
pending = rows . get ( " pending " , 0 )
running = rows . get ( " running " , 0 )
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
done = rows . get ( " done " , 0 )
feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00
failed = rows . get ( " failed " , 0 )
feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
total = sum ( rows . values ( ) )
rate = int ( os . getenv ( " CONCURRENCY_LIMIT " , " 50 " ) )
eta_seconds = ( pending + running ) / max ( rate / 10 , 1 ) if ( pending + running ) > 0 else None
return { " total " : total , " pending " : pending , " running " : running , " done " : done , " failed " : failed , " eta_seconds " : eta_seconds }