DomGod

Author	SHA1	Message	Date
Malin	54c781773d	fix: retry https on ConnectTimeout, not just ConnectError Port 80 is often firewalled (drops packets → ConnectTimeout) rather than refused (ConnectError). Previously ConnectTimeout hit the generic except branch and broke without trying https, marking everything dead. Now ConnectError + RemoteProtocolError + ConnectTimeout all trigger an https retry. ReadTimeout still marks dead (server responded on connect but was too slow). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 19:42:10 +02:00
Malin	b53545b7dd	fix: bind exception variable in ConnectError handler to prevent NameError Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 18:48:04 +02:00
Malin	6657e6ea1f	fix: rotate UA + treat any HTTP response as live (not just 200/203) - Rotate across 7 real browser UAs to avoid bot detection - Any 2xx/3xx/4xx/5xx response = server is UP = live (only no-response = dead) - Parking signals still checked on 200/203 body content - Previous 403/404 responses were incorrectly marking live servers as dead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 18:36:32 +02:00
Malin	8a4ec88d73	fix: always fallback to https on any http failure (fixes HTTPS-only sites marked dead) Previous fix only retried on ConnectError. Servers that accept TCP on port 80 but hang, return protocol errors, or timeout also need the https fallback. Now any exception on http triggers https retry. Shorter http timeout (4s) avoids wasting time on non-responsive port 80. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 17:43:44 +02:00
Malin	f8ab910eca	feat: add rescan dead domains checkbox to validator Adds rescan_dead flag that causes _filter_unvalidated to treat previously-dead domains as needing a fresh check. Useful after fixing the http/https detection bug. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:12:59 +02:00
Malin	ae2fad0152	fix: try https fallback when http port 80 is closed (fixes HTTPS-only domains marked as dead) Many modern servers refuse HTTP connections entirely. The validator was only trying http://, causing HTTPS-only sites to be wrongly marked dead. Now falls back to https:// on ConnectError. Also increased timeouts slightly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 20:11:00 +02:00
Malin	3f042196d3	fix: always reset validator offset on start (fixes wrong TLD resuming previous offset) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 17:29:07 +02:00
Malin	8f387cada2	feat: bulk validator tab + status/niche/type browse filters - New app/validator.py: background HTTP checker for entire dataset - 50 concurrent checks, skips already-validated domains - Extracts prescreen_status, server, IP, load_time_ms - start/stop/status API at /api/validator/start\|stop\|status - New dedicated "Validator 🔬" tab with stats grid, TLD filter, Start/Stop controls, live progress indicator - Browse tab: "Live" column replaced with "Status" dot (color-coded ● from prescreen_status, falls back to is_live) - Browse tab: new Status / Niche / Type filter dropdowns - db.py: added ip TEXT + load_time_ms INTEGER columns + migrations; get_enriched() supports prescreen_status/niche/site_type filters - main.py: /api/enriched extended with prescreen_status/niche/site_type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 08:27:24 +02:00
Malin	468d76387d	fix: 429 retry, sequential batching, force UI refresh after prescreen 1. prescreener.py: classify_with_deepseek now retries on 429 with exponential back-off (5s → 10s → 20s → 40s, up to 4 attempts); same back-off also covers other transient errors. 2. main.py: prescreen batches run sequentially with a 3s gap instead of asyncio.gather (parallel). Parallel batches caused the second batch to always hit the 429 rate limit, leaving most domains unclassified (only the smaller last batch succeeded). 3. index.html: prescreenSelected() now clears this.domains before calling _fetch() so Alpine re-renders the full table with the updated niche/type values; also updates the notify hint to mention the expected 1-2 min wait. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:52:39 +02:00
Malin	a30085975e	fix: poll Replicate for DeepSeek-R1 async predictions (202 Accepted) DeepSeek-R1 is too slow for synchronous Replicate wait; it returns 202 with a prediction URL instead of the completed output. Added polling loop: - POST with Prefer: wait=60 - If 202 or status=starting/processing, poll urls.get every 2s up to 90× (~3 min ceiling) - On succeeded, use the final response data as normal - On failed/canceled/timeout, log and return [] Also guards against output=None before calling str.join(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:43:13 +02:00
Malin	a0c9db1ef2	fix: DeepSeek niche/type not saving to DB Two bugs: 1. _parse_classify_output stripped <think> block before searching for JSON. DeepSeek-R1 often puts the JSON array inside the think block (especially when it "decides" mid-reasoning), so stripping it first destroyed the data. Fix: search full output first, then inside <think>, then stripped — three fallback strategies with info logging at each step. 2. Phase 2 save used bare UPDATE WHERE domain=? which silently does nothing if the domain row doesn't exist yet in enriched_domains. Fix: replace with INSERT ... ON CONFLICT DO UPDATE (true upsert). Also adds logger.info lines so container logs show raw DeepSeek output and parse result count for easy debugging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:35:49 +02:00
Malin	7fc510f903	feat: two-phase pre-screening with HTTP check + DeepSeek batch classification Phase 1 (no AI credits): httpx checks every selected domain concurrently (30 parallel) with real browser UA — detects live/dead/parked/redirect. Parked: keyword scan in body/title + known parking host redirect check. Results saved to DB immediately; dead/parked never reach DeepSeek. Phase 2 (single DeepSeek call): all live-site titles + snippets bundled into ONE Replicate/DeepSeek-R1 request → returns niche + type for every domain in batch (up to 80 per call, parallelised if more). - app/prescreener.py (new): _check_one(), prescreen_domains(), classify_with_deepseek(), parking signal lists, same-domain redirect logic - app/db.py: prescreen_status/niche/site_type/prescreen_at columns + migrations; save_prescreen_results() upsert helper - app/main.py: POST /api/prescreen/batch endpoint - app/static/index.html: - 🔍 Pre-screen button (disabled while running, shows spinner) - Niche + Type columns in Browse and Leads tables (.pni/.pty pills) - Prescreen status colour dot (●) when niche not yet set - prescreening state flag; result toast shows per-status counts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:22:45 +02:00
Malin	63f961dc80	feat: add Leads tab and Hide Assessed filter in Browse - db.py: get_enriched() accepts ai_only + lead_quality params - main.py: /api/enriched exposes ai_only + lead_quality query params; new /api/export/leads endpoint produces CSV with contacts + pitch - index.html: - New "Leads 🤖" tab shows all AI-assessed domains with contacts (quality/country/limit filters, per-row 📋 copy email, 🔍 modal, CSV export, pagination, auto-refreshes every 3s) - Browse: "Hide assessed" checkbox filters out already-processed domains so you can focus on fresh targets - Poll cycle refreshes Leads tab when active Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 18:57:15 +02:00
Malin	22eae3f9b7	feat: add EN/ES/RO language selector for AI pitch generation - db.py: add `language` column to ai_queue; migration; queue_ai() accepts language param and re-queues with ON CONFLICT UPDATE so changing language works - main.py: batch and single assess endpoints accept `language` from request body - enricher.py: ai_worker_loop reads language column, passes to _assess_one() - replicate_ai.py: assess_domain() and _build_prompt() accept language param; OUTPUT LANGUAGE section injected into prompt so Gemini writes pitch/email in the requested language (EN/ES/RO) - index.html: flag dropdown (🇪🇸/🇬🇧/🇷🇴) next to AI Assess button; aiLang state default ES; language sent in all batch assessment requests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:39:27 +02:00
Malin	88c27bfff5	feat: full-service agency pitch — outreach email + subject, richer Gemini brief - Prompt now describes complete agency capabilities (everything web-related) - Concrete pitch examples with business name + specific problem references - New mandatory output fields: outreach_email (3-4 sentence ready-to-send ES) and email_subject (specific subject line) - HOT/WARM/COLD scoring guide based on site deficiency count - Modal: pitch box replaced with full outreach email + subject + Copy button - max_output_tokens raised to 6000 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:34:37 +02:00
Malin	6cea07f0f4	fix: truncated JSON, missing pitch for placeholder sites, token limit - max_output_tokens 2048→4096 (main truncation fix) - page snippet 2000→800 chars, search results capped at 600 chars - JSON schema reordered: lead_quality/pitch_angle/services_needed first, so most important fields survive even if output is truncated - RULES block in prompt: placeholder = HOT lead, pitch_angle is MANDATORY, services_needed must have ≥2 items, keep values ≤15 words to avoid truncation - _parse_output: truncated JSON repair — closes open [] and {} brackets and strips trailing incomplete key-value before retrying json.loads Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:32:49 +02:00
Malin	d62e4e986e	feat: web search for contacts, copyright year, contact page scan, CMS/age from Gemini - replicate_ai: DDG pre-search runs before every Gemini call; top 4 snippets injected into prompt so Gemini can find phone/email not on homepage - replicate_ai: explicit instructions to use search results for contact lookup - replicate_ai: new output fields cms_detected + site_last_updated - site_analyzer: copyright year extracted from footer (© / copyright pattern) - site_analyzer: Last-Modified from HTTP header + OG meta tag - site_analyzer: scans /contacto /contact /contactanos /sobre-nosotros for additional emails/phones (parallel with sitemap/robots fetch) - index.html: modal shows CMS (AI-detected), Last Updated (red if pre-2021) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:22:14 +02:00
Malin	dad910b6b0	feat: 5 fixes — dead site scoring, Kit Digital precision, social icons, GMB detection, social/GMB weighting 1. scorer: dead sites capped at 5 (was scoring HOT from SSL/CMS signals) 2. Kit Digital: require explicit kit-digital/agente-digitalizador signals; generic EU logo patterns (fondos-europeos, logo-ue, cofinanciado) removed. Gemini kit_digital_confirmed now overwrites heuristic in DB. 3. Browse table: social links replaced with compact coloured icon badges (fb/ig/in/x/tt/yt) linked to the profile URLs 4. site_analyzer: added has_gmb / gmb_url detection (Maps embed, Place links, LocalBusiness schema); fed to Gemini prompt 5. scorer: +5 no-social, +3 reachable contact; Gemini prompt includes GMB and social media management as sellable services; modal shows GMB/social status Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 07:21:02 +02:00
Malin	793aea8a5f	fix: auto-refresh browse results when AI assessments complete Track aiSt.done across poll cycles; re-fetch current page whenever the done count increases while on the browse tab. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:13:35 +02:00
Malin	f33dabbb7d	fix: close missing triple-quote on _build_prompt f-string (SyntaxError) The f-string in _build_prompt was never closed — the }} at end of the JSON template was missing the closing \"\"\". Python consumed the entire rest of the file as f-string content, then tried to evaluate the {\s\S} regex braces as an f-string expression, giving "unexpected character after line continuation character". Also bundles the earlier timeout fixes (SSL handshake, DNS, analyze_site 90s cap, _assess_one 180s cap, worker reset of stale running jobs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:03:56 +02:00
Malin	5bef587ca0	fix: add timeouts to SSL/DNS blocking calls, reset stuck AI jobs on startup - SSL handshake: set socket timeout before wrap_socket (prevents indefinite hang) - SSL executor: asyncio.wait_for(..., timeout=12) - DNS gethostbyname: asyncio.wait_for(..., timeout=6) - analyze_site: hard 90s timeout wrapper - _assess_one: hard 180s ceiling via asyncio.timeout() - ai_worker_loop: reset 'running' → 'pending' on startup (clears crashed-session jobs) - Add POST /api/ai/reset endpoint + UI button to unstick jobs without restart Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:11:27 +02:00
Malin	684fbd75b8	chore: add .gitignore and remove tracked __pycache__ files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:04:47 +02:00
Malin	60c9b495ae	fix: AI worker crash-proof + GDPR/hosting/accessibility analysis AI worker fixes (root cause of "nothing reaches Replicate"): - Worker task died silently — no exception handler around while loop - Added try/except around entire loop body with exc_info logging - Added watchdog task that restarts dead workers every 10 seconds - ensure_workers_alive() called on every /api/ai/assess/batch POST - _assess_one() is now a top-level function (not closure) — avoids subtle scoping bugs with async inner functions in while loops - /api/ai/debug endpoint: shows worker alive status, task exception, last 10 queue entries — browse to /api/ai/debug to diagnose - /api/ai/worker/restart endpoint + UI button - "Restart AI worker" button + "Debug AI queue" link in enrichment tab site_analyzer.py — new signals: - IP resolution + ip-api.com for ASN, org, ISP, host country - EU hosting detection (27 EU + EEA + adequacy countries) - GDPR: detects Cookiebot, OneTrust, CookiePro, Osano, Iubenda, Borlabs, CookieYes, Complianz, Usercentrics + text signals - Privacy policy and GDPR text presence - Accessibility: html lang missing, images without alt count, skip nav link, empty links, inputs without labels Gemini prompt additions: - Hosting section: IP, ASN, org/ISP, EU vs non-EU flag - GDPR section: cookie tool, notice, privacy policy - Accessibility section: all quick-scan results - New output fields: hosting_notes, gdpr_compliance, accessibility_issues[] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:01:34 +02:00
Malin	5ad8259c75	feat: deep site analysis engine + fix AI assess for any domain site_analyzer.py (new): - Fresh scrape with timing, page size, server, CMS detection - Lorem ipsum detection (16 phrases incl. user's example) - Placeholder content detection (hello world, sample page, etc.) - Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity - Webmaster: Google Search Console, Bing, Yandex verification tags - sitemap.xml and robots.txt check + Googlebot block detection - Mobile viewport check, word count, image/script count - Full contact extraction: emails, phones, WhatsApp, social links - Kit Digital signal detection AI worker fix: - No longer requires pre-enrichment — works on ANY selected domain - Does fresh site_analyzer scrape then calls Gemini with full context - Stores site_analysis JSON alongside AI assessment - Upserts into enriched_domains even if domain was never enriched Gemini prompt now includes: - Complete technical snapshot (load time, size, server, SSL) - Full SEO signals (sitemap, robots, analytics, webmaster verified) - Content quality (lorem ipsum matches, placeholder matches) - Kit Digital signals - All extracted contacts - 500-word page text sample - Outputs: summary, site_quality_score/10, content_issues[], urgency_signals[], performance_notes, seo_status, best_contact_channel+value, all_contacts, ES pitch, services_needed, outreach_notes UI: rich AI modal with summary banner, quality grid, content issues, urgency signals, full contact list, technical snapshot Fixes: correct Replicate token, ai_queue status='running' bug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:46:01 +02:00
Malin	faca4b6e1a	feat: Gemini AI assessment, Kit Digital detection, contact extraction Kit Digital detection (enricher.py): - Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc - Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR - Scans links for acelerapyme.es, red.es, kit-digital refs - +20 score bonus for Kit Digital confirmed sites (proven IT buyers) Contact extraction (enricher.py): - Pulls mailto/tel/wa.me links from HTML - Extracts email addresses via regex, phone numbers (ES format) - Detects social media links (FB, IG, LinkedIn, Twitter, TikTok) - Stored as JSON in contact_info column Gemini via Replicate (replicate_ai.py): - Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation - Identifies best contact channel + actual value (email/phone/WA) - Writes Spanish cold-call/email pitch angle - Lists services likely needed + outreach notes - 3 concurrent requests, 90s timeout, JSON output parsing DB: migration adds kit_digital, kit_digital_signals, contact_info, ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value, ai_queue table UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full assessment), contact chips (email/phone/WA/social), AI Assess button, Kit Digital only filter, AI queue status in enrichment tab Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:25:06 +02:00
Malin	7acff12242	feat: persistent DuckDB index, new filters, pagination fix, enrich UX - Build /data/domains.duckdb on first run (tld+parts columns + ART index) → TLD filter goes from ~60s full scan to <100ms index lookup → System still works (slower) while index builds in background - New /api/domains params: alpha_only, no_sld, keyword → alpha_only: domains with only letters (no hyphens/numbers) → no_sld: parts=2, excludes com.es / net.es patterns → keyword: LIKE '%term%' niche search - /api/domains and /api/enriched now return total count for pagination - Pagination: shows total matches, page X of Y, Next disabled at last page - Enrich button: toast notifications instead of alert(), error handling - Select all on page button, clear selection button - Stats/TLD breakdown cached after first load (no repeat full scan) - Header shows index build status (building → ready) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:00:08 +02:00
Malin	2db95cc727	fix: run as python -m app.main to fix ModuleNotFoundError Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:27:29 +02:00
Malin	b2e7a2f2db	feat: initial Dockerized domain intelligence dashboard - FastAPI backend with DuckDB pushdown queries on 72M parquet - Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com - Resumable parquet download with HTTP Range support - Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT) - Single-file Alpine.js + Chart.js dashboard on port 6677 - SQLite enrichment DB with job queue and scores tables - Dockerized with persistent /data volume Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:22:30 +02:00

28 Commits