DomGod

Author	SHA1	Message	Date
Malin	468d76387d	fix: 429 retry, sequential batching, force UI refresh after prescreen 1. prescreener.py: classify_with_deepseek now retries on 429 with exponential back-off (5s → 10s → 20s → 40s, up to 4 attempts); same back-off also covers other transient errors. 2. main.py: prescreen batches run sequentially with a 3s gap instead of asyncio.gather (parallel). Parallel batches caused the second batch to always hit the 429 rate limit, leaving most domains unclassified (only the smaller last batch succeeded). 3. index.html: prescreenSelected() now clears this.domains before calling _fetch() so Alpine re-renders the full table with the updated niche/type values; also updates the notify hint to mention the expected 1-2 min wait. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:52:39 +02:00
Malin	a30085975e	fix: poll Replicate for DeepSeek-R1 async predictions (202 Accepted) DeepSeek-R1 is too slow for synchronous Replicate wait; it returns 202 with a prediction URL instead of the completed output. Added polling loop: - POST with Prefer: wait=60 - If 202 or status=starting/processing, poll urls.get every 2s up to 90× (~3 min ceiling) - On succeeded, use the final response data as normal - On failed/canceled/timeout, log and return [] Also guards against output=None before calling str.join(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:43:13 +02:00
Malin	a0c9db1ef2	fix: DeepSeek niche/type not saving to DB Two bugs: 1. _parse_classify_output stripped <think> block before searching for JSON. DeepSeek-R1 often puts the JSON array inside the think block (especially when it "decides" mid-reasoning), so stripping it first destroyed the data. Fix: search full output first, then inside <think>, then stripped — three fallback strategies with info logging at each step. 2. Phase 2 save used bare UPDATE WHERE domain=? which silently does nothing if the domain row doesn't exist yet in enriched_domains. Fix: replace with INSERT ... ON CONFLICT DO UPDATE (true upsert). Also adds logger.info lines so container logs show raw DeepSeek output and parse result count for easy debugging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:35:49 +02:00
Malin	7fc510f903	feat: two-phase pre-screening with HTTP check + DeepSeek batch classification Phase 1 (no AI credits): httpx checks every selected domain concurrently (30 parallel) with real browser UA — detects live/dead/parked/redirect. Parked: keyword scan in body/title + known parking host redirect check. Results saved to DB immediately; dead/parked never reach DeepSeek. Phase 2 (single DeepSeek call): all live-site titles + snippets bundled into ONE Replicate/DeepSeek-R1 request → returns niche + type for every domain in batch (up to 80 per call, parallelised if more). - app/prescreener.py (new): _check_one(), prescreen_domains(), classify_with_deepseek(), parking signal lists, same-domain redirect logic - app/db.py: prescreen_status/niche/site_type/prescreen_at columns + migrations; save_prescreen_results() upsert helper - app/main.py: POST /api/prescreen/batch endpoint - app/static/index.html: - 🔍 Pre-screen button (disabled while running, shows spinner) - Niche + Type columns in Browse and Leads tables (.pni/.pty pills) - Prescreen status colour dot (●) when niche not yet set - prescreening state flag; result toast shows per-status counts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:22:45 +02:00
Malin	63f961dc80	feat: add Leads tab and Hide Assessed filter in Browse - db.py: get_enriched() accepts ai_only + lead_quality params - main.py: /api/enriched exposes ai_only + lead_quality query params; new /api/export/leads endpoint produces CSV with contacts + pitch - index.html: - New "Leads 🤖" tab shows all AI-assessed domains with contacts (quality/country/limit filters, per-row 📋 copy email, 🔍 modal, CSV export, pagination, auto-refreshes every 3s) - Browse: "Hide assessed" checkbox filters out already-processed domains so you can focus on fresh targets - Poll cycle refreshes Leads tab when active Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 18:57:15 +02:00
Malin	22eae3f9b7	feat: add EN/ES/RO language selector for AI pitch generation - db.py: add `language` column to ai_queue; migration; queue_ai() accepts language param and re-queues with ON CONFLICT UPDATE so changing language works - main.py: batch and single assess endpoints accept `language` from request body - enricher.py: ai_worker_loop reads language column, passes to _assess_one() - replicate_ai.py: assess_domain() and _build_prompt() accept language param; OUTPUT LANGUAGE section injected into prompt so Gemini writes pitch/email in the requested language (EN/ES/RO) - index.html: flag dropdown (🇪🇸/🇬🇧/🇷🇴) next to AI Assess button; aiLang state default ES; language sent in all batch assessment requests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:39:27 +02:00
Malin	88c27bfff5	feat: full-service agency pitch — outreach email + subject, richer Gemini brief - Prompt now describes complete agency capabilities (everything web-related) - Concrete pitch examples with business name + specific problem references - New mandatory output fields: outreach_email (3-4 sentence ready-to-send ES) and email_subject (specific subject line) - HOT/WARM/COLD scoring guide based on site deficiency count - Modal: pitch box replaced with full outreach email + subject + Copy button - max_output_tokens raised to 6000 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:34:37 +02:00
Malin	6cea07f0f4	fix: truncated JSON, missing pitch for placeholder sites, token limit - max_output_tokens 2048→4096 (main truncation fix) - page snippet 2000→800 chars, search results capped at 600 chars - JSON schema reordered: lead_quality/pitch_angle/services_needed first, so most important fields survive even if output is truncated - RULES block in prompt: placeholder = HOT lead, pitch_angle is MANDATORY, services_needed must have ≥2 items, keep values ≤15 words to avoid truncation - _parse_output: truncated JSON repair — closes open [] and {} brackets and strips trailing incomplete key-value before retrying json.loads Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:32:49 +02:00
Malin	d62e4e986e	feat: web search for contacts, copyright year, contact page scan, CMS/age from Gemini - replicate_ai: DDG pre-search runs before every Gemini call; top 4 snippets injected into prompt so Gemini can find phone/email not on homepage - replicate_ai: explicit instructions to use search results for contact lookup - replicate_ai: new output fields cms_detected + site_last_updated - site_analyzer: copyright year extracted from footer (© / copyright pattern) - site_analyzer: Last-Modified from HTTP header + OG meta tag - site_analyzer: scans /contacto /contact /contactanos /sobre-nosotros for additional emails/phones (parallel with sitemap/robots fetch) - index.html: modal shows CMS (AI-detected), Last Updated (red if pre-2021) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 08:22:14 +02:00
Malin	dad910b6b0	feat: 5 fixes — dead site scoring, Kit Digital precision, social icons, GMB detection, social/GMB weighting 1. scorer: dead sites capped at 5 (was scoring HOT from SSL/CMS signals) 2. Kit Digital: require explicit kit-digital/agente-digitalizador signals; generic EU logo patterns (fondos-europeos, logo-ue, cofinanciado) removed. Gemini kit_digital_confirmed now overwrites heuristic in DB. 3. Browse table: social links replaced with compact coloured icon badges (fb/ig/in/x/tt/yt) linked to the profile URLs 4. site_analyzer: added has_gmb / gmb_url detection (Maps embed, Place links, LocalBusiness schema); fed to Gemini prompt 5. scorer: +5 no-social, +3 reachable contact; Gemini prompt includes GMB and social media management as sellable services; modal shows GMB/social status Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 07:21:02 +02:00
Malin	793aea8a5f	fix: auto-refresh browse results when AI assessments complete Track aiSt.done across poll cycles; re-fetch current page whenever the done count increases while on the browse tab. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:13:35 +02:00
Malin	f33dabbb7d	fix: close missing triple-quote on _build_prompt f-string (SyntaxError) The f-string in _build_prompt was never closed — the }} at end of the JSON template was missing the closing \"\"\". Python consumed the entire rest of the file as f-string content, then tried to evaluate the {\s\S} regex braces as an f-string expression, giving "unexpected character after line continuation character". Also bundles the earlier timeout fixes (SSL handshake, DNS, analyze_site 90s cap, _assess_one 180s cap, worker reset of stale running jobs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:03:56 +02:00
Malin	5bef587ca0	fix: add timeouts to SSL/DNS blocking calls, reset stuck AI jobs on startup - SSL handshake: set socket timeout before wrap_socket (prevents indefinite hang) - SSL executor: asyncio.wait_for(..., timeout=12) - DNS gethostbyname: asyncio.wait_for(..., timeout=6) - analyze_site: hard 90s timeout wrapper - _assess_one: hard 180s ceiling via asyncio.timeout() - ai_worker_loop: reset 'running' → 'pending' on startup (clears crashed-session jobs) - Add POST /api/ai/reset endpoint + UI button to unstick jobs without restart Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:11:27 +02:00
Malin	684fbd75b8	chore: add .gitignore and remove tracked __pycache__ files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:04:47 +02:00
Malin	60c9b495ae	fix: AI worker crash-proof + GDPR/hosting/accessibility analysis AI worker fixes (root cause of "nothing reaches Replicate"): - Worker task died silently — no exception handler around while loop - Added try/except around entire loop body with exc_info logging - Added watchdog task that restarts dead workers every 10 seconds - ensure_workers_alive() called on every /api/ai/assess/batch POST - _assess_one() is now a top-level function (not closure) — avoids subtle scoping bugs with async inner functions in while loops - /api/ai/debug endpoint: shows worker alive status, task exception, last 10 queue entries — browse to /api/ai/debug to diagnose - /api/ai/worker/restart endpoint + UI button - "Restart AI worker" button + "Debug AI queue" link in enrichment tab site_analyzer.py — new signals: - IP resolution + ip-api.com for ASN, org, ISP, host country - EU hosting detection (27 EU + EEA + adequacy countries) - GDPR: detects Cookiebot, OneTrust, CookiePro, Osano, Iubenda, Borlabs, CookieYes, Complianz, Usercentrics + text signals - Privacy policy and GDPR text presence - Accessibility: html lang missing, images without alt count, skip nav link, empty links, inputs without labels Gemini prompt additions: - Hosting section: IP, ASN, org/ISP, EU vs non-EU flag - GDPR section: cookie tool, notice, privacy policy - Accessibility section: all quick-scan results - New output fields: hosting_notes, gdpr_compliance, accessibility_issues[] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:01:34 +02:00
Malin	5ad8259c75	feat: deep site analysis engine + fix AI assess for any domain site_analyzer.py (new): - Fresh scrape with timing, page size, server, CMS detection - Lorem ipsum detection (16 phrases incl. user's example) - Placeholder content detection (hello world, sample page, etc.) - Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity - Webmaster: Google Search Console, Bing, Yandex verification tags - sitemap.xml and robots.txt check + Googlebot block detection - Mobile viewport check, word count, image/script count - Full contact extraction: emails, phones, WhatsApp, social links - Kit Digital signal detection AI worker fix: - No longer requires pre-enrichment — works on ANY selected domain - Does fresh site_analyzer scrape then calls Gemini with full context - Stores site_analysis JSON alongside AI assessment - Upserts into enriched_domains even if domain was never enriched Gemini prompt now includes: - Complete technical snapshot (load time, size, server, SSL) - Full SEO signals (sitemap, robots, analytics, webmaster verified) - Content quality (lorem ipsum matches, placeholder matches) - Kit Digital signals - All extracted contacts - 500-word page text sample - Outputs: summary, site_quality_score/10, content_issues[], urgency_signals[], performance_notes, seo_status, best_contact_channel+value, all_contacts, ES pitch, services_needed, outreach_notes UI: rich AI modal with summary banner, quality grid, content issues, urgency signals, full contact list, technical snapshot Fixes: correct Replicate token, ai_queue status='running' bug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:46:01 +02:00
Malin	faca4b6e1a	feat: Gemini AI assessment, Kit Digital detection, contact extraction Kit Digital detection (enricher.py): - Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc - Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR - Scans links for acelerapyme.es, red.es, kit-digital refs - +20 score bonus for Kit Digital confirmed sites (proven IT buyers) Contact extraction (enricher.py): - Pulls mailto/tel/wa.me links from HTML - Extracts email addresses via regex, phone numbers (ES format) - Detects social media links (FB, IG, LinkedIn, Twitter, TikTok) - Stored as JSON in contact_info column Gemini via Replicate (replicate_ai.py): - Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation - Identifies best contact channel + actual value (email/phone/WA) - Writes Spanish cold-call/email pitch angle - Lists services likely needed + outreach notes - 3 concurrent requests, 90s timeout, JSON output parsing DB: migration adds kit_digital, kit_digital_signals, contact_info, ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value, ai_queue table UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full assessment), contact chips (email/phone/WA/social), AI Assess button, Kit Digital only filter, AI queue status in enrichment tab Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:25:06 +02:00
Malin	7acff12242	feat: persistent DuckDB index, new filters, pagination fix, enrich UX - Build /data/domains.duckdb on first run (tld+parts columns + ART index) → TLD filter goes from ~60s full scan to <100ms index lookup → System still works (slower) while index builds in background - New /api/domains params: alpha_only, no_sld, keyword → alpha_only: domains with only letters (no hyphens/numbers) → no_sld: parts=2, excludes com.es / net.es patterns → keyword: LIKE '%term%' niche search - /api/domains and /api/enriched now return total count for pagination - Pagination: shows total matches, page X of Y, Next disabled at last page - Enrich button: toast notifications instead of alert(), error handling - Select all on page button, clear selection button - Stats/TLD breakdown cached after first load (no repeat full scan) - Header shows index build status (building → ready) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:00:08 +02:00
Malin	2db95cc727	fix: run as python -m app.main to fix ModuleNotFoundError Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:27:29 +02:00
Malin	b2e7a2f2db	feat: initial Dockerized domain intelligence dashboard - FastAPI backend with DuckDB pushdown queries on 72M parquet - Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com - Resumable parquet download with HTTP Range support - Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT) - Single-file Alpine.js + Chart.js dashboard on port 6677 - SQLite enrichment DB with job queue and scores tables - Dockerized with persistent /data volume Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:22:30 +02:00

20 Commits