Commit Graph

24 Commits

Author SHA1 Message Date
2f0959b8e8 fix: smart routing in Browse — enrichment filters use /api/enriched, discovery uses /api/domains
Root cause: loadDomains() always hit /api/domains (DuckDB 72M rows) and filtered
niche/site_type/prescreen_status client-side on a random page of 100 domains —
virtually none had been classified, so Live+Beauty+Ecommerce always returned 0.

- loadDomains() now routes to /api/enriched when any enrichment filter is active
  (prescreen_status, niche, site_type, country) — all filters are server-side SQLite
- Falls back to /api/domains only when no enrichment filters are set (discovery mode)
- alpha_only and no_sld supported in both modes:
  - DuckDB: existing regex support
  - SQLite: LIKE patterns (no hyphens/digits) + dot-count (no SLD)
- Add alpha_only/no_sld params to /api/enriched endpoint and get_enriched()
- Fix stale d.classified reference in prescreenOne toast

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 08:53:54 +02:00
daccb99a0c fix: prescreen returns immediately after HTTP check, DeepSeek runs in background
Previously /api/prescreen/batch blocked for 4-10 minutes waiting for Replicate/
DeepSeek, causing browser connection timeout and zero results saved.

- Phase 1 (HTTP check) runs synchronously and saves results immediately
- Phase 2 (DeepSeek classify) fires as asyncio.create_task and runs in background
- Response is returned to client as soon as phase 1 completes (~30-90s)
- Frontend toast shows "classifying N in background" so user knows niche/type
  will appear shortly without waiting
- Each DeepSeek sub-batch saves independently so partial results are preserved

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 08:28:26 +02:00
7ec0304dea feat: add Validate Selected button, Alpha only and No SLD filters to beauty Browse
- /api/validate/batch endpoint: HTTP-check only (no DeepSeek), accepts up to 500 domains
- Validate Selected bulk button: runs validate in 500-domain chunks, shows live/dead summary
- Alpha only checkbox: passes alpha_only=true to /api/domains to exclude hyphens/numbers
- No SLD checkbox: passes no_sld=true to /api/domains to skip com.es / co.uk style domains
- Both flags wired into loadDomains() and resetFilters()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 07:59:32 +02:00
5672b61b5e fix: beauty Browse uses /api/domains (DuckDB) like main DomGod
- loadDomains() now calls /api/domains (72M domain index) instead of /api/enriched
- keyword and TLD filters are server-side (DuckDB); prescreen_status, niche,
  site_type, country are client-side — same pattern as main DomGod _fetch()
- "Not checked" now correctly finds domains that exist in DuckDB but have never
  been pre-screened (no row in enriched_domains, so no prescreen_status)
- results info shows "X shown · Y matching · page N" to reflect DuckDB total vs
  client-side-filtered visible count

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 07:49:47 +02:00
db93401a81 fix: remove prescreen tab, use bulk bar select+prescreen only
- drop standalone Pre-screen tab (textarea upload) — confusing duplicate
- bulk bar Pre-screen Selected button is the only entry point now
- add prescreening flag with loading state on button + double-click guard
- remove dead prescreenInput/prescreenRunning/prescreenResult state vars and runPrescreen()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 20:00:50 +02:00
ad03107f0d fix: beauty frontend server-side filtering and bulk actions
- add keyword and tld params to get_enriched() in db.py (LIKE on domain + page_title)
- forward keyword/tld through /api/enriched in beauty_main.py
- rewrite beauty/index.html loadDomains() to pass all filters server-side via URLSearchParams
- track domainsTotal from API response for correct pagination display
- add Pre-screen Selected and B2B Assess Selected bulk action buttons
- add per-row Screen and Assess buttons
- goSearch() resets to page 1 before fetching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:44:34 +02:00
a7dd7927b9 feat: BeautyLeads B2B cosmetics frontend on port 7788
New service (app/beauty_main.py) sharing the same /data volume:
- Separate FastAPI app running on port 7788
- beauty_ai.py: brand universe scan (~650 brands), portfolio match
  detection against OUR_BRANDS, Gemini B2B assessment prompt in Spanish
  returning quality/categories/dist_matches/outreach_email
- beauty_queue table + beauty_lead_quality/beauty_assessment columns
  in enriched_domains (with migrations)
- Endpoints: /api/beauty/assess/batch, /api/beauty/leads,
  /api/beauty/status, /api/beauty/export, /api/beauty/reset
- Static frontend: Browse (beauty/ecommerce pre-filtered, no CMS/SSL/KD
  columns), Validator, B2B Pipeline (brand chips, expandable outreach),
  Pre-screen, Export CSV
- docker-compose: second 'beauty' service with shared data volume
- Dockerfile: expose 7788 alongside 6677

Also: add 'error' prescreen_status handling + UI (orange stat box,
filter option) for 4xx/5xx HTTP responses

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:31:10 +02:00
db95876db2 fix: SQLite database locked errors + add error status for 4xx/5xx
SQLite locking:
- Enable WAL journal mode in init_db (readers don't block writers)
- Set busy_timeout=30000ms in init_db
- Add timeout=30 to every aiosqlite.connect() across db.py, validator.py,
  enricher.py, main.py so connections wait up to 30s instead of crashing

Error status:
- 4xx/5xx HTTP responses are now prescreen_status='error' (server alive
  but broken/blocking) instead of 'live'
- Added 'error' counter to validator stats and orange Error stat box in UI
- Added ps-error CSS class (orange) and filter option in Browse tab

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 07:10:45 +02:00
f8ab910eca feat: add rescan dead domains checkbox to validator
Adds rescan_dead flag that causes _filter_unvalidated to treat
previously-dead domains as needing a fresh check. Useful after
fixing the http/https detection bug.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 20:12:59 +02:00
8f387cada2 feat: bulk validator tab + status/niche/type browse filters
- New app/validator.py: background HTTP checker for entire dataset
  - 50 concurrent checks, skips already-validated domains
  - Extracts prescreen_status, server, IP, load_time_ms
  - start/stop/status API at /api/validator/start|stop|status

- New dedicated "Validator 🔬" tab with stats grid, TLD filter,
  Start/Stop controls, live progress indicator

- Browse tab: "Live" column replaced with "Status" dot (color-coded
  ● from prescreen_status, falls back to is_live)
- Browse tab: new Status / Niche / Type filter dropdowns

- db.py: added ip TEXT + load_time_ms INTEGER columns + migrations;
  get_enriched() supports prescreen_status/niche/site_type filters

- main.py: /api/enriched extended with prescreen_status/niche/site_type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 08:27:24 +02:00
468d76387d fix: 429 retry, sequential batching, force UI refresh after prescreen
1. prescreener.py: classify_with_deepseek now retries on 429 with
   exponential back-off (5s → 10s → 20s → 40s, up to 4 attempts);
   same back-off also covers other transient errors.

2. main.py: prescreen batches run sequentially with a 3s gap instead
   of asyncio.gather (parallel). Parallel batches caused the second
   batch to always hit the 429 rate limit, leaving most domains
   unclassified (only the smaller last batch succeeded).

3. index.html: prescreenSelected() now clears this.domains before
   calling _fetch() so Alpine re-renders the full table with the
   updated niche/type values; also updates the notify hint to mention
   the expected 1-2 min wait.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 21:52:39 +02:00
7fc510f903 feat: two-phase pre-screening with HTTP check + DeepSeek batch classification
Phase 1 (no AI credits): httpx checks every selected domain concurrently
(30 parallel) with real browser UA — detects live/dead/parked/redirect.
Parked: keyword scan in body/title + known parking host redirect check.
Results saved to DB immediately; dead/parked never reach DeepSeek.

Phase 2 (single DeepSeek call): all live-site titles + snippets bundled
into ONE Replicate/DeepSeek-R1 request → returns niche + type for every
domain in batch (up to 80 per call, parallelised if more).

- app/prescreener.py (new): _check_one(), prescreen_domains(),
  classify_with_deepseek(), parking signal lists, same-domain redirect logic
- app/db.py: prescreen_status/niche/site_type/prescreen_at columns +
  migrations; save_prescreen_results() upsert helper
- app/main.py: POST /api/prescreen/batch endpoint
- app/static/index.html:
  - 🔍 Pre-screen button (disabled while running, shows spinner)
  - Niche + Type columns in Browse and Leads tables (.pni/.pty pills)
  - Prescreen status colour dot (●) when niche not yet set
  - prescreening state flag; result toast shows per-status counts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 21:22:45 +02:00
63f961dc80 feat: add Leads tab and Hide Assessed filter in Browse
- db.py: get_enriched() accepts ai_only + lead_quality params
- main.py: /api/enriched exposes ai_only + lead_quality query params;
  new /api/export/leads endpoint produces CSV with contacts + pitch
- index.html:
  - New "Leads 🤖" tab shows all AI-assessed domains with contacts
    (quality/country/limit filters, per-row 📋 copy email, 🔍 modal,
    CSV export, pagination, auto-refreshes every 3s)
  - Browse: "Hide assessed" checkbox filters out already-processed
    domains so you can focus on fresh targets
  - Poll cycle refreshes Leads tab when active

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 18:57:15 +02:00
22eae3f9b7 feat: add EN/ES/RO language selector for AI pitch generation
- db.py: add `language` column to ai_queue; migration; queue_ai() accepts
  language param and re-queues with ON CONFLICT UPDATE so changing language works
- main.py: batch and single assess endpoints accept `language` from request body
- enricher.py: ai_worker_loop reads language column, passes to _assess_one()
- replicate_ai.py: assess_domain() and _build_prompt() accept language param;
  OUTPUT LANGUAGE section injected into prompt so Gemini writes pitch/email in
  the requested language (EN/ES/RO)
- index.html: flag dropdown (🇪🇸/🇬🇧/🇷🇴) next to AI Assess button; aiLang
  state default ES; language sent in all batch assessment requests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 08:39:27 +02:00
88c27bfff5 feat: full-service agency pitch — outreach email + subject, richer Gemini brief
- Prompt now describes complete agency capabilities (everything web-related)
- Concrete pitch examples with business name + specific problem references
- New mandatory output fields: outreach_email (3-4 sentence ready-to-send ES)
  and email_subject (specific subject line)
- HOT/WARM/COLD scoring guide based on site deficiency count
- Modal: pitch box replaced with full outreach email + subject + Copy button
- max_output_tokens raised to 6000

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 08:34:37 +02:00
d62e4e986e feat: web search for contacts, copyright year, contact page scan, CMS/age from Gemini
- replicate_ai: DDG pre-search runs before every Gemini call; top 4 snippets
  injected into prompt so Gemini can find phone/email not on homepage
- replicate_ai: explicit instructions to use search results for contact lookup
- replicate_ai: new output fields cms_detected + site_last_updated
- site_analyzer: copyright year extracted from footer (© / copyright pattern)
- site_analyzer: Last-Modified from HTTP header + OG meta tag
- site_analyzer: scans /contacto /contact /contactanos /sobre-nosotros for
  additional emails/phones (parallel with sitemap/robots fetch)
- index.html: modal shows CMS (AI-detected), Last Updated (red if pre-2021)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 08:22:14 +02:00
dad910b6b0 feat: 5 fixes — dead site scoring, Kit Digital precision, social icons, GMB detection, social/GMB weighting
1. scorer: dead sites capped at 5 (was scoring HOT from SSL/CMS signals)
2. Kit Digital: require explicit kit-digital/agente-digitalizador signals;
   generic EU logo patterns (fondos-europeos, logo-ue, cofinanciado) removed.
   Gemini kit_digital_confirmed now overwrites heuristic in DB.
3. Browse table: social links replaced with compact coloured icon badges
   (fb/ig/in/x/tt/yt) linked to the profile URLs
4. site_analyzer: added has_gmb / gmb_url detection (Maps embed, Place links,
   LocalBusiness schema); fed to Gemini prompt
5. scorer: +5 no-social, +3 reachable contact; Gemini prompt includes GMB and
   social media management as sellable services; modal shows GMB/social status

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 07:21:02 +02:00
793aea8a5f fix: auto-refresh browse results when AI assessments complete
Track aiSt.done across poll cycles; re-fetch current page whenever
the done count increases while on the browse tab.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:13:35 +02:00
5bef587ca0 fix: add timeouts to SSL/DNS blocking calls, reset stuck AI jobs on startup
- SSL handshake: set socket timeout before wrap_socket (prevents indefinite hang)
- SSL executor: asyncio.wait_for(..., timeout=12)
- DNS gethostbyname: asyncio.wait_for(..., timeout=6)
- analyze_site: hard 90s timeout wrapper
- _assess_one: hard 180s ceiling via asyncio.timeout()
- ai_worker_loop: reset 'running' → 'pending' on startup (clears crashed-session jobs)
- Add POST /api/ai/reset endpoint + UI button to unstick jobs without restart

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 18:11:27 +02:00
60c9b495ae fix: AI worker crash-proof + GDPR/hosting/accessibility analysis
AI worker fixes (root cause of "nothing reaches Replicate"):
- Worker task died silently — no exception handler around while loop
- Added try/except around entire loop body with exc_info logging
- Added watchdog task that restarts dead workers every 10 seconds
- ensure_workers_alive() called on every /api/ai/assess/batch POST
- _assess_one() is now a top-level function (not closure) — avoids
  subtle scoping bugs with async inner functions in while loops
- /api/ai/debug endpoint: shows worker alive status, task exception,
  last 10 queue entries — browse to /api/ai/debug to diagnose
- /api/ai/worker/restart endpoint + UI button
- "Restart AI worker" button + "Debug AI queue" link in enrichment tab

site_analyzer.py — new signals:
- IP resolution + ip-api.com for ASN, org, ISP, host country
- EU hosting detection (27 EU + EEA + adequacy countries)
- GDPR: detects Cookiebot, OneTrust, CookiePro, Osano, Iubenda,
  Borlabs, CookieYes, Complianz, Usercentrics + text signals
- Privacy policy and GDPR text presence
- Accessibility: html lang missing, images without alt count,
  skip nav link, empty links, inputs without labels

Gemini prompt additions:
- Hosting section: IP, ASN, org/ISP, EU vs non-EU flag
- GDPR section: cookie tool, notice, privacy policy
- Accessibility section: all quick-scan results
- New output fields: hosting_notes, gdpr_compliance,
  accessibility_issues[]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 18:01:34 +02:00
5ad8259c75 feat: deep site analysis engine + fix AI assess for any domain
site_analyzer.py (new):
- Fresh scrape with timing, page size, server, CMS detection
- Lorem ipsum detection (16 phrases incl. user's example)
- Placeholder content detection (hello world, sample page, etc.)
- Analytics: GA4, GTM, Facebook Pixel, Hotjar, Clarity
- Webmaster: Google Search Console, Bing, Yandex verification tags
- sitemap.xml and robots.txt check + Googlebot block detection
- Mobile viewport check, word count, image/script count
- Full contact extraction: emails, phones, WhatsApp, social links
- Kit Digital signal detection

AI worker fix:
- No longer requires pre-enrichment — works on ANY selected domain
- Does fresh site_analyzer scrape then calls Gemini with full context
- Stores site_analysis JSON alongside AI assessment
- Upserts into enriched_domains even if domain was never enriched

Gemini prompt now includes:
- Complete technical snapshot (load time, size, server, SSL)
- Full SEO signals (sitemap, robots, analytics, webmaster verified)
- Content quality (lorem ipsum matches, placeholder matches)
- Kit Digital signals
- All extracted contacts
- 500-word page text sample
- Outputs: summary, site_quality_score/10, content_issues[],
  urgency_signals[], performance_notes, seo_status,
  best_contact_channel+value, all_contacts, ES pitch,
  services_needed, outreach_notes

UI: rich AI modal with summary banner, quality grid, content issues,
    urgency signals, full contact list, technical snapshot

Fixes: correct Replicate token, ai_queue status='running' bug

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:46:01 +02:00
faca4b6e1a feat: Gemini AI assessment, Kit Digital detection, contact extraction
Kit Digital detection (enricher.py):
- Scans img src/alt/srcset for digitalizadores, kit-digital, fondos-europeos etc
- Scans page text for Kit Digital, Agente Digitalizador, Next Generation EU, PRTR
- Scans links for acelerapyme.es, red.es, kit-digital refs
- +20 score bonus for Kit Digital confirmed sites (proven IT buyers)

Contact extraction (enricher.py):
- Pulls mailto/tel/wa.me links from HTML
- Extracts email addresses via regex, phone numbers (ES format)
- Detects social media links (FB, IG, LinkedIn, Twitter, TikTok)
- Stored as JSON in contact_info column

Gemini via Replicate (replicate_ai.py):
- Assesses lead quality (HOT/WARM/COLD), Kit Digital confirmation
- Identifies best contact channel + actual value (email/phone/WA)
- Writes Spanish cold-call/email pitch angle
- Lists services likely needed + outreach notes
- 3 concurrent requests, 90s timeout, JSON output parsing

DB: migration adds kit_digital, kit_digital_signals, contact_info,
    ai_assessment, ai_lead_quality, ai_pitch, ai_contact_channel/value,
    ai_queue table

UI: Kit Digital 🏅 badge, AI quality pill (clickable modal with full
    assessment), contact chips (email/phone/WA/social), AI Assess button,
    Kit Digital only filter, AI queue status in enrichment tab

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:25:06 +02:00
7acff12242 feat: persistent DuckDB index, new filters, pagination fix, enrich UX
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
  → TLD filter goes from ~60s full scan to <100ms index lookup
  → System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
  → alpha_only: domains with only letters (no hyphens/numbers)
  → no_sld: parts=2, excludes com.es / net.es patterns
  → keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:00:08 +02:00
b2e7a2f2db feat: initial Dockerized domain intelligence dashboard
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:22:30 +02:00