Port 80 is often firewalled (drops packets → ConnectTimeout) rather than
refused (ConnectError). Previously ConnectTimeout hit the generic except
branch and broke without trying https, marking everything dead.
Now ConnectError + RemoteProtocolError + ConnectTimeout all trigger an
https retry. ReadTimeout still marks dead (server responded on connect
but was too slow).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rotate across 7 real browser UAs to avoid bot detection
- Any 2xx/3xx/4xx/5xx response = server is UP = live (only no-response = dead)
- Parking signals still checked on 200/203 body content
- Previous 403/404 responses were incorrectly marking live servers as dead
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix only retried on ConnectError. Servers that accept TCP on port 80
but hang, return protocol errors, or timeout also need the https fallback.
Now any exception on http triggers https retry. Shorter http timeout (4s)
avoids wasting time on non-responsive port 80.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds rescan_dead flag that causes _filter_unvalidated to treat
previously-dead domains as needing a fresh check. Useful after
fixing the http/https detection bug.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Many modern servers refuse HTTP connections entirely. The validator was
only trying http://, causing HTTPS-only sites to be wrongly marked dead.
Now falls back to https:// on ConnectError. Also increased timeouts slightly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. prescreener.py: classify_with_deepseek now retries on 429 with
exponential back-off (5s → 10s → 20s → 40s, up to 4 attempts);
same back-off also covers other transient errors.
2. main.py: prescreen batches run sequentially with a 3s gap instead
of asyncio.gather (parallel). Parallel batches caused the second
batch to always hit the 429 rate limit, leaving most domains
unclassified (only the smaller last batch succeeded).
3. index.html: prescreenSelected() now clears this.domains before
calling _fetch() so Alpine re-renders the full table with the
updated niche/type values; also updates the notify hint to mention
the expected 1-2 min wait.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DeepSeek-R1 is too slow for synchronous Replicate wait; it returns 202
with a prediction URL instead of the completed output. Added polling loop:
- POST with Prefer: wait=60
- If 202 or status=starting/processing, poll urls.get every 2s up to 90×
(~3 min ceiling)
- On succeeded, use the final response data as normal
- On failed/canceled/timeout, log and return []
Also guards against output=None before calling str.join().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs:
1. _parse_classify_output stripped <think> block before searching for JSON.
DeepSeek-R1 often puts the JSON array inside the think block (especially
when it "decides" mid-reasoning), so stripping it first destroyed the data.
Fix: search full output first, then inside <think>, then stripped — three
fallback strategies with info logging at each step.
2. Phase 2 save used bare UPDATE WHERE domain=? which silently does nothing
if the domain row doesn't exist yet in enriched_domains.
Fix: replace with INSERT ... ON CONFLICT DO UPDATE (true upsert).
Also adds logger.info lines so container logs show raw DeepSeek output
and parse result count for easy debugging.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 (no AI credits): httpx checks every selected domain concurrently
(30 parallel) with real browser UA — detects live/dead/parked/redirect.
Parked: keyword scan in body/title + known parking host redirect check.
Results saved to DB immediately; dead/parked never reach DeepSeek.
Phase 2 (single DeepSeek call): all live-site titles + snippets bundled
into ONE Replicate/DeepSeek-R1 request → returns niche + type for every
domain in batch (up to 80 per call, parallelised if more).
- app/prescreener.py (new): _check_one(), prescreen_domains(),
classify_with_deepseek(), parking signal lists, same-domain redirect logic
- app/db.py: prescreen_status/niche/site_type/prescreen_at columns +
migrations; save_prescreen_results() upsert helper
- app/main.py: POST /api/prescreen/batch endpoint
- app/static/index.html:
- 🔍 Pre-screen button (disabled while running, shows spinner)
- Niche + Type columns in Browse and Leads tables (.pni/.pty pills)
- Prescreen status colour dot (●) when niche not yet set
- prescreening state flag; result toast shows per-status counts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- db.py: add `language` column to ai_queue; migration; queue_ai() accepts
language param and re-queues with ON CONFLICT UPDATE so changing language works
- main.py: batch and single assess endpoints accept `language` from request body
- enricher.py: ai_worker_loop reads language column, passes to _assess_one()
- replicate_ai.py: assess_domain() and _build_prompt() accept language param;
OUTPUT LANGUAGE section injected into prompt so Gemini writes pitch/email in
the requested language (EN/ES/RO)
- index.html: flag dropdown (🇪🇸/🇬🇧/🇷🇴) next to AI Assess button; aiLang
state default ES; language sent in all batch assessment requests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Prompt now describes complete agency capabilities (everything web-related)
- Concrete pitch examples with business name + specific problem references
- New mandatory output fields: outreach_email (3-4 sentence ready-to-send ES)
and email_subject (specific subject line)
- HOT/WARM/COLD scoring guide based on site deficiency count
- Modal: pitch box replaced with full outreach email + subject + Copy button
- max_output_tokens raised to 6000
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- max_output_tokens 2048→4096 (main truncation fix)
- page snippet 2000→800 chars, search results capped at 600 chars
- JSON schema reordered: lead_quality/pitch_angle/services_needed first,
so most important fields survive even if output is truncated
- RULES block in prompt: placeholder = HOT lead, pitch_angle is MANDATORY,
services_needed must have ≥2 items, keep values ≤15 words to avoid truncation
- _parse_output: truncated JSON repair — closes open [] and {} brackets
and strips trailing incomplete key-value before retrying json.loads
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. scorer: dead sites capped at 5 (was scoring HOT from SSL/CMS signals)
2. Kit Digital: require explicit kit-digital/agente-digitalizador signals;
generic EU logo patterns (fondos-europeos, logo-ue, cofinanciado) removed.
Gemini kit_digital_confirmed now overwrites heuristic in DB.
3. Browse table: social links replaced with compact coloured icon badges
(fb/ig/in/x/tt/yt) linked to the profile URLs
4. site_analyzer: added has_gmb / gmb_url detection (Maps embed, Place links,
LocalBusiness schema); fed to Gemini prompt
5. scorer: +5 no-social, +3 reachable contact; Gemini prompt includes GMB and
social media management as sellable services; modal shows GMB/social status
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track aiSt.done across poll cycles; re-fetch current page whenever
the done count increases while on the browse tab.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The f-string in _build_prompt was never closed — the }} at end of the
JSON template was missing the closing \"\"\". Python consumed the entire
rest of the file as f-string content, then tried to evaluate the
{\s\S} regex braces as an f-string expression, giving
"unexpected character after line continuation character".
Also bundles the earlier timeout fixes (SSL handshake, DNS, analyze_site
90s cap, _assess_one 180s cap, worker reset of stale running jobs).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AI worker fixes (root cause of "nothing reaches Replicate"):
- Worker task died silently — no exception handler around while loop
- Added try/except around entire loop body with exc_info logging
- Added watchdog task that restarts dead workers every 10 seconds
- ensure_workers_alive() called on every /api/ai/assess/batch POST
- _assess_one() is now a top-level function (not closure) — avoids
subtle scoping bugs with async inner functions in while loops
- /api/ai/debug endpoint: shows worker alive status, task exception,
last 10 queue entries — browse to /api/ai/debug to diagnose
- /api/ai/worker/restart endpoint + UI button
- "Restart AI worker" button + "Debug AI queue" link in enrichment tab
site_analyzer.py — new signals:
- IP resolution + ip-api.com for ASN, org, ISP, host country
- EU hosting detection (27 EU + EEA + adequacy countries)
- GDPR: detects Cookiebot, OneTrust, CookiePro, Osano, Iubenda,
Borlabs, CookieYes, Complianz, Usercentrics + text signals
- Privacy policy and GDPR text presence
- Accessibility: html lang missing, images without alt count,
skip nav link, empty links, inputs without labels
Gemini prompt additions:
- Hosting section: IP, ASN, org/ISP, EU vs non-EU flag
- GDPR section: cookie tool, notice, privacy policy
- Accessibility section: all quick-scan results
- New output fields: hosting_notes, gdpr_compliance,
accessibility_issues[]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>