1. prescreener.py: classify_with_deepseek now retries on 429 with
exponential back-off (5s → 10s → 20s → 40s, up to 4 attempts);
same back-off also covers other transient errors.
2. main.py: prescreen batches run sequentially with a 3s gap instead
of asyncio.gather (parallel). Parallel batches caused the second
batch to always hit the 429 rate limit, leaving most domains
unclassified (only the smaller last batch succeeded).
3. index.html: prescreenSelected() now clears this.domains before
calling _fetch() so Alpine re-renders the full table with the
updated niche/type values; also updates the notify hint to mention
the expected 1-2 min wait.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DeepSeek-R1 is too slow for synchronous Replicate wait; it returns 202
with a prediction URL instead of the completed output. Added polling loop:
- POST with Prefer: wait=60
- If 202 or status=starting/processing, poll urls.get every 2s up to 90×
(~3 min ceiling)
- On succeeded, use the final response data as normal
- On failed/canceled/timeout, log and return []
Also guards against output=None before calling str.join().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs:
1. _parse_classify_output stripped <think> block before searching for JSON.
DeepSeek-R1 often puts the JSON array inside the think block (especially
when it "decides" mid-reasoning), so stripping it first destroyed the data.
Fix: search full output first, then inside <think>, then stripped — three
fallback strategies with info logging at each step.
2. Phase 2 save used bare UPDATE WHERE domain=? which silently does nothing
if the domain row doesn't exist yet in enriched_domains.
Fix: replace with INSERT ... ON CONFLICT DO UPDATE (true upsert).
Also adds logger.info lines so container logs show raw DeepSeek output
and parse result count for easy debugging.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 (no AI credits): httpx checks every selected domain concurrently
(30 parallel) with real browser UA — detects live/dead/parked/redirect.
Parked: keyword scan in body/title + known parking host redirect check.
Results saved to DB immediately; dead/parked never reach DeepSeek.
Phase 2 (single DeepSeek call): all live-site titles + snippets bundled
into ONE Replicate/DeepSeek-R1 request → returns niche + type for every
domain in batch (up to 80 per call, parallelised if more).
- app/prescreener.py (new): _check_one(), prescreen_domains(),
classify_with_deepseek(), parking signal lists, same-domain redirect logic
- app/db.py: prescreen_status/niche/site_type/prescreen_at columns +
migrations; save_prescreen_results() upsert helper
- app/main.py: POST /api/prescreen/batch endpoint
- app/static/index.html:
- 🔍 Pre-screen button (disabled while running, shows spinner)
- Niche + Type columns in Browse and Leads tables (.pni/.pty pills)
- Prescreen status colour dot (●) when niche not yet set
- prescreening state flag; result toast shows per-status counts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- db.py: add `language` column to ai_queue; migration; queue_ai() accepts
language param and re-queues with ON CONFLICT UPDATE so changing language works
- main.py: batch and single assess endpoints accept `language` from request body
- enricher.py: ai_worker_loop reads language column, passes to _assess_one()
- replicate_ai.py: assess_domain() and _build_prompt() accept language param;
OUTPUT LANGUAGE section injected into prompt so Gemini writes pitch/email in
the requested language (EN/ES/RO)
- index.html: flag dropdown (🇪🇸/🇬🇧/🇷🇴) next to AI Assess button; aiLang
state default ES; language sent in all batch assessment requests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Prompt now describes complete agency capabilities (everything web-related)
- Concrete pitch examples with business name + specific problem references
- New mandatory output fields: outreach_email (3-4 sentence ready-to-send ES)
and email_subject (specific subject line)
- HOT/WARM/COLD scoring guide based on site deficiency count
- Modal: pitch box replaced with full outreach email + subject + Copy button
- max_output_tokens raised to 6000
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- max_output_tokens 2048→4096 (main truncation fix)
- page snippet 2000→800 chars, search results capped at 600 chars
- JSON schema reordered: lead_quality/pitch_angle/services_needed first,
so most important fields survive even if output is truncated
- RULES block in prompt: placeholder = HOT lead, pitch_angle is MANDATORY,
services_needed must have ≥2 items, keep values ≤15 words to avoid truncation
- _parse_output: truncated JSON repair — closes open [] and {} brackets
and strips trailing incomplete key-value before retrying json.loads
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. scorer: dead sites capped at 5 (was scoring HOT from SSL/CMS signals)
2. Kit Digital: require explicit kit-digital/agente-digitalizador signals;
generic EU logo patterns (fondos-europeos, logo-ue, cofinanciado) removed.
Gemini kit_digital_confirmed now overwrites heuristic in DB.
3. Browse table: social links replaced with compact coloured icon badges
(fb/ig/in/x/tt/yt) linked to the profile URLs
4. site_analyzer: added has_gmb / gmb_url detection (Maps embed, Place links,
LocalBusiness schema); fed to Gemini prompt
5. scorer: +5 no-social, +3 reachable contact; Gemini prompt includes GMB and
social media management as sellable services; modal shows GMB/social status
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track aiSt.done across poll cycles; re-fetch current page whenever
the done count increases while on the browse tab.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The f-string in _build_prompt was never closed — the }} at end of the
JSON template was missing the closing \"\"\". Python consumed the entire
rest of the file as f-string content, then tried to evaluate the
{\s\S} regex braces as an f-string expression, giving
"unexpected character after line continuation character".
Also bundles the earlier timeout fixes (SSL handshake, DNS, analyze_site
90s cap, _assess_one 180s cap, worker reset of stale running jobs).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AI worker fixes (root cause of "nothing reaches Replicate"):
- Worker task died silently — no exception handler around while loop
- Added try/except around entire loop body with exc_info logging
- Added watchdog task that restarts dead workers every 10 seconds
- ensure_workers_alive() called on every /api/ai/assess/batch POST
- _assess_one() is now a top-level function (not closure) — avoids
subtle scoping bugs with async inner functions in while loops
- /api/ai/debug endpoint: shows worker alive status, task exception,
last 10 queue entries — browse to /api/ai/debug to diagnose
- /api/ai/worker/restart endpoint + UI button
- "Restart AI worker" button + "Debug AI queue" link in enrichment tab
site_analyzer.py — new signals:
- IP resolution + ip-api.com for ASN, org, ISP, host country
- EU hosting detection (27 EU + EEA + adequacy countries)
- GDPR: detects Cookiebot, OneTrust, CookiePro, Osano, Iubenda,
Borlabs, CookieYes, Complianz, Usercentrics + text signals
- Privacy policy and GDPR text presence
- Accessibility: html lang missing, images without alt count,
skip nav link, empty links, inputs without labels
Gemini prompt additions:
- Hosting section: IP, ASN, org/ISP, EU vs non-EU flag
- GDPR section: cookie tool, notice, privacy policy
- Accessibility section: all quick-scan results
- New output fields: hosting_notes, gdpr_compliance,
accessibility_issues[]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Build /data/domains.duckdb on first run (tld+parts columns + ART index)
→ TLD filter goes from ~60s full scan to <100ms index lookup
→ System still works (slower) while index builds in background
- New /api/domains params: alpha_only, no_sld, keyword
→ alpha_only: domains with only letters (no hyphens/numbers)
→ no_sld: parts=2, excludes com.es / net.es patterns
→ keyword: LIKE '%term%' niche search
- /api/domains and /api/enriched now return total count for pagination
- Pagination: shows total matches, page X of Y, Next disabled at last page
- Enrich button: toast notifications instead of alert(), error handling
- Select all on page button, clear selection button
- Stats/TLD breakdown cached after first load (no repeat full scan)
- Header shows index build status (building → ready)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FastAPI backend with DuckDB pushdown queries on 72M parquet
- Async enrichment worker: HTTP, SSL, DNS MX, CMS fingerprint, ip-api.com
- Resumable parquet download with HTTP Range support
- Lead scoring engine (max 100 pts, target countries ES,GB,DE,FR,RO,PT,AD,IT)
- Single-file Alpine.js + Chart.js dashboard on port 6677
- SQLite enrichment DB with job queue and scores tables
- Dockerized with persistent /data volume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>