# Error Handling & Recovery System ## Legend | Symbol | Meaning | | Abbrev | Meaning | |--------|---------|---|--------|---------| | 🔄 | retry/recovery | | err | error | | ⚠ | warning/caution | | rec | recovery | | ✅ | success/fixed | | ctx | context | | 🔧 | repair/fix | | fail | failure | ## Error Classification & Response ```yaml Severity_Levels: CRITICAL [10]: Data loss, security breach, prod down Response: Immediate stop, alert, rollback, incident response Recovery: Manual intervention required, full investigation HIGH [7-9]: Build failure, test failure, deployment issues Response: Stop workflow, notify user, suggest fixes Recovery: Automated retry w/ backoff, alternative paths MEDIUM [4-6]: Warning conditions, perf degradation Response: Continue w/ warning, log for later review Recovery: Attempt optimization, monitor for escalation LOW [1-3]: Info messages, style violations, minor optimizations Response: Note in output, continue execution Recovery: Background fixes, cleanup on completion Error_Categories: Transient: Network_Timeouts: "MCP server unreachable, API timeouts" Resource_Busy: "File locked, system overloaded" Rate_Limits: "API quota exceeded, temporary blocks" Strategy: Exponential backoff retry, circuit breaker pattern Permanent: Syntax_Errors: "Invalid code, malformed input" Permission_Denied: "Access restrictions, auth failures" Not_Found: "Missing files, invalid paths" Strategy: No retry, immediate fallback or user guidance Context: Configuration: "Missing env vars, incorrect settings" State_Conflicts: "Dirty working tree, merge conflicts" Version_Mismatch: "Incompatible versions, deprecated APIs" Strategy: Validation, helpful error msgs, setup guidance Resource: Memory: "Out of memory, insufficient resources" Disk_Space: "Storage full, temp space unavailable" API_Limits: "Rate limits, quota exceeded" Strategy: Resource monitoring, cleanup, queue management ``` ## Intelligent Retry Strategies ```yaml Retry_Logic: Exponential_Backoff: Base_Delay: "1 second" Max_Delay: "60 seconds" Max_Attempts: "3 for transient, 1 for permanent" Jitter: "±25% randomization to avoid thundering herd" Adaptive_Strategy: Network_Errors: "Retry w/ longer timeout" Rate_Limits: "Wait for reset period + retry" Resource_Busy: "Short delay + retry w/ alternative" Permanent: "No retry, immediate fallback" Circuit_Breaker: Failure_Threshold: "3 consecutive failures" Recovery_Time: "5 minutes before re-enabling" Health_Check: "Lightweight test before full retry" ``` ## MCP Server Failover ```yaml Failover_Hierarchy: Context7_Failure: Primary: "C7 documentation lookup" Fallback_1: "WebSearch official docs" Fallback_2: "Local cache if available" Fallback_3: "Continue w/ warning + note limitation" Sequential_Failure: Primary: "Sequential thinking server" Fallback_1: "Native step-by-step analysis" Fallback_2: "Simplified linear approach" Fallback_3: "Manual breakdown w/ user input" Magic_Failure: Primary: "Magic UI component generation" Fallback_1: "Search existing components in project" Fallback_2: "Generate basic template manually" Fallback_3: "Provide implementation guidance" Puppeteer_Failure: Primary: "Browser automation & testing" Fallback_1: "Manual test instructions" Fallback_2: "Static analysis alternatives" Fallback_3: "Skip browser-specific operations" Server_Health_Monitoring: Availability_Check: Frequency: "Every 5 minutes during active use" Timeout: "3 seconds per check" Circuit_Breaker: "Disable after 3 consecutive failures" Recovery_Check: "Re-enable after 5 minutes" Performance_Degradation: Slow_Response: ">30s response time" Action: "Switch to faster alternative if available" Notification: "Inform user of performance impact" ``` ## Recovery Strategies ```yaml Automatic_Recovery: Retry_Mechanisms: Simple: "Up to 3 attempts with 1s delay" Exponential: "1s, 2s, 4s, 8s delays with jitter" Circuit_Breaker: "Stop retries after threshold failures" Fallback_Patterns: Alternative_Commands: "Use native tools if MCP fails" Degraded_Functionality: "Skip non-essential features" Cached_Results: "Use previous successful outputs" State_Management: Checkpoints: "Save state before risky operations" Rollback: "Automatic revert to last known good state" Cleanup: "Remove partial results on failure" Manual_Recovery: User_Guidance: Clear_Error_Messages: "What failed, why, how to fix" Action_Items: "Specific steps user can take" Documentation_Links: "Relevant help resources" Intervention_Points: Confirmation: "Ask before destructive operations" Override: "Allow user to skip validation warnings" Custom: "Accept user-provided alternative approaches" Recovery_Tools: Diagnostic: "Commands to investigate failures" Repair: "Automated fixes for common issues" Reset: "Return to clean state for fresh start" ``` ## Context Preservation ```yaml Operation_Checkpoints: Before_Risky_Operations: Create_Checkpoint: "Save current state before destructive ops" Include: "File states, working directory, command context" Location: ".claudedocs/checkpoints/checkpoint-" During_Command_Chains: Intermediate_Results: "Save results after each successful step" Context_Handoff: "Pass validated context to next command" Rollback_Points: "Mark safe restoration points" Failure_Recovery: Partial_Completion: "Preserve completed work" State_Analysis: "Determine safe rollback point" User_Options: "Present recovery choices" Context_Resilience: Session_State: Persistent_Storage: "Maintain state across interruptions" Auto_Save: "Periodic context snapshots" Recovery: "Restore from last known good state" Command_Chain_Recovery: Failed_Step_Isolation: "Don't lose previous successful steps" Alternative_Paths: "Suggest different approaches for failed step" Partial_Results: "Use completed work in recovery strategy" ``` ## Proactive Error Prevention ```yaml Pre_Execution_Validation: Environment_Check: Required_Tools: "Verify dependencies before starting" Permissions: "Check access rights for planned operations" Disk_Space: "Ensure adequate space for outputs" Network: "Verify connectivity for remote operations" Conflict_Detection: File_Locks: "Check for locked files before editing" Git_State: "Verify clean working tree for git ops" Process_Conflicts: "Detect conflicting background processes" Resource_Availability: Memory_Usage: "Ensure adequate RAM for large operations" CPU_Load: "Warn if system under heavy load" Token_Budget: "Estimate token usage vs available quota" Risk_Assessment: Operation_Scoring: Data_Loss_Risk: "1-10 scale based on destructiveness" Reversibility: "Can operation be undone?" Scope_Impact: "How many files/systems affected?" Mitigation_Requirements: High_Risk: "Require explicit confirmation + backup" Medium_Risk: "Warn user + create checkpoint" Low_Risk: "Proceed w/ monitoring" ``` ## Command-Specific Recovery ```yaml Build_Failures: Clean_Retry: "Remove artifacts, clear cache, rebuild" Dependency_Issues: "Update lockfiles, reinstall packages" Compilation_Errors: "Suggest fixes, alternative approaches" Test_Failures: Flaky_Tests: "Retry failed tests, identify unstable tests" Environment_Issues: "Reset test state, check prerequisites" Coverage_Gaps: "Generate missing tests, update thresholds" Deploy_Failures: Health_Check_Failures: "Rollback, investigate logs" Resource_Constraints: "Scale up, optimize deployment" Configuration_Issues: "Validate settings, check secrets" Analysis_Failures: Tool_Unavailable: "Fallback to alternative analysis tools" Large_Codebase: "Reduce scope, batch processing" Permission_Issues: "Guide user through access setup" ``` ## Enhanced Error Reporting ```yaml Intelligent_Error_Messages: Root_Cause_Analysis: Technical_Details: "Specific error codes, stack traces" User_Context: "What user was trying to accomplish" Environmental_Factors: "System state, recent changes" Actionable_Guidance: Immediate_Steps: "What user can do right now" Alternative_Approaches: "Different ways to achieve goal" Prevention: "How to avoid this error in future" Context_Preservation: Session_Info: "Command history, current state" Relevant_Files: "Which files were being processed" System_State: "Git status, dependency versions" Error_Learning: Pattern_Recognition: Frequent_Issues: "Track commonly occurring errors" User_Patterns: "Learn user-specific failure modes" System_Patterns: "Identify environment-specific issues" Adaptive_Responses: Personalized_Suggestions: "Based on user's history" Proactive_Warnings: "Predict likely issues" Automated_Fixes: "Apply known solutions automatically" ``` ## Integration with Commands ```yaml Pre_Execution_Validation: Prerequisites: "Check required tools, permissions, resources" Environment: "Validate configuration, network connectivity" State: "Ensure clean starting state, no conflicts" During_Execution: Monitoring: "Track progress, resource usage, early warnings" Checkpointing: "Save state at critical milestones" Health_Checks: "Validate system state during long operations" Post_Execution: Verification: "Confirm expected outcomes achieved" Cleanup: "Remove temporary files, release resources" Reporting: "Document successes, failures, lessons learned" Error_Reporting_Format: Structured: "Consistent error message format across commands" Actionable: "Include specific steps for resolution" Contextual: "Provide relevant system and environment information" Traceable: "Include operation ID for troubleshooting" ``` ## Configuration & Customization ```yaml User_Preferences: Recovery_Style: "Conservative (safe) vs Aggressive (fast)" Retry_Limits: "Maximum attempts for different error types" Notification: "How and when to alert user of issues" Automation_Level: "How much recovery to attempt automatically" Project_Settings: Critical_Operations: "Commands that require extra safety" Acceptable_Risk: "Tolerance for failures in development vs production" Resource_Limits: "Maximum time, memory, network usage" Dependencies: "Critical external services that must be available" Environment_Adaptation: Development: "More aggressive retries, helpful error messages" Staging: "Balanced approach, thorough logging" Production: "Conservative recovery, immediate alerting" CI_CD: "Fast failure, detailed diagnostic information" ``` ## Usage Examples ```yaml Network_Failure_Scenario: Error: "Context7 server timeout during docs lookup" Recovery: "Auto-fallback to WebSearch → local cache" User_Experience: "⚠ Using cached docs (Context7 unavailable)" File_Lock_Scenario: Error: "Cannot edit file (locked by another process)" Recovery: "Wait 5s → retry → suggest alternatives" User_Experience: "Retrying in 5s... or try manual edit" Command_Chain_Failure: Error: "Step 3 of 5 fails in build workflow" Recovery: "Preserve steps 1-2, suggest alternatives for 3" User_Experience: "Build partially complete. Alternative approaches: ..." ``` --- *Error Handling v1.0 - Comprehensive error recovery and resilience for SuperClaude*