SuperClaude/.claude/commands/shared/error-handling.yml

# Error Handling & Recovery System

## Legend
| Symbol | Meaning | | Abbrev | Meaning |
|--------|---------|---|--------|---------|
| 🔄 | retry/recovery | | err | error |
| ⚠ | warning/caution | | rec | recovery |
| ✅ | success/fixed | | ctx | context |
| 🔧 | repair/fix | | fail | failure |

## Error Classification & Response

```yaml
Severity_Levels:
  CRITICAL [10]: Data loss, security breach, prod down
    Response: Immediate stop, alert, rollback, incident response
    Recovery: Manual intervention required, full investigation

  HIGH [7-9]: Build failure, test failure, deployment issues
    Response: Stop workflow, notify user, suggest fixes
    Recovery: Automated retry w/ backoff, alternative paths

  MEDIUM [4-6]: Warning conditions, perf degradation
    Response: Continue w/ warning, log for later review
    Recovery: Attempt optimization, monitor for escalation

  LOW [1-3]: Info messages, style violations, minor optimizations
    Response: Note in output, continue execution
    Recovery: Background fixes, cleanup on completion

Error_Categories:
  Transient:
    Network_Timeouts: "MCP server unreachable, API timeouts"
    Resource_Busy: "File locked, system overloaded"
    Rate_Limits: "API quota exceeded, temporary blocks"
    Strategy: Exponential backoff retry, circuit breaker pattern

  Permanent:
    Syntax_Errors: "Invalid code, malformed input"
    Permission_Denied: "Access restrictions, auth failures"
    Not_Found: "Missing files, invalid paths"
    Strategy: No retry, immediate fallback or user guidance

  Context:
    Configuration: "Missing env vars, incorrect settings"
    State_Conflicts: "Dirty working tree, merge conflicts"
    Version_Mismatch: "Incompatible versions, deprecated APIs"
    Strategy: Validation, helpful error msgs, setup guidance

  Resource:
    Memory: "Out of memory, insufficient resources"
    Disk_Space: "Storage full, temp space unavailable"
    API_Limits: "Rate limits, quota exceeded"
    Strategy: Resource monitoring, cleanup, queue management
```

## Intelligent Retry Strategies

```yaml
Retry_Logic:
  Exponential_Backoff:
    Base_Delay: "1 second"
    Max_Delay: "60 seconds"
    Max_Attempts: "3 for transient, 1 for permanent"
    Jitter: "±25% randomization to avoid thundering herd"

  Adaptive_Strategy:
    Network_Errors: "Retry w/ longer timeout"
    Rate_Limits: "Wait for reset period + retry"
    Resource_Busy: "Short delay + retry w/ alternative"
    Permanent: "No retry, immediate fallback"

Circuit_Breaker:
  Failure_Threshold: "3 consecutive failures"
  Recovery_Time: "5 minutes before re-enabling"
  Health_Check: "Lightweight test before full retry"
```

## MCP Server Failover

```yaml
Failover_Hierarchy:
  Context7_Failure:
    Primary: "C7 documentation lookup"
    Fallback_1: "WebSearch official docs"
    Fallback_2: "Local cache if available"
    Fallback_3: "Continue w/ warning + note limitation"

  Sequential_Failure:
    Primary: "Sequential thinking server"
    Fallback_1: "Native step-by-step analysis"
    Fallback_2: "Simplified linear approach"
    Fallback_3: "Manual breakdown w/ user input"

  Magic_Failure:
    Primary: "Magic UI component generation"
    Fallback_1: "Search existing components in project"
    Fallback_2: "Generate basic template manually"
    Fallback_3: "Provide implementation guidance"

  Puppeteer_Failure:
    Primary: "Browser automation & testing"
    Fallback_1: "Manual test instructions"
    Fallback_2: "Static analysis alternatives"
    Fallback_3: "Skip browser-specific operations"

Server_Health_Monitoring:
  Availability_Check:
    Frequency: "Every 5 minutes during active use"
    Timeout: "3 seconds per check"
    Circuit_Breaker: "Disable after 3 consecutive failures"
    Recovery_Check: "Re-enable after 5 minutes"

  Performance_Degradation:
    Slow_Response: ">30s response time"
    Action: "Switch to faster alternative if available"
    Notification: "Inform user of performance impact"
```

## Recovery Strategies

```yaml
Automatic_Recovery:
  Retry_Mechanisms:
    Simple: "Up to 3 attempts with 1s delay"
    Exponential: "1s, 2s, 4s, 8s delays with jitter"
    Circuit_Breaker: "Stop retries after threshold failures"

  Fallback_Patterns:
    Alternative_Commands: "Use native tools if MCP fails"
    Degraded_Functionality: "Skip non-essential features"
    Cached_Results: "Use previous successful outputs"

  State_Management:
    Checkpoints: "Save state before risky operations"
    Rollback: "Automatic revert to last known good state"
    Cleanup: "Remove partial results on failure"

Manual_Recovery:
  User_Guidance:
    Clear_Error_Messages: "What failed, why, how to fix"
    Action_Items: "Specific steps user can take"
    Documentation_Links: "Relevant help resources"

  Intervention_Points:
    Confirmation: "Ask before destructive operations"
    Override: "Allow user to skip validation warnings"
    Custom: "Accept user-provided alternative approaches"

  Recovery_Tools:
    Diagnostic: "Commands to investigate failures"
    Repair: "Automated fixes for common issues"
    Reset: "Return to clean state for fresh start"
```

## Context Preservation

```yaml
Operation_Checkpoints:
  Before_Risky_Operations:
    Create_Checkpoint: "Save current state before destructive ops"
    Include: "File states, working directory, command context"
    Location: ".claudedocs/checkpoints/checkpoint-<timestamp>"

  During_Command_Chains:
    Intermediate_Results: "Save results after each successful step"
    Context_Handoff: "Pass validated context to next command"
    Rollback_Points: "Mark safe restoration points"

  Failure_Recovery:
    Partial_Completion: "Preserve completed work"
    State_Analysis: "Determine safe rollback point"
    User_Options: "Present recovery choices"

Context_Resilience:
  Session_State:
    Persistent_Storage: "Maintain state across interruptions"
    Auto_Save: "Periodic context snapshots"
    Recovery: "Restore from last known good state"

  Command_Chain_Recovery:
    Failed_Step_Isolation: "Don't lose previous successful steps"
    Alternative_Paths: "Suggest different approaches for failed step"
    Partial_Results: "Use completed work in recovery strategy"
```

## Proactive Error Prevention

```yaml
Pre_Execution_Validation:
  Environment_Check:
    Required_Tools: "Verify dependencies before starting"
    Permissions: "Check access rights for planned operations"
    Disk_Space: "Ensure adequate space for outputs"
    Network: "Verify connectivity for remote operations"

  Conflict_Detection:
    File_Locks: "Check for locked files before editing"
    Git_State: "Verify clean working tree for git ops"
    Process_Conflicts: "Detect conflicting background processes"

  Resource_Availability:
    Memory_Usage: "Ensure adequate RAM for large operations"
    CPU_Load: "Warn if system under heavy load"
    Token_Budget: "Estimate token usage vs available quota"

Risk_Assessment:
  Operation_Scoring:
    Data_Loss_Risk: "1-10 scale based on destructiveness"
    Reversibility: "Can operation be undone?"
    Scope_Impact: "How many files/systems affected?"

  Mitigation_Requirements:
    High_Risk: "Require explicit confirmation + backup"
    Medium_Risk: "Warn user + create checkpoint"
    Low_Risk: "Proceed w/ monitoring"
```

## Command-Specific Recovery

```yaml
Build_Failures:
  Clean_Retry: "Remove artifacts, clear cache, rebuild"
  Dependency_Issues: "Update lockfiles, reinstall packages"
  Compilation_Errors: "Suggest fixes, alternative approaches"

Test_Failures:
  Flaky_Tests: "Retry failed tests, identify unstable tests"
  Environment_Issues: "Reset test state, check prerequisites"
  Coverage_Gaps: "Generate missing tests, update thresholds"

Deploy_Failures:
  Health_Check_Failures: "Rollback, investigate logs"
  Resource_Constraints: "Scale up, optimize deployment"
  Configuration_Issues: "Validate settings, check secrets"

Analysis_Failures:
  Tool_Unavailable: "Fallback to alternative analysis tools"
  Large_Codebase: "Reduce scope, batch processing"
  Permission_Issues: "Guide user through access setup"
```

## Enhanced Error Reporting

```yaml
Intelligent_Error_Messages:
  Root_Cause_Analysis:
    Technical_Details: "Specific error codes, stack traces"
    User_Context: "What user was trying to accomplish"
    Environmental_Factors: "System state, recent changes"

  Actionable_Guidance:
    Immediate_Steps: "What user can do right now"
    Alternative_Approaches: "Different ways to achieve goal"
    Prevention: "How to avoid this error in future"

  Context_Preservation:
    Session_Info: "Command history, current state"
    Relevant_Files: "Which files were being processed"
    System_State: "Git status, dependency versions"

Error_Learning:
  Pattern_Recognition:
    Frequent_Issues: "Track commonly occurring errors"
    User_Patterns: "Learn user-specific failure modes"
    System_Patterns: "Identify environment-specific issues"

  Adaptive_Responses:
    Personalized_Suggestions: "Based on user's history"
    Proactive_Warnings: "Predict likely issues"
    Automated_Fixes: "Apply known solutions automatically"
```

## Integration with Commands

```yaml
Pre_Execution_Validation:
  Prerequisites: "Check required tools, permissions, resources"
  Environment: "Validate configuration, network connectivity"
  State: "Ensure clean starting state, no conflicts"

During_Execution:
  Monitoring: "Track progress, resource usage, early warnings"
  Checkpointing: "Save state at critical milestones"
  Health_Checks: "Validate system state during long operations"

Post_Execution:
  Verification: "Confirm expected outcomes achieved"
  Cleanup: "Remove temporary files, release resources"
  Reporting: "Document successes, failures, lessons learned"

Error_Reporting_Format:
  Structured: "Consistent error message format across commands"
  Actionable: "Include specific steps for resolution"
  Contextual: "Provide relevant system and environment information"
  Traceable: "Include operation ID for troubleshooting"
```

## Configuration & Customization

```yaml
User_Preferences:
  Recovery_Style: "Conservative (safe) vs Aggressive (fast)"
  Retry_Limits: "Maximum attempts for different error types"
  Notification: "How and when to alert user of issues"
  Automation_Level: "How much recovery to attempt automatically"

Project_Settings:
  Critical_Operations: "Commands that require extra safety"
  Acceptable_Risk: "Tolerance for failures in development vs production"
  Resource_Limits: "Maximum time, memory, network usage"
  Dependencies: "Critical external services that must be available"

Environment_Adaptation:
  Development: "More aggressive retries, helpful error messages"
  Staging: "Balanced approach, thorough logging"
  Production: "Conservative recovery, immediate alerting"
  CI_CD: "Fast failure, detailed diagnostic information"
```

## Usage Examples

```yaml
Network_Failure_Scenario:
  Error: "Context7 server timeout during docs lookup"
  Recovery: "Auto-fallback to WebSearch → local cache"
  User_Experience: "⚠ Using cached docs (Context7 unavailable)"

File_Lock_Scenario:
  Error: "Cannot edit file (locked by another process)"
  Recovery: "Wait 5s → retry → suggest alternatives"
  User_Experience: "Retrying in 5s... or try manual edit"

Command_Chain_Failure:
  Error: "Step 3 of 5 fails in build workflow"
  Recovery: "Preserve steps 1-2, suggest alternatives for 3"
  User_Experience: "Build partially complete. Alternative approaches: ..."
```

---
*Error Handling v1.0 - Comprehensive error recovery and resilience for SuperClaude*