# Error Recovery & Resilience Patterns

## Error Classification & Response

```yaml
Error Severity Levels:
  CRITICAL [10]: Data loss, security breach, prod down
    Response: Immediate stop, alert, rollback, incident response
    Recovery: Manual intervention required, full investigation
    
  HIGH [7-9]: Build failure, test failure, deployment issues
    Response: Stop workflow, notify user, suggest fixes
    Recovery: Automated retry w/ backoff, alternative paths
    
  MEDIUM [4-6]: Warning conditions, perf degradation
    Response: Continue w/ warning, log for later review
    Recovery: Attempt optimization, monitor for escalation
    
  LOW [1-3]: Info messages, style violations, minor optimizations
    Response: Note in output, continue execution
    Recovery: Background fixes, cleanup on completion

Error Categories:
  Transient: Network timeouts, temp resource unavailability
    Strategy: Exponential backoff retry, circuit breaker pattern
    
  Config: Missing env vars, incorrect settings, permissions
    Strategy: Validation, helpful error msgs, setup guidance
    
  Logic: Code bugs, algorithm errors, edge cases
    Strategy: Fallback implementations, graceful degradation
    
  Resource: Out of memory, disk space, API rate limits
    Strategy: Resource monitoring, cleanup, queue management
```

## Recovery Strategies

```yaml
Automatic Recovery:
  Retry Mechanisms:
    Simple: Up to 3 attempts with 1s delay
    Exponential: 1s, 2s, 4s, 8s delays with jitter
    Circuit Breaker: Stop retries after threshold failures
    
  Fallback Patterns:
    Alternative Commands: Use native tools if MCP fails
    Degraded Functionality: Skip non-essential features
    Cached Results: Use previous successful outputs
    
  State Management:
    Checkpoints: Save state before risky operations
    Rollback: Automatic revert to last known good state
    Cleanup: Remove partial results on failure

Manual Recovery:
  User Guidance:
    Clear Error Messages: What failed, why, how to fix
    Action Items: Specific steps user can take
    Documentation Links: Relevant help resources
    
  Intervention Points:
    Confirmation: Ask before destructive operations
    Override: Allow user to skip validation warnings
    Custom: Accept user-provided alternative approaches
    
  Recovery Tools:
    Diagnostic: Commands to investigate failures
    Repair: Automated fixes for common issues
    Reset: Return to clean state for fresh start
```

## Command-Specific Recovery

```yaml
Build Failures:
  Clean & Retry: Remove artifacts, clear cache, rebuild
  Dependency Issues: Update lockfiles, reinstall packages
  Compilation Errors: Suggest fixes, alternative approaches
  
Test Failures:
  Flaky Tests: Retry failed tests, identify unstable tests
  Environment Issues: Reset test state, check prerequisites
  Coverage Gaps: Generate missing tests, update thresholds
  
Deploy Failures:
  Health Check Failures: Rollback, investigate logs
  Resource Constraints: Scale up, optimize deployment
  Configuration Issues: Validate settings, check secrets
  
Analysis Failures:
  Tool Unavailable: Fallback to alternative analysis tools
  Large Codebase: Reduce scope, batch processing
  Permission Issues: Guide user through access setup

MCP Server Failures:
  Connection Issues: Retry with exponential backoff
  Timeout: Reduce request complexity, use native tools
  Rate Limiting: Queue requests, implement backoff
  Service Unavailable: Fallback to cached results or native tools
```

## Error Monitoring & Learning

```yaml
Error Tracking:
  Frequency: Count occurrence of specific error types
  Patterns: Identify common failure sequences
  Trends: Monitor error rates over time
  Context: Track environmental factors during failures
  
Failure Analysis:
  Root Cause: Automated analysis of failure chains
  Prevention: Suggest preventive measures
  Optimization: Identify improvement opportunities
  Documentation: Update guides based on common issues
  
Learning System:
  Success Patterns: Learn from successful recovery strategies
  User Preferences: Remember user's preferred recovery methods
  Environment Adaptation: Adjust strategies based on project context
  Continuous Improvement: Update recovery logic based on outcomes
```

## Integration with Commands

```yaml
Pre-Execution Validation:
  Prerequisites: Check required tools, permissions, resources
  Environment: Validate configuration, network connectivity
  State: Ensure clean starting state, no conflicts
  
During Execution:
  Monitoring: Track progress, resource usage, early warnings
  Checkpointing: Save state at critical milestones
  Health Checks: Validate system state during long operations
  
Post-Execution:
  Verification: Confirm expected outcomes achieved
  Cleanup: Remove temporary files, release resources
  Reporting: Document successes, failures, lessons learned
  
Error Reporting Format:
  Structured: Consistent error message format across commands
  Actionable: Include specific steps for resolution
  Contextual: Provide relevant system and environment information
  Traceable: Include operation ID for troubleshooting
```

## Configuration & Customization

```yaml
User Preferences:
  Recovery Style: Conservative (safe) vs Aggressive (fast)
  Retry Limits: Maximum attempts for different error types
  Notification: How and when to alert user of issues
  Automation Level: How much recovery to attempt automatically
  
Project Settings:
  Critical Operations: Commands that require extra safety
  Acceptable Risk: Tolerance for failures in development vs production
  Resource Limits: Maximum time, memory, network usage
  Dependencies: Critical external services that must be available
  
Environment Adaptation:
  Development: More aggressive retries, helpful error messages
  Staging: Balanced approach, thorough logging
  Production: Conservative recovery, immediate alerting
  CI/CD: Fast failure, detailed diagnostic information
```

---
*Error recovery: Resilient command execution with intelligent failure handling*