Day 11: Chaos Testing Framework

~5 min read

What I Built

  • Chaos testing framework with 6 failure scenarios
  • IBKR connection loss simulation
  • Database timeout handling
  • Rate limiting graceful backoff
  • AI model timeout fallback
  • Stale market data detection
  • Network jitter retry logic
  • Chaos report generation

Code Highlight

# Chaos testing scenarios
class ChaosScenario:
    """Base class for chaos test scenarios."""

    def __init__(self, name: str, description: str):
        self.name = name
        self.description = description

    async def execute(self, *args, **kwargs):
        raise NotImplementedError

# Example: Network Jitter Scenario
class NetworkJitterScenario(ChaosScenario):
    def __init__(self):
        super().__init__(
            "Network Jitter",
            "Intermittent network failures; retry logic recovers"
        )

    @pytest.mark.asyncio
    async def execute(self, ibkr_client):
        """Simulate network jitter (fails then succeeds)."""
        call_count = 0

        async def flaky_submit_order(*args, **kwargs):
            nonlocal call_count
            call_count += 1
            if call_count == 1:
                raise ConnectionError("Network jitter")
            return {"orderId": "12345"}

        with patch.object(ibkr_client, 'submit_order', side_effect=flaky_submit_order):
            # Retry should succeed on second attempt
            result = await retry_with_backoff(
                ibkr_client.submit_order,
                max_retries=3,
                base_delay=0.1,
                exceptions=(ConnectionError,)
            )
            assert result["orderId"] == "12345"
            assert call_count == 2

Architecture Decision

Chaos testing is crucial for production reliability. Rather than testing happy paths, we simulate real-world failures to ensure graceful degradation. The framework uses pytest fixtures for clean test isolation and focuses on the six most critical failure modes: external service outages, database issues, rate limits, AI timeouts, data staleness, and network instability.

Testing Results

All 6 chaos tests passing, covering critical failure scenarios:

  • ✅ IBKR Connection Loss - ConnectionError handling
  • ✅ Database Timeout - AsyncSession timeout simulation
  • ✅ Rate Limiting - HTTPException 429 handling
  • ✅ AI Model Slowness - TimeoutError with fallback
  • ✅ Stale Market Data - Cache expiration detection
  • ✅ Network Jitter - Retry logic with backoff

Next Steps

Day 12: Integration testing with full trade flow mocking and E2E test setup.


Follow @therealkamba on X for regular updates. View all posts →