Resilience Package

The @hazeljs/resilience package provides fault-tolerance and resilience patterns for HazelJS microservices. It includes circuit breaker, retry, timeout, bulkhead, rate limiter, and metrics collection — all usable via decorators or programmatic API.

Purpose

In a microservices architecture, services depend on each other over the network. Networks are unreliable — services go down, become slow, or get overloaded. Without resilience patterns, a single failing service can cascade and take down your entire system. The @hazeljs/resilience package solves this by providing:

Circuit Breaker: Stops calling a failing service before it drags everything else down
Retry: Automatically re-attempts transient failures with configurable backoff
Timeout: Fails fast when a service is too slow, freeing resources
Bulkhead: Limits concurrent calls to isolate failures and prevent resource exhaustion
Rate Limiter: Controls request throughput with token bucket and sliding window strategies
Metrics: Tracks success/failure/latency per target, feeding into gateway canary decisions

Architecture

The package provides both a decorator API for declarative use and a programmatic API for advanced scenarios:

graph TD
  A["Your Service Method"] --> B["@WithCircuitBreaker"]
  B --> C["@WithRetry"]
  C --> D["@WithTimeout"]
  D --> E["@WithBulkhead"]
  E --> F["@WithRateLimit"]
  F --> G["Actual Call"]
  
  H["MetricsCollector"] --> B
  H --> I["SlidingWindow<br/>(Count / Time)"]
  
  style A fill:#3b82f6,stroke:#60a5fa,stroke-width:2px,color:#fff
  style B fill:#ef4444,stroke:#f87171,stroke-width:2px,color:#fff
  style C fill:#f59e0b,stroke:#fbbf24,stroke-width:2px,color:#fff
  style D fill:#8b5cf6,stroke:#a78bfa,stroke-width:2px,color:#fff
  style E fill:#10b981,stroke:#34d399,stroke-width:2px,color:#fff
  style F fill:#ec4899,stroke:#f472b6,stroke-width:2px,color:#fff
  style G fill:#6366f1,stroke:#818cf8,stroke-width:2px,color:#fff

Key Components

CircuitBreaker: State machine (CLOSED → OPEN → HALF_OPEN) that prevents cascading failures
RetryPolicy: Retries with fixed, linear, or exponential backoff and optional jitter
Timeout: Promise-based timeout wrapper with cancellation
Bulkhead: Concurrency limiter with queue support
RateLimiter: Token bucket and sliding window strategies
MetricsCollector: Tracks call statistics within a sliding window
Decorators: @WithCircuitBreaker, @WithRetry, @WithTimeout, @WithBulkhead, @WithRateLimit, @Fallback

Installation

npm install @hazeljs/resilience

Quick Start

Decorator API

The decorator API lets you apply resilience patterns declaratively to any class method:

import { Injectable } from '@hazeljs/core';
import { WithCircuitBreaker, WithRetry, WithTimeout, WithBulkhead, Fallback } from '@hazeljs/resilience';

@Injectable()
class PaymentService {
  @WithCircuitBreaker({
    failureThreshold: 5,
    slidingWindow: { type: 'count', size: 20 },
    resetTimeout: 30_000,
    fallback: 'processPaymentFallback',
  })
  @WithRetry({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 })
  @WithTimeout(5000)
  @WithBulkhead({ maxConcurrent: 10, maxQueue: 50 })
  async processPayment(order: Order): Promise<PaymentResult> {
    return await this.paymentGateway.charge(order);
  }

  @Fallback('processPayment')
  async processPaymentFallback(order: Order): Promise<PaymentResult> {
    return { status: 'queued', message: 'Payment will be processed later' };
  }
}

When processPayment is called, the decorators execute in order: circuit breaker check → retry wrapper → timeout → bulkhead concurrency check → actual call. If the circuit is open, it immediately calls the fallback without touching the network.

Programmatic API

For cases where you need more control, use the classes directly:

import { CircuitBreaker, RetryPolicy, Timeout, Bulkhead, RateLimiter } from '@hazeljs/resilience';

// Circuit Breaker
const breaker = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeout: 30000,
  slidingWindow: { type: 'count', size: 20 },
});

const result = await breaker.execute(() => fetch('/api/data'));

// Listen to state changes
breaker.on('stateChange', (from, to) => {
  console.log(`Circuit breaker: ${from} -> ${to}`);
});

// Retry
const retry = new RetryPolicy({
  maxAttempts: 3,
  backoff: 'exponential',
  baseDelay: 1000,
  jitter: true,
});
const data = await retry.execute(() => fetch('/api/unstable'));

// Timeout
const timeout = new Timeout(5000);
const response = await timeout.execute(() => fetch('/api/slow'));

// Bulkhead
const bulkhead = new Bulkhead({ maxConcurrent: 10, maxQueue: 50 });
const result = await bulkhead.execute(() => intensiveOperation());

// Rate Limiter
const limiter = new RateLimiter({
  strategy: 'token-bucket',
  max: 100,
  window: 60000,
});
if (limiter.tryAcquire()) {
  await handleRequest();
}

Circuit Breaker

The circuit breaker prevents cascading failures by monitoring call success rates and temporarily blocking calls to unhealthy services.

How It Works

graph LR
  A["CLOSED<br/>(Normal)"] -->|"failures >= threshold"| B["OPEN<br/>(Blocking)"]
  B -->|"reset timeout"| C["HALF_OPEN<br/>(Testing)"]
  C -->|"success >= threshold"| A
  C -->|"failure"| B
  
  style A fill:#10b981,stroke:#34d399,stroke-width:2px,color:#fff
  style B fill:#ef4444,stroke:#f87171,stroke-width:2px,color:#fff
  style C fill:#f59e0b,stroke:#fbbf24,stroke-width:2px,color:#fff

State	Behavior
CLOSED	All calls pass through. Failures are counted in the sliding window.
OPEN	All calls are immediately rejected. After `resetTimeout`, transitions to HALF_OPEN.
HALF_OPEN	A limited number of trial calls are allowed. If they succeed, the breaker closes. If they fail, it opens again.

Configuration

const breaker = new CircuitBreaker({
  // When to open the circuit
  failureThreshold: 5,           // Number of failures to trigger OPEN
  failureRateThreshold: 50,      // Or: percentage failure rate to trigger OPEN

  // How long to wait before testing again
  resetTimeout: 30000,           // ms in OPEN state before trying HALF_OPEN

  // Sliding window for tracking failures
  slidingWindow: {
    type: 'count',               // 'count' or 'time'
    size: 20,                    // Last 20 calls (count) or 20s window (time)
  },

  // How many trial calls in HALF_OPEN
  halfOpenMaxCalls: 3,

  // Custom failure detection
  isFailure: (error) => {
    // Only count 5xx as failures, not 4xx
    return error.status >= 500;
  },
});

Events

breaker.on('stateChange', (from, to) => {
  console.log(`Circuit: ${from} -> ${to}`);
});

breaker.on('success', (duration) => {
  console.log(`Call succeeded in ${duration}ms`);
});

breaker.on('failure', (error, duration) => {
  console.log(`Call failed after ${duration}ms: ${error}`);
});

breaker.on('rejected', () => {
  console.log('Call rejected — circuit is OPEN');
});

Metrics

const metrics = breaker.getMetrics();
console.log(metrics);
// {
//   totalRequests: 150,
//   failureCount: 12,
//   failureRate: 8,
//   successCount: 138,
//   averageLatency: 45,
//   p99Latency: 120,
// }

Circuit Breaker Registry

Manage named circuit breakers across your application:

import { CircuitBreakerRegistry } from '@hazeljs/resilience';

// Get or create a named circuit breaker
const breaker = CircuitBreakerRegistry.getOrCreate('payment-service', {
  failureThreshold: 5,
  resetTimeout: 30000,
});

// Get an existing one
const same = CircuitBreakerRegistry.get('payment-service');

// List all registered breakers
const all = CircuitBreakerRegistry.getAll();

Retry Policy

Automatically retries failed operations with configurable backoff strategies.

Backoff Strategies

Strategy	Delay Pattern	Use Case
`fixed`	Same delay every time (e.g. 1s, 1s, 1s)	Simple retry with constant delay
`linear`	Delay increases linearly (e.g. 1s, 2s, 3s)	Gradually back off
`exponential`	Delay doubles each time (e.g. 1s, 2s, 4s, 8s)	Best for network calls — gives the service time to recover

Configuration

const retry = new RetryPolicy({
  maxAttempts: 3,                    // Total attempts (including first)
  backoff: 'exponential',           // 'fixed' | 'linear' | 'exponential'
  baseDelay: 1000,                  // Starting delay in ms
  maxDelay: 30000,                  // Cap on delay
  jitter: true,                     // Add randomness to prevent thundering herd

  // Only retry on certain errors
  retryOn: (error) => {
    return error.code === 'ECONNREFUSED' || error.status >= 500;
  },
});

const result = await retry.execute(async () => {
  return await fetch('https://api.example.com/data');
});

Retry with Circuit Breaker

Combine retry with circuit breaker for maximum resilience:

const breaker = new CircuitBreaker({ failureThreshold: 5, resetTimeout: 30000 });
const retry = new RetryPolicy({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 });

// Retry wraps the circuit breaker — if all retries fail, the circuit counts it as a failure
const result = await retry.execute(() =>
  breaker.execute(() => fetch('/api/data'))
);

Timeout

Enforces a time limit on operations, freeing resources when a service is too slow.

import { Timeout, TimeoutError } from '@hazeljs/resilience';

const timeout = new Timeout(5000); // 5 seconds

try {
  const result = await timeout.execute(async () => {
    return await fetch('/api/slow-endpoint');
  });
} catch (error) {
  if (error instanceof TimeoutError) {
    console.log('Request timed out after 5000ms');
  }
}

Or use the standalone helper:

import { withTimeout, TimeoutError } from '@hazeljs/resilience';

const result = await withTimeout(
  fetch('/api/slow-endpoint'),
  5000
);

Bulkhead

Limits concurrent executions to isolate failures and prevent resource exhaustion. Excess requests are queued up to a maximum, then rejected.

import { Bulkhead, BulkheadFullError } from '@hazeljs/resilience';

const bulkhead = new Bulkhead({
  maxConcurrent: 10,  // Max parallel executions
  maxQueue: 50,       // Max queued requests
  queueTimeout: 5000, // How long a queued request waits before being rejected
});

try {
  const result = await bulkhead.execute(async () => {
    return await processRequest();
  });
} catch (error) {
  if (error instanceof BulkheadFullError) {
    console.log('Service overloaded — try again later');
  }
}

// Check current state
console.log(bulkhead.getActiveCount());   // Current parallel executions
console.log(bulkhead.getQueueSize());     // Requests waiting in queue

Rate Limiter

Controls request throughput using token bucket or sliding window strategies.

Token Bucket

Allows bursts up to the bucket size, then refills at a steady rate:

import { RateLimiter } from '@hazeljs/resilience';

const limiter = new RateLimiter({
  strategy: 'token-bucket',
  max: 100,          // Bucket size (max burst)
  window: 60000,     // Refill window in ms
  refillRate: 10,    // Tokens added per second
});

if (limiter.tryAcquire()) {
  await handleRequest();
} else {
  const retryAfter = limiter.getRetryAfterMs();
  res.status(429).header('Retry-After', String(Math.ceil(retryAfter / 1000))).send('Too Many Requests');
}

Sliding Window

Tracks requests in a rolling time window:

const limiter = new RateLimiter({
  strategy: 'sliding-window',
  max: 100,          // Max requests per window
  window: 60000,     // Window size in ms
});

Metrics Collection

The MetricsCollector tracks call statistics within a sliding window, providing insights into service health:

import { MetricsCollector, MetricsRegistry } from '@hazeljs/resilience';

// Create a collector with a 60-second window
const collector = new MetricsCollector(60000);

// Record outcomes
collector.recordSuccess(45);  // 45ms latency
collector.recordFailure(120, 'timeout');

// Get aggregated metrics
const snapshot = collector.getSnapshot();
console.log(snapshot);
// {
//   totalRequests: 1500,
//   successCount: 1480,
//   failureCount: 20,
//   failureRate: 1.33,
//   averageLatency: 42,
//   p50Latency: 35,
//   p95Latency: 95,
//   p99Latency: 150,
// }

// Use the registry for named metrics
MetricsRegistry.getOrCreate('user-service');
MetricsRegistry.getOrCreate('order-service');

These metrics feed directly into the @hazeljs/gateway canary deployment engine for automated promotion and rollback decisions.

Combining Patterns

The real power comes from composing patterns together. Here's a production-ready service call:

Decorator Composition

@Injectable()
class OrderService {
  @WithCircuitBreaker({
    failureThreshold: 5,
    resetTimeout: 30000,
    fallback: 'createOrderFallback',
  })
  @WithRetry({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 })
  @WithTimeout(5000)
  @WithBulkhead({ maxConcurrent: 20, maxQueue: 100 })
  async createOrder(data: OrderData): Promise<Order> {
    return await this.httpClient.post('/orders', data);
  }

  @Fallback('createOrder')
  async createOrderFallback(data: OrderData): Promise<Order> {
    // Queue for later processing
    await this.queue.add('create-order', data);
    return { id: 'pending', status: 'queued' };
  }
}

Programmatic Composition

const breaker = new CircuitBreaker({ failureThreshold: 5, resetTimeout: 30000 });
const retry = new RetryPolicy({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 });
const timeout = new Timeout(5000);
const bulkhead = new Bulkhead({ maxConcurrent: 20, maxQueue: 100 });

async function resilientCall<T>(fn: () => Promise<T>): Promise<T> {
  return bulkhead.execute(() =>
    retry.execute(() =>
      breaker.execute(() =>
        timeout.execute(fn)
      )
    )
  );
}

const result = await resilientCall(() => fetch('/api/orders'));

Integration with @hazeljs/gateway

The @hazeljs/gateway package uses @hazeljs/resilience internally for per-route protection. When you configure circuit breakers or rate limits in your gateway config, they use the same classes:

// gateway.config.ts
const gatewayConfig = () => ({
  gateway: {
    resilience: {
      defaultCircuitBreaker: {
        failureThreshold: parseInt(process.env.GATEWAY_CB_THRESHOLD || '5'),
        resetTimeout: parseInt(process.env.GATEWAY_CB_RESET_TIMEOUT || '30000'),
      },
      defaultTimeout: parseInt(process.env.GATEWAY_DEFAULT_TIMEOUT || '5000'),
    },
    routes: [
      {
        path: '/api/users/**',
        serviceName: 'user-service',
        circuitBreaker: { failureThreshold: 10 },
        rateLimit: { strategy: 'sliding-window', max: 100, window: 60000 },
      },
    ],
  },
});

Best Practices

Order matters for decorators: Apply @WithCircuitBreaker outermost, then @WithRetry, then @WithTimeout. This ensures retries happen inside the circuit breaker's tracking.
Set realistic timeouts: A 5-second timeout for a service that normally responds in 50ms means something is very wrong. Set timeouts close to your p99 latency plus a buffer.
Use exponential backoff with jitter: Prevents the thundering herd problem where all retries hit the recovering service at the same time.
Configure circuit breaker thresholds per service: A critical payment service might open after 3 failures, while a notification service can tolerate 10.
Monitor metrics: Use MetricsCollector to track service health and feed data into dashboards or alerting systems.
Always define fallbacks: For user-facing operations, define a fallback that returns a degraded response instead of an error.
Use bulkheads for shared resources: If your service calls multiple downstream services, use separate bulkheads to prevent one slow service from consuming all your connection pool.

What's Next?

Learn about Gateway Package for intelligent API routing with canary deployments
Explore Discovery Package for service registration and discovery
Check out Config Package for managing resilience settings via environment variables