Resilience Package

npm downloads

The @hazeljs/resilience package provides fault-tolerance and resilience patterns for HazelJS microservices. It includes circuit breaker, retry, timeout, bulkhead, rate limiter, and metrics collection — all usable via decorators or programmatic API.

Purpose

In a microservices architecture, services depend on each other over the network. Networks are unreliable — services go down, become slow, or get overloaded. Without resilience patterns, a single failing service can cascade and take down your entire system. The @hazeljs/resilience package solves this by providing:

  • Circuit Breaker: Stops calling a failing service before it drags everything else down
  • Retry: Automatically re-attempts transient failures with configurable backoff
  • Timeout: Fails fast when a service is too slow, freeing resources
  • Bulkhead: Limits concurrent calls to isolate failures and prevent resource exhaustion
  • Rate Limiter: Controls request throughput with token bucket and sliding window strategies
  • Metrics: Tracks success/failure/latency per target, feeding into gateway canary decisions

Architecture

The package provides both a decorator API for declarative use and a programmatic API for advanced scenarios:

graph TD
  A["Your Service Method"] --> B["@WithCircuitBreaker"]
  B --> C["@WithRetry"]
  C --> D["@WithTimeout"]
  D --> E["@WithBulkhead"]
  E --> F["@WithRateLimit"]
  F --> G["Actual Call"]
  
  H["MetricsCollector"] --> B
  H --> I["SlidingWindow<br/>(Count / Time)"]
  
  style A fill:#3b82f6,stroke:#60a5fa,stroke-width:2px,color:#fff
  style B fill:#ef4444,stroke:#f87171,stroke-width:2px,color:#fff
  style C fill:#f59e0b,stroke:#fbbf24,stroke-width:2px,color:#fff
  style D fill:#8b5cf6,stroke:#a78bfa,stroke-width:2px,color:#fff
  style E fill:#10b981,stroke:#34d399,stroke-width:2px,color:#fff
  style F fill:#ec4899,stroke:#f472b6,stroke-width:2px,color:#fff
  style G fill:#6366f1,stroke:#818cf8,stroke-width:2px,color:#fff

Key Components

  1. CircuitBreaker: State machine (CLOSED → OPEN → HALF_OPEN) that prevents cascading failures
  2. RetryPolicy: Retries with fixed, linear, or exponential backoff and optional jitter
  3. Timeout: Promise-based timeout wrapper with cancellation
  4. Bulkhead: Concurrency limiter with queue support
  5. RateLimiter: Token bucket and sliding window strategies
  6. MetricsCollector: Tracks call statistics within a sliding window
  7. Decorators: @WithCircuitBreaker, @WithRetry, @WithTimeout, @WithBulkhead, @WithRateLimit, @Fallback

Installation

npm install @hazeljs/resilience

Quick Start

Decorator API

The decorator API lets you apply resilience patterns declaratively to any class method:

import { Injectable } from '@hazeljs/core';
import { WithCircuitBreaker, WithRetry, WithTimeout, WithBulkhead, Fallback } from '@hazeljs/resilience';

@Injectable()
class PaymentService {
  @WithCircuitBreaker({
    failureThreshold: 5,
    slidingWindow: { type: 'count', size: 20 },
    resetTimeout: 30_000,
    fallback: 'processPaymentFallback',
  })
  @WithRetry({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 })
  @WithTimeout(5000)
  @WithBulkhead({ maxConcurrent: 10, maxQueue: 50 })
  async processPayment(order: Order): Promise<PaymentResult> {
    return await this.paymentGateway.charge(order);
  }

  @Fallback('processPayment')
  async processPaymentFallback(order: Order): Promise<PaymentResult> {
    return { status: 'queued', message: 'Payment will be processed later' };
  }
}

When processPayment is called, the decorators execute in order: circuit breaker check → retry wrapper → timeout → bulkhead concurrency check → actual call. If the circuit is open, it immediately calls the fallback without touching the network.

Programmatic API

For cases where you need more control, use the classes directly:

import { CircuitBreaker, RetryPolicy, Timeout, Bulkhead, RateLimiter } from '@hazeljs/resilience';

// Circuit Breaker
const breaker = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeout: 30000,
  slidingWindow: { type: 'count', size: 20 },
});

const result = await breaker.execute(() => fetch('/api/data'));

// Listen to state changes
breaker.on('stateChange', (from, to) => {
  console.log(`Circuit breaker: ${from} -> ${to}`);
});

// Retry
const retry = new RetryPolicy({
  maxAttempts: 3,
  backoff: 'exponential',
  baseDelay: 1000,
  jitter: true,
});
const data = await retry.execute(() => fetch('/api/unstable'));

// Timeout
const timeout = new Timeout(5000);
const response = await timeout.execute(() => fetch('/api/slow'));

// Bulkhead
const bulkhead = new Bulkhead({ maxConcurrent: 10, maxQueue: 50 });
const result = await bulkhead.execute(() => intensiveOperation());

// Rate Limiter
const limiter = new RateLimiter({
  strategy: 'token-bucket',
  max: 100,
  window: 60000,
});
if (limiter.tryAcquire()) {
  await handleRequest();
}

Circuit Breaker

The circuit breaker prevents cascading failures by monitoring call success rates and temporarily blocking calls to unhealthy services.

How It Works

graph LR
  A["CLOSED<br/>(Normal)"] -->|"failures >= threshold"| B["OPEN<br/>(Blocking)"]
  B -->|"reset timeout"| C["HALF_OPEN<br/>(Testing)"]
  C -->|"success >= threshold"| A
  C -->|"failure"| B
  
  style A fill:#10b981,stroke:#34d399,stroke-width:2px,color:#fff
  style B fill:#ef4444,stroke:#f87171,stroke-width:2px,color:#fff
  style C fill:#f59e0b,stroke:#fbbf24,stroke-width:2px,color:#fff
StateBehavior
CLOSEDAll calls pass through. Failures are counted in the sliding window.
OPENAll calls are immediately rejected. After resetTimeout, transitions to HALF_OPEN.
HALF_OPENA limited number of trial calls are allowed. If they succeed, the breaker closes. If they fail, it opens again.

Configuration

const breaker = new CircuitBreaker({
  // When to open the circuit
  failureThreshold: 5,           // Number of failures to trigger OPEN
  failureRateThreshold: 50,      // Or: percentage failure rate to trigger OPEN

  // How long to wait before testing again
  resetTimeout: 30000,           // ms in OPEN state before trying HALF_OPEN

  // Sliding window for tracking failures
  slidingWindow: {
    type: 'count',               // 'count' or 'time'
    size: 20,                    // Last 20 calls (count) or 20s window (time)
  },

  // How many trial calls in HALF_OPEN
  halfOpenMaxCalls: 3,

  // Custom failure detection
  isFailure: (error) => {
    // Only count 5xx as failures, not 4xx
    return error.status >= 500;
  },
});

Events

breaker.on('stateChange', (from, to) => {
  console.log(`Circuit: ${from} -> ${to}`);
});

breaker.on('success', (duration) => {
  console.log(`Call succeeded in ${duration}ms`);
});

breaker.on('failure', (error, duration) => {
  console.log(`Call failed after ${duration}ms: ${error}`);
});

breaker.on('rejected', () => {
  console.log('Call rejected — circuit is OPEN');
});

Metrics

const metrics = breaker.getMetrics();
console.log(metrics);
// {
//   totalRequests: 150,
//   failureCount: 12,
//   failureRate: 8,
//   successCount: 138,
//   averageLatency: 45,
//   p99Latency: 120,
// }

Circuit Breaker Registry

Manage named circuit breakers across your application:

import { CircuitBreakerRegistry } from '@hazeljs/resilience';

// Get or create a named circuit breaker
const breaker = CircuitBreakerRegistry.getOrCreate('payment-service', {
  failureThreshold: 5,
  resetTimeout: 30000,
});

// Get an existing one
const same = CircuitBreakerRegistry.get('payment-service');

// List all registered breakers
const all = CircuitBreakerRegistry.getAll();

Retry Policy

Automatically retries failed operations with configurable backoff strategies.

Backoff Strategies

StrategyDelay PatternUse Case
fixedSame delay every time (e.g. 1s, 1s, 1s)Simple retry with constant delay
linearDelay increases linearly (e.g. 1s, 2s, 3s)Gradually back off
exponentialDelay doubles each time (e.g. 1s, 2s, 4s, 8s)Best for network calls — gives the service time to recover

Configuration

const retry = new RetryPolicy({
  maxAttempts: 3,                    // Total attempts (including first)
  backoff: 'exponential',           // 'fixed' | 'linear' | 'exponential'
  baseDelay: 1000,                  // Starting delay in ms
  maxDelay: 30000,                  // Cap on delay
  jitter: true,                     // Add randomness to prevent thundering herd

  // Only retry on certain errors
  retryOn: (error) => {
    return error.code === 'ECONNREFUSED' || error.status >= 500;
  },
});

const result = await retry.execute(async () => {
  return await fetch('https://api.example.com/data');
});

Retry with Circuit Breaker

Combine retry with circuit breaker for maximum resilience:

const breaker = new CircuitBreaker({ failureThreshold: 5, resetTimeout: 30000 });
const retry = new RetryPolicy({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 });

// Retry wraps the circuit breaker — if all retries fail, the circuit counts it as a failure
const result = await retry.execute(() =>
  breaker.execute(() => fetch('/api/data'))
);

Timeout

Enforces a time limit on operations, freeing resources when a service is too slow.

import { Timeout, TimeoutError } from '@hazeljs/resilience';

const timeout = new Timeout(5000); // 5 seconds

try {
  const result = await timeout.execute(async () => {
    return await fetch('/api/slow-endpoint');
  });
} catch (error) {
  if (error instanceof TimeoutError) {
    console.log('Request timed out after 5000ms');
  }
}

Or use the standalone helper:

import { withTimeout, TimeoutError } from '@hazeljs/resilience';

const result = await withTimeout(
  fetch('/api/slow-endpoint'),
  5000
);

Bulkhead

Limits concurrent executions to isolate failures and prevent resource exhaustion. Excess requests are queued up to a maximum, then rejected.

import { Bulkhead, BulkheadFullError } from '@hazeljs/resilience';

const bulkhead = new Bulkhead({
  maxConcurrent: 10,  // Max parallel executions
  maxQueue: 50,       // Max queued requests
  queueTimeout: 5000, // How long a queued request waits before being rejected
});

try {
  const result = await bulkhead.execute(async () => {
    return await processRequest();
  });
} catch (error) {
  if (error instanceof BulkheadFullError) {
    console.log('Service overloaded — try again later');
  }
}

// Check current state
console.log(bulkhead.getActiveCount());   // Current parallel executions
console.log(bulkhead.getQueueSize());     // Requests waiting in queue

Rate Limiter

Controls request throughput using token bucket or sliding window strategies.

Token Bucket

Allows bursts up to the bucket size, then refills at a steady rate:

import { RateLimiter } from '@hazeljs/resilience';

const limiter = new RateLimiter({
  strategy: 'token-bucket',
  max: 100,          // Bucket size (max burst)
  window: 60000,     // Refill window in ms
  refillRate: 10,    // Tokens added per second
});

if (limiter.tryAcquire()) {
  await handleRequest();
} else {
  const retryAfter = limiter.getRetryAfterMs();
  res.status(429).header('Retry-After', String(Math.ceil(retryAfter / 1000))).send('Too Many Requests');
}

Sliding Window

Tracks requests in a rolling time window:

const limiter = new RateLimiter({
  strategy: 'sliding-window',
  max: 100,          // Max requests per window
  window: 60000,     // Window size in ms
});

Metrics Collection

The MetricsCollector tracks call statistics within a sliding window, providing insights into service health:

import { MetricsCollector, MetricsRegistry } from '@hazeljs/resilience';

// Create a collector with a 60-second window
const collector = new MetricsCollector(60000);

// Record outcomes
collector.recordSuccess(45);  // 45ms latency
collector.recordFailure(120, 'timeout');

// Get aggregated metrics
const snapshot = collector.getSnapshot();
console.log(snapshot);
// {
//   totalRequests: 1500,
//   successCount: 1480,
//   failureCount: 20,
//   failureRate: 1.33,
//   averageLatency: 42,
//   p50Latency: 35,
//   p95Latency: 95,
//   p99Latency: 150,
// }

// Use the registry for named metrics
MetricsRegistry.getOrCreate('user-service');
MetricsRegistry.getOrCreate('order-service');

These metrics feed directly into the @hazeljs/gateway canary deployment engine for automated promotion and rollback decisions.

Combining Patterns

The real power comes from composing patterns together. Here's a production-ready service call:

Decorator Composition

@Injectable()
class OrderService {
  @WithCircuitBreaker({
    failureThreshold: 5,
    resetTimeout: 30000,
    fallback: 'createOrderFallback',
  })
  @WithRetry({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 })
  @WithTimeout(5000)
  @WithBulkhead({ maxConcurrent: 20, maxQueue: 100 })
  async createOrder(data: OrderData): Promise<Order> {
    return await this.httpClient.post('/orders', data);
  }

  @Fallback('createOrder')
  async createOrderFallback(data: OrderData): Promise<Order> {
    // Queue for later processing
    await this.queue.add('create-order', data);
    return { id: 'pending', status: 'queued' };
  }
}

Programmatic Composition

const breaker = new CircuitBreaker({ failureThreshold: 5, resetTimeout: 30000 });
const retry = new RetryPolicy({ maxAttempts: 3, backoff: 'exponential', baseDelay: 500 });
const timeout = new Timeout(5000);
const bulkhead = new Bulkhead({ maxConcurrent: 20, maxQueue: 100 });

async function resilientCall<T>(fn: () => Promise<T>): Promise<T> {
  return bulkhead.execute(() =>
    retry.execute(() =>
      breaker.execute(() =>
        timeout.execute(fn)
      )
    )
  );
}

const result = await resilientCall(() => fetch('/api/orders'));

Integration with @hazeljs/gateway

The @hazeljs/gateway package uses @hazeljs/resilience internally for per-route protection. When you configure circuit breakers or rate limits in your gateway config, they use the same classes:

// gateway.config.ts
const gatewayConfig = () => ({
  gateway: {
    resilience: {
      defaultCircuitBreaker: {
        failureThreshold: parseInt(process.env.GATEWAY_CB_THRESHOLD || '5'),
        resetTimeout: parseInt(process.env.GATEWAY_CB_RESET_TIMEOUT || '30000'),
      },
      defaultTimeout: parseInt(process.env.GATEWAY_DEFAULT_TIMEOUT || '5000'),
    },
    routes: [
      {
        path: '/api/users/**',
        serviceName: 'user-service',
        circuitBreaker: { failureThreshold: 10 },
        rateLimit: { strategy: 'sliding-window', max: 100, window: 60000 },
      },
    ],
  },
});

Best Practices

  1. Order matters for decorators: Apply @WithCircuitBreaker outermost, then @WithRetry, then @WithTimeout. This ensures retries happen inside the circuit breaker's tracking.

  2. Set realistic timeouts: A 5-second timeout for a service that normally responds in 50ms means something is very wrong. Set timeouts close to your p99 latency plus a buffer.

  3. Use exponential backoff with jitter: Prevents the thundering herd problem where all retries hit the recovering service at the same time.

  4. Configure circuit breaker thresholds per service: A critical payment service might open after 3 failures, while a notification service can tolerate 10.

  5. Monitor metrics: Use MetricsCollector to track service health and feed data into dashboards or alerting systems.

  6. Always define fallbacks: For user-facing operations, define a fallback that returns a degraded response instead of an error.

  7. Use bulkheads for shared resources: If your service calls multiple downstream services, use separate bulkheads to prevent one slow service from consuming all your connection pool.

What's Next?

  • Learn about Gateway Package for intelligent API routing with canary deployments
  • Explore Discovery Package for service registration and discovery
  • Check out Config Package for managing resilience settings via environment variables