Skip to main content

Retry Strategies

Reliability

Configure intelligent retry behavior with exponential backoff and Dead Letter Queue patterns.

Exponential Backoff

Increasing delays between retries

Smart Retry Logic

Retry only transient failures

Dead Letter Queue

Capture permanently failed jobs

Failure Monitoring

Track and alert on failures

Retry Configuration

Configure how jobs should be retried when they fail. By default, jobs are retried up to 3 times with exponential backoff.

import { platform } from '@/lib/platform'

const job = await platform.jobs.schedule({
  url: 'https://myapp.com/api/jobs/process',
  body: { data: 'value' },
  retries: {
    maxAttempts: 5,
    backoff: 'exponential',
    initialDelay: 60,       // 1 minute
    maxDelay: 3600,         // 1 hour max between retries
    retryOn: [500, 502, 503, 504], // Only retry server errors
  },
})

Retry Options

PropertyTypeDescription
maxAttemptsnumber= 3Maximum number of retry attempts (1-10)
backoff"linear" | "exponential"= "exponential"Backoff strategy between retries
initialDelaynumber= 60Initial delay in seconds before first retry
maxDelaynumber= 3600Maximum delay in seconds between retries
retryOnnumber[]= [500, 502, 503, 504]HTTP status codes that trigger retry

Exponential Backoff

Exponential backoff increases the delay between retries, giving failing services time to recover while avoiding overwhelming them with requests.

// Exponential backoff formula:
// delay = min(initialDelay * 2^attempt, maxDelay)

// Example with initialDelay=60, maxDelay=3600:
// Attempt 1: 60 seconds (1 min)
// Attempt 2: 120 seconds (2 min)
// Attempt 3: 240 seconds (4 min)
// Attempt 4: 480 seconds (8 min)
// Attempt 5: 960 seconds (16 min)
// Attempt 6: 1920 seconds (32 min)
// Attempt 7: 3600 seconds (60 min, capped at maxDelay)
Exponential backoff
// Best for most use cases - prevents thundering herd
const job = await platform.jobs.schedule({
  url: 'https://myapp.com/api/jobs/webhook',
  body: { event: 'order.created' },
  retries: {
    maxAttempts: 5,
    backoff: 'exponential',
    initialDelay: 30,  // Start at 30 seconds
    maxDelay: 1800,    // Cap at 30 minutes
  },
})

// Retry timeline:
// Fail #1 -> wait 30s  -> Attempt #2
// Fail #2 -> wait 60s  -> Attempt #3
// Fail #3 -> wait 120s -> Attempt #4
// Fail #4 -> wait 240s -> Attempt #5
// Fail #5 -> move to DLQ

Jitter

Sylphx automatically adds random jitter (up to 10%) to retry delays. This prevents synchronized retries when multiple jobs fail at the same time.

Max Retry Limits

Set appropriate limits based on the criticality and nature of your job.

High retry count
// Critical jobs that must eventually succeed
const job = await platform.jobs.schedule({
  url: 'https://myapp.com/api/jobs/payment-webhook',
  body: { paymentId: 'pay_123' },
  retries: {
    maxAttempts: 10,        // Try hard to deliver
    backoff: 'exponential',
    initialDelay: 60,
    maxDelay: 3600,
  },
  // Also configure DLQ alerting
  deadLetterQueue: {
    enabled: true,
    webhookUrl: 'https://myapp.com/api/alerts/dlq',
  },
})

Dead Letter Queue (DLQ)

Jobs that fail after all retry attempts are moved to the Dead Letter Queue. This allows you to investigate and manually reprocess failed jobs.

import { platform } from '@/lib/platform'

// Configure DLQ behavior when scheduling
const job = await platform.jobs.schedule({
  url: 'https://myapp.com/api/jobs/important-task',
  body: { taskId: 'task_123' },
  retries: {
    maxAttempts: 5,
    backoff: 'exponential',
  },
  deadLetterQueue: {
    enabled: true,
    webhookUrl: 'https://myapp.com/api/webhooks/dlq-alert',
    retentionDays: 30,
  },
})

DLQ Options

PropertyTypeDescription
enabledboolean= trueEnable Dead Letter Queue for failed jobs
webhookUrlstringURL to notify when job moves to DLQ
retentionDaysnumber= 30Days to retain failed jobs in DLQ
maxSizenumber= 1000Maximum jobs in DLQ before oldest are purged

Working with DLQ

Query DLQ
import { platform } from '@/lib/platform'

// List all jobs in the Dead Letter Queue
const dlqJobs = await platform.jobs.listDLQ({
  limit: 50,
  offset: 0,
})

for (const job of dlqJobs.items) {
  console.log({
    id: job.id,
    url: job.url,
    failedAt: job.failedAt,
    attempts: job.attempts,
    lastError: job.lastError,
  })
}

Handling Transient vs Permanent Failures

Distinguish between transient failures (which should be retried) and permanent failures (which should not).

Proper error handling
// app/api/jobs/process-order/route.ts
import { platform } from '@/lib/platform'
import { NextRequest } from 'next/server'

export async function POST(req: NextRequest) {
  const isValid = await platform.jobs.verifyRequest(req)
  if (!isValid) {
    return new Response('Unauthorized', { status: 401 })
  }

  const { orderId } = await req.json()

  try {
    const order = await getOrder(orderId)

    // Permanent failure - order doesn't exist
    if (!order) {
      return new Response(JSON.stringify({
        error: 'Order not found',
        code: 'ORDER_NOT_FOUND',
      }), {
        status: 400, // 4xx = no retry
      })
    }

    // Permanent failure - invalid state
    if (order.status === 'cancelled') {
      return new Response(JSON.stringify({
        error: 'Order is cancelled',
        code: 'ORDER_CANCELLED',
      }), {
        status: 400, // 4xx = no retry
      })
    }

    await processOrder(order)

    return new Response('OK', { status: 200 })
  } catch (error) {
    // Transient failure - external service down
    if (error.code === 'ECONNREFUSED') {
      return new Response(JSON.stringify({
        error: 'Payment service unavailable',
        code: 'SERVICE_UNAVAILABLE',
      }), {
        status: 503, // 5xx = will retry
      })
    }

    // Transient failure - rate limited
    if (error.status === 429) {
      return new Response(JSON.stringify({
        error: 'Rate limited',
        code: 'RATE_LIMITED',
        retryAfter: error.retryAfter,
      }), {
        status: 503, // 5xx = will retry
        headers: {
          'Retry-After': String(error.retryAfter),
        },
      })
    }

    // Unknown error - retry to be safe
    return new Response(JSON.stringify({
      error: 'Internal error',
      code: 'INTERNAL_ERROR',
    }), {
      status: 500, // 5xx = will retry
    })
  }
}

Return Appropriate Status Codes

  • 2xx: Success - job completed
  • 4xx: Permanent failure - do not retry (except 408, 429)
  • 5xx: Transient failure - will retry

Monitoring Failed Jobs

Set up monitoring and alerting for job failures to catch issues early.

Query failure metrics
import { platform } from '@/lib/platform'

// Get job failure statistics
const stats = await platform.jobs.getStats({
  period: '24h',
})

console.log({
  totalJobs: stats.total,
  completedJobs: stats.completed,
  failedJobs: stats.failed,
  failureRate: stats.failureRate, // percentage
  averageRetries: stats.averageRetries,
  dlqSize: stats.dlqSize,
})

// Get failure stats by URL
const urlStats = await platform.jobs.getStats({
  period: '24h',
  groupBy: 'url',
})

for (const stat of urlStats) {
  if (stat.failureRate > 0.1) { // > 10% failure rate
    console.warn(`High failure rate for ${stat.url}: ${stat.failureRate}`)
  }
}

Best Practices

Make Jobs Idempotent

Jobs may run multiple times due to retries. Use idempotency keys to prevent duplicate processing.

Return Proper Status Codes

Use 4xx for permanent failures (no retry) and 5xx for transient failures (will retry).

Set Appropriate Timeouts

Configure timeouts shorter than your retry delay to prevent overlapping executions.

Monitor DLQ

Set up alerts for DLQ size and regularly review failed jobs to identify systemic issues.

Idempotency Pattern

// Use idempotency keys in your job handler
export async function POST(req: NextRequest) {
  const { orderId, idempotencyKey } = await req.json()

  // Check if already processed
  const existing = await db.processedJobs.findUnique({
    where: { idempotencyKey },
  })

  if (existing) {
    // Already processed - return success without reprocessing
    return new Response('Already processed', { status: 200 })
  }

  // Process the job
  await processOrder(orderId)

  // Record that we processed it
  await db.processedJobs.create({
    data: { idempotencyKey, processedAt: new Date() },
  })

  return new Response('OK', { status: 200 })
}