Skip to content

Conversation

@threepointone
Copy link
Contributor

Summary

This PR adds structured retry support to the Agents SDK. The goal is to replace ad-hoc retry logic with a consistent, configurable system that works across schedules, queues, MCP connections, and user code.

This is a proposal -- I'd love feedback on the API surface, defaults, and scope before we merge.

  • this.retry(fn, options?) -- retry any async operation with exponential backoff
  • queue(), schedule(), scheduleEvery() accept per-task { retry?: RetryOptions }
  • addMcpServer() accepts { retry?: RetryOptions } for connection retries
  • Class-level defaults via static options = { retry: { ... } }
  • Internal retries for workflow operations with DO-aware error detection
  • Bonus cleanup: getQueue(), getQueues(), getSchedule(), dequeue(), dequeueAll(), dequeueAllByCallback() made synchronous (they were async but only did sync SQL work)

Motivation

Agents interact with external services and platform APIs that fail transiently. Without structured retries, every failure is either fatal or requires hand-rolled retry logic. The cloudflare/actors library has well-tested retry primitives -- this brings similar reliability to the Agents SDK.

There is already a // TODO: add retries comment in index.ts for queue/schedule execution. This PR addresses that.

Design decisions

Internal-first primitives

The core retry functions (tryN, jitterBackoff, isErrorRetryable) live in src/retries.ts and are not re-exported from the package entry point. They are implementation details. Only RetryOptions (the type) and this.retry() (the method) are public. This lets us iterate on internals without breaking changes.

Full jitter backoff

We use the "Full Jitter" strategy from the AWS Architecture Blog: delay = random(0, min(2^attempt * base, max)). It has the best p99 latency for high-contention scenarios and is the simplest to implement.

Retry options stored in SQLite

Per-task retry options are persisted as JSON in a retry_options TEXT column alongside the task. This ensures they survive DO hibernation. Schema migration uses the existing ADD COLUMN IF NOT EXISTS pattern.

shouldRetry only on this.retry()

The shouldRetry predicate is available on this.retry() but not on queue()/schedule() because functions can't be serialized to SQLite. For scheduled/queued tasks, handle non-retryable errors inside the callback itself.

Eager validation

validateRetryOptions() runs at enqueue/schedule time. Invalid values (maxAttempts: 0, baseDelayMs: -1, NaN) throw immediately instead of failing hours later when the task executes. The validation does not check against default values for fields not provided -- this is an acceptable tradeoff documented in the design doc.

Sync getters

While working on this, we noticed getQueue(), getQueues(), getSchedule(), dequeue(), dequeueAll(), and dequeueAllByCallback() were async despite doing only synchronous SQL work. getSchedules() was already sync. We made them all consistent. This is backward compatible -- await on a non-Promise is a no-op.

What's included

Area Files What changed
Core primitives src/retries.ts RetryOptions, tryN, jitterBackoff, isErrorRetryable, validateRetryOptions
Agent class src/index.ts this.retry(), retry options on queue()/schedule(), class-level defaults, sync getters
MCP client src/mcp/client.ts Retry options on addMcpServer(), persisted in server_options
Unit tests src/tests/retries.test.ts 25 tests for primitives and validation
Integration tests src/tests/retry-integration.test.ts 14 tests exercising retry through the Agent runtime
Test agents src/tests/agents/retry.ts TestRetryAgent, TestRetryDefaultsAgent
Design doc design/retries.md Architecture, decisions, tradeoffs
User docs docs/retries.md Quick start, API reference, patterns, limitations
Doc updates docs/scheduling.md, docs/queue.md, docs/mcp-client.md Updated types and signatures
Playground demo examples/playground/src/demos/core/ Interactive retry demo with 3 scenarios
Changesets .changeset/retry-utilities.md, .changeset/sync-getters.md Minor (retries) + patch (sync getters)

Defaults

Setting Default Rationale
maxAttempts 3 Enough for transient blips, not so many that a broken service blocks the agent
baseDelayMs 100ms Fast first retry for quick recovery
maxDelayMs 3000ms Cap at 3s to avoid blocking the DO event loop too long

These apply to this.retry(), queue(), and schedule(). Workflow operations use 200ms base / 3s max. MCP connections use 500ms base / 5s max.

What this does NOT do

  • Dead-letter queue -- failed tasks are dequeued after exhausting retries. We log and route through onError() but don't persist failures. Worth discussing.
  • Circuit breaker -- no failure rate tracking across calls. Each task exhausts its retry budget independently. Could layer on later.
  • onRetryExhausted hook -- exhausted retries surface through the existing onError() path. A dedicated hook was considered but deferred to keep the API surface small.
  • Custom error types -- retry failures use standard Error. Custom error types for better classification is a follow-up.

Notes for reviewers

  1. Start with design/retries.md -- it explains the architecture, decisions, and tradeoffs in detail. The code will make more sense after reading it.

  2. src/retries.ts is the core -- ~140 lines. Everything else composes on top of tryN. If the primitives look right, the rest follows.

  3. Check the _flushQueue() and alarm handler changes in src/index.ts -- these are the most impactful integration points. They read retry_options from the DB row, merge with class-level defaults, and pass to tryN. Payload parsing was hoisted outside the retry loop to avoid repeated deserialization.

  4. The playground demo is functional -- run cd examples/playground && npm run dev and navigate to Core > Retry. Three interactive scenarios: flaky operation, shouldRetry filter, and queue with retry.

  5. Backward compatibility -- all new parameters are optional. The retry_options TEXT column is added via migration. The sync getter change is safe (await on a non-Promise resolves immediately).

  6. Open questions I'd like feedback on:

    • Are the defaults right? 3 attempts / 100ms base feels conservative. Should we go higher?
    • Should shouldRetry be available on queue()/schedule() via a string-based callback name pattern instead of a function?
    • Should we add a dead-letter mechanism now, or is logging + onError() sufficient for v1?

Introduce a retry system across the Agents SDK: add core primitives (jitterBackoff, tryN, isErrorRetryable) and expose this.retry() plus a RetryOptions type. Persist per-task retry options for queue() and schedule()/scheduleEvery() (new retry_options DB columns) and allow MCP server connection retry config. Validate retry options eagerly and provide class-level defaults via static options; internal Durable Object-aware retries were added for workflow operations and MCP reconnection logic. Includes extensive docs and examples (playground UI + demo agent), and unit/integration tests for retries.
@changeset-bot
Copy link

changeset-bot bot commented Feb 9, 2026

🦋 Changeset detected

Latest commit: 2f866b3

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
agents Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 9, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/agents@874

commit: 2f866b3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant