RFC: Retry utilities for the Agents SDK #874

threepointone · 2026-02-09T12:39:56Z

Summary

This PR adds structured retry support to the Agents SDK. The goal is to replace ad-hoc retry logic with a consistent, configurable system that works across schedules, queues, MCP connections, and user code.

This is a proposal -- I'd love feedback on the API surface, defaults, and scope before we merge.

this.retry(fn, options?) -- retry any async operation with exponential backoff
queue(), schedule(), scheduleEvery() accept per-task { retry?: RetryOptions }
addMcpServer() accepts { retry?: RetryOptions } for connection retries
Class-level defaults via static options = { retry: { ... } }
Internal retries for workflow operations with DO-aware error detection
Bonus cleanup: getQueue(), getQueues(), getSchedule(), dequeue(), dequeueAll(), dequeueAllByCallback() made synchronous (they were async but only did sync SQL work)

Motivation

Agents interact with external services and platform APIs that fail transiently. Without structured retries, every failure is either fatal or requires hand-rolled retry logic. The cloudflare/actors library has well-tested retry primitives -- this brings similar reliability to the Agents SDK.

There is already a // TODO: add retries comment in index.ts for queue/schedule execution. This PR addresses that.

Design decisions

Internal-first primitives

The core retry functions (tryN, jitterBackoff, isErrorRetryable) live in src/retries.ts and are not re-exported from the package entry point. They are implementation details. Only RetryOptions (the type) and this.retry() (the method) are public. This lets us iterate on internals without breaking changes.

Full jitter backoff

We use the "Full Jitter" strategy from the AWS Architecture Blog: delay = random(0, min(2^attempt * base, max)). It has the best p99 latency for high-contention scenarios and is the simplest to implement.

Retry options stored in SQLite

Per-task retry options are persisted as JSON in a retry_options TEXT column alongside the task. This ensures they survive DO hibernation. Schema migration uses the existing ADD COLUMN IF NOT EXISTS pattern.

`shouldRetry` only on `this.retry()`

The shouldRetry predicate is available on this.retry() but not on queue()/schedule() because functions can't be serialized to SQLite. For scheduled/queued tasks, handle non-retryable errors inside the callback itself.

Eager validation

validateRetryOptions() runs at enqueue/schedule time. Invalid values (maxAttempts: 0, baseDelayMs: -1, NaN) throw immediately instead of failing hours later when the task executes. The validation does not check against default values for fields not provided -- this is an acceptable tradeoff documented in the design doc.

Sync getters

While working on this, we noticed getQueue(), getQueues(), getSchedule(), dequeue(), dequeueAll(), and dequeueAllByCallback() were async despite doing only synchronous SQL work. getSchedules() was already sync. We made them all consistent. This is backward compatible -- await on a non-Promise is a no-op.

What's included

Area	Files	What changed
Core primitives	`src/retries.ts`	`RetryOptions`, `tryN`, `jitterBackoff`, `isErrorRetryable`, `validateRetryOptions`
Agent class	`src/index.ts`	`this.retry()`, retry options on `queue()`/`schedule()`, class-level defaults, sync getters
MCP client	`src/mcp/client.ts`	Retry options on `addMcpServer()`, persisted in `server_options`
Unit tests	`src/tests/retries.test.ts`	25 tests for primitives and validation
Integration tests	`src/tests/retry-integration.test.ts`	14 tests exercising retry through the Agent runtime
Test agents	`src/tests/agents/retry.ts`	`TestRetryAgent`, `TestRetryDefaultsAgent`
Design doc	`design/retries.md`	Architecture, decisions, tradeoffs
User docs	`docs/retries.md`	Quick start, API reference, patterns, limitations
Doc updates	`docs/scheduling.md`, `docs/queue.md`, `docs/mcp-client.md`	Updated types and signatures
Playground demo	`examples/playground/src/demos/core/`	Interactive retry demo with 3 scenarios
Changesets	`.changeset/retry-utilities.md`, `.changeset/sync-getters.md`	Minor (retries) + patch (sync getters)

Defaults

Setting	Default	Rationale
`maxAttempts`	3	Enough for transient blips, not so many that a broken service blocks the agent
`baseDelayMs`	100ms	Fast first retry for quick recovery
`maxDelayMs`	3000ms	Cap at 3s to avoid blocking the DO event loop too long

These apply to this.retry(), queue(), and schedule(). Workflow operations use 200ms base / 3s max. MCP connections use 500ms base / 5s max.

What this does NOT do

Dead-letter queue -- failed tasks are dequeued after exhausting retries. We log and route through onError() but don't persist failures. Worth discussing.
Circuit breaker -- no failure rate tracking across calls. Each task exhausts its retry budget independently. Could layer on later.
onRetryExhausted hook -- exhausted retries surface through the existing onError() path. A dedicated hook was considered but deferred to keep the API surface small.
Custom error types -- retry failures use standard Error. Custom error types for better classification is a follow-up.

Notes for reviewers

Start with design/retries.md -- it explains the architecture, decisions, and tradeoffs in detail. The code will make more sense after reading it.
src/retries.ts is the core -- ~140 lines. Everything else composes on top of tryN. If the primitives look right, the rest follows.
Check the _flushQueue() and alarm handler changes in src/index.ts -- these are the most impactful integration points. They read retry_options from the DB row, merge with class-level defaults, and pass to tryN. Payload parsing was hoisted outside the retry loop to avoid repeated deserialization.
The playground demo is functional -- run cd examples/playground && npm run dev and navigate to Core > Retry. Three interactive scenarios: flaky operation, shouldRetry filter, and queue with retry.
Backward compatibility -- all new parameters are optional. The retry_options TEXT column is added via migration. The sync getter change is safe (await on a non-Promise resolves immediately).
Open questions I'd like feedback on:
- Are the defaults right? 3 attempts / 100ms base feels conservative. Should we go higher?
- Should shouldRetry be available on queue()/schedule() via a string-based callback name pattern instead of a function?
- Should we add a dead-letter mechanism now, or is logging + onError() sufficient for v1?

Introduce a retry system across the Agents SDK: add core primitives (jitterBackoff, tryN, isErrorRetryable) and expose this.retry() plus a RetryOptions type. Persist per-task retry options for queue() and schedule()/scheduleEvery() (new retry_options DB columns) and allow MCP server connection retry config. Validate retry options eagerly and provide class-level defaults via static options; internal Durable Object-aware retries were added for workflow operations and MCP reconnection logic. Includes extensive docs and examples (playground UI + demo agent), and unit/integration tests for retries.

changeset-bot · 2026-02-09T12:40:01Z

🦋 Changeset detected

Latest commit: 2f866b3

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
agents	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

pkg-pr-new · 2026-02-09T12:42:08Z

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/agents@874

commit: 2f866b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Retry utilities for the Agents SDK #874

RFC: Retry utilities for the Agents SDK #874

Uh oh!

threepointone commented Feb 9, 2026

Uh oh!

changeset-bot bot commented Feb 9, 2026

Uh oh!

pkg-pr-new bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RFC: Retry utilities for the Agents SDK #874

Are you sure you want to change the base?

RFC: Retry utilities for the Agents SDK #874

Uh oh!

Conversation

threepointone commented Feb 9, 2026

Summary

Motivation

Design decisions

Internal-first primitives

Full jitter backoff

Retry options stored in SQLite

shouldRetry only on this.retry()

Eager validation

Sync getters

What's included

Defaults

What this does NOT do

Notes for reviewers

Uh oh!

changeset-bot bot commented Feb 9, 2026

🦋 Changeset detected

Uh oh!

pkg-pr-new bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`shouldRetry` only on `this.retry()`