RFC: Retry utilities for the Agents SDK #874
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds structured retry support to the Agents SDK. The goal is to replace ad-hoc retry logic with a consistent, configurable system that works across schedules, queues, MCP connections, and user code.
This is a proposal -- I'd love feedback on the API surface, defaults, and scope before we merge.
this.retry(fn, options?)-- retry any async operation with exponential backoffqueue(),schedule(),scheduleEvery()accept per-task{ retry?: RetryOptions }addMcpServer()accepts{ retry?: RetryOptions }for connection retriesstatic options = { retry: { ... } }getQueue(),getQueues(),getSchedule(),dequeue(),dequeueAll(),dequeueAllByCallback()made synchronous (they wereasyncbut only did sync SQL work)Motivation
Agents interact with external services and platform APIs that fail transiently. Without structured retries, every failure is either fatal or requires hand-rolled retry logic. The
cloudflare/actorslibrary has well-tested retry primitives -- this brings similar reliability to the Agents SDK.There is already a
// TODO: add retriescomment inindex.tsfor queue/schedule execution. This PR addresses that.Design decisions
Internal-first primitives
The core retry functions (
tryN,jitterBackoff,isErrorRetryable) live insrc/retries.tsand are not re-exported from the package entry point. They are implementation details. OnlyRetryOptions(the type) andthis.retry()(the method) are public. This lets us iterate on internals without breaking changes.Full jitter backoff
We use the "Full Jitter" strategy from the AWS Architecture Blog:
delay = random(0, min(2^attempt * base, max)). It has the best p99 latency for high-contention scenarios and is the simplest to implement.Retry options stored in SQLite
Per-task retry options are persisted as JSON in a
retry_options TEXTcolumn alongside the task. This ensures they survive DO hibernation. Schema migration uses the existingADD COLUMN IF NOT EXISTSpattern.shouldRetryonly onthis.retry()The
shouldRetrypredicate is available onthis.retry()but not onqueue()/schedule()because functions can't be serialized to SQLite. For scheduled/queued tasks, handle non-retryable errors inside the callback itself.Eager validation
validateRetryOptions()runs at enqueue/schedule time. Invalid values (maxAttempts: 0,baseDelayMs: -1, NaN) throw immediately instead of failing hours later when the task executes. The validation does not check against default values for fields not provided -- this is an acceptable tradeoff documented in the design doc.Sync getters
While working on this, we noticed
getQueue(),getQueues(),getSchedule(),dequeue(),dequeueAll(), anddequeueAllByCallback()wereasyncdespite doing only synchronous SQL work.getSchedules()was already sync. We made them all consistent. This is backward compatible --awaiton a non-Promise is a no-op.What's included
src/retries.tsRetryOptions,tryN,jitterBackoff,isErrorRetryable,validateRetryOptionssrc/index.tsthis.retry(), retry options onqueue()/schedule(), class-level defaults, sync getterssrc/mcp/client.tsaddMcpServer(), persisted inserver_optionssrc/tests/retries.test.tssrc/tests/retry-integration.test.tssrc/tests/agents/retry.tsTestRetryAgent,TestRetryDefaultsAgentdesign/retries.mddocs/retries.mddocs/scheduling.md,docs/queue.md,docs/mcp-client.mdexamples/playground/src/demos/core/.changeset/retry-utilities.md,.changeset/sync-getters.mdDefaults
maxAttemptsbaseDelayMsmaxDelayMsThese apply to
this.retry(),queue(), andschedule(). Workflow operations use 200ms base / 3s max. MCP connections use 500ms base / 5s max.What this does NOT do
onError()but don't persist failures. Worth discussing.onRetryExhaustedhook -- exhausted retries surface through the existingonError()path. A dedicated hook was considered but deferred to keep the API surface small.Error. Custom error types for better classification is a follow-up.Notes for reviewers
Start with
design/retries.md-- it explains the architecture, decisions, and tradeoffs in detail. The code will make more sense after reading it.src/retries.tsis the core -- ~140 lines. Everything else composes on top oftryN. If the primitives look right, the rest follows.Check the
_flushQueue()and alarm handler changes insrc/index.ts-- these are the most impactful integration points. They readretry_optionsfrom the DB row, merge with class-level defaults, and pass totryN. Payload parsing was hoisted outside the retry loop to avoid repeated deserialization.The playground demo is functional -- run
cd examples/playground && npm run devand navigate to Core > Retry. Three interactive scenarios: flaky operation, shouldRetry filter, and queue with retry.Backward compatibility -- all new parameters are optional. The
retry_options TEXTcolumn is added via migration. The sync getter change is safe (awaiton a non-Promise resolves immediately).Open questions I'd like feedback on:
shouldRetrybe available onqueue()/schedule()via a string-based callback name pattern instead of a function?onError()sufficient for v1?