Combine workflow event replay and step bundles to do work inline where possible, only deferring to queue for parallelism

Eager Processing of Steps & Incremental Event Replay

Date: March 2026

This is a major internal architecture change to how Workflow DevKit executes workflows and steps on the Vercel platform. It reduces function invocations and queue overhead by executing steps inline within the same function invocation as the workflow replay, rather than dispatching every step to a separate function via the queue.

Previous Architecture

The previous architecture used two separate routes, each backed by its own queue trigger:

Queue: __wkf_workflow_*  -->  /.well-known/workflow/v1/flow   (workflow replay in VM)
                                |
                          suspension (step needed)
                                |
                          queue step to __wkf_step_*
                                |
Queue: __wkf_step_*      -->  /.well-known/workflow/v1/step   (step execution in Node.js)
                                |
                          step completes
                                |
                          queue continuation to __wkf_workflow_*
                                v
                          (cycle repeats for each step)

Each step required 2 queue messages (step invoke + workflow continuation) and 2 function invocations, plus cold start overhead for each. A serial workflow with 10 steps needed ~21 function invocations.

New Architecture

The two routes are merged into a single handler at /.well-known/workflow/v1/flow using workflowEntrypoint(). The step route is no longer generated.

The handler runs an inline execution loop:

receive queue message
  |
  +-- if message has stepId: execute that step, queue workflow continuation, exit
  |
  v
replay workflow in VM
  |
  +-- workflow completed --> create run_completed event, exit
  +-- workflow failed   --> create run_failed event, exit
  |
  v
suspension with pending operations
  |
  +-- process hooks and waits (unchanged)
  |
  +-- 0 pending steps  --> return (waits/hooks only)
  +-- 1 pending step   --> execute inline, loop back to replay
  +-- N pending steps  --> queue N-1 to self (with stepId),
  |                        execute 1 inline, loop back to replay
  |
  +-- timeout check: if wall-clock time >= threshold,
  |   re-schedule self via queue and exit
  |
  v
(loop continues until completion, timeout, or non-step suspension)

A serial workflow with 10 steps now completes in 1 function invocation.

Inline Step Execution

After the workflow suspends with pending steps, the handler executes one step inline:

Create step_started event
Hydrate step input from the event log
Look up the step function via getStepFunction(stepName)
Execute the step function
Create step_completed or step_failed event
Loop back to workflow replay

This logic lives in executeStep() in packages/core/src/runtime/step-executor.ts.

Background Steps (Parallel Execution)

When a workflow suspends with multiple pending steps (e.g., from Promise.all), the handler:

Creates step_created events for all pending steps
Queues N-1 steps back to __wkf_workflow_* with a stepId in the message payload
Executes 1 step inline
Loops back to replay

Each background step message is handled by a separate function invocation of the same handler. When a message arrives with stepId, the handler executes that specific step, queues a plain workflow continuation (without stepId), and exits. It does not replay the workflow inline — the step events (step_started/step_completed) need to be consumed by the workflow's event subscriptions during a fresh replay.

Convergence After Parallel Steps

After N parallel steps complete, up to N concurrent replay attempts may occur (each background step queues a continuation). The event-sourced architecture ensures safe convergence:

step_created idempotency --- duplicate creates return 409
step_completed idempotency --- only the first invocation to complete a step records the result (409 for duplicates)
Queue idempotency keys --- background step messages use correlationId as idempotency key
Deterministic replay --- all invocations produce the same result given the same event log

Multiple concurrent invocations may all successfully call step_started and begin executing the same step — step_started does not reject already-running steps (it succeeds and increments the attempt counter). However, only one step_completed event is recorded; the rest get 409 and exit.

Known limitation: The redundant step_started calls inflate the attempt counter, which could cause premature "exceeded max retries" failures if many concurrent invocations race on the same step. In practice this is bounded by the number of parallel steps that just completed (e.g., a Promise.all with 10 steps produces at most 10 concurrent continuations).

A future improvement would be to add ne(status, 'running') to the step_started WHERE clause in workflow-server, so that only the first invocation succeeds and the rest get 409. This would require careful handling of the legitimate retry-after-SIGKILL case where a step was killed mid-execution and the queue re-delivers the message.

Incremental Event Loading

The handler caches the event log in memory across loop iterations. Instead of re-fetching the entire event log on each replay:

First iteration: full load via getAllWorkflowRunEventsWithCursor(), which returns both the events and the final pagination cursor
Subsequent iterations: getNewWorkflowRunEvents(runId, cursor) fetches only events created after the saved cursor and appends them to the cached array

For a 10-step serial workflow completing in one invocation, the 10th replay loads ~2 new events instead of re-fetching all ~30.

Server-Side Cursor Fix

The incremental loading depends on the server returning a cursor even on the final page of results (hasMore: false). Previously, workflow-server returned cursor: null when there were no more pages. This was fixed in the peter/fix-end-cursor branch to always return an eid:<eventId> cursor when there are events, aligning with world-local and world-postgres behavior.

If a World implementation does not return a cursor after the initial load, the handler logs an error and falls back to a full reload.

Timeout Handling

The inline execution loop checks wall-clock time before each replay iteration. If the elapsed time exceeds a configurable threshold (default: 110 seconds, for a 120-second function limit), the handler re-schedules itself via the queue and returns.

The threshold is configurable via the WORKFLOW_V2_TIMEOUT_MS environment variable.

If a single step takes longer than the timeout threshold, the step runs to completion (or SIGKILL) — there is no interruption mechanism for in-progress step execution. This is the same behavior as the previous architecture.

Queue Message Changes

The WorkflowInvokePayload schema has a new optional field:

stepId: z.string().optional()

When present, the handler executes that specific step before (or instead of) replaying the workflow. Background steps are queued with this field set.

The queue trigger configuration uses WORKFLOW_QUEUE_TRIGGER on the __wkf_workflow_* topic. The __wkf_step_* topic and its separate trigger are no longer generated.

Builder Changes

Base Builder

New method createCombinedBundle() in packages/builders/src/base-builder.ts:

Builds the step registrations bundle (same esbuild + SWC step mode as before)
Builds the workflow VM code string (same esbuild + SWC workflow mode as before)
Generates a combined route file that imports the step registrations and uses workflowEntrypoint(workflowCode)

No changes to the SWC plugin were needed. The two-pass build approach (separate step and workflow SWC modes) still applies.

Framework Builders

All framework builders were updated to use createCombinedBundle():

Next.js (eager and deferred/lazyDiscovery): replaces separate step + flow route generation
NestJS, Nitro, Standalone: replaces separate createStepsBundle() + createWorkflowsBundle() calls
SvelteKit, Astro: same, plus post-processing regex updated to match workflowEntrypoint
Vercel Build Output API (used by Nitro/Astro production): single flow.func/ with WORKFLOW_QUEUE_TRIGGER

Generated File Layout

.well-known/workflow/v1/
  flow/
    route.js                  # Handler (workflowEntrypoint)
    __step_registrations.js   # Step function registrations (side effects)
  webhook/
    [token]/
      route.js                # Webhook handler (unchanged)
  manifest.json               # Workflow/step/class manifest (unchanged)
  config.json                 # Functions config (single trigger)

The step/ directory is no longer generated.

Suspension Handler

handleSuspension() in packages/core/src/runtime/suspension-handler.ts creates events for all pending operations (hooks, step events, wait events) but does not queue step messages. It returns the pending step items so the handler can decide which to execute inline vs. queue to background.

Concerns and Edge Cases

VM Sandboxing

Workflow code still runs in a Node.js VM for determinism and sandboxing. Step code runs in the Node.js host context. The only change is that both happen within the same function invocation.

Bundle Size and Cold Start

The combined bundle is larger (contains both step code and workflow VM code). Cold start time increases slightly. The reduction in total function invocations more than compensates.

Step Retries

When an inline step fails with retries remaining:

RetryableError with explicit retryAfter delay: re-queue to self with stepId and delay
Transient errors with immediate retry: re-queue to self with stepId (delay = 1s)
FatalError: fail immediately

Mixed Suspensions

A suspension may contain steps, hooks, and waits simultaneously. The handler creates events for all, executes any pending step inline, and returns with the wait timeout if applicable. The workflow will re-suspend on next replay for the still-pending hooks/waits.

Hook Conflicts

If a hook conflict is detected during suspension handling, the handler breaks the loop and returns { timeoutSeconds: 0 } for immediate re-invocation, same as the previous behavior.

Encryption Key Resolution

Encryption keys are resolved per-run during event loading and step execution, same as before.

Framework Support

All framework integrations have been updated: Next.js (eager and deferred/lazyDiscovery), NestJS, SvelteKit, Astro, Nitro/Nuxt/Hono/Express/Vite, and CLI standalone. The Vercel Build Output API builder (used by Nitro and Astro for production deploys) also uses the combined bundle with WORKFLOW_QUEUE_TRIGGER.

Non-Next.js Integration Challenges

Module Scope Duplication in Re-Bundled Output

Builders that use bundleFinalOutput: true (standalone CLI, Vercel Build Output API, NestJS) produce a single file where esbuild re-bundles the step registrations and the workflow runtime together. esbuild creates isolated module scopes for each source module, even within the same output file. This meant registerStepFunction and getStepFunction operated on different Map instances — steps were registered into one Map but looked up from another.

Fix: The step function registry (registeredSteps Map in @workflow/core/private) and the step context storage (contextStorage AsyncLocalStorage in @workflow/core/step/context-storage) were changed from module-scoped variables to globalThis singletons using Symbol.for. This ensures all esbuild module scopes share the same instances. The pattern was already used in the codebase for the World singleton and the class serialization registry.

Workflow Package CJS Export Condition

The workflow package's root export has "require": "./dist/typescript-plugin.cjs" for TypeScript editor plugin loading. When esbuild bundles with CJS format, it resolves import { defineHook } from 'workflow' via the require condition, getting the TS plugin instead of the API.

Fix: Added a "node" condition ("node": "./dist/index.js") before the "require" condition in the workflow package's exports. esbuild with conditions: ['node'] matches "node" first and uses the correct API entry. TypeScript's plugin loader doesn't use conditions: ['node'], so it still falls through to "require" for the TS plugin.

Local World Concurrent Replay Interference

The local development world (world-local) processes queue messages with high concurrency (default: 1000). With the V2 combined handler, parallel steps generate multiple workflow continuation messages. When these are processed concurrently, each triggers a replay that sees in-flight events from other concurrent replays. This causes "unconsumed event" errors because the event consumer encounters events that don't match any subscriber in the current replay state.

In production (Vercel), this doesn't happen — each function invocation is isolated with its own event loading.

Fix: The world-testing embedded test server now defaults to WORKFLOW_LOCAL_QUEUE_CONCURRENCY=1, serializing queue processing to simulate the production isolation model. This prevents concurrent replays from interfering with each other.

ESM `bundleFinalOutput` and Dynamic Require Errors

When bundleFinalOutput: true is used with ESM format, esbuild bundles CJS dependencies (like debug) into the output. CJS require() calls are wrapped in esbuild's __require polyfill, which throws "Dynamic require of X is not supported" in ESM contexts where require is undefined. This affected all ESM-based framework builders (Nitro, NestJS, SvelteKit, Astro) that were switched to bundleFinalOutput: true during the V2 migration.

Fix: ESM builders use bundleFinalOutput: false with externalizeNonSteps: true, matching the pre-V2 behavior. The framework's own bundler (Vite, Rollup, Turbopack) handles dependency resolution. Only CJS builders (standalone CLI, Vercel Build Output API) use bundleFinalOutput: true, where require() is natively available.

Rollup Tree-Shaking of Step Registrations

When bundleFinalOutput: false is used with Nitro's rollup pipeline, the step registrations bundle (steps.mjs) only contains side-effect code (registerStepFunction calls) with no exports. Rollup tree-shakes the entire module because it has no used exports, removing all step registrations from the production bundle. This causes "Step not found" errors at runtime.

Fix: The steps bundle now exports a sentinel value (export const __steps_registered = true), and the combined route file imports it (import { __steps_registered } from './steps.mjs'). This gives rollup a used binding to track, preventing it from dropping the module and its side effects.

Concurrent Replay Interference with Multi-Batch Workflows

When the V2 inline execution loop advances through multiple suspension points (e.g., completes all steps in batch 1 and creates step_created events for batch 2), concurrent continuations from batch 1's background steps may replay the workflow and encounter batch 2's events without matching subscribers. The events have valid correlationIds but the concurrent replay hasn't reached the code path that creates those subscriptions, triggering the "Unconsumed event in event log" error.

Fix: The EventsConsumer's onUnconsumedEvent callback now returns true (skip) for step lifecycle events (step_created, step_started, step_completed, step_failed, step_retrying) whose correlationId has a corresponding step_created event in the log — confirming they're from a legitimate concurrent handler, not corruption. The EventsConsumer was extended to support a boolean return from onUnconsumedEvent: true advances past the event, false/undefined triggers the error. Orphaned events with unknown correlationIds still error as before.

Stale V1 Artifacts in Build Caches

SvelteKit and Astro's build caches (including Vercel's) may preserve the old V1 step/ route directory from previous builds. When the V2 builder runs, it no longer generates step routes, but the stale files remain and cause build failures (e.g., importing the removed stepEntrypoint). Additionally, SvelteKit's beforeExit hook that patches .vc-config.json files for Vercel deployments was still trying to configure the non-existent step.func/ directory.

Fix: SvelteKit and Astro builders now clean up stale V1 step route directories during build. SvelteKit's Vercel deployment hook was updated to only configure the combined flow.func/ directory.

Next.js Canary Turbopack and Temp Files

The deferred (lazyDiscovery) Next.js builder writes build artifacts with a .temp extension to avoid HMR churn, then copies them to their final names. The V2 migration created __step_registrations.route.js.temp in the app/ directory. Canary Turbopack rejects this file as an "Unknown module type" because the .temp extension has no associated loader.

Fix: The step registrations file is written directly to its final name (__step_registrations.js) since it doesn't need the temp-file HMR mechanism. Only the route file uses temp naming.

Concurrent `step_started` Inflating Attempt Counter

When the V2 handler dispatches N parallel steps as background messages, each background step completion queues a workflow continuation. Up to N continuations may replay concurrently, and each may attempt to start the same not-yet-completed step (since step_started succeeds for already-running steps). Each call atomically increments the attempt counter. With N=5 parallel steps, the attempt counter can reach 5 on the first genuine execution — exceeding the default maxRetries + 1 = 4 threshold and prematurely failing the step with "exceeded max retries".

This is the same known limitation described in "Convergence After Parallel Steps" above, but with a concrete failure mode: promiseRaceStressTestWorkflow (which uses 5 parallel steps with Promise.race) consistently failed on Postgres tests.

Fix: The max retries check in executeStep() now only enforces when step.error exists — distinguishing actual retries (failed → retry with error) from concurrent first-attempt races (multiple handlers start the same step simultaneously without any prior failure). Concurrent starts are harmless since step_completed idempotency ensures only the first completion wins.

Inline Step Execution with Pending Stream Operations

When a step writes to a WritableStream (e.g., via getWritable()), the stream data is piped to S3 through a background flushablePipe operation registered with waitUntil(). In V1, each step ran in a separate function invocation — the function returned after the step, and waitUntil flushed the stream data to S3 before the function was garbage collected. The test's stream reader could then read the committed data.

In V2, the inline execution loop continues after the step completes (replay → next sleep → etc.), keeping the function alive and competing for event loop time with the background stream ops. The waitUntil ops don't get exclusive event loop time to flush, so the stream data never reaches S3 within the test's 60-second timeout. This caused outputStreamWorkflow and outputStreamInsideStepWorkflow to consistently time out on Vercel Prod.

Fix: executeStep() now returns hasPendingOps: true when the step had background operations (stream pipes). The V2 handler checks this flag after inline step execution: if pending ops exist, it breaks the inline loop, queues a workflow continuation, and returns — giving waitUntil exclusive control to flush the ops. This matches V1 behavior where each step was a separate function invocation. Steps without stream ops continue to execute inline without the extra queue round-trip.

Eager Processing of Steps & Incremental Event Replay

On this page