Orchestration
An orchestrator is an agent that runs other agents. It picks a task, spawns a worker session to do the work, reads the worker’s output, injects follow-up messages when needed, and keeps a record of what it started and why. The platform treats an orchestrator as a long-lived session: it doesn’t time out while its workers are busy, and it survives pod crashes.
This page describes the data model, the six tools an orchestrator calls, and the failure modes. Spawn permission is one kind of permission grant; the grant model itself lives in Permission grants. Session fundamentals live in Sessions and the scheduler; execution details (pod spec, sidecar) live in Architecture Overview.
Capability is per-agent, not a role
Section titled “Capability is per-agent, not a role”There is no binary orchestrator/worker distinction. Every agent starts as a worker. An agent becomes an orchestrator by holding one or more spawn grants — a row in permission_grants whose subject is the agent, whose grant_type is spawn, and whose details names a child agent:
{ "agent_subject_id": "<parent agent>", "grant_type": "spawn", "details": { "child_agent_id": "<child agent>" }, "scope": "persistent" | "session"}An agent with zero active spawn grants is a pure worker. An agent with one or more is an orchestrator with respect to exactly the named children. Spawning anything else returns agent_not_permitted.
Persistent grants come from the agent edit screen. Session grants come from the runtime request_grant dialog. Both flow through the same POST /api/workspaces/:slug/grants endpoint, which is user-authenticated — no agent can grant itself anything.
Edit screen
Section titled “Edit screen”A “Can spawn” card on the agent edit page lists every other agent in the workspace with a checkbox. Toggling a box writes or revokes a spawn grant with scope='persistent'. The card sits below the repos card and above the schedule card.
Self-grant is rejected: the details.child_agent_id must not equal the agent_subject_id.
Auto-injected system prompt
Section titled “Auto-injected system prompt”When an agent has at least one active spawn grant at session creation, the Job watcher appends a fixed block to the agent’s system prompt. The agent doesn’t have to be told to read the block; it’s there on every session the orchestrator runs.
The exact text the orchestrator sees:
## Other agents you can spawn
You can start and supervise sessions of these agents:
- <agent-slug-1> — <agent-name-1>- <agent-slug-2> — <agent-name-2>
If you need another agent not on this list, call request_grant withgrant_type='spawn' and the child_agent_id you want. The user will seea dialog and can approve or deny.
Tools:- spawn_session(agent_slug, prompt) → session_id. Starts a new session of an agent you're permitted to spawn.- read_session(session_id, after_seq?) → { events, status, last_seq }. Pulls the child's event log. Pass after_seq to read only newer events.- message_session(session_id, text). Sends text to a child as if it were a user message.- await_children(session_ids) → { <session_id>: { status, result } }. Blocks until every listed child has reached `complete` or `failed`.
Rules:- Children run in this workspace. Cross-workspace spawning is not allowed.- Do not spawn children in a loop. If you need many similar tasks, write them as one prompt to a single child.- Children inherit this workspace's git installations. They can clone and push the same repos this agent can.The Job watcher interpolates the bullet list from the current set of active persistent spawn grants at pod creation time. The list is snapshotted to the pod env — adding a new spawn grant while a session is running does not change what the agent sees until the session restarts. Session-scope grants approved during the run don’t appear in the prompt either; the orchestrator discovers those by calling spawn_session and seeing it succeed. The “if you need another agent” line tells the agent to try anyway.
Parent and child
Section titled “Parent and child”Every session has an optional parent.
ALTER TABLE sessions ADD COLUMN parent_session_id UUID REFERENCES sessions(id) ON DELETE SET NULL, ADD COLUMN parent_tool_use_id TEXT;parent_session_idisNULLfor top-level sessions.parent_tool_use_idrecords which specific tool call spawned the child, so a parent with several open children can route messages back to the right conversation turn.- A child inherits the parent’s workspace. Cross-workspace spawning is rejected at the api layer.
- Cycles are rejected at spawn time: a session cannot spawn an ancestor.
Runtime differences for orchestrator sessions
Section titled “Runtime differences for orchestrator sessions”The Job watcher checks whether the session’s agent has any active persistent spawn grants. If yes, the pod spec changes:
| Property | Worker | Orchestrator |
|---|---|---|
activeDeadlineSeconds | 3600 | unset (no hard deadline) |
restartPolicy | Never | OnFailure |
backoffLimit | 0 | 6 |
| Idle timeout | 15 min default | paused while children are active |
| Workspace volume | emptyDir | per-session PersistentVolumeClaim |
| Extra MCP tools exposed | none | spawn / read / message / await |
| System prompt addition | none | ”Other agents you can spawn” block |
Both shapes share the same agent container image and the same wire event schema. The difference is the lifetime contract and which tools the agent sees.
Six operations
Section titled “Six operations”Everything an orchestrator does reduces to six operations. Each is a single MCP tool call; the sidecar translates the call into a platform action.
1. Spawn a child
Section titled “1. Spawn a child”spawn_session({ agent_slug: "code-writer", prompt: "Refactor the checkout module to extract the validation logic", request_id: "t_042"})The sidecar POSTs:
POST /api/internal/sessions{ "workspace_slug": "...", "agent_slug": "code-writer", "parent_session_id": "...", "parent_tool_use_id": "t_042", "triggered_by": "orchestrator", "initial_prompt": "Refactor the checkout module..."}The api looks up the parent agent’s active spawn grants and checks that one names the requested child agent (either as scope='persistent' or as scope='session' with the current session id). If not, the call returns agent_not_permitted. Otherwise it creates a pending session and the Job watcher picks it up on the next tick.
sequenceDiagram
participant O as Orchestrator agent
participant OS as Orchestrator sidecar
participant A as api
participant JW as Job watcher
participant C as Child pod
O->>OS: spawn_session(agent_slug, prompt)
OS->>A: POST /api/internal/sessions
A->>A: check permission_grants
A->>A: INSERT sessions (pending, parent_session_id=...)
A-->>OS: { session_id }
OS-->>O: { session_id }
A->>JW: next tick
JW->>C: create Job
2. Read a child’s events
Section titled “2. Read a child’s events”read_session({ session_id: "019d...", after_seq: 42 // optional cursor})Returns:
{ status: "pending" | "running" | "complete" | "failed", last_seq: 57, events: [ { seq, type, payload, timestamp }, ... ]}The sidecar handles the call by querying the api’s internal endpoint GET /api/internal/sessions/:id/events?after_seq=N. Events come back oldest-first, up to a server-side cap of 1000 per call. The orchestrator uses last_seq as the next after_seq cursor.
read_session is the pull-based inspection path. It complements report_to_parent (below), which is push-based: workers voluntarily send messages when they need attention. An orchestrator can read at any time without the child having to do anything special.
Permission: the parent can read any session in its own workspace whose parent_session_id is the caller’s session id — nothing else. No reading of other orchestrators’ children.
3. Report to parent (called by the child)
Section titled “3. Report to parent (called by the child)”The child agent calls:
report_to_parent({ text: "I found three call sites that use the old validator. Should I update all of them?", options: ["yes, update all", "list them first"]})The child sidecar publishes to x1.session.{parent_session_id}.input with the caller tagged:
{ "text": "I found three call sites...", "from_session_id": "019d...", "from_agent_slug": "code-writer", "request_id": "parent_tool_use_id_from_spawn", "options": ["yes, update all", "list them first"]}The parent sidecar injects the message into its agent. The orchestrator sees it as a user message; the UI renders it with a chip showing the child agent’s name and a link to the child session. The request_id matches the parent_tool_use_id from the spawn, so the SDK routes the answer to the right tool call when the orchestrator responds.
report_to_parent is always enabled for a child that has a parent — it doesn’t need a grant.
4. Message a child
Section titled “4. Message a child”message_session({ session_id: "019d...", text: "Yes, update all three. Commit after each file so we can review."})The sidecar POSTs to the api’s internal endpoint, which publishes to x1.session.{child_id}.input. The child treats the orchestrator’s message exactly like a human operator’s.
Permission check is the same as read_session: the target session must have parent_session_id = orchestrator's session id.
5. Await children
Section titled “5. Await children”await_children({ session_ids: ["019d...", "019e..."] })Blocks until every listed child has reached complete or failed. The sidecar subscribes to each child’s event subject and resolves the tool call on the first terminal event per child. Returns a map of session_id → { status, result }.
Useful for “spawn N, wait for all, then aggregate.” For a push-based alternative, the orchestrator can sit idle and let report_to_parent messages drive it; the idle timer is paused while children are active either way.
6. Cancel a child
Section titled “6. Cancel a child”cancel_session({ session_id: "019d..." })Flips the child’s session row to failed and terminates its pod. The orchestrator can call this on any child it spawned. The platform does not auto-cancel children when the parent completes — orphaned children run until they finish or the reaper catches them.
Resume after crash
Section titled “Resume after crash”Orchestrators pin their SDK session id to the platform session id. On pod restart, the agent container reads SESSION_ID from env, passes it to query({ resume: SESSION_ID, ... }), and the Claude Agent SDK rehydrates the conversation from the transcript on the pod’s persistent volume.
Orchestrator pods use per-session PVCs:
volumes: - name: workspace persistentVolumeClaim: claimName: x1-session-{sessionId}The PVC is created by the Job watcher when the agent has any active persistent spawn grant. The restartPolicy: OnFailure + backoffLimit: 6 combination lets the pod come back on node failure without the watcher noticing.
Worker pods do not use PVCs. They’re short-lived; a crashed worker is a failed session, not a restart.
What’s persisted
Section titled “What’s persisted”| Kind | Location |
|---|---|
| ”Agent X can spawn Y” | permission_grants (grant_type=‘spawn’) |
| “I spawned X” | sessions.parent_session_id on the child |
| ”X told me Y” | session_events on the parent (user message with from_session_id) |
| “I told X Y” | session_events on X (user message) |
| “X finished” | session_events.type = 'session.completed' on X |
| ”My conversation so far” | Claude Agent SDK transcript on the PVC |
No separate “orchestration log” table. Recovery on restart: re-enumerate children of this session id via SELECT * FROM sessions WHERE parent_session_id = ?, resume the SDK transcript, carry on.
UI rendering
Section titled “UI rendering”A session detail page shows:
- Its own events in the main stream.
- A Children panel listing direct child sessions with status pills, linking to each child’s detail page.
- In the event stream,
user.messageevents whose payload carriesfrom_session_idrender with a child-session chip (agent name, short session id, clickable). They still sort byseqwith everything else.
The child session detail page has a breadcrumb back to its parent. No nested stream rendering — the parent’s page is the index, the child’s page is the full log.
Failure modes
Section titled “Failure modes”Orchestrator pod dies mid-spawn. The child’s sessions row either doesn’t exist yet (transaction rolled back) or exists with status='pending' and no pod. The resumed orchestrator re-enumerates children; the Job watcher picks up the pending row and starts a pod. Idempotency on parent_tool_use_id prevents duplicate spawns — the api rejects a second spawn with the same (parent_session_id, parent_tool_use_id).
Child sidecar dies while running. The parent stops receiving report_to_parent messages. read_session still works — it reads from DB — so the orchestrator can poll until it sees a terminal status. A reaper in the api flips children whose pod has been gone more than N minutes to status='failed' and emits a synthetic session.failed event.
Orchestrator dies with children still running. Children keep running; their events keep flowing to NATS and landing in session_events. When the orchestrator resumes, it reads past events via read_session and sees terminal ones as already-completed await_children returns.
Infinite spawn loop. Depth is capped at one for now: spawn_session rejects calls from any session whose parent_session_id is non-null. Deep nesting is out of scope until we have a use case.
Cross-workspace spawn. Rejected at the api layer. spawn_session returns workspace_mismatch if the requested agent’s workspace doesn’t match the orchestrator’s.
Grant revoked mid-session. The allowlist is snapshotted into pod env when the Job is created. Revoking a spawn grant while a session is running does not retroactively disallow spawns already enumerated in the agent’s system prompt. It does gate future spawn_session calls at the api — the next spawn returns agent_not_permitted even if the agent’s prompt still lists the now-removed child. The agent may be confused. Documented, not fixed.
Dangling grant references. If the child agent named in a spawn grant’s details is deleted, the agent_subject_id or child_agent_id foreign key (depending on which is referenced in the details schema — child_agent_id is not an FK because it lives inside jsonb) will not cascade. A daily sweep in the permissions domain flips those to revoked_at.
Out of scope
Section titled “Out of scope”Intentional non-goals:
- Multi-level nesting. Orchestrators cannot spawn orchestrators. Two levels only.
- Cross-workspace orchestration. A worker spawned by an orchestrator lives in the same workspace.
- Broadcast messaging. No “message all children” primitive. Orchestrators loop over session ids.
- Automatic child cancellation on parent completion. The orchestrator explicitly calls
cancel_sessionif it wants children stopped.
Permission model
Section titled “Permission model”Orchestrators run with the same identity as the user who started them. Spawning a child uses the same installation_id resolution as any other session — the child’s pod gets git credentials via the same sidecar → api → GitHub App path. There is no separate “orchestrator service account.”