In-cluster registry
Every agent image — the platform’s runtime-core, the shipped presets, and the workspace-authored images admins create in the image catalog — lives in a container registry running inside the same cluster as the x1agent control plane. This doc specifies the deployment, the namespacing scheme, how images are built, and the boundary between public-registry pulls and in-cluster pushes.
Companion docs:
- Runtime images — what runtime-core is and how admin images
FROMit. - Siblings — images reference sibling images, which are typically pulled from public registries through the registry’s pull-through cache.
Why in-cluster
Section titled “Why in-cluster”Three reasons it is not optional:
- Startup latency. Session pods are short-lived and spawn frequently. Pulling an admin-authored image from Docker Hub on every session start is a visible user-perceived delay. A node-local registry pulls once and serves to every subsequent pod.
- No external dependency. x1agent is deployable in air-gapped environments. The registry exists in-cluster so no image pull ever depends on outbound internet access.
- Push target for Kaniko builds. When a workspace admin saves a new image version, the Kaniko build Job pushes the result somewhere. That somewhere is the in-cluster registry.
Deployment
Section titled “Deployment”v1 ships with registry:2 (the official CNCF Distribution image). Minimal and sufficient.
graph LR
subgraph ns[namespace: x1agent-infra]
reg[registry deployment<br/>image: registry:2<br/>replicas: 1]
svc[Service: x1-registry<br/>ClusterIP :5000]
pvc[PVC: x1-registry-data<br/>20Gi default]
end
subgraph pods[session pods across all workspaces]
ap[agent pod]
end
reg -. mounts .- pvc
svc -. routes to .- reg
ap -- pulls images --> svc
Deployment shape:
- One
Deploymentwithreplicas: 1(registry:2 does not support multi-replica scale-out without shared storage + coordination; single replica is fine for a single cluster). - One
PersistentVolumeClaimfor storage. Default 20Gi; sizing guidance below. - One
Service(ClusterIP) exposing:5000. - One
ConfigMapwith the registry’sconfig.yml, which configures pull-through mirrors fordocker.io,ghcr.io, andquay.io. - No
Ingress. The registry is never exposed outside the cluster.
The registry runs in a x1agent-infra namespace alongside Postgres and NATS. Helm values land under deploy/helm/registry/ (see the repository’s deploy/ tree once Phase 1 is built). devspace picks it up automatically during mise run dev.
Storage
Section titled “Storage”Images grow without bound unless actively managed. Sizing:
- Platform images (runtime-core + presets): ~500 MB compressed per version. Expect 3–5 versions retained at any time → ~2.5 GB.
- Workspace images: depend heavily on language. A python-django image FROM runtime-core is typically 250–400 MB compressed; a full-stack polyglot image can hit 800 MB. Expect 5–10 versions per image.
- A 20-workspace cluster with 5 images per workspace at 5 versions each at 400 MB = 200 GB. Pick a PVC size that matches your expected scale; 20 GB is enough for single-workspace dev, 100+ GB is the right ballpark for a production cluster.
Garbage collection runs as a weekly CronJob that invokes registry garbage-collect against the registry’s config. Blob GC frees space by deleting blobs no longer referenced by any manifest. Before GC runs, the Job deletes manifests older than the per-image retention policy (retain last N versions, where N is configurable per workspace; default 5).
Namespacing
Section titled “Namespacing”Images are named with a two-level namespace that maps to authorization:
<service>/x1agent/<name>:<version> — platform-maintained images<service>/ws/<workspace-id>/<name>:<version> — workspace-authored images<service>/mirror/<upstream-registry>/<path>:<tag> — pull-through cacheWhere <service> is the in-cluster address (x1-registry.x1agent-infra.svc.cluster.local:5000 internally; resolved via the cluster’s DNS).
Platform images. Written only by the platform’s CI pipeline. Read by every session pod.
Workspace images. Written only by the API’s image build controller on behalf of admins of that workspace. Read only by session pods running in that workspace. Cross-workspace pull is blocked at the API/authz level; the registry itself treats workspaces as plain path prefixes. See Access control below.
Pull-through mirror. Public images (e.g. postgres:16 declared as a sibling) are not re-pushed; they are fetched from their upstream registry by the in-cluster registry and cached at <service>/mirror/docker.io/library/postgres:16. The first session that uses a given public image pays the upstream pull cost; every subsequent session on the same cluster hits the cache.
The admin UI and the API always present the short form (postgres:16); the pod-spec translator rewrites it to the mirror form at pod-spec generation time so pulls stay in-cluster.
Access control
Section titled “Access control”In v1, auth on the registry itself is intentionally minimal — the registry’s ClusterIP is only reachable from inside the cluster, and K8s NetworkPolicy restricts which pods can talk to it. Fine-grained RBAC (per-workspace push/pull tokens) is a follow-up once a second cluster is stood up and cross-cluster replication becomes relevant.
Until then:
- The API has write credentials (mounted as a K8s Secret) for the registry. It uses them to push built images and to write retention metadata.
- Session pods have read credentials. They use a workspace-scoped image pull secret bound into the pod spec.
- Workspaces cannot read each other’s images. Enforced at the API (image catalog endpoints are workspace-scoped) and at the pod-spec level (a session’s pull secret only includes its workspace’s namespace).
- Workspaces cannot write directly. Image writes always go through the Kaniko build controlled by the API; no endpoint lets a user push a raw image.
Admission controllers (OPA/Kyverno) in x1agent deployments can further restrict which registries session pods can pull from. The default policy permits only the in-cluster registry and its mirror prefix.
Build pipeline
Section titled “Build pipeline”Images are built by Kaniko — the standard K8s-native builder. Kaniko runs as an unprivileged container, reads a Dockerfile, builds the image layer-by-layer, and pushes to a registry. No Docker daemon, no privileged containers, no host socket.
sequenceDiagram
participant UI as Browser
participant API
participant K8s
participant Kaniko as Kaniko Job
participant Reg as in-cluster registry
UI->>API: POST /images/:id/versions { dockerfile, siblings_yaml }
API->>API: validate, compute dockerfile_hash
API->>K8s: create Kaniko Job with Dockerfile in a ConfigMap
K8s->>Kaniko: schedule pod
Kaniko->>Reg: pull FROM runtime-core (if needed)
Kaniko->>Kaniko: build layers
Kaniko->>Reg: push ws/<id>/<name>:<version>
Kaniko-->>K8s: exit 0 (or exit 1 on failure)
K8s-->>API: Job status watcher
API->>API: update agent_image_versions row (status = succeeded/failed)
API->>N: publish image.build.<id>.status
N->>UI: status event (in-flight progress)
Key properties:
- One Kaniko Job per build. Jobs are not reused. Each version’s build is its own pod; logs are kept as the Job’s pod logs plus a copy in Postgres’s
agent_image_versions.log_refpointer to a blob store. - ConfigMap holds the Dockerfile. Kaniko mounts it from a ConfigMap that’s created per-build and deleted on success. No build-context tarball is shipped around; the Dockerfile is the entire context in v1 (admins cannot
COPYlocal files from their laptop). IfCOPYfrom the repo becomes a need, we ship a git-context fetcher at Phase 3 follow-up — sidestepping it for v1 keeps the surface small. - Log streaming. The API subscribes to the Kaniko pod’s logs and forwards lines to NATS subject
x1.image.build.<version_id>.logs. The admin UI’s build-log viewer subscribes the same way the session event viewer subscribes to session events — same pattern, same infrastructure. - Concurrency. Per workspace, one build at a time by default; global concurrency cap prevents runaway clusters during a mass rebuild. Configurable in the platform config.
- Cache. Kaniko supports layer cache on a shared volume; in v1 we run without cache (every build is clean). Cached builds are a follow-up optimization.
Scenarios
Section titled “Scenarios”Admin creates a new Python/Django image
Section titled “Admin creates a new Python/Django image”- Admin clicks Images → New, fills in name, Dockerfile, siblings YAML. Save.
- API validates, creates an
agent_imagesrow and a firstagent_image_versionsrow withstatus: pending. - API creates a Kaniko Job and a ConfigMap containing the Dockerfile.
- Kaniko pulls
x1agent/runtime-core:v1from the registry, builds the image, pushes tows/<id>/python-django:v1. - API watcher updates the row to
status: succeededand setsbuilt_refto the registry reference with digest. - Admin’s UI flips from “building” to “ready” — the image is now selectable on agent edit screens.
Admin edits the Dockerfile
Section titled “Admin edits the Dockerfile”- Edit the Dockerfile text. Save.
- API computes the new hash, creates a new
agent_image_versionsrow withstatus: pending. Previous version stays around. - Kaniko builds, pushes as
ws/<id>/python-django:v2. - Admin flips
current_version_idto v2 (or does so automatically depending on workspace policy).
Rollback is changing current_version_id back. The built image is still in the registry; no rebuild needed.
Session uses a shipped preset
Section titled “Session uses a shipped preset”- Workspace admin assigns
x1agent/preset-python-django:v1to an agent. No Dockerfile authored by the admin. - Preset images are owned by the platform team and pushed by CI; workspaces can consume them but not edit.
Session references a public image as a sibling
Section titled “Session references a public image as a sibling”- Agent’s siblings declare
image: postgres:16. - API rewrites to
<reg>/mirror/docker.io/library/postgres:16in the pod spec. - Registry’s pull-through cache fetches
postgres:16from Docker Hub on first use; subsequent sessions hit the cache.
Future
Section titled “Future”- Harbor for multi-cluster replication, RBAC, vulnerability scanning. Migration path: push images to Harbor, change the registry Service to point at it. Clients (session pods, Kaniko) don’t care which OCI registry sits behind the Service.
- Cosign signing for platform-maintained images.
x1agent/runtime-coregets signed by a keyless GitHub Actions workflow; admission policy verifies signatures on pull. - Layer cache for Kaniko builds. Shared cache volume, content-addressed. Cuts incremental build times from 30s–3min to seconds.
- Image provenance (SLSA). Build attestations recorded alongside each version.
None of these ship in v1. Plain registry:2 + Kaniko + a namespacing scheme that maps to workspace authorization is enough to prove the architecture.
Summary
Section titled “Summary”- One in-cluster
registry:2Deployment, one PVC, one Service. - Namespacing:
x1agent/<name>for platform,ws/<id>/<name>for workspaces,mirror/*for pull-through. - Admin-authored images built by Kaniko Jobs; pushed to the workspace namespace; read-only by session pods.
- Pull-through cache serves public sibling images.
- No external exposure, no cross-workspace reads, no direct user writes.