A persistent, cost-bounded, audit-grade environment for long-running AI coding work — where context survives sessions, costs can't run away, and every decision leaves a trail. Field-tested across months of real multi-session campaigns; published here as a whitepaper plus working reference implementation.
A reference architecture, not a product. Six composable pillars, a small set of conventions, and ~600 lines of glue that turn any AI coding agent into a contractor with a project log instead of a goldfish.
The diagrams below describe the pattern. The code in the reference implementation is permissively licensed and was built around VS Code, GitHub Copilot Chat, an object-storage backend, and a local open-weights model on a consumer GPU — but nothing here is bound to those choices. Swap the host editor, swap the cloud, swap the local model. The pattern stands.
Modern AI coding agents are stateless. Every conversation starts cold. Memory is what the host platform decides to keep. Costs are unbounded. Audit is whatever the chat transcript happens to retain. Multi-day campaigns degrade into "we already discussed this — read the scrollback" loops, until eventually the scrollback gets compacted by an opaque summarizer and detail vanishes.
Agent OS treats the agent like a contractor who clocks in every day: they don't remember yesterday's work from their own head, they read the project log. The log lives on disk, in a vector brain, and in immutable cloud storage. The agent's job is to keep that log honest and to read it before acting.
User memory auto-loads every turn (~200 lines, ~30KB). Reference memory loads on demand. Session memory is scratch. Repo memory holds verified facts. A semantic vector brain handles fuzzy recall across tens of thousands of chunks.
A local open-weights model on a consumer GPU compresses verbose memory, distills brain recalls, generates cold-start briefs, and writes rolling summaries that pre-empt host-side compaction — all without spending a token of paid context. The expensive model thinks; the free model summarizes.
Nightly object-storage sync of every workspace byte. Time-based immutability blocks any delete, overwrite, even with the master key. Cool tier auto-tiers to Archive after 30 days. ~200 GB stored at well under a dollar a month.
Kill-switch sentinel + spend-gate script + cloud budget alert. Every billable script must clear all three before touching money. Default ceiling configurable. Failure mode = abort, never silently overspend.
Weekly snapshot to vault/brain/. Two newest kept hot. Plus a sealed .SEALED.zip on local disk with sentinel files preventing the agent from mistaking it for live data. Restoration: rename + expand, three steps.
Per-session journals + a one-line audit trail + the local-model distiller produce a ~3KB curated brief on demand. The successor agent reads it instead of guessing from a lossy auto-summary. Context survives compaction, restart, model swap.
Honest version: the local-compute pillar carries weight, and "consumer GPU" hides a wide range. Here is what to expect at each profile. Everything else in Agent OS (memory tiers, vault, guardrails, audit-trail, continuity discipline) works on any machine that runs your editor.
| Profile | Local model | Distill speed | Verdict |
|---|---|---|---|
| Workstation, 16 GB+ VRAM RTX 4080/5080/5090, M3/M4 Max, similar |
14B coder model, full quality | ~3–15 sec / summary | Reference-implementation experience. |
| Mid-range laptop, 8 GB VRAM RTX 3060/4060, M2 Pro, etc. |
7B coder model, q4 quantized | ~10–30 sec / summary | Workable. Summaries shorter, slightly noisier. |
| Integrated graphics no discrete GPU |
3B model on CPU, or hosted small model | ~30–90 sec / summary | Slow but functional. Or skip local entirely and use a cheap hosted model — the spend gate keeps it bounded. |
| Phone-class compute | — | — | Not the target. The pattern needs some available compute for distillation. |
The five non-compute pillars don't care about the GPU. If you can run an editor and reach object storage, you can run Memory + Vault + Guardrails + Brain backup + Continuity unchanged. Local compute is the cherry on top, not the load-bearing wall.
The pattern starts on a single workstation, but the semantic-brain pillar is designed to network. Run the brain as a containerized service exposing HTTP and Server-Sent Events, and any number of agents on any number of machines hit the same recall surface — no per-seat duplication of the corpus, no drift between team members, no re-embedding when a new agent joins.
remember() calls. Always writable.recall() queries both collections in parallel, merges by distance, returns top-N. Caller never sees the seam.remember by one agent is queryable by every other within seconds.claim() / release() / handoff() / pulse_others() so agents don't step on each other on a shared task.This is how the same architecture scales from "one developer, one machine, one Copilot" to "a team of humans + agents working a multi-month codebase together" without redesigning anything below the brain layer.
Plenty of good work exists in this space. Calling out the baselines so the contribution here is precise — Agent OS is the combination, not any one piece.
.cursorrules/memories/..aider.chat.historyWhat's distinctive here: the tiered-memory + WORM-vault + fail-closed cost gate + local-model distiller + networked brain — assembled and disciplined as one architecture, with conventions that keep the agent honest. Each individual pillar has prior art. The integration is the point.
Three commands sketch the pattern. The agent reconstructs working state from durable artifacts — no scrollback dive, no host summary required:
# 1. Load routing rules + preferences (auto-loaded; this is the manual form) Get-Content /memories/00-routing.md # 2. Generate a curated cold-start brief — local model, ~60 sec, zero-cost pwsh tools/qwen-coldstart.ps1 # 3. Read the latest session journal (compressed if >24h old) pwsh tools/latest-revival.ps1 -N 1
Paths shown are from the reference implementation. Equivalent flows compose easily on macOS / Linux / any shell.
Long-running coding campaigns. When work spans weeks across many sessions and any detail loss costs hours of rediscovery.
Regulated environments. WORM-immutable audit trail with cryptographic time-based retention. Every session, every decision, every command — restorable for 7 years.
Multi-agent handoffs. When one agent finishes and a different model picks up. Curated brief + audit-trail + repo memory mean the receiving agent doesn't start cold.
Cost-sensitive autonomy. When the agent has authority over billable resources. Three-layer fail-closed cost gate plus outside-in budget alert means autonomy without surprise bills.