What this is

A reference architecture, not a product. Six composable pillars, a small set of conventions, and ~600 lines of glue that turn any AI coding agent into a contractor with a project log instead of a goldfish.

The diagrams below describe the pattern. The code in the reference implementation is permissively licensed and was built around VS Code, GitHub Copilot Chat, an object-storage backend, and a local open-weights model on a consumer GPU — but nothing here is bound to those choices. Swap the host editor, swap the cloud, swap the local model. The pattern stands.

The problem this solves

Modern AI coding agents are stateless. Every conversation starts cold. Memory is what the host platform decides to keep. Costs are unbounded. Audit is whatever the chat transcript happens to retain. Multi-day campaigns degrade into "we already discussed this — read the scrollback" loops, until eventually the scrollback gets compacted by an opaque summarizer and detail vanishes.

Agent OS treats the agent like a contractor who clocks in every day: they don't remember yesterday's work from their own head, they read the project log. The log lives on disk, in a vector brain, and in immutable cloud storage. The agent's job is to keep that log honest and to read it before acting.

Six pillars

Memory

Tiered, durable, agent-readable

User memory auto-loads every turn (~200 lines, ~30KB). Reference memory loads on demand. Session memory is scratch. Repo memory holds verified facts. A semantic vector brain handles fuzzy recall across tens of thousands of chunks.

Local compute

Zero-cost compression and distillation

A local open-weights model on a consumer GPU compresses verbose memory, distills brain recalls, generates cold-start briefs, and writes rolling summaries that pre-empt host-side compaction — all without spending a token of paid context. The expensive model thinks; the free model summarizes.

Vault

WORM-immutable backup, 7-year retention

Nightly object-storage sync of every workspace byte. Time-based immutability blocks any delete, overwrite, even with the master key. Cool tier auto-tiers to Archive after 30 days. ~200 GB stored at well under a dollar a month.

Guardrails

Three layers, fail-closed

Kill-switch sentinel + spend-gate script + cloud budget alert. Every billable script must clear all three before touching money. Default ceiling configurable. Failure mode = abort, never silently overspend.

Brain backup

Two cloud copies + one sealed local

Weekly snapshot to vault/brain/. Two newest kept hot. Plus a sealed .SEALED.zip on local disk with sentinel files preventing the agent from mistaking it for live data. Restoration: rename + expand, three steps.

Continuity

Cold-start in 60 seconds

Per-session journals + a one-line audit trail + the local-model distiller produce a ~3KB curated brief on demand. The successor agent reads it instead of guessing from a lossy auto-summary. Context survives compaction, restart, model swap.

Architecture at a glance

Hot context
User memory
Auto-loaded every turn. Routing rules, preferences, audit-trail tail.
/memories/*.md
Session memory
Per-conversation scratch. Cleared on close.
/memories/session/
Repo memory
Codebase facts. Append-only, verified at write time.
/memories/repo/
Warm context
Reference memory
Loaded on routing trigger. Topic-scoped per project area. Listed but not auto-included.
/memories/reference/
Semantic brain
Tens of thousands of chunks. Fuzzy recall across SDKs, documentation, prior work, and decisions.
vector store · cloud-hosted
Cold-start brief
Local-model-distilled ~3KB summary of state files + last N audit rows + newest session journal.
coldstart distiller
Cold storage
Draft container
Mutable working backup. 30-day soft-delete window. Versioning + change feed.
object storage · Cool tier
Vault container
WORM-immutable 7yr. Snapshots, one-time promotes. Lifecycle to Archive at 30d.
object storage · Archive tier
Sealed local
.SEALED.zip with DO-NOT-READ.txt sentinel + .gitignore. Agent searches skip it.
_sealed-brain-backups/
Guardrails
Kill switch
Touch a sentinel file. All billable scripts abort exit 99 until removed.
.qwen-killswitch
Spend gate
Checks cloud month-to-date spend vs ceiling before any billable op. Fail-closed.
spend-gate.ps1
Cloud budget
Email alerts at 50% / 75% / 100% actual + 100% forecasted. Outside-in safety net.
provider budgets API
Discipline
Audit-trail
Every session writes one row. Open / closed / locked status. Survives all compaction.
audit-trail.md
Session journals
Per-session detail dossier. Concrete, dense, restorable. Archived when stale.
REVIVAL-*.md
Fix-at-root reflex
Encoded rule: notice bug in shared infra → fix source first, never route around.
fix-at-root-reflex.md

By the numbers

<$1
Azure Blob storage
7 yr
WORM retention
~3 KB
Cold-start brief
3
Independent fail-safes

Hardware reality

Honest version: the local-compute pillar carries weight, and "consumer GPU" hides a wide range. Here is what to expect at each profile. Everything else in Agent OS (memory tiers, vault, guardrails, audit-trail, continuity discipline) works on any machine that runs your editor.

ProfileLocal modelDistill speedVerdict
Workstation, 16 GB+ VRAM
RTX 4080/5080/5090, M3/M4 Max, similar
14B coder model, full quality ~3–15 sec / summary Reference-implementation experience.
Mid-range laptop, 8 GB VRAM
RTX 3060/4060, M2 Pro, etc.
7B coder model, q4 quantized ~10–30 sec / summary Workable. Summaries shorter, slightly noisier.
Integrated graphics
no discrete GPU
3B model on CPU, or hosted small model ~30–90 sec / summary Slow but functional. Or skip local entirely and use a cheap hosted model — the spend gate keeps it bounded.
Phone-class compute Not the target. The pattern needs some available compute for distillation.

The five non-compute pillars don't care about the GPU. If you can run an editor and reach object storage, you can run Memory + Vault + Guardrails + Brain backup + Continuity unchanged. Local compute is the cherry on top, not the load-bearing wall.

Scaling beyond one workstation: the networked brain

The pattern starts on a single workstation, but the semantic-brain pillar is designed to network. Run the brain as a containerized service exposing HTTP and Server-Sent Events, and any number of agents on any number of machines hit the same recall surface — no per-seat duplication of the corpus, no drift between team members, no re-embedding when a new agent joins.

Brain service
Frozen read collection
Curated corpus baked into the image at build time. Sealed vector segment, fast queries, deterministic across deploys. Restart-safe.
prebuilt-index/ in container
Sidecar write collection
Lazily created at first write in the same persistent client. Holds new memories, ad-hoc corpus loads, per-agent remember() calls. Always writable.
side-collection · same DB file
Merge at query time
Each recall() queries both collections in parallel, merges by distance, returns top-N. Caller never sees the seam.
recall.py merge layer
Multi-agent
Shared HTTP/SSE
Multiple agents — same model or different — connect to one brain URL. Every remember by one agent is queryable by every other within seconds.
/sse · /tool/recall · /tool/remember
Coordination primitives
Optional layer adds claim() / release() / handoff() / pulse_others() so agents don't step on each other on a shared task.
multi-agent coord wedge
Audit-trail still wins
Even with a shared brain, the canonical decision log is the per-repo audit-trail. Brain is for fuzzy recall; audit-trail is for ground truth.
audit-trail.md (per workspace)

This is how the same architecture scales from "one developer, one machine, one Copilot" to "a team of humans + agents working a multi-month codebase together" without redesigning anything below the brain layer.

How a session actually runs

  1. Trigger. User types a routing keyword. Agent loads scoped memory, runs brain pulse, fetches latest revival doc.
  2. Work. Edits, searches, runs commands, calls scripts. Branding stamp + version bump on every touched file.
  3. Checkpoint. Every ~5 exchanges, brain checkpoint. Discoveries written to durable memory immediately, not "noted for later."
  4. Touched-folder log. Anything outside the workspace gets registered so the next backup grabs it.
  5. Close. Audit-trail row + REVIVAL doc + brain remember(). Session memory archived.
  6. Overnight. Scheduled tasks run vault sync, weekly brain snapshot, and rolling-window memory archive — all gated by spend-gate.

What it replaces

Default agent experience

  • Cold start every chat. "Read the scrollback."
  • Auto-compaction picks what survives. Lossy by surprise.
  • Chat-only memory. Disk reality drifts from agent reality.
  • No cost ceiling. Bug in a loop = morning email from billing.
  • Backups are someone else's job.
  • Audit = scrollback search.

Agent OS

  • Cold start reads a 3KB curated brief. Zero context loss.
  • Compaction can happen safely — durable memory survives it.
  • Memory tiers map to disk, vector brain, and immutable cloud.
  • Three-layer cost gate. Worst case = abort, never overspend.
  • Nightly azcopy + weekly brain snapshot + sealed local copy.
  • Audit-trail with one row per session. Always greppable.

How this fits next to existing tools

Plenty of good work exists in this space. Calling out the baselines so the contribution here is precise — Agent OS is the combination, not any one piece.

GitHub Copilot Chat memory
tiered memory · per workspace
The actual host platform behind the reference implementation. Provides the memory tiers Agent OS leans on. What's missing in the box: durable cold storage, cost gates, the discipline conventions, the local-model distillation.
Cursor · .cursorrules
project-scoped instructions
Solves preference persistence and routing. No durable journal, no audit trail, no cold-start protocol, no cost gates. Sits at the same tier as the routing rules in /memories/.
Cline / Roo Code
custom-instructions + agentic loops
Strong on agentic tool use. Memory is per-conversation. Same gap as default Copilot: no durable layer between conversations, no cost gate, no immutable audit.
Aider · .aider.chat.history
conversation log + git commits
Audit-as-git is excellent — every change is a commit. Closest existing analog to the audit-trail pillar. Doesn't have the brain, the vault, or the cost gates.
Continue.dev
configurable IDE assistant
Configurable model routing and context providers. Comparable to the routing layer. Memory and audit story is lighter; no cold-storage tier.
LangChain / LlamaIndex memory
programmatic memory abstractions
Library-level building blocks for memory. You'd assemble something like the memory + brain pillars on top of these. Doesn't speak to discipline conventions or cost gates — that's outside their scope.
AutoGen · CrewAI · smol-agents
multi-agent orchestration
Solves how agents coordinate during a session. Agent OS is concerned with what survives between sessions. Complementary, not competing — wire either of these on top of the brain layer for multi-agent.

What's distinctive here: the tiered-memory + WORM-vault + fail-closed cost gate + local-model distiller + networked brain — assembled and disciplined as one architecture, with conventions that keep the agent honest. Each individual pillar has prior art. The integration is the point.

Try the cold-start protocol

Three commands sketch the pattern. The agent reconstructs working state from durable artifacts — no scrollback dive, no host summary required:

# 1. Load routing rules + preferences (auto-loaded; this is the manual form)
Get-Content /memories/00-routing.md

# 2. Generate a curated cold-start brief — local model, ~60 sec, zero-cost
pwsh tools/qwen-coldstart.ps1

# 3. Read the latest session journal (compressed if >24h old)
pwsh tools/latest-revival.ps1 -N 1

Paths shown are from the reference implementation. Equivalent flows compose easily on macOS / Linux / any shell.

Where this could matter

Long-running coding campaigns. When work spans weeks across many sessions and any detail loss costs hours of rediscovery.

Regulated environments. WORM-immutable audit trail with cryptographic time-based retention. Every session, every decision, every command — restorable for 7 years.

Multi-agent handoffs. When one agent finishes and a different model picks up. Curated brief + audit-trail + repo memory mean the receiving agent doesn't start cold.

Cost-sensitive autonomy. When the agent has authority over billable resources. Three-layer fail-closed cost gate plus outside-in budget alert means autonomy without surprise bills.