Engram Memory for Agents on OCI - Part 1. The Why and the What

 



Copyright: Sanjay Basu


1. Why Memory, and Why Now

Long context is not memory. This is the sentence I keep wanting to staple to people’s foreheads at conferences. Yes, frontier models will happily accept two million tokens of input. No, that does not mean stuffing every prior conversation into the prompt is a good idea, even if you can afford the bill, which most companies cannot.

The empirical case against the long-context-as-memory pattern is by now embarrassingly well documented. The original Lost in the Middle paper from Stanford showed that retrieval accuracy collapses when relevant facts are buried in the middle of a long context window. Every follow-up study since (NoLiMa, Michelangelo, RULER, the whole genre) has confirmed the same shape. Effective context length is much smaller than nominal context length. The model’s attention is not democratic. It cares about the beginning, it cares about the end, and the middle goes to the same place socks go in the dryer.

Then you have the economics. Every time you replay the entire history into the prompt, you pay for the entire history. Anthropic and OpenAI both meter input tokens, which means a chatty user with a long-running thread is effectively a slowly leaking faucet running directly into your billing account. Caching helps. It does not solve the problem.

And then there is the worst category of failure, which is when long context appears to work. The model gives a coherent answer, you ship the feature, and three weeks later a customer complains because the agent told them to set up an integration that the user explicitly disabled four conversations ago. The relevant fact was in the context. The model just did not weight it properly. This is the failure mode that gets people fired, and it is structurally inevitable in any architecture that treats memory as an undifferentiated firehose of past tokens.

What you actually want is closer to how human declarative memory works. Episodic experiences come in noisy, the brain runs an asynchronous consolidation process during sleep and quiet wakefulness, and what survives is a small set of dense, semantically organized traces that can be retrieved on demand. The hippocampus does the encoding, the neocortex stores the long-term result, and the inconvenient stuff gets forgotten on purpose because forgetting is a feature.

The Tonegawa group calls these surviving traces engram cells. The dynamics are not static. Recent work at Buffalo and elsewhere has shown that engrams refine themselves over the consolidation period, with neurons dropping in and out of the ensemble until the memory becomes selective rather than vague. The point is that biological memory is a pipeline, not a tape.

Engram, the Weaviate product, is essentially a software architecture that takes this seriously. The idea is to have a memory service that is structurally separate from the conversation, runs asynchronously, extracts atomic facts from raw inputs, reconciles them against what is already known, and persists only the durable result. The agent then queries this store at retrieval time. Same shape as the brain. Different substrate.

2. The Engram Pattern in One Page

Strip away the marketing and an engram-style memory service is built around four primitives. I am going to use Weaviate’s naming because it is the cleanest in the field right now, but the same shape shows up in MemGPT, Letta, mem0, and a half dozen other research prototypes. Convergent evolution, as it were.

Topics. Natural language descriptions of what kinds of facts are worth remembering. UserPreferences, ConversationSummary, TaskGoal, Experience. Topics function as magnets. The extractor only pulls out information that matches one of them. Without topics, you are just summarizing everything, which is the same problem you started with.

Scopes. Who can see this memory. User-scoped means the memory belongs to one specific user and is hard-isolated. Project-scoped means everyone using the agent can see it (useful for shared learning). Property-scoped is for soft isolation, like tagging a memory with a conversation_id so it can be filtered if needed but is not strictly walled off.

Pipelines. A graph of asynchronous steps. Extract pulls memories from raw input. TransformWithContext queries the existing store and decides whether the new memory is a duplicate, an update, a contradiction, or a genuinely new fact. Buffer holds intermediate results until some condition is met. Commit is the only step that actually writes to durable storage.

Bounded vs unbounded. Some topics should produce at most one memory per scope. A user profile, for instance. The system enforces that constraint at write time. Unbounded topics can have many memories per scope and are the default.

The reason this is a useful framing, and not just another vector database with a fancier API, is that it forces you to make decisions up front about what you actually care about remembering. Most agent failures I see in customer engagements are not failures of retrieval. They are failures of curation. The system remembered everything, including the noise, and could not tell what was signal at retrieval time. Engram-style topics are an opinion about signal expressed at write time. That opinion is the whole game.


3. Why Build This on OCI Instead of Just Using Engram

Reasonable question. The answer has three parts and one of them is uncomfortable.

First, the comfortable part. A lot of the customers I talk to at Oracle have data residency, sovereign deployment, or regulatory constraints that make any externally hosted memory service a non-starter. If your customer data lives in a German tenancy because of GDPR Article 44 or a UAE tenancy because of the National Cybersecurity Authority requirements, then routing every chatbot turn through a third party SaaS to extract memories is structurally impossible. You need the memory service inside your own compartment, ideally on the same network as the database.

Second, the slightly less comfortable part. If you already have Oracle Database 23ai running, you already have a vector database. That is not marketing copy, that is just SQL. AI Vector Search ships with the VECTOR data type, HNSW and IVF indexes, native embedding model integration via ONNX, and a hybrid vector index that combines lexical and semantic search in one query. Bolting on a separate vector store is a thing people do, but the joins get awkward, the consistency guarantees get weaker, and you end up explaining to auditors why your customer’s name lives in two systems with two backup policies and two access control models. With 23ai you can keep the memories next to the row that owns them and write a single SQL query that filters on tenant, joins to a customer record, and ranks by cosine distance, all in one pass.

Third, the uncomfortable part. AgentSpec exists, it is open source under Apache 2.0 and UPL, and it is genuinely good. Oracle published it last fall as a framework-agnostic declarative language for defining agents, with adapters for LangGraph, AutoGen, CrewAI, and the WayFlow reference runtime. It does for agent definitions roughly what ONNX did for neural network architectures, which is to say it makes them portable. If you have ever ported a multi-agent system from one framework to another and emerged with all your hair, you know why this matters.

Building on OCI lets us write the agents once in AgentSpec, run them on whichever runtime makes sense (WayFlow for production, LangGraph for development, anything else if requirements change), persist memories in a database that already speaks vectors, and keep everything inside one tenancy with one IAM policy and one set of audit logs. That is the value proposition. It is not the only valid architecture. It is just the one that does not require you to apologize to your security team.

The engram is the abstraction. The pipeline is the contract. The database is the substrate. The agent is the consumer. Mix these up and the result is what we politely call vibes-based architecture.

4. The Service Architecture

Here is the high-level shape of what we are going to build. I want to walk through this carefully because the temptation when designing a memory service is to collapse the architecture into one box labeled memory and ship it. That box is where the failures hide.


4.1 The Write Path

When an agent or application has new raw data (a chat turn, a tool call, an event, anything), it sends a small payload to the Memory Write Service. This service does almost nothing. It validates the payload, attaches a scope (user, project, or property), assigns an ingestion ID, writes the raw payload to OCI Object Storage for audit purposes, and publishes a message onto an OCI Streaming topic called memory.raw. Then it returns. Total latency on this path should be under 50 milliseconds. The agent does not wait for memory extraction. That is the whole point.

The reason for the streaming layer is twofold. One, it gives you durability for free. OCI Streaming is Kafka-compatible and the messages are persisted before the producer gets an ack, so if the extraction worker crashes mid-process you can replay. Two, it gives you ordering. Streaming partitions are ordered by key, and if you key on user_id or conversation_id, you get strict in-order processing of all memory updates for that scope, which matters more than you would think when one fact contradicts another.

4.2 The Pipeline Orchestrator

This is where AgentSpec earns its keep. The orchestrator runs a WayFlow runtime that has been configured with one Flow per memory pipeline. Each Flow is a DAG of nodes that map directly onto the engram pattern (extract, transform, buffer, consolidate, commit). The orchestrator subscribes to the memory.raw topic, picks up new messages, and kicks off the appropriate Flow based on which group the message belongs to.

Why WayFlow and not just hand-rolled Python? Because the Flow definition is a declarative JSON object that can be checked into version control, diffed in pull requests, and swapped out at runtime without redeploying the service. When your CTO asks why you changed the consolidation prompt and you can show them the git history of the Flow definition, that is a good day. When you cannot, it is a different day.


4.3 The Workers

Each pipeline node, when triggered, dispatches its actual work to a stateless worker. We can run these as OCI Functions for the lightweight steps (extract, buffer flush) and as OKE pods for the heavier ones that need more memory or longer-lived connections (TransformWithContext, which has to query the database for related memories before deciding what to do).

The workers call out to two AI services. OCI Generative AI handles the embedding generation (Cohere Embed v4, 1536 dimensions, multimodal if you need it) and the lighter LLM tool calls. For more involved reasoning, like the LLM-as-judge logic in TransformWithContext, you can either keep using Command A on the on-demand endpoint or stand up a dedicated AI cluster with a larger model. The AgentSpec configuration abstracts this away, which is convenient when you want to A/B test reasoning models without rewriting workers.

4.4 The Substrate

Oracle Database 23ai is the bottom of the stack and it is doing more work than the diagram suggests. The schema (which we will look at properly in Part 2) holds the actual memory rows, the embeddings, the scope metadata, the audit trail, and the pipeline run log. HNSW indexes accelerate similarity search. The hybrid vector index lets you do a lexical-plus-semantic search in one shot, which is useful when memories contain proper nouns that you really do not want to lose to embedding fuzziness.

Multi-tenancy is enforced at the database layer using pluggable databases or, for lighter isolation, schema-per-tenant. The key insight is that scope filtering happens in SQL, not in application code, which means you cannot accidentally leak across tenants by forgetting a parameter in a Python function. Auditors love this. So do you, the morning after a security review.

5. The Read Path

Synchronous, fast, no surprises. The Memory Query Service accepts a search string and a scope, generates an embedding (or pulls one from cache if the query is repeated), and runs a SQL query against the memories table that filters by scope and orders by VECTOR_DISTANCE. Optional reranking goes through Cohere Rerank 3.5 if precision matters more than latency.

The interesting design decision here is what to expose to the agent. Two patterns work in practice.

The first pattern is implicit retrieval. Every time the agent gets a new user message, the application backend runs a memory query in parallel with the LLM call, takes any memory above a similarity threshold, and stuffs it into the system prompt. The agent never knows the memory service exists. This is the right pattern for most consumer chatbots because it is simple, predictable, and the failure mode (no relevant memories returned) just means the agent behaves the way it would have without memory.

The second pattern is explicit retrieval as a tool call. The agent has a search_memories tool in its tool list and calls it whenever it thinks recalling something would help. This gives the agent more control, which is what you want for a complex reasoning agent that runs long loops. It also means the agent is going to get retrieval wrong sometimes, search for the wrong things, and waste tokens. The trade is autonomy for unpredictability. Pick your fighter.

In our reference architecture we expose both, because AgentSpec lets us define the search_memories tool once and bind it into multiple agents with different policies. The expensive part is not exposing the tool. It is deciding which agents get to use it.

6. Why This Matters Most for Multi-Agent Systems

Single-agent chatbots can get away with a lot. They are essentially one conversation, one context window, and one LLM. Memory matters but the surface area is small. Multi-agent systems are the place where the engram pattern stops being a nice-to-have and becomes structurally required.

Consider a typical agentic RAG setup. A planner agent receives a user query, decomposes it into subtasks, and dispatches them to specialist agents. A research agent does retrieval. A coding agent writes scripts. A critic agent reviews the output. Each of these runs in its own context window. Each of them learns things during execution that would be useful for future runs. None of them naturally share that knowledge.

Without a shared memory layer, the system relearns the same lessons every time. The research agent searches for “comedy movies” as a literal text query when it should have used a genre filter, the planner notices, the user complains, and the next time the system runs the exact same flow plays out again. Forever. The Greeks had a word for this and it was Sisyphus.

With an engram-style shared memory layer scoped to the project, the lesson learned in one run becomes available to all future runs. The Experience topic captures the consolidated insight (“when filtering by genre, use the genres property, not a near-text query”) and the next planner agent retrieves that memory before deciding how to delegate. This is what people mean when they say agents can learn. It is not online learning in the gradient-descent sense. It is structured note-taking with retrieval, and it works.

The catch is that doing this safely in a multi-tenant deployment is non-trivial. You absolutely do not want User A’s feedback teaching the agent to behave differently for User B unless you have explicitly designed for that. The scope mechanism is what saves you here. Project-scoped memories for shared learning, user-scoped memories for personal context, never the twain shall meet unless you opt in.

7. What You Get When This Works

A few things, some obvious and some less so.

Latency goes down on the write path because the agent is no longer waiting for memory consolidation. Latency goes down on the read path because you are searching tens or hundreds of curated memories per scope rather than thousands of raw conversation turns. Token costs go down because the prompt gets shorter when only relevant memories are retrieved. Accuracy goes up because the model is no longer doing implicit summarization in its head every turn.

More subtly, debuggability goes way up. Every memory has a provenance (which raw input produced it, which pipeline run, which extraction step, which prompt version). When the agent gives a wrong answer, you can trace it back to a specific memory, look at the source data, and decide whether the bug is in the model, the prompt, the topic configuration, or the source data itself. Try doing that with a long-context system. I will wait.

The audit story also gets dramatically better, which the security and compliance teams will notice before anyone else does. You can answer questions like “what does the system know about this user, and where did it learn it” with a single SQL query. Right-to-be-forgotten requests become a delete plus cascade rather than a forensic exercise.

Finally, and this is the one that matters most for production systems, the memory service is independently scalable. The write path scales with conversation volume. The read path scales with retrieval frequency. The pipeline workers scale with extraction load. The database scales with total memory count. None of these are coupled to your agent runtime, which means you can evolve each piece without touching the others.

8. What You Give Up

Honesty obliges. Not all of this is free.

You give up freshness. Because the write path is asynchronous, there is a window (typically a few seconds, sometimes longer if the pipeline is busy) during which a fact has been said but is not yet retrievable. For most use cases this is fine. Most of the time you do not need to retrieve a memory that was created in the last few seconds because that information is still in the active context. But you need to design around it. The pattern of “user tells the agent their name and the agent immediately searches memory for the name and fails to find it” is a real failure mode if you do not think about it.

You give up some control over what gets remembered. The extraction step is an LLM tool call, which means it has the usual failure modes (hallucinations, omissions, occasional weirdness). You can tune the topic descriptions and the extract prompt to mitigate this, but you cannot eliminate it. The system will sometimes remember things you did not intend it to remember and sometimes fail to remember things you wanted it to. This is a feature of working with statistical models. Embrace it or build a deterministic system instead, but do not pretend you can have both.

You add operational complexity. A memory service is a stateful distributed system with an LLM in the loop, and that means the failure modes are interesting in the way that engineering managers fear. You will have pipeline runs that get stuck. You will have transformer steps that produce non-sensical reconciliations. You will have buffer flushes that fire at the wrong time. None of these are unsolvable. All of them require monitoring, alerting, and the kind of operational discipline that early-stage teams do not always have.

9. Where We Go From Here

That is the why and the what. We have established that long context is not memory, that engram-style memory services solve real problems for production agentic systems, and that OCI plus AgentSpec plus Oracle Database 23ai gives us a coherent stack to build on. We have walked through the architecture at a level where you can sketch it on a whiteboard and explain it to your director.

Part 2 is the how. We will define the AgentSpec configurations for the multi-agent system, write the actual schema for the memory store in Oracle 23ai, implement the extract and transform workers, set up the pipeline orchestration, and walk through the SQL that does the synchronous retrieval. We will also handle some of the gnarly bits, such as how to do the bounded-topic constraint at the database level without a race condition, how to make the embedding generation idempotent, and how to wire up multi-tenancy so it cannot be bypassed by an off-by-one in application code.

And we will argue, gently, that the right way to build agentic systems in 2026 is to stop thinking of memory as an afterthought and start thinking of it as the substrate that makes everything else possible. The Tonegawa lab figured this out for biological brains a decade ago. We are still catching up.

In Part 2, we put the engram in the database, the pipeline in WayFlow, the agents in AgentSpec, and the whole thing in a single OCI compartment. Then we run it.

Continued in Part 2: The Code









Comments

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance