LLMs are like someone with perfect language skills but total amnesia. Every time you start a new API call, the model has zero recollection of anything you have ever said — unless you explicitly replay the history in the prompt. Memory systems solve this by giving your application a structured way to carry context forward, ranging from simple chat buffers (a short-term notepad) to vector databases (a searchable long-term filing cabinet).Without memory, every LLM interaction starts fresh:
Without Memory With Memory━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━User: "I'm John" User: "I'm John"AI: "Nice to meet you!" AI: "Nice to meet you, John!"User: "What's my name?" User: "What's my name?"AI: "I don't know." AI: "Your name is John."
The simplest approach: keep the last N messages. Think of it like a whiteboard in a meeting room — you can only see what is currently written, and when you run out of space you erase the oldest notes to make room for new ones.
Pitfall: Using message count instead of token count. A buffer of “last 20 messages” sounds safe, but if one message contains a pasted document with 8,000 tokens, you will blow past context limits. Always enforce a token budget, not just a message count. See TokenBufferMemory below for the correct approach.
from dataclasses import dataclass, fieldfrom typing import Optionalfrom openai import OpenAIclient = OpenAI()@dataclassclass Message: role: str content: str timestamp: float = field(default_factory=lambda: time.time())class BufferMemory: """Simple buffer memory - keeps last N messages. Fast and simple, but drops old context entirely. Good for short conversations; breaks down on long multi-turn sessions where early context matters (e.g., "remember I said I'm a vegetarian" from 30 messages ago). """ def __init__(self, max_messages: int = 20): self.max_messages = max_messages self.messages: list[Message] = [] def add(self, role: str, content: str): """Add a message to memory""" self.messages.append(Message(role=role, content=content)) # Trim if exceeds max if len(self.messages) > self.max_messages: self.messages = self.messages[-self.max_messages:] def get_messages(self) -> list[dict]: """Get messages for LLM context""" return [ {"role": m.role, "content": m.content} for m in self.messages ] def clear(self): """Clear all memory""" self.messages = []# Usagememory = BufferMemory(max_messages=10)def chat(user_input: str) -> str: # Add user message to memory memory.add("user", user_input) # Create messages with history messages = [ {"role": "system", "content": "You are a helpful assistant."} ] + memory.get_messages() # Generate response response = client.chat.completions.create( model="gpt-4o", messages=messages ) assistant_message = response.choices[0].message.content # Add assistant response to memory memory.add("assistant", assistant_message) return assistant_message
import tiktokenclass TokenBufferMemory: """Buffer memory with token limit. The production-safe version of BufferMemory. Instead of counting messages, it counts tokens and evicts the oldest messages when the budget is exceeded. This prevents the context window overflow that message-count buffers are prone to. """ def __init__(self, max_tokens: int = 4000, model: str = "gpt-4o"): self.max_tokens = max_tokens self.encoder = tiktoken.encoding_for_model(model) self.messages: list[Message] = [] def _count_tokens(self, text: str) -> int: return len(self.encoder.encode(text)) def _total_tokens(self) -> int: return sum( self._count_tokens(m.content) + 4 # +4 for message overhead for m in self.messages ) def add(self, role: str, content: str): self.messages.append(Message(role=role, content=content)) # Trim oldest messages until under token limit while self._total_tokens() > self.max_tokens and len(self.messages) > 1: self.messages.pop(0) def get_messages(self) -> list[dict]: return [ {"role": m.role, "content": m.content} for m in self.messages ]
Compress conversation history into summaries. This is like taking meeting minutes — instead of recording every word spoken, you capture the key decisions, facts, and action items. The trade-off is that you lose detail (exact phrasing, nuance) but gain the ability to maintain context across much longer conversations.
Cost consideration: Summary memory makes an extra LLM call every time it summarizes. If your conversations are short (under 20 turns), the summarization cost may exceed the savings from a smaller context window. Use summary memory for sessions that regularly exceed your token budget.
class SummaryMemory: """Memory that summarizes older messages. When the buffer fills up, the oldest half of messages are compressed into a running summary via a cheap model call (gpt-4o-mini). The summary is prepended to subsequent requests so the model retains awareness of earlier context. """ def __init__( self, buffer_size: int = 10, summary_interval: int = 5 ): self.buffer_size = buffer_size self.summary_interval = summary_interval self.messages: list[Message] = [] self.summary: Optional[str] = None self.messages_since_summary = 0 def add(self, role: str, content: str): self.messages.append(Message(role=role, content=content)) self.messages_since_summary += 1 # Create summary when buffer is full if len(self.messages) > self.buffer_size: self._update_summary() def _update_summary(self): """Summarize older messages""" # Take oldest messages to summarize to_summarize = self.messages[:-self.buffer_size//2] if not to_summarize: return # Format messages for summarization conversation = "\n".join([ f"{m.role}: {m.content}" for m in to_summarize ]) # Create summary response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "Summarize this conversation, preserving key facts, decisions, and context." }, {"role": "user", "content": conversation} ] ) new_summary = response.choices[0].message.content # Combine with existing summary if present if self.summary: self.summary = f"{self.summary}\n\nUpdated: {new_summary}" else: self.summary = new_summary # Keep only recent messages self.messages = self.messages[-self.buffer_size//2:] self.messages_since_summary = 0 def get_context(self) -> list[dict]: """Get context for LLM""" context = [] # Add summary if exists if self.summary: context.append({ "role": "system", "content": f"Previous conversation summary:\n{self.summary}" }) # Add recent messages context.extend([ {"role": m.role, "content": m.content} for m in self.messages ]) return context# Usagememory = SummaryMemory(buffer_size=10)def chat_with_summary(user_input: str) -> str: memory.add("user", user_input) messages = [ {"role": "system", "content": "You are a helpful assistant."} ] + memory.get_context() response = client.chat.completions.create( model="gpt-4o", messages=messages ) assistant_message = response.choices[0].message.content memory.add("assistant", assistant_message) return assistant_message
Store and retrieve memories semantically. If buffer memory is a whiteboard and summary memory is meeting minutes, vector memory is a searchable filing cabinet. Every piece of information gets indexed by its meaning (via embeddings), so you can retrieve the three most relevant memories for any given query — even if the exact words are different.
Pitfall: Storing everything as vector memory. Embedding every single message is expensive and creates noise. A message like “ok sounds good” has zero retrieval value but costs an embedding API call. Filter for substantive content before storing: facts, preferences, decisions, and instructions — not acknowledgments and filler.
from openai import OpenAIimport numpy as npfrom datetime import datetimeimport jsonclient = OpenAI()class VectorMemory: """Long-term memory using vector embeddings. Each memory is embedded into a high-dimensional vector. At retrieval time, the query is also embedded and we find the closest memories via cosine similarity. This means "What is my dietary preference?" can retrieve a memory stored as "User said they are vegetarian" -- even though the words are completely different. """ def __init__(self, embedding_model: str = "text-embedding-3-small"): self.embedding_model = embedding_model self.memories: list[dict] = [] self.embeddings: list[np.ndarray] = [] def _get_embedding(self, text: str) -> np.ndarray: response = client.embeddings.create( model=self.embedding_model, input=text ) return np.array(response.data[0].embedding) def add( self, content: str, metadata: dict = None ): """Add a memory""" embedding = self._get_embedding(content) memory = { "content": content, "timestamp": datetime.now().isoformat(), "metadata": metadata or {} } self.memories.append(memory) self.embeddings.append(embedding) def search( self, query: str, top_k: int = 5, threshold: float = 0.7 ) -> list[dict]: """Search for relevant memories""" if not self.memories: return [] query_embedding = self._get_embedding(query) # Calculate cosine similarities similarities = [] for i, emb in enumerate(self.embeddings): similarity = np.dot(query_embedding, emb) / ( np.linalg.norm(query_embedding) * np.linalg.norm(emb) ) similarities.append((i, similarity)) # Sort by similarity similarities.sort(key=lambda x: x[1], reverse=True) # Return top results above threshold results = [] for idx, score in similarities[:top_k]: if score >= threshold: memory = self.memories[idx].copy() memory["similarity"] = score results.append(memory) return results def add_conversation(self, role: str, content: str): """Add a conversation turn as memory""" self.add( content=f"{role}: {content}", metadata={"type": "conversation", "role": role} ) def save(self, path: str): """Save memories to file""" data = { "memories": self.memories, "embeddings": [e.tolist() for e in self.embeddings] } with open(path, "w") as f: json.dump(data, f) def load(self, path: str): """Load memories from file""" with open(path) as f: data = json.load(f) self.memories = data["memories"] self.embeddings = [np.array(e) for e in data["embeddings"]]# Usage with LLMvector_memory = VectorMemory()def chat_with_long_term_memory(user_input: str) -> str: # Search for relevant memories relevant_memories = vector_memory.search(user_input, top_k=3) # Build context from memories memory_context = "" if relevant_memories: memory_context = "Relevant memories:\n" + "\n".join([ f"- {m['content']} (similarity: {m['similarity']:.2f})" for m in relevant_memories ]) # Add current message to memory vector_memory.add_conversation("user", user_input) # Generate response messages = [ { "role": "system", "content": f"You are a helpful assistant with long-term memory.\n\n{memory_context}" }, {"role": "user", "content": user_input} ] response = client.chat.completions.create( model="gpt-4o", messages=messages ) assistant_message = response.choices[0].message.content # Add response to memory vector_memory.add_conversation("assistant", assistant_message) return assistant_message
Track facts about specific entities. Think of this like a CRM for your AI — it maintains a structured profile for every person, place, or thing mentioned in conversation. When the user says “I work at Google and my manager is Sarah,” the entity memory creates entries for the user, Google, and Sarah with their attributes and relationships.
When to use entity memory vs. vector memory: Entity memory excels at structured facts (“Alice’s role is VP of Engineering”). Vector memory excels at unstructured recall (“That time we discussed the pros and cons of microservices”). Production systems typically need both.
from dataclasses import dataclass, fieldfrom typing import Any@dataclassclass EntityInfo: entity_type: str attributes: dict = field(default_factory=dict) last_updated: float = field(default_factory=lambda: time.time())class EntityMemory: """Memory for tracking entities and their attributes. Uses LLM extraction to pull structured entity data from freeform text. Each entity gets a name, type, and key-value attributes that are updated over time as new information surfaces in the conversation. """ def __init__(self): self.entities: dict[str, EntityInfo] = {} def update_entity( self, name: str, entity_type: str, attributes: dict ): """Update or create entity""" if name in self.entities: self.entities[name].attributes.update(attributes) self.entities[name].last_updated = time.time() else: self.entities[name] = EntityInfo( entity_type=entity_type, attributes=attributes ) def get_entity(self, name: str) -> Optional[EntityInfo]: return self.entities.get(name) def extract_entities_from_text(self, text: str) -> list[dict]: """Use LLM to extract entities from text""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Extract entities and their attributes from the text. Return JSON: [{"name": "...", "type": "person|place|thing", "attributes": {...}}]""" }, {"role": "user", "content": text} ], response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) return result.get("entities", []) def process_and_store(self, text: str): """Extract and store entities from text""" entities = self.extract_entities_from_text(text) for entity in entities: self.update_entity( name=entity["name"], entity_type=entity["type"], attributes=entity.get("attributes", {}) ) def get_context(self) -> str: """Get entity context for LLM""" if not self.entities: return "" context = "Known entities:\n" for name, info in self.entities.items(): attrs = ", ".join( f"{k}: {v}" for k, v in info.attributes.items() ) context += f"- {name} ({info.entity_type}): {attrs}\n" return context# Usageentity_memory = EntityMemory()def chat_with_entity_memory(user_input: str) -> str: # Extract entities from user input entity_memory.process_and_store(user_input) # Get entity context entity_context = entity_memory.get_context() messages = [ { "role": "system", "content": f"You are a helpful assistant.\n\n{entity_context}" }, {"role": "user", "content": user_input} ] response = client.chat.completions.create( model="gpt-4o", messages=messages ) assistant_message = response.choices[0].message.content # Also extract entities from response entity_memory.process_and_store(assistant_message) return assistant_message
Combine all memory types for maximum effectiveness. This is the architecture that production AI assistants actually use — you would not build a house with just a hammer, and you should not build a memory system with just one memory type.The hybrid approach mirrors how human memory works: you remember the last few minutes in vivid detail (buffer), have a general sense of what happened earlier today (summary), can search through years of experiences when prompted (vector), and maintain an address book of key people and facts (entity).
class HybridMemory: """Combines buffer, summary, vector, and entity memory. Each memory type handles a different time horizon: - Buffer: last few minutes (exact recent messages) - Summary: current session (compressed overview) - Vector: all time (semantic search over full history) - Entity: all time (structured facts about key entities) """ def __init__( self, buffer_size: int = 10, vector_threshold: float = 0.75 ): self.buffer = BufferMemory(max_messages=buffer_size) self.summary = SummaryMemory(buffer_size=buffer_size * 2) self.vector = VectorMemory() self.entity = EntityMemory() self.vector_threshold = vector_threshold def add(self, role: str, content: str): """Add message to all memory systems""" # Short-term self.buffer.add(role, content) self.summary.add(role, content) # Long-term self.vector.add_conversation(role, content) # Entity extraction self.entity.process_and_store(content) def get_context(self, query: str) -> dict: """Get comprehensive context for query""" # Recent messages (short-term) recent = self.buffer.get_messages() # Summary of older conversation summary = self.summary.summary # Relevant long-term memories long_term = self.vector.search(query, top_k=3, threshold=self.vector_threshold) # Entity knowledge entities = self.entity.get_context() return { "recent_messages": recent, "summary": summary, "long_term_memories": long_term, "entities": entities } def build_messages( self, query: str, system_prompt: str ) -> list[dict]: """Build message list for LLM""" context = self.get_context(query) # Build system message with context system_content = system_prompt if context["summary"]: system_content += f"\n\nConversation summary:\n{context['summary']}" if context["long_term_memories"]: memories = "\n".join([ f"- {m['content']}" for m in context["long_term_memories"] ]) system_content += f"\n\nRelevant memories:\n{memories}" if context["entities"]: system_content += f"\n\n{context['entities']}" messages = [{"role": "system", "content": system_content}] messages.extend(context["recent_messages"]) return messages def save(self, path: str): """Persist memory to disk""" self.vector.save(f"{path}_vector.json") # Add more persistence as needed def load(self, path: str): """Load memory from disk""" self.vector.load(f"{path}_vector.json")# Usagememory = HybridMemory(buffer_size=10)def chat_with_hybrid_memory(user_input: str) -> str: messages = memory.build_messages( query=user_input, system_prompt="You are a helpful assistant with perfect memory." ) # Add current user message messages.append({"role": "user", "content": user_input}) response = client.chat.completions.create( model="gpt-4o", messages=messages ) assistant_message = response.choices[0].message.content # Update memory memory.add("user", user_input) memory.add("assistant", assistant_message) return assistant_message
Dumping the entire memory context into the system message bloats token usage and drowns out the user’s actual question. Instead, retrieve only the top 3-5 most relevant memories for each request. Treat memory retrieval like a database query, not a full table scan.
Forgetting to handle contradictory memories
If the user says “I’m a vegetarian” in message 5 and “I eat chicken now” in message 50, naive retrieval might surface both. Your system needs a recency bias or explicit override mechanism so newer information takes precedence over older memories.
No memory expiration or cleanup
Memories accumulate over time. A year-old preference might be stale, and irrelevant memories create noise that degrades retrieval quality. Implement TTLs, decay scoring (older memories rank lower), or periodic cleanup to keep your memory store healthy.
Embedding model mismatch between store and retrieval
If you store memories with text-embedding-3-small and later switch to text-embedding-3-large, all your cosine similarities become meaningless — different models produce incompatible vector spaces. Pin your embedding model version and re-embed everything if you upgrade.
You are building a customer support chatbot that needs to remember user preferences across sessions spanning weeks. Walk me through your memory architecture.
Strong Answer:
I would use a hybrid memory system with three tiers. Tier one is a buffer memory for the current conversation session — the last 10-20 messages in full detail, kept in the context window. This handles the immediate conversational flow: “as I mentioned earlier in this chat.” Tier two is a summary memory that compresses completed sessions into a paragraph-length summary. When a session ends, I run the full conversation through a cheap model like gpt-4o-mini with a summarization prompt and store the result. When the user returns next week, I inject the session summary into the system prompt. Tier three is an entity memory backed by a simple key-value store — facts about this specific user extracted via structured extraction: their name, plan tier, preferred language, past issues, product preferences.
The entity memory is the most important for cross-session persistence. Summaries are lossy — you lose nuance and specific details. But entity facts like “user prefers email over phone” or “user is on the Enterprise plan” are precise and stable. I would extract entities after each conversation turn using a structured extraction prompt and upsert into a database keyed by user ID.
For the retrieval path, when a user starts a new session, I build context in layers: first the entity facts (cheap, always relevant), then the most recent session summary (moderate tokens), and optionally a vector search over all past conversation turns if the user references something specific that is not in the summary. The vector search is the expensive fallback, not the primary path.
The critical production concern is token budget management. If a user has 50 past sessions, you cannot inject all 50 summaries. I cap the injected context at roughly 2,000 tokens of memory and prioritize recency. The entity store has no token cost until serialized, so it scales to thousands of facts cheaply.
Follow-up: How do you handle entity conflicts — for example, the user said they prefer Python in January but then said they switched to Rust in March?This is a real problem and the naive approach of just overwriting creates issues because sometimes both facts are true in different contexts. My approach is to timestamp every entity fact and include a “source” field — which conversation it came from. When there is a conflict, I keep both entries but mark the older one as potentially stale. During context construction, I present the most recent value but add a note like “previously expressed preference for Python.” This lets the LLM handle the nuance naturally — if the user asks about Python, the model knows they have history with it even though their current preference is Rust. For truly contradictory facts like “user’s email is X” versus “user’s email is Y,” I keep only the most recent because those are definitional, not preferential. The key design principle is: preferences are additive, identity facts are replacement.
What are the trade-offs between buffer memory and summary memory, and when does each one fail?
Strong Answer:
Buffer memory keeps the last N messages verbatim. Its strength is zero information loss within the window — every detail, every nuance, every instruction the user gave is preserved exactly. Its failure mode is the cliff edge: message N+1 pushes message 1 out entirely. There is no graceful degradation. If the user gave a critical instruction 15 messages ago and your buffer is 10 messages, that instruction is gone forever. I have seen this cause real production bugs where an agent “forgets” a user constraint mid-conversation because the constraint was stated early and fell out of the buffer.
Summary memory compresses older messages using an LLM summarization call. Its strength is that nothing is completely forgotten — key facts survive in compressed form. Its failure mode is lossy compression. The summarizer decides what is “important” and it can get that wrong. Specific numbers, exact quotes, nuanced conditions like “only do X if Y and Z are both true” tend to get flattened or dropped during summarization. I have seen summary memory turn “the customer wants a refund only if the item is defective and was purchased within 30 days” into “the customer wants a refund” — losing the critical conditions.
The deeper trade-off is cost versus fidelity. Buffer memory is free — no extra API calls. Summary memory costs you a summarization call every time the buffer fills up, and each summarization can itself consume 1,000-2,000 tokens. For a high-volume application with thousands of concurrent conversations, those summarization calls add up. A hybrid approach works best: buffer for recent messages, summary for older messages, and entity extraction for critical facts you cannot afford to lose during summarization.
One nuance most people miss: the summarization quality depends heavily on the prompt. A generic “summarize this conversation” produces generic summaries. A prompt like “extract all commitments, constraints, preferences, and unresolved questions from this conversation” produces dramatically more useful summaries for downstream use.
Follow-up: If you are using a token-based buffer instead of a message-count buffer, what edge cases do you need to handle?The main edge case is a single message that is extremely long — for instance, a user pastes in a 5,000-token document and asks “summarize this.” If your token budget is 4,000 tokens, that one message alone exceeds the buffer. You need a policy: do you truncate the message, reject it, or temporarily expand the buffer? My approach is to never truncate user messages silently because it breaks the user’s trust and can cause nonsensical responses. Instead, I flag that the message exceeds the memory budget, process it with a dedicated summarization step, and store the summary in place of the original. Another edge case is the overhead accounting — each message has metadata overhead beyond just the content tokens. The OpenAI chat format adds roughly 4 tokens per message for role tags and delimiters, and the system message has additional overhead. If you are not accounting for this, your token count will be systematically under-estimated and you will occasionally hit context window limits in production.
How does vector memory work for LLM agents, and what are the failure modes of cosine similarity search over conversation history?
Strong Answer:
Vector memory works by embedding each conversation turn (or extracted fact) into a high-dimensional vector using a model like text-embedding-3-small, then storing those vectors in an index. When the agent needs to recall something, you embed the current query and find the stored vectors with the highest cosine similarity. The retrieved memories are injected into the LLM’s context as additional information. This gives you semantic recall — “what did we discuss about pricing?” can find a conversation from three weeks ago where the user mentioned budget constraints, even though the word “pricing” never appeared.
The first failure mode is the “semantic gap” problem. Embedding models capture semantic similarity, but not all relevant information is semantically similar to the query. If the user asks “what is my account number?” the relevant memory might be “user: my account is 12345” — which is semantically close. But if the relevant memory is “agent: I have verified your identity and updated the record” — that is contextually relevant but semantically distant from “account number.” You miss it entirely.
The second failure mode is the threshold problem. A cosine similarity threshold of 0.7 sounds reasonable, but in practice, embeddings from the same model cluster around 0.7-0.9 for most text pairs. You get a lot of false positives — memories that score 0.75 but are not actually relevant. And lowering the threshold to 0.6 floods you with noise. The quality of retrieval is extremely sensitive to this threshold, and the optimal value varies by domain, embedding model, and even the length of the text being embedded.
The third failure mode is temporal blindness. Vector similarity has no concept of time. A preference the user expressed 6 months ago has the same retrieval priority as one from yesterday, unless you explicitly incorporate recency into the scoring. In practice, I add a time-decay factor to the similarity score: final_score = similarity * decay_factor(age) where the decay might be exponential with a half-life of 30 days.
The fourth and most subtle failure mode is embedding drift. If you switch embedding models or upgrade to a new version, all your stored vectors become incompatible with new query vectors. You have to re-embed your entire memory store, which for a long-running agent could be millions of entries.
Follow-up: You mentioned hybrid approaches. How would you combine vector memory with keyword search for an agent’s long-term memory?The pattern I use is reciprocal rank fusion. You run two parallel searches: one via vector similarity over embeddings, one via BM25 keyword search over the raw text of memories. Each search produces a ranked list. You merge the lists using RRF: for each memory, its fused score is the sum of 1 / (k + rank_in_list) across both lists, where k is typically 60. This naturally boosts memories that appear in both lists while still surfacing memories that only one method finds. The practical value is significant. Vector search finds semantically related memories, keyword search finds exact matches — account numbers, product names, error codes — that embedding models routinely miss because they compress those into generic “technical term” embeddings. In my experience, hybrid search improves recall by 15-25% over pure vector search for agent memory retrieval. The cost is minimal — BM25 is essentially free compared to the embedding API call.
Design a memory system for a multi-turn coding assistant that helps developers across multiple projects over months.
Strong Answer:
This is a challenging design because the memory needs to be both project-specific and developer-specific. I would structure it around three memory stores. First, a per-project knowledge base that stores: tech stack preferences (this project uses TypeScript, React, Prisma), architectural decisions (we chose event sourcing for the order service), coding conventions (we use kebab-case file names, prefer functional components), and known issues (the auth middleware has a bug with refresh tokens that we are working around). This is extracted from conversations and stored as structured facts, not raw conversation text.
Second, a developer profile that persists across all projects: experience level, preferred explanation depth, communication style preferences, timezone, and frequently used tools. This is the entity memory from the earlier discussion but scoped to the person, not the project.
Third, a conversation-level working memory that uses a summary chain. Each session produces a summary that includes: what was accomplished, what was left unresolved, and any decisions that were made. When the developer returns to the same project, I load the project knowledge base and the last 3 session summaries.
The retrieval strategy is layered. For every query, I always include: the project knowledge base (these are stable facts, usually under 500 tokens), the developer profile (50-100 tokens), and the last session summary (200-300 tokens). Then I do a vector search over all past conversation turns for this project if the query seems to reference something specific. Total memory injection stays under 2,000 tokens in the common case.
The most important production detail is how facts get into the project knowledge base. I do not rely on the developer explicitly telling the assistant to “remember” things. Instead, after every session, I run an extraction prompt: “What new facts about the project, its architecture, its conventions, or its known issues were discussed? Return as structured JSON.” This passive extraction is what makes the memory feel magical — the assistant just “knows” things without the user having to repeat themselves.
Follow-up: How do you handle stale information in the project knowledge base — for example, the project migrated from JavaScript to TypeScript three months ago but the old fact is still in memory?Staleness is the hardest problem in persistent memory systems. My approach has three layers. First, every fact has a confidence score that decays over time and gets boosted when the fact is confirmed by new conversations. A fact that has not been referenced or confirmed in 60 days drops to low confidence and gets deprioritized in retrieval. Second, I run a contradiction detection step during extraction. If the new session says “we migrated to TypeScript,” the extraction prompt is designed to also flag any existing facts that this contradicts — in this case, “project uses JavaScript.” The old fact gets archived, not deleted, so you maintain a changelog. Third, when the developer explicitly corrects the assistant — “no, we use TypeScript now” — I treat that as a high-priority fact update that immediately overwrites the old value. The key insight is that you need both passive decay and active correction, because some facts become stale silently (nobody mentions the language anymore) while others are explicitly superseded.