Conversation Memory System Design¶
Table of Contents¶
- Overview
- Goals
- Architecture
- Memory Strategies
- Integration Points
- Implementation Plan
- Phase 2.2: Semantic Search & RAG
Overview¶
The conversation memory system enables harombe agents to maintain context across sessions, remember past interactions, and provide continuity in multi-turn conversations.
Goals¶
- Persistence - Conversations survive application restarts
- Efficiency - Fast retrieval without loading entire history
- Scalability - Handle long conversations with token limits
- Simplicity - SQLite backend, no external dependencies
- Extensibility - Easy to swap backends (PostgreSQL, Redis) later
Architecture¶
Components¶
┌─────────────────────────────────────────────────┐
│ Agent │
│ ┌─────────────────────────────────────────┐ │
│ │ Memory Manager │ │
│ │ - Session management │ │
│ │ - Message filtering │ │
│ │ - Context windowing │ │
│ │ - Summarization │ │
│ └──────────────┬──────────────────────────┘ │
│ │ │
│ ┌──────────────▼──────────────────────────┐ │
│ │ Storage Backend │ │
│ │ - SQLite database │ │
│ │ - CRUD operations │ │
│ │ - Indexing │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Storage Schema (SQLite)¶
-- Sessions table
CREATE TABLE sessions (
id TEXT PRIMARY KEY, -- UUID
created_at TIMESTAMP NOT NULL, -- Session creation time
updated_at TIMESTAMP NOT NULL, -- Last activity
metadata TEXT, -- JSON: user, tags, etc.
system_prompt TEXT -- System prompt used
);
-- Messages table
CREATE TABLE messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT NOT NULL, -- FK to sessions
role TEXT NOT NULL, -- user, assistant, system, tool
content TEXT, -- Message content
tool_calls TEXT, -- JSON: tool calls if any
tool_call_id TEXT, -- For tool responses
name TEXT, -- Tool name for tool messages
created_at TIMESTAMP NOT NULL, -- Message timestamp
FOREIGN KEY (session_id) REFERENCES sessions(id) ON DELETE CASCADE
);
-- Indexes for performance
CREATE INDEX idx_messages_session ON messages(session_id);
CREATE INDEX idx_messages_created ON messages(created_at);
CREATE INDEX idx_sessions_updated ON sessions(updated_at);
Memory Manager API¶
class MemoryManager:
"""High-level memory management."""
def create_session(self, system_prompt: str, metadata: dict) -> str:
"""Create a new conversation session. Returns session_id."""
def load_session(self, session_id: str) -> list[Message]:
"""Load conversation history for a session."""
def save_message(self, session_id: str, message: Message) -> None:
"""Save a message to the session."""
def get_recent_messages(
self,
session_id: str,
max_tokens: int = 4096
) -> list[Message]:
"""Get recent messages within token limit."""
def list_sessions(self, limit: int = 10) -> list[dict]:
"""List recent sessions with metadata."""
def delete_session(self, session_id: str) -> None:
"""Delete a session and all its messages."""
def prune_old_sessions(self, days: int = 30) -> int:
"""Delete sessions older than N days. Returns count deleted."""
Memory Strategies¶
1. Simple Windowing (Phase 2.1)¶
Load the most recent N messages that fit within token limit:
def get_recent_messages(session_id, max_tokens):
messages = load_messages(session_id, order='DESC', limit=100)
result = []
total_tokens = 0
for msg in reversed(messages): # Oldest to newest
tokens = estimate_tokens(msg)
if total_tokens + tokens > max_tokens:
break
result.append(msg)
total_tokens += tokens
return result
Pros: Simple, fast, predictable Cons: May lose important context from earlier in conversation
2. Summarization (Future)¶
Summarize old messages to compress history:
def get_context_with_summary(session_id, max_tokens):
# Get all messages
all_messages = load_messages(session_id)
# If within limit, return as-is
if estimate_tokens(all_messages) <= max_tokens:
return all_messages
# Otherwise, summarize old context
old_msgs = all_messages[:-20] # All but recent 20
recent_msgs = all_messages[-20:]
summary = summarize_conversation(old_msgs)
return [
Message(role="system", content=f"Previous context: {summary}"),
*recent_msgs
]
Pros: Preserves important context, better continuity Cons: More complex, requires LLM call, lossy compression
3. Hybrid (Future)¶
Combine summarization with selective important message retention:
- Keep system messages always
- Summarize routine exchanges
- Flag and retain important messages (user decisions, key facts)
- Keep recent N messages verbatim
Integration Points¶
Agent Class Modifications¶
class Agent:
def __init__(
self,
llm: LLMClient,
tools: list[Tool],
memory_manager: MemoryManager | None = None,
session_id: str | None = None,
...
):
self.memory = memory_manager
self.session_id = session_id
# Load history if session exists
if self.memory and self.session_id:
self.state = self._load_session_state()
else:
self.state = AgentState(system_prompt)
async def run(self, query: str) -> str:
# Add user message
self.state.add_user_message(query)
# Save to memory if enabled
if self.memory:
self.memory.save_message(
self.session_id,
self.state.messages[-1]
)
# ... ReAct loop ...
# Save assistant response
if self.memory:
self.memory.save_message(
self.session_id,
self.state.messages[-1]
)
return response
CLI Integration¶
New commands in harombe chat:
/sessions List recent conversation sessions
/load <session> Load and continue a previous session
/save Force save current session
/history Show conversation history for current session
/clear Clear current session history (but keep in DB)
/forget Delete current session from memory
Configuration Schema¶
memory:
enabled: true # Enable conversation memory
storage_path: ~/.harombe/memory.db # SQLite database location
max_history_tokens: 4096 # Token limit for context
auto_prune_days: 30 # Auto-delete old sessions
strategy: simple # simple, summarization, hybrid
Implementation Plan¶
Phase 2.1: Basic Memory (Complete)¶
- ✅ Design architecture (this document)
- ✅ Implement SQLite storage backend
- ✅ Implement simple windowing strategy
- ✅ Integrate with Agent class
- ✅ Add CLI commands
- ✅ Add configuration
- ✅ Write tests
- ✅ Update documentation
Phase 2.2: Semantic Search & RAG (Complete)¶
- ✅ Design vector store architecture
- ✅ Implement embedding service (sentence-transformers, Ollama)
- ✅ Implement ChromaDB vector store
- ✅ Integrate with MemoryManager (auto-embedding)
- ✅ Add semantic search capabilities
- ✅ Implement RAG for agent
- ✅ Write comprehensive tests
- ✅ Update documentation and examples
Phase 2.3: Advanced Memory (Future)¶
- Implement summarization strategy
- Add message importance scoring
- Implement hybrid strategy
- Export/import sessions
- Alternative backends (PostgreSQL, Redis)
- Cloud storage adapters
File Structure¶
src/harombe/memory/
├── __init__.py
├── storage.py # SQLite backend implementation
├── manager.py # MemoryManager class
├── strategies.py # Windowing/summarization strategies
└── schema.py # Pydantic models for session/message
tests/memory/
├── __init__.py
├── test_storage.py
├── test_manager.py
└── test_strategies.py
Token Estimation¶
Simple heuristic for token counting:
def estimate_tokens(message: Message) -> int:
"""Rough token estimate: ~4 chars per token."""
text = message.content or ""
# Add tool call overhead
if message.tool_calls:
for tc in message.tool_calls:
text += json.dumps(tc.arguments)
return len(text) // 4
For production, consider using tiktoken library for accurate counts.
Error Handling¶
- Session not found: Create new session automatically
- Database locked: Retry with exponential backoff
- Token limit exceeded: Fall back to most recent messages only
- Corrupted data: Log error, continue without memory
Testing Strategy¶
- Unit tests: Storage CRUD operations
- Integration tests: Agent + memory workflows
- Performance tests: Large conversation history
- Concurrency tests: Multiple agents sharing storage
Security Considerations¶
- No encryption in Phase 2.1 (local SQLite only)
- Future: Add encryption at rest for sensitive conversations
- Access control: Not needed for single-user local setup
- PII handling: Covered in Phase 2.3 (Privacy Router)
Migration Path¶
For users upgrading:
- Memory is opt-in via config
- Existing agents work without changes
- No data migration needed (fresh start)
- Old sessions can be imported via CLI tool (future)
Phase 2.2: Semantic Search & RAG¶
Overview¶
Phase 2.2 extends the memory system with semantic search capabilities using vector embeddings. This enables agents to find relevant context from past conversations even when the exact wording differs, powering Retrieval-Augmented Generation (RAG) for more intelligent, context-aware responses.
Architecture Extension¶
┌───────────────────────────────────────────────────────────────────┐
│ Agent with RAG │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Memory Manager │ │
│ │ - Session management │ │
│ │ - Message filtering & windowing │ │
│ │ - Semantic search (NEW) │ │
│ │ - RAG context retrieval (NEW) │ │
│ └────────┬────────────────────────────┬─────────────────────┘ │
│ │ │ │
│ ┌────────▼─────────────────┐ ┌──────▼───────────────────────┐ │
│ │ Storage Backend │ │ Vector Store (ChromaDB) │ │
│ │ - SQLite database │ │ - Embeddings storage │ │
│ │ - Message CRUD │ │ - Similarity search │ │
│ └──────────────────────────┘ │ - HNSW indexing │ │
│ └──────▲───────────────────────┘ │
│ │ │
│ ┌──────┴───────────────────────┐ │
│ │ Embedding Client │ │
│ │ - sentence-transformers │ │
│ │ - Ollama (optional) │ │
│ └──────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
Components¶
1. Embedding Client¶
Generates vector embeddings (numerical representations) for text:
class EmbeddingClient(Protocol):
async def embed(self, texts: list[str]) -> list[list[float]]:
"""Generate embeddings for multiple texts."""
async def embed_single(self, text: str) -> list[float]:
"""Generate embedding for a single text."""
@property
def dimension(self) -> int:
"""Embedding dimension (e.g., 384 for all-MiniLM-L6-v2)."""
Implementations:
- SentenceTransformerEmbedding (default) - Local, privacy-first
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Dimension: 384
-
No API calls, runs locally on CPU or GPU
-
OllamaEmbedding - Uses Ollama for embeddings
- Leverages existing Ollama infrastructure
- Larger models available (e.g.,
nomic-embed-text)
2. Vector Store¶
Stores and searches embeddings using approximate nearest neighbor algorithms:
class VectorStore(Protocol):
def add(
self,
ids: list[str],
embeddings: list[list[float]],
documents: list[str],
metadata: list[dict[str, Any]],
) -> None:
"""Add embeddings to the store."""
def search(
self,
query_embedding: list[float],
top_k: int = 10,
where: dict[str, Any] | None = None,
) -> tuple[list[str], list[str], list[dict], list[float]]:
"""Search for similar embeddings. Returns (ids, docs, metadata, distances)."""
Implementation: ChromaDBVectorStore
- Lightweight, embedded database
- Uses HNSW (Hierarchical Navigable Small World) for fast search
- Cosine similarity for distance metric
- Supports metadata filtering (e.g., by session_id)
- Persistent or in-memory storage
3. Enhanced Memory Manager¶
Extended with semantic search capabilities:
class MemoryManager:
def __init__(
self,
storage_path: Path,
max_history_tokens: int = 4096,
embedding_client: EmbeddingClient | None = None,
vector_store: VectorStore | None = None,
):
# Enable semantic search if both components provided
self.semantic_search_enabled = (
embedding_client is not None and vector_store is not None
)
def save_message(self, session_id: str, message: Message) -> int:
"""Save message to SQLite AND auto-embed to vector store."""
message_id = self.storage.save_message(record)
# Auto-embed if semantic search enabled
if self.semantic_search_enabled and message.content:
self._embed_message(message_id, session_id, message)
return message_id
async def search_similar(
self,
query: str,
top_k: int = 5,
session_id: str | None = None,
min_similarity: float | None = None,
) -> list[Message]:
"""Search for semantically similar messages."""
# Generate query embedding
query_embedding = await self.embedding_client.embed_single(query)
# Search vector store with optional session filter
where = {"session_id": session_id} if session_id else None
ids, docs, metadata, distances = self.vector_store.search(
query_embedding=query_embedding,
top_k=top_k,
where=where,
)
# Filter by similarity threshold and convert to Messages
results = []
for doc, meta, distance in zip(docs, metadata, distances):
similarity = 1.0 - distance # Convert distance to similarity
if min_similarity and similarity < min_similarity:
continue
results.append(Message(role=meta["role"], content=doc))
return results
async def get_relevant_context(
self,
query: str,
max_tokens: int = 2048,
session_id: str | None = None,
) -> list[Message]:
"""Get relevant context within token budget."""
# Over-fetch candidates
candidates = await self.search_similar(
query=query,
top_k=20,
session_id=session_id,
)
# Filter by token budget
results = []
current_tokens = 0
for msg in candidates:
msg_tokens = estimate_tokens(msg)
if current_tokens + msg_tokens > max_tokens:
break
results.append(msg)
current_tokens += msg_tokens
return results
4. RAG-Enabled Agent¶
Agent retrieves relevant context before LLM calls:
class Agent:
def __init__(
self,
llm: LLMClient,
tools: list[Tool],
memory_manager: MemoryManager | None = None,
session_id: str | None = None,
enable_rag: bool = False,
rag_top_k: int = 5,
rag_min_similarity: float = 0.7,
):
self.enable_rag = enable_rag
self.rag_top_k = rag_top_k
self.rag_min_similarity = rag_min_similarity
async def run(self, user_message: str) -> str:
# Load conversation history
state = self._load_history()
# Retrieve relevant context if RAG enabled
rag_context = None
if self.enable_rag and self.memory_manager:
rag_context = await self._retrieve_rag_context(user_message)
# Inject context into user message
if rag_context:
enhanced_message = self._format_message_with_context(
user_message, rag_context
)
state.add_user_message(enhanced_message)
else:
state.add_user_message(user_message)
# Save ORIGINAL message (without RAG context) to memory
if self.memory_manager:
self.memory_manager.save_message(
self.session_id,
Message(role="user", content=user_message),
)
# ... ReAct loop ...
def _format_message_with_context(
self,
user_message: str,
context: list[Message],
) -> str:
"""Format enhanced message with retrieved context."""
lines = [
"RELEVANT CONTEXT FROM PAST CONVERSATIONS:",
"---",
]
for msg in context:
role = msg.role.upper()
content = msg.content[:200] # Truncate long messages
if len(msg.content) > 200:
content += "..."
lines.append(f"[{role}]: {content}")
lines.extend([
"---",
"",
"Now answer the current user question using the context above if relevant.",
"",
f"USER QUESTION: {user_message}",
])
return "\n".join(lines)
Embedding Schema¶
Embeddings are stored with metadata for filtering and retrieval:
{
"id": "msg_12345", # Unique identifier
"embedding": [0.1, 0.2, ...], # 384-dim vector (for all-MiniLM)
"document": "message text", # Original text
"metadata": {
"session_id": "abc-123", # Session identifier
"message_id": 12345, # Database message ID
"role": "user", # Message role
"timestamp": 1234567890, # Unix timestamp (optional)
}
}
Configuration¶
memory:
enabled: true
storage_path: ~/.harombe/memory.db
max_history_tokens: 4096
# Vector store configuration
vector_store:
enabled: true
backend: chromadb # Only chromadb supported now
embedding_model: sentence-transformers/all-MiniLM-L6-v2 # Model to use
embedding_provider: sentence-transformers # Local embeddings
persist_directory: ~/.harombe/vectors # Storage directory (null = in-memory)
collection_name: harombe_embeddings # Collection name
# RAG configuration
rag:
enabled: true
top_k: 5 # Number of similar messages to retrieve
min_similarity: 0.7 # Similarity threshold (0.0-1.0)
How It Works¶
- Message saved → Automatically embedded and stored in vector database
- User query → If RAG enabled, retrieve similar messages
- Context injection → Format retrieved messages into prompt
- LLM call → Agent generates response with enhanced context
- Response saved → Stored in both SQLite and vector store
Performance Characteristics¶
- Embedding generation: ~10-50ms per message (CPU), ~1-5ms (GPU)
- Vector search: Sub-millisecond for <10K messages, ~10ms for 100K+
- Storage overhead: ~1.5KB per message (384-dim float32 embedding)
- Memory usage: ChromaDB loads index into RAM (~2MB per 1K messages)
Use Cases¶
- Cross-session knowledge: "What did we discuss about Python last week?"
- Related topics: Find messages about similar topics with different wording
- Context-aware responses: Agent recalls relevant past information
- Knowledge base: Search entire conversation history semantically
- Multi-agent memory: Multiple agents sharing a knowledge pool
Testing Strategy¶
- Unit tests: Embedding clients, vector store operations
- Integration tests: MemoryManager with semantic search
- Agent tests: RAG functionality with mocked LLM
- Performance tests: Large-scale embedding and retrieval
- Similarity tests: Verify semantic matching works correctly
Privacy & Security¶
- Local embeddings - sentence-transformers runs entirely offline
- No API calls - All processing happens on your hardware
- Data isolation - Vector store stored locally alongside SQLite
- Optional encryption - ChromaDB supports encryption at rest (future)
Backward Compatibility¶
- Semantic search is opt-in via config
- All new parameters have defaults
- Existing code works without changes
- Memory can be enabled without semantic search
- Semantic search requires both
embedding_clientandvector_store
Future Enhancements¶
- Alternative embeddings: Support for OpenAI, Cohere, custom models
- Hybrid search: Combine keyword and semantic search
- Metadata indexing: Filter by date, user, tags, importance
- Automatic importance scoring: Identify key messages for retention
- Multi-modal embeddings: Support for images, audio (CLIP, etc.)
- Backfill optimization: Batch embedding for large existing datasets