Conversation Memory System Design¶

Table of Contents¶

Overview
Goals
Architecture
Memory Strategies
Integration Points
Implementation Plan
Phase 2.2: Semantic Search & RAG

Overview¶

The conversation memory system enables harombe agents to maintain context across sessions, remember past interactions, and provide continuity in multi-turn conversations.

Goals¶

Persistence - Conversations survive application restarts
Efficiency - Fast retrieval without loading entire history
Scalability - Handle long conversations with token limits
Simplicity - SQLite backend, no external dependencies
Extensibility - Easy to swap backends (PostgreSQL, Redis) later

Architecture¶

Components¶

┌─────────────────────────────────────────────────┐
│                    Agent                        │
│  ┌─────────────────────────────────────────┐   │
│  │         Memory Manager                   │   │
│  │  - Session management                    │   │
│  │  - Message filtering                     │   │
│  │  - Context windowing                     │   │
│  │  - Summarization                         │   │
│  └──────────────┬──────────────────────────┘   │
│                 │                               │
│  ┌──────────────▼──────────────────────────┐   │
│  │         Storage Backend                  │   │
│  │  - SQLite database                       │   │
│  │  - CRUD operations                       │   │
│  │  - Indexing                              │   │
│  └─────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘

Storage Schema (SQLite)¶

-- Sessions table
CREATE TABLE sessions (
    id TEXT PRIMARY KEY,              -- UUID
    created_at TIMESTAMP NOT NULL,    -- Session creation time
    updated_at TIMESTAMP NOT NULL,    -- Last activity
    metadata TEXT,                    -- JSON: user, tags, etc.
    system_prompt TEXT                -- System prompt used
);

-- Messages table
CREATE TABLE messages (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id TEXT NOT NULL,         -- FK to sessions
    role TEXT NOT NULL,               -- user, assistant, system, tool
    content TEXT,                     -- Message content
    tool_calls TEXT,                  -- JSON: tool calls if any
    tool_call_id TEXT,                -- For tool responses
    name TEXT,                        -- Tool name for tool messages
    created_at TIMESTAMP NOT NULL,    -- Message timestamp
    FOREIGN KEY (session_id) REFERENCES sessions(id) ON DELETE CASCADE
);

-- Indexes for performance
CREATE INDEX idx_messages_session ON messages(session_id);
CREATE INDEX idx_messages_created ON messages(created_at);
CREATE INDEX idx_sessions_updated ON sessions(updated_at);

Memory Manager API¶

class MemoryManager:
    """High-level memory management."""

    def create_session(self, system_prompt: str, metadata: dict) -> str:
        """Create a new conversation session. Returns session_id."""

    def load_session(self, session_id: str) -> list[Message]:
        """Load conversation history for a session."""

    def save_message(self, session_id: str, message: Message) -> None:
        """Save a message to the session."""

    def get_recent_messages(
        self,
        session_id: str,
        max_tokens: int = 4096
    ) -> list[Message]:
        """Get recent messages within token limit."""

    def list_sessions(self, limit: int = 10) -> list[dict]:
        """List recent sessions with metadata."""

    def delete_session(self, session_id: str) -> None:
        """Delete a session and all its messages."""

    def prune_old_sessions(self, days: int = 30) -> int:
        """Delete sessions older than N days. Returns count deleted."""

Memory Strategies¶

1. Simple Windowing (Phase 2.1)¶

Load the most recent N messages that fit within token limit:

def get_recent_messages(session_id, max_tokens):
    messages = load_messages(session_id, order='DESC', limit=100)

    result = []
    total_tokens = 0

    for msg in reversed(messages):  # Oldest to newest
        tokens = estimate_tokens(msg)
        if total_tokens + tokens > max_tokens:
            break
        result.append(msg)
        total_tokens += tokens

    return result

Pros: Simple, fast, predictable Cons: May lose important context from earlier in conversation

2. Summarization (Future)¶

Summarize old messages to compress history:

def get_context_with_summary(session_id, max_tokens):
    # Get all messages
    all_messages = load_messages(session_id)

    # If within limit, return as-is
    if estimate_tokens(all_messages) <= max_tokens:
        return all_messages

    # Otherwise, summarize old context
    old_msgs = all_messages[:-20]  # All but recent 20
    recent_msgs = all_messages[-20:]

    summary = summarize_conversation(old_msgs)

    return [
        Message(role="system", content=f"Previous context: {summary}"),
        *recent_msgs
    ]

Pros: Preserves important context, better continuity Cons: More complex, requires LLM call, lossy compression

3. Hybrid (Future)¶

Combine summarization with selective important message retention:

Keep system messages always
Summarize routine exchanges
Flag and retain important messages (user decisions, key facts)
Keep recent N messages verbatim

Integration Points¶

Agent Class Modifications¶

class Agent:
    def __init__(
        self,
        llm: LLMClient,
        tools: list[Tool],
        memory_manager: MemoryManager | None = None,
        session_id: str | None = None,
        ...
    ):
        self.memory = memory_manager
        self.session_id = session_id

        # Load history if session exists
        if self.memory and self.session_id:
            self.state = self._load_session_state()
        else:
            self.state = AgentState(system_prompt)

    async def run(self, query: str) -> str:
        # Add user message
        self.state.add_user_message(query)

        # Save to memory if enabled
        if self.memory:
            self.memory.save_message(
                self.session_id,
                self.state.messages[-1]
            )

        # ... ReAct loop ...

        # Save assistant response
        if self.memory:
            self.memory.save_message(
                self.session_id,
                self.state.messages[-1]
            )

        return response

CLI Integration¶

New commands in harombe chat:

/sessions          List recent conversation sessions
/load <session>    Load and continue a previous session
/save              Force save current session
/history           Show conversation history for current session
/clear             Clear current session history (but keep in DB)
/forget            Delete current session from memory

Configuration Schema¶

memory:
  enabled: true # Enable conversation memory
  storage_path: ~/.harombe/memory.db # SQLite database location
  max_history_tokens: 4096 # Token limit for context
  auto_prune_days: 30 # Auto-delete old sessions
  strategy: simple # simple, summarization, hybrid

Implementation Plan¶

Phase 2.1: Basic Memory (Complete)¶

✅ Design architecture (this document)
✅ Implement SQLite storage backend
✅ Implement simple windowing strategy
✅ Integrate with Agent class
✅ Add CLI commands
✅ Add configuration
✅ Write tests
✅ Update documentation

Phase 2.2: Semantic Search & RAG (Complete)¶

✅ Design vector store architecture
✅ Implement embedding service (sentence-transformers, Ollama)
✅ Implement ChromaDB vector store
✅ Integrate with MemoryManager (auto-embedding)
✅ Add semantic search capabilities
✅ Implement RAG for agent
✅ Write comprehensive tests
✅ Update documentation and examples

Phase 2.3: Advanced Memory (Future)¶

Implement summarization strategy
Add message importance scoring
Implement hybrid strategy
Export/import sessions
Alternative backends (PostgreSQL, Redis)
Cloud storage adapters

File Structure¶

src/harombe/memory/
├── __init__.py
├── storage.py          # SQLite backend implementation
├── manager.py          # MemoryManager class
├── strategies.py       # Windowing/summarization strategies
└── schema.py           # Pydantic models for session/message

tests/memory/
├── __init__.py
├── test_storage.py
├── test_manager.py
└── test_strategies.py

Token Estimation¶

Simple heuristic for token counting:

def estimate_tokens(message: Message) -> int:
    """Rough token estimate: ~4 chars per token."""
    text = message.content or ""

    # Add tool call overhead
    if message.tool_calls:
        for tc in message.tool_calls:
            text += json.dumps(tc.arguments)

    return len(text) // 4

For production, consider using tiktoken library for accurate counts.

Error Handling¶

Session not found: Create new session automatically
Database locked: Retry with exponential backoff
Token limit exceeded: Fall back to most recent messages only
Corrupted data: Log error, continue without memory

Testing Strategy¶

Unit tests: Storage CRUD operations
Integration tests: Agent + memory workflows
Performance tests: Large conversation history
Concurrency tests: Multiple agents sharing storage

Security Considerations¶

No encryption in Phase 2.1 (local SQLite only)
Future: Add encryption at rest for sensitive conversations
Access control: Not needed for single-user local setup
PII handling: Covered in Phase 2.3 (Privacy Router)

Migration Path¶

For users upgrading:

Memory is opt-in via config
Existing agents work without changes
No data migration needed (fresh start)
Old sessions can be imported via CLI tool (future)

Phase 2.2: Semantic Search & RAG¶

Overview¶

Phase 2.2 extends the memory system with semantic search capabilities using vector embeddings. This enables agents to find relevant context from past conversations even when the exact wording differs, powering Retrieval-Augmented Generation (RAG) for more intelligent, context-aware responses.

Architecture Extension¶

┌───────────────────────────────────────────────────────────────────┐
│                           Agent with RAG                          │
│  ┌───────────────────────────────────────────────────────────┐   │
│  │              Memory Manager                                │   │
│  │  - Session management                                      │   │
│  │  - Message filtering & windowing                           │   │
│  │  - Semantic search (NEW)                                   │   │
│  │  - RAG context retrieval (NEW)                             │   │
│  └────────┬────────────────────────────┬─────────────────────┘   │
│           │                            │                          │
│  ┌────────▼─────────────────┐  ┌──────▼───────────────────────┐ │
│  │   Storage Backend        │  │   Vector Store (ChromaDB)    │ │
│  │   - SQLite database      │  │   - Embeddings storage       │ │
│  │   - Message CRUD         │  │   - Similarity search        │ │
│  └──────────────────────────┘  │   - HNSW indexing            │ │
│                                 └──────▲───────────────────────┘ │
│                                        │                          │
│                                 ┌──────┴───────────────────────┐ │
│                                 │   Embedding Client           │ │
│                                 │   - sentence-transformers    │ │
│                                 │   - Ollama (optional)        │ │
│                                 └──────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘

Components¶

1. Embedding Client¶

Generates vector embeddings (numerical representations) for text:

class EmbeddingClient(Protocol):
    async def embed(self, texts: list[str]) -> list[list[float]]:
        """Generate embeddings for multiple texts."""

    async def embed_single(self, text: str) -> list[float]:
        """Generate embedding for a single text."""

    @property
    def dimension(self) -> int:
        """Embedding dimension (e.g., 384 for all-MiniLM-L6-v2)."""

Implementations:

SentenceTransformerEmbedding (default) - Local, privacy-first
Model: sentence-transformers/all-MiniLM-L6-v2
Dimension: 384
No API calls, runs locally on CPU or GPU
OllamaEmbedding - Uses Ollama for embeddings
Leverages existing Ollama infrastructure
Larger models available (e.g., nomic-embed-text)

2. Vector Store¶

Stores and searches embeddings using approximate nearest neighbor algorithms:

class VectorStore(Protocol):
    def add(
        self,
        ids: list[str],
        embeddings: list[list[float]],
        documents: list[str],
        metadata: list[dict[str, Any]],
    ) -> None:
        """Add embeddings to the store."""

    def search(
        self,
        query_embedding: list[float],
        top_k: int = 10,
        where: dict[str, Any] | None = None,
    ) -> tuple[list[str], list[str], list[dict], list[float]]:
        """Search for similar embeddings. Returns (ids, docs, metadata, distances)."""

Implementation: ChromaDBVectorStore

Lightweight, embedded database
Uses HNSW (Hierarchical Navigable Small World) for fast search
Cosine similarity for distance metric
Supports metadata filtering (e.g., by session_id)
Persistent or in-memory storage

3. Enhanced Memory Manager¶

Extended with semantic search capabilities:

class MemoryManager:
    def __init__(
        self,
        storage_path: Path,
        max_history_tokens: int = 4096,
        embedding_client: EmbeddingClient | None = None,
        vector_store: VectorStore | None = None,
    ):
        # Enable semantic search if both components provided
        self.semantic_search_enabled = (
            embedding_client is not None and vector_store is not None
        )

    def save_message(self, session_id: str, message: Message) -> int:
        """Save message to SQLite AND auto-embed to vector store."""
        message_id = self.storage.save_message(record)

        # Auto-embed if semantic search enabled
        if self.semantic_search_enabled and message.content:
            self._embed_message(message_id, session_id, message)

        return message_id

    async def search_similar(
        self,
        query: str,
        top_k: int = 5,
        session_id: str | None = None,
        min_similarity: float | None = None,
    ) -> list[Message]:
        """Search for semantically similar messages."""
        # Generate query embedding
        query_embedding = await self.embedding_client.embed_single(query)

        # Search vector store with optional session filter
        where = {"session_id": session_id} if session_id else None
        ids, docs, metadata, distances = self.vector_store.search(
            query_embedding=query_embedding,
            top_k=top_k,
            where=where,
        )

        # Filter by similarity threshold and convert to Messages
        results = []
        for doc, meta, distance in zip(docs, metadata, distances):
            similarity = 1.0 - distance  # Convert distance to similarity
            if min_similarity and similarity < min_similarity:
                continue
            results.append(Message(role=meta["role"], content=doc))

        return results

    async def get_relevant_context(
        self,
        query: str,
        max_tokens: int = 2048,
        session_id: str | None = None,
    ) -> list[Message]:
        """Get relevant context within token budget."""
        # Over-fetch candidates
        candidates = await self.search_similar(
            query=query,
            top_k=20,
            session_id=session_id,
        )

        # Filter by token budget
        results = []
        current_tokens = 0
        for msg in candidates:
            msg_tokens = estimate_tokens(msg)
            if current_tokens + msg_tokens > max_tokens:
                break
            results.append(msg)
            current_tokens += msg_tokens

        return results

4. RAG-Enabled Agent¶

Agent retrieves relevant context before LLM calls:

class Agent:
    def __init__(
        self,
        llm: LLMClient,
        tools: list[Tool],
        memory_manager: MemoryManager | None = None,
        session_id: str | None = None,
        enable_rag: bool = False,
        rag_top_k: int = 5,
        rag_min_similarity: float = 0.7,
    ):
        self.enable_rag = enable_rag
        self.rag_top_k = rag_top_k
        self.rag_min_similarity = rag_min_similarity

    async def run(self, user_message: str) -> str:
        # Load conversation history
        state = self._load_history()

        # Retrieve relevant context if RAG enabled
        rag_context = None
        if self.enable_rag and self.memory_manager:
            rag_context = await self._retrieve_rag_context(user_message)

        # Inject context into user message
        if rag_context:
            enhanced_message = self._format_message_with_context(
                user_message, rag_context
            )
            state.add_user_message(enhanced_message)
        else:
            state.add_user_message(user_message)

        # Save ORIGINAL message (without RAG context) to memory
        if self.memory_manager:
            self.memory_manager.save_message(
                self.session_id,
                Message(role="user", content=user_message),
            )

        # ... ReAct loop ...

    def _format_message_with_context(
        self,
        user_message: str,
        context: list[Message],
    ) -> str:
        """Format enhanced message with retrieved context."""
        lines = [
            "RELEVANT CONTEXT FROM PAST CONVERSATIONS:",
            "---",
        ]

        for msg in context:
            role = msg.role.upper()
            content = msg.content[:200]  # Truncate long messages
            if len(msg.content) > 200:
                content += "..."
            lines.append(f"[{role}]: {content}")

        lines.extend([
            "---",
            "",
            "Now answer the current user question using the context above if relevant.",
            "",
            f"USER QUESTION: {user_message}",
        ])

        return "\n".join(lines)

Embedding Schema¶

Embeddings are stored with metadata for filtering and retrieval:

{
    "id": "msg_12345",           # Unique identifier
    "embedding": [0.1, 0.2, ...], # 384-dim vector (for all-MiniLM)
    "document": "message text",   # Original text
    "metadata": {
        "session_id": "abc-123",  # Session identifier
        "message_id": 12345,      # Database message ID
        "role": "user",           # Message role
        "timestamp": 1234567890,  # Unix timestamp (optional)
    }
}

Configuration¶

memory:
  enabled: true
  storage_path: ~/.harombe/memory.db
  max_history_tokens: 4096

  # Vector store configuration
  vector_store:
    enabled: true
    backend: chromadb # Only chromadb supported now
    embedding_model: sentence-transformers/all-MiniLM-L6-v2 # Model to use
    embedding_provider: sentence-transformers # Local embeddings
    persist_directory: ~/.harombe/vectors # Storage directory (null = in-memory)
    collection_name: harombe_embeddings # Collection name

  # RAG configuration
  rag:
    enabled: true
    top_k: 5 # Number of similar messages to retrieve
    min_similarity: 0.7 # Similarity threshold (0.0-1.0)

How It Works¶

Message saved → Automatically embedded and stored in vector database
User query → If RAG enabled, retrieve similar messages
Context injection → Format retrieved messages into prompt
LLM call → Agent generates response with enhanced context
Response saved → Stored in both SQLite and vector store

Performance Characteristics¶

Embedding generation: ~10-50ms per message (CPU), ~1-5ms (GPU)
Vector search: Sub-millisecond for <10K messages, ~10ms for 100K+
Storage overhead: ~1.5KB per message (384-dim float32 embedding)
Memory usage: ChromaDB loads index into RAM (~2MB per 1K messages)

Use Cases¶

Cross-session knowledge: "What did we discuss about Python last week?"
Related topics: Find messages about similar topics with different wording
Context-aware responses: Agent recalls relevant past information
Knowledge base: Search entire conversation history semantically
Multi-agent memory: Multiple agents sharing a knowledge pool

Testing Strategy¶

Unit tests: Embedding clients, vector store operations
Integration tests: MemoryManager with semantic search
Agent tests: RAG functionality with mocked LLM
Performance tests: Large-scale embedding and retrieval
Similarity tests: Verify semantic matching works correctly

Privacy & Security¶

Local embeddings - sentence-transformers runs entirely offline
No API calls - All processing happens on your hardware
Data isolation - Vector store stored locally alongside SQLite
Optional encryption - ChromaDB supports encryption at rest (future)

Backward Compatibility¶

Semantic search is opt-in via config
All new parameters have defaults
Existing code works without changes
Memory can be enabled without semantic search
Semantic search requires both embedding_client and vector_store

Future Enhancements¶

Alternative embeddings: Support for OpenAI, Cohere, custom models
Hybrid search: Combine keyword and semantic search
Metadata indexing: Filter by date, user, tags, importance
Automatic importance scoring: Identify key messages for retention
Multi-modal embeddings: Support for images, audio (CLIP, etc.)
Backfill optimization: Batch embedding for large existing datasets