Reference Architecture: Multi-Node Cluster with Privacy Routing¶

A 2-3 node cluster that routes queries based on complexity and privacy sensitivity.

Overview¶

This architecture distributes workload across multiple machines. Simple questions stay on a lightweight local node, complex analysis goes to a powerful server, and privacy-sensitive queries never leave the local network. The privacy router makes the routing decision transparent to users.

Hardware Layout¶

Node	Hardware	Model	Tier	Role
Laptop	MacBook M2, 16 GB	`llama3.1:8b`	0	Coordinator + fast queries
Server A	Linux, 64 GB RAM, RTX 4090	`llama3.1:70b`	2	Complex analysis
Server B (optional)	Linux, 32 GB RAM	`codellama:34b`	1	Code tasks

Architecture¶

┌─────────────────────────────────────┐
│  Laptop (Coordinator, Tier 0)       │
│  harombe chat                       │
│  ┌─────────────┐  ┌──────────────┐  │
│  │Privacy Router│  │Cluster Router│  │
│  │ PII filter   │  │ Complexity   │  │
│  └──────┬──────┘  └──────┬───────┘  │
│         │                │          │
│  ┌──────▼──────┐         │          │
│  │ Ollama      │         │          │
│  │ llama3.1:8b │         │          │
│  └─────────────┘         │          │
└──────────────────────────┼──────────┘
                           │ LAN only
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼──────┐  ┌─▼───────────┐
       │  Server A   │  │  Server B   │
       │  Tier 2     │  │  Tier 1     │
       │  70b model  │  │  34b model  │
       └─────────────┘  └─────────────┘

Setup¶

1. Configure the Coordinator (Laptop)¶

# harombe.yaml (laptop)

model:
  name: llama3.1:8b

privacy:
  mode: hybrid
  pii_detection: true
  sensitivity_threshold: 0.5
  cloud_provider:
    # "cloud" here is Server A on LAN, not public cloud
    base_url: http://server-a.local:11434
    model: llama3.1:70b

cluster:
  enabled: true
  node_id: laptop
  tier: 0
  bind_host: "0.0.0.0"
  bind_port: 8000
  nodes:
    - id: server-a
      host: server-a.local
      port: 8000
      tier: 2
      capabilities:
        - reasoning
        - analysis
    - id: server-b
      host: server-b.local
      port: 8000
      tier: 1
      capabilities:
        - code

memory:
  enabled: true
  backend: sqlite
  embedding:
    model: nomic-embed-text
    provider: ollama

tools:
  shell: true
  filesystem: true
  web_search: true
  confirm_dangerous: true

security:
  hitl:
    enabled: true

2. Configure Server A¶

# harombe.yaml (Server A)

model:
  name: llama3.1:70b

cluster:
  enabled: true
  node_id: server-a
  tier: 2
  bind_host: "0.0.0.0"
  bind_port: 8000

ollama:
  host: http://localhost:11434

3. Configure Server B¶

# harombe.yaml (Server B)

model:
  name: codellama:34b

cluster:
  enabled: true
  node_id: server-b
  tier: 1
  bind_host: "0.0.0.0"
  bind_port: 8000

ollama:
  host: http://localhost:11434

4. Start Nodes¶

# On each machine:
harombe start

5. Chat from Coordinator¶

# On laptop:
harombe chat

Query Routing Examples¶

Query	Complexity	Privacy	Routed To
"What time is it?"	Simple	None	Laptop (Tier 0)
"Summarize this report"	Medium	None	Server B (Tier 1)
"Analyze codebase architecture"	Complex	None	Server A (Tier 2)
"Review employee records in HR/"	Complex	PII detected	Laptop (local-only)

Privacy Routing Behavior¶

The privacy router intercepts queries before cluster routing:

PII detected (names, SSNs, medical data) → forced to local node, regardless of complexity
Sensitivity above threshold → local processing only
Clean queries → routed normally by complexity tier

This ensures sensitive data never leaves the coordinator machine, even in a cluster.

Health and Failover¶

Nodes send heartbeats every 30 seconds
Circuit breaker opens after 3 consecutive failures
If a tier is unavailable, the query falls back to the next available tier
The coordinator can handle all queries locally if both servers go down

Monitoring¶

# Check cluster status
harombe cluster status

# View routing statistics
harombe chat
You> /privacy