Multi-Node Cluster Setup Guide¶

Step-by-step guide for setting up a 2-3 node Harombe cluster.

Overview¶

Harombe clusters route queries to different nodes based on task complexity. Simple questions go to fast/lightweight nodes, while complex analysis tasks go to powerful nodes with larger models.

Prerequisites¶

On each machine:

Python 3.11+
Ollama installed and running
Network connectivity between all machines (port 8000 by default)

Architecture¶

┌─────────────────────────────────┐
│  Coordinator (your laptop)      │
│  - Runs harombe chat            │
│  - Routes queries to nodes      │
│  - Handles failover             │
└───────┬──────────┬──────────────┘
        │          │
        ▼          ▼
┌──────────┐  ┌──────────┐
│  Node A  │  │  Node B  │
│  Tier 0  │  │  Tier 2  │
│  3b model│  │  72b model│
└──────────┘  └──────────┘

Step 1: Set Up Worker Nodes¶

Repeat on each machine that will serve as a worker node.

Install Harombe¶

pip install harombe

Pull the model¶

Choose a model appropriate for this node's hardware:

# Lightweight node (4-8GB VRAM)
ollama pull qwen2.5:3b

# Medium node (16GB VRAM)
ollama pull qwen2.5:14b

# Powerful node (48GB+ VRAM)
ollama pull qwen2.5:72b

Configure the node¶

Create ~/.harombe/harombe.yaml:

model:
  name: qwen2.5:14b # The model this node runs

server:
  host: 0.0.0.0 # Listen on all interfaces
  port: 8000

ollama:
  host: http://localhost:11434

Start the node¶

harombe start

Verify it's accessible¶

From another machine:

curl http://<node-ip>:8000/health

You should see:

{ "status": "ok", "model": "qwen2.5:14b" }

Step 2: Configure the Coordinator¶

On the machine where you'll run harombe chat, create ~/.harombe/harombe.yaml:

model:
  name: qwen2.5:7b # Local model (optional, for simple queries)
  temperature: 0.7

agent:
  max_steps: 10

tools:
  shell: true
  filesystem: true
  web_search: true
  confirm_dangerous: true

cluster:
  routing:
    prefer_local: true # Use lowest-latency node when possible
    fallback_strategy: graceful # Fall back to other tiers if preferred unavailable
    load_balance: true # Distribute across same-tier nodes

  nodes:
    - name: laptop
      host: localhost
      port: 8000
      model: qwen2.5:7b
      tier: 0 # Fast: simple queries

    - name: workstation
      host: 192.168.1.100
      port: 8000
      model: qwen2.5:14b
      tier: 1 # Medium: balanced workloads

    - name: server
      host: 192.168.1.200
      port: 8000
      model: qwen2.5:72b
      tier: 2 # Powerful: complex analysis

Step 3: Verify the Cluster¶

# Check cluster status
harombe cluster status

# Test connectivity to all nodes
harombe cluster test

# View performance metrics
harombe cluster metrics

Expected output from harombe cluster status:

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
┃ Name        ┃ Host                    ┃ Tier ┃ Model         ┃ Status    ┃ Latency ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
│ laptop      │ localhost:8000          │ 0    │ qwen2.5:7b    │ available │ 1.2ms   │
│ workstation │ 192.168.1.100:8000      │ 1    │ qwen2.5:14b   │ available │ 5.3ms   │
│ server      │ 192.168.1.200:8000      │ 2    │ qwen2.5:72b   │ available │ 12.1ms  │
└─────────────┴─────────────────────────┴──────┴───────────────┴───────────┴─────────┘

Step 4: Use the Cluster¶

harombe chat

The router automatically selects the best node:

"What is Python?" → Tier 0 (laptop, fast response)
"Explain async/await in Python" → Tier 1 (workstation, balanced)
"Refactor this code, write tests, and explain trade-offs" → Tier 2 (server, powerful model)

Tier Guidelines¶

Tier	Use Case	Typical Hardware	Model Size
0	Simple queries, quick factual answers	Laptop, Mac Mini	1-7B
1	Moderate analysis, explanations	Desktop, workstation	7-30B
2	Complex reasoning, code generation, large context	Server, cloud GPU	30-72B+

Tiers are user-defined — assign based on your judgment.

Fallback Behavior¶

When the preferred tier is unavailable:

Graceful (default): Tries adjacent tiers. If tier 2 is down, tries tier 1, then tier 0.
Strict: Only uses the recommended tier. Returns an error if unavailable.

Troubleshooting¶

Node shows "unavailable"¶

# Check if the node is running
curl http://<node-ip>:8000/health

# Check Ollama is running on the node
curl http://<node-ip>:11434/api/tags

# Check firewall/network
ping <node-ip>

High latency¶

Ensure nodes are on the same network (LAN preferred)
Check for network congestion
Use harombe cluster metrics to identify bottlenecks

Circuit breaker open¶

After repeated failures, the circuit breaker prevents traffic to a failing node. It automatically tests recovery after 60 seconds. Check the node's health and restart if necessary.

Security Considerations¶

Cluster traffic is unencrypted by default. Use SSH tunnels or VPN for sensitive data.
Set auth_token on remote nodes for basic authentication.
Run nodes behind a firewall — don't expose port 8000 to the internet.