GUIDE · INFRASTRUCTURE

Register a heterogeneous fleet, route by available VRAM.

machines-server treats every workstation, app host and edge node as a first-class entity with hardware profile, available models and live telemetry. This guide walks you through adding a second machine and routing agents to whichever node has the model loaded and the headroom to run it.

Why route across machines

One workstation with a 24 GB GPU is enough for most solo development. Two machines start mattering as soon as you want a 70B-class model resident on one and a small fast model resident on another, or when you want a reviewer agent on a CPU-only host while the writer agent stays on the GPU machine. machines-server is how those agents find each other.

1. Describe each machine in YAML

Drop one file per machine into the machines/ directory at the root of your consciousness-server checkout. The format is plain YAML; the operator owns it.

machines/workstation.yaml
# machines/workstation.yaml
name: workstation
role: gpu-host
hardware:
  cpu: 16-core x86_64
  ram_gb: 128
  gpu:
    - model: 24GB consumer GPU
      vram_gb: 24
    - model: 24GB consumer GPU
      vram_gb: 24
storage_gb: 8000
network:
  hostname: workstation.lan
  port: 11434          # Ollama
models:
  - llama3.1:70b
  - qwen2.5:32b
  - nomic-embed-text
tags: [primary, gpu, training]

And a second one for a CPU-only host:

machines/app-host.yaml
# machines/app-host.yaml
name: app-host
role: cpu-host
hardware:
  cpu: 16-core x86_64
  ram_gb: 64
  gpu: []              # no GPU, CPU-only
storage_gb: 2000
network:
  hostname: app-host.lan
  port: 11434
models:
  - phi3:mini
  - qwen2.5:3b
tags: [secondary, cpu]

The required keys are name, role, hardware, network.hostname and models. Everything else is optional and shows up unmodified in the API response — handy for tagging ("primary", "training", "edge"), grouping by site, or pinning licence terms.

2. Confirm machines-server sees them

machines-server hot-reloads on read. No restart, no compose rebuild. After dropping the YAML:

terminal
# machines-server is hot-reloading: drop a new YAML and curl will see
# the machine on the next /api/infrastructure call. Sanity check:
curl -s http://127.0.0.1:3038/api/machines | jq '.machines[].name'

# Output:
# "workstation"
# "app-host"

3. Watch live telemetry

Each machine periodically reports CPU, RAM, GPU VRAM (via nvidia-smi), and which models are currently loaded by Ollama. The aggregate endpoint:

terminal
curl -s http://127.0.0.1:3038/api/infrastructure | jq '.'

What you get back:

response
{
  "machines": [
    {
      "name": "workstation",
      "role": "gpu-host",
      "status": "online",
      "telemetry": {
        "cpu_percent": 12.4,
        "ram_used_gb": 38.1,
        "gpu": [
          { "vram_used_gb": 18.2, "vram_free_gb": 5.8, "utilisation": 47 },
          { "vram_used_gb":  0.5, "vram_free_gb": 23.5, "utilisation":  0 }
        ]
      },
      "loaded_models": ["llama3.1:70b"]
    },
    {
      "name": "app-host",
      "role": "cpu-host",
      "status": "online",
      "telemetry": { "cpu_percent": 4.0, "ram_used_gb": 12.3 },
      "loaded_models": ["phi3:mini"]
    }
  ]
}

Telemetry is collected by an agent process running on each host (the machines-server sidecar). If a machine drops off the network, status moves to "offline" on the next poll cycle; agents looking for it should fall back to a peer.

4. Route agents to the right machine

The simplest router picks the machine with the most VRAM free that already has the requested model loaded. Loading a model from cold can take 10–60 s on a large model; using a host where it's already resident is the dominant optimisation.

router.py
import requests

def pick_machine_for(model: str, min_free_vram_gb: float = 0):
    """Find a machine that has the model loaded and enough VRAM headroom."""
    infra = requests.get("http://127.0.0.1:3038/api/infrastructure").json()
    candidates = []
    for m in infra["machines"]:
        if m["status"] != "online":
            continue
        if model not in m.get("loaded_models", []):
            continue
        gpu = m.get("telemetry", {}).get("gpu", [])
        free = max((g["vram_free_gb"] for g in gpu), default=0)
        if free >= min_free_vram_gb:
            candidates.append((free, m["name"], m["network"]["hostname"]))
    if not candidates:
        raise RuntimeError(f"no machine has {model} with {min_free_vram_gb} GB free")
    candidates.sort(reverse=True)        # most-free first
    _, name, host = candidates[0]
    return name, host

# Usage: route a heavy job to whichever GPU host has the headroom.
name, host = pick_machine_for("llama3.1:70b", min_free_vram_gb=4)
print(f"sending request to {name} at {host}")

Plug the result into your agent's request — Cortex, your custom client, or a dispatcher that hands the job over via chat (@workstation-agent please run this).

Going further