Why route across machines
One workstation with a 24 GB GPU is enough for most solo development. Two machines start mattering as soon as you want a 70B-class model resident on one and a small fast model resident on another, or when you want a reviewer agent on a CPU-only host while the writer agent stays on the GPU machine. machines-server is how those agents find each other.
1. Describe each machine in YAML
Drop one file per machine into the machines/
directory at the root of your consciousness-server
checkout. The format is plain YAML; the operator owns it.
# machines/workstation.yaml
name: workstation
role: gpu-host
hardware:
cpu: 16-core x86_64
ram_gb: 128
gpu:
- model: 24GB consumer GPU
vram_gb: 24
- model: 24GB consumer GPU
vram_gb: 24
storage_gb: 8000
network:
hostname: workstation.lan
port: 11434 # Ollama
models:
- llama3.1:70b
- qwen2.5:32b
- nomic-embed-text
tags: [primary, gpu, training] And a second one for a CPU-only host:
# machines/app-host.yaml
name: app-host
role: cpu-host
hardware:
cpu: 16-core x86_64
ram_gb: 64
gpu: [] # no GPU, CPU-only
storage_gb: 2000
network:
hostname: app-host.lan
port: 11434
models:
- phi3:mini
- qwen2.5:3b
tags: [secondary, cpu]
The required keys are name, role,
hardware, network.hostname and
models. Everything else is optional and shows
up unmodified in the API response — handy for tagging
("primary", "training", "edge"), grouping by site, or
pinning licence terms.
2. Confirm machines-server sees them
machines-server hot-reloads on read. No restart, no compose rebuild. After dropping the YAML:
# machines-server is hot-reloading: drop a new YAML and curl will see
# the machine on the next /api/infrastructure call. Sanity check:
curl -s http://127.0.0.1:3038/api/machines | jq '.machines[].name'
# Output:
# "workstation"
# "app-host" 3. Watch live telemetry
Each machine periodically reports CPU, RAM, GPU VRAM
(via nvidia-smi), and which models are
currently loaded by Ollama. The aggregate endpoint:
curl -s http://127.0.0.1:3038/api/infrastructure | jq '.' What you get back:
{
"machines": [
{
"name": "workstation",
"role": "gpu-host",
"status": "online",
"telemetry": {
"cpu_percent": 12.4,
"ram_used_gb": 38.1,
"gpu": [
{ "vram_used_gb": 18.2, "vram_free_gb": 5.8, "utilisation": 47 },
{ "vram_used_gb": 0.5, "vram_free_gb": 23.5, "utilisation": 0 }
]
},
"loaded_models": ["llama3.1:70b"]
},
{
"name": "app-host",
"role": "cpu-host",
"status": "online",
"telemetry": { "cpu_percent": 4.0, "ram_used_gb": 12.3 },
"loaded_models": ["phi3:mini"]
}
]
}
Telemetry is collected by an agent process running on each
host (the machines-server sidecar). If a
machine drops off the network, status moves
to "offline" on the next poll cycle; agents
looking for it should fall back to a peer.
4. Route agents to the right machine
The simplest router picks the machine with the most VRAM free that already has the requested model loaded. Loading a model from cold can take 10–60 s on a large model; using a host where it's already resident is the dominant optimisation.
import requests
def pick_machine_for(model: str, min_free_vram_gb: float = 0):
"""Find a machine that has the model loaded and enough VRAM headroom."""
infra = requests.get("http://127.0.0.1:3038/api/infrastructure").json()
candidates = []
for m in infra["machines"]:
if m["status"] != "online":
continue
if model not in m.get("loaded_models", []):
continue
gpu = m.get("telemetry", {}).get("gpu", [])
free = max((g["vram_free_gb"] for g in gpu), default=0)
if free >= min_free_vram_gb:
candidates.append((free, m["name"], m["network"]["hostname"]))
if not candidates:
raise RuntimeError(f"no machine has {model} with {min_free_vram_gb} GB free")
candidates.sort(reverse=True) # most-free first
_, name, host = candidates[0]
return name, host
# Usage: route a heavy job to whichever GPU host has the headroom.
name, host = pick_machine_for("llama3.1:70b", min_free_vram_gb=4)
print(f"sending request to {name} at {host}")
Plug the result into your agent's request — Cortex, your
custom client, or a dispatcher that hands the job over via
chat (@workstation-agent please run this).
Going further
- Platform / Machines page → for the full machines-server feature surface (MCP support, AUTH_MODE, future use cases beyond servers).
- Secure deployment → when machines on different sites need signed requests over a VPN.
- Write a custom agent → that picks tasks based on which machine it lives on.