Pooled Apple Silicon at Scale
Multiple Mac Minis networked together — each contributing its unified memory and Apple GPU cores to a shared inference pool. Run the largest open-source models and serve them to an entire organisation, privately and without cloud costs.
150–400+
tokens / second
Pooled Apple Silicon
405B+
model parameters
Largest open-source LLMs
∞
concurrent users
Local network serving
Starting from
~$5,000+
Scales with number of nodes
Built for: organisations · schools · research teams · anyone needing private LLM serving for multiple users simultaneously
How the cluster works
Node 1 (Primary)
Orchestrates inference, handles API requests
Node 2
Contributes unified memory + GPU cores to pool
Node 3+
Each node adds capacity linearly
Gigabit Switch
Low-latency inter-node communication
- Pooled Apple Silicon — unified memory shared across nodes via llama.cpp cluster mode
- Serve models to your whole team or organisation over local network
- Extremely power efficient — ~8–35 W per node
- Silent, desktop-sized — no special server room needed
- Start with 2 nodes, expand anytime
- Compatible with OpenAI-compatible API — drop-in for existing tools
Specifications (per node)
| Unit | Apple Mac Mini M4 or M4 Pro |
|---|---|
| Memory per node | 24–64 GB unified (pooled across cluster) |
| Inference speed | ~40–80 tok/s per node · 150–400+ tok/s pooled |
| Models supported | Up to 405B+ parameters (across full cluster) |
| Networking | Gigabit Ethernet · low-latency cluster mode |
| Power per node | ~8–35 W (exceptionally efficient) |
| OS | macOS (latest) · llama.cpp cluster stack |
| Minimum nodes | 2 (expandable) |