The Architecture — How a Multi-Agent System Actually Works
Why One Agent Isn't Enough
The first mistake everyone makes: "I'll just set up one AI agent and have it do everything."
That's like hiring one person to do CS, accounting, marketing, inventory management, and wholesale sales. Even if they were brilliant, they'd be context-switching constantly and doing everything poorly.
The architecture that works is specialized agents with a shared brain.
The Multi-Agent Architecture
┌─────────────────────────────┐
│ 🧠 AURELIO (Central Hub) │
│ Strategy + Orchestration │
│ Cloud VPS (EU) │
└───────────┬─────────────────┘
│ Tailscale Mesh
┌───────────┬───────────┼───────────┬───────────────┐
│ │ │ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌───────▼───────┐
│🦋Mafalda│ │📊Ferland│ │🏪 Gala │ │💻Donatel│ │🏖️ Olivia │
│ CS │ │ Finance │ │ Retail │ │Digital │ │Merchandising │
│ │ │ │ │ │ │Marketing│ │& Wholesale │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └───────┬───────┘
│ │ │ │ │
└───────────┴───────────┴───────────┴───────────────┘
Mac Mini (secondary host)
┌─────────────────────────────┐
│ SHARED BRAIN (1,341 files) │
│ rsync ↔ every 30 minutes │
│ + MCP Server (44 tools) │
└─────────────────────────────┘
The production topology: Aurelio (hub/strategy) runs on a cloud VPS in the EU. The five domain agents run on a dedicated Mac Mini. All connected via Tailscale encrypted mesh. A shared brain syncs bidirectionally every 30 minutes via rsync.
Design Principles
1. Each Agent Has a Clear Domain
Every agent owns a specific area of the business. This means: - Clear responsibility: If something goes wrong with inventory, you know which agent to look at - Focused context: The CS agent doesn't need to know about wholesale pricing formulas - Independent scaling: You can run the CS agent on a beefier model (Opus) while the reporting agent runs on a cheaper one (Sonnet/Flash)
2. Agents Communicate Through the Hub
In the ideal architecture, agents communicate through a central orchestrator — this prevents chaos and creates an audit trail. In practice, the coordination often flows through team messaging channels (Slack, WhatsApp) where both agents and humans participate. The key is that every cross-domain request is logged and traceable, whether it routes through the hub directly or through a shared channel.
Example flow: 1. CS Agent receives complaint: "My order hasn't arrived" 2. CS Agent checks Shopify order status → sees it's marked "fulfilled" 3. CS Agent escalates to Ops Agent via hub: "Order #4521 marked fulfilled but customer says not received" 4. Ops Agent checks 3PL tracking → finds package stuck in customs 5. Ops Agent responds to hub with status update 6. Hub routes back to CS Agent with full context 7. CS Agent drafts customer response with accurate tracking info and ETA
All of this happens in seconds. No human needed unless confidence is low.
3. Shared Knowledge, Separate Personalities
All agents share: - Product catalog (descriptions, pricing, inventory levels) - Brand voice guidelines (tone, vocabulary, do's and don'ts) - Customer history (past orders, past issues, lifetime value) - Business rules (return policy, shipping times, discount limits)
But each agent has its own: - SOUL.md — personality, expertise, communication style - Tools — only the integrations it needs - Decision thresholds — when to act autonomously vs. escalate to human
4. Human-in-the-Loop by Default
This is critical. The system is designed with graduated autonomy:
| Confidence Level | Action |
|---|---|
| > 95% | Agent acts autonomously |
| 80-95% | Agent acts but flags for async human review |
| 60-80% | Agent drafts response, human must approve |
| < 60% | Agent escalates to human with full context |
Over time, as the system learns and you trust it more, you adjust these thresholds upward. In a typical deployment: - Month 1: 40% autonomous - Month 3: 65% autonomous - Month 6: 82% autonomous - Month 12: 91% autonomous (current)
The remaining 9% are genuinely complex cases that benefit from human judgment — legal issues, VIP customers, PR-sensitive situations.
5. Everything is Logged and Auditable
Every agent action is logged: - What decision was made - What data was used - What confidence level was assigned - Whether a human reviewed it - What the outcome was
This isn't just good practice — it's a GDPR requirement if you're operating in Europe, and it's how the system learns and improves.
The Technology Stack
At the core, the system runs on OpenClaw — an open-source AI agent framework that connects LLMs to real-world actions.
| Component | What We Use | Why |
|---|---|---|
| Agent Runtime | OpenClaw | Open source, local-first, 50+ integrations, active community |
| Primary LLM | Mixed: GPT-5.4 (hub), Opus 4.6 (CS), Qwen 3.6 Plus free (4 agents) | Best model per role; free-tier for 4 of 7 agents |
| Secondary LLMs | GPT-5.4 / Codex / Gemini Flash | Cost-effective for simpler tasks, coding agents |
| Hosting | Dedicated server (Hetzner) | €40/mo, full control, EU data residency |
| Agent-to-Agent | ACP (Agent Communication Protocol) | Native cross-agent coordination |
| Messaging | WhatsApp, Slack, Email | Meet teams where they already work |
| Memory | Local files + structured knowledge base | No vendor lock-in, full data ownership |
| Monitoring | Built-in heartbeats + cron jobs | Self-healing, auto-restart |
| ERP / Accounting | Holded | Invoicing, payment reconciliation, ledger |
| Expense Management | Payhawk | Corporate cards, expense tracking, bank statements |
| Knowledge Base | Notion + Brain Context Tree | 409 docs organized by domain, auto-indexed |
| Social Listening | Agent-Reach + bird CLI | Twitter/X monitoring, multi-platform scanning |
| Semantic Search | Exa | Better than Google for competitive research |
| Image Generation | Krea AI | Product shots, creative assets |
| Power-User Layer | Claude Code + MCP | Founder's direct interface — 44 tools, slash commands, subagents |
Total system cost: €352/month all-in (infrastructure + LLM access + subscriptions). See Ch.12 for the complete cost breakdown.
Compare this to roughly €6,500/month in equivalent saved labor hours (62 hours/week × €21/h loaded operational labor, plus founder opportunity cost). That is a verifiable 18:1 ratio, calculated in Ch.12 with every assumption on the table.
What You Need to Get Started
Minimum viable setup (1 agent): - A Linux server or VPS (€20-40/month) - An LLM API key (Anthropic, OpenAI, or similar) - OpenClaw installed and configured - One channel connection (Slack, WhatsApp, or email) - 2-4 hours for initial setup
Full deployment (8 agents): - VPS (€15/month) + Mac Mini as secondary compute (~€22/month amortized) - LLM API access (MiniMax M2.5 primary + Sonnet/Opus fallbacks) - All channel connections configured - Integrations with your existing tools (Shopify, Klaviyo, etc.) - 2-4 weeks for full implementation and tuning - A second compute node (Mac Mini or equivalent) for agent isolation
A note on agent roles: The production deployment documented in this playbook runs eight agents across eight domains: CS, Finance, Retail, Digital Marketing, Merchandising & Wholesale, and a central hub for strategy and orchestration. In practice, you'll adapt these to your org chart. Wholesale was absorbed into Merchandising because the same operational patterns apply (account management, order flow, pricing). Finance is now a standalone domain. The architecture is modular by design — start with the domains that hurt most, split or merge as you scale.
6. Confidence Scoring (New)
Every agent includes a standardized confidence framework that drives graduated autonomy:
| Confidence | Action | Example |
|---|---|---|
| > 95% | Act autonomously | Standard tracking query, stock check |
| 80-95% | Act + flag [REVIEW] |
Return within policy, payment reminder |
| 60-80% | Draft for human approval | Complaint response, discount request |
| < 60% | Escalate with full context | Legal issue, VIP escalation |
Agents self-report confidence with every action: [Confidence: 92%] Responding to tracking query. This creates an auditable record and enables systematic threshold adjustment over time.
7. Audit Logging
Every agent action is logged to a structured JSONL audit trail:
{"timestamp":"2026-03-28T21:21:14+00:00","agent":"mafalda","action":"cs_ticket_response","confidence":"94%","data":"order_tracking","human_review":false}
Monthly rotation. GDPR-compliant. The audit log is the foundation for measuring autonomy rates and detecting quality regressions.
The next eight chapters walk through each agent in detail — what it does, how it's configured, and the specific results it delivers.
The Implementation Kit has production templates, scripts, and a 30-day deployment calendar. Everything in this playbook — packaged to build with.
Get the Kit — €299 →