Open Source · Apache 2.0 · Privacy by Design

Distributed AI Inference.
Yours. Private. Affordable.

MoE Sovereign orchestrates freely available language models into a powerful Mixture-of-Experts system — distributed across consumer and legacy hardware, fully self-hosted, zero cloud dependency.

12 Expert Domains
20 Precision Tools
Cache Layers
0 Cloud Calls
7+ GPU Nodes

What is MoE Sovereign?

Instead of a single massive model on an expensive GPU, many specialized models are coordinated — each running on the hardware best suited for it.

The Problem

Modern Large Language Models like GPT-4 or Claude require five- to six-figure investments in GPU hardware alone for self-hosting — and create a permanent cloud dependency with corresponding privacy risks. For businesses, research institutions, and privacy-conscious users, that is simply not an option.

The Solution: Mixture of Experts

MoE Sovereign distributes inference across a cluster of more affordable nodes. Each request is analyzed by an intelligent planner, routed to the appropriate expert models, and the results are synthesized by a merger model. The outcome: quality comparable to cloud LLMs — with full data control at a fraction of the cost.

The API is fully OpenAI-compatible, so existing tools like Open WebUI, Claude Code, or any OpenAI SDK integration work without modification.

Data Sovereignty

All data stays on your infrastructure. Not a single API call leaves your network — Privacy by Design from the ground up.

Affordability

Instead of one expensive high-end GPU: many affordable consumer GPUs and repurposed enterprise hardware like Tesla M10 and K80 cards.

Scalability

New GPU nodes are simply added to the configuration. Load distribution and VRAM management happen automatically.

Openness

Fully open source under Apache 2.0. No vendor lock-in, no hidden costs, no proprietary stack.

Everything in One Stack

MoE Sovereign ships with all components needed for a production-ready AI infrastructure.

OpenAI-Compatible API

Drop-in replacement for OpenAI endpoints. Every tool that works with /v1/chat/completions works immediately.

12 Expert Domains

Specialized models for math, code, medicine, law, translation, data analysis and more — dynamically routed via LangGraph.

MCP Precision Tools

20 deterministic tools for math, date arithmetic, unit conversion, cryptography and German law — no hallucinations.

Knowledge Graph (GraphRAG)

Neo4j-based knowledge database with automatic extraction and 2-hop traversal for deep contextual knowledge.

4-Layer Caching

Semantic similarity search (ChromaDB) + Redis plan cache + GraphRAG cache + performance scores for minimal latency.

Private Web Search

SearXNG integration for real-time research without tracking. Results are incorporated before the response is generated.

User Management & Budgets

SQLite-based user and API key management with token budgets, permissions, and full audit log via Kafka.

Prometheus & Grafana

5 pre-built dashboards for operations, model performance, cache hit rates, tool calls and hardware metrics.

Web Admin Interface

FastAPI + Jinja2 dashboard for live configuration of experts, models, system prompts and users — no container restart needed.

Event Streaming (Kafka)

Apache Kafka in KRaft mode for audit logs, async GraphRAG learning and feedback loop — no Zookeeper overhead.

Vision & Multimodal

Tier-2 experts for image, screenshot and document analysis. Base64 images are passed directly in /v1/chat/completions.

Skills System

17 built-in skills (PDF, DOCX, XLSX, code review, MCP builder and more) plus synchronization with upstream skill libraries.

The Processing Pipeline

LangGraph orchestrates a deterministic state graph that routes every request through up to eleven processing nodes.

Services at a Glance

Deployed services and their roles
Service Technology Port Role
Orchestrator FastAPI + LangGraph 8002 OpenAI-compatible API endpoint and pipeline controller
Admin UI FastAPI + Jinja2 8088 Web dashboard for configuration and monitoring
MCP Server FastMCP (Python) 8003 20 deterministic precision tools
ChromaDB ChromaDB 8001 Vector database for semantic caching
Redis Redis Stack 6379 Plan cache, performance scores, LangGraph checkpoints
Neo4j Neo4j 5 Community 7474 / 7687 Knowledge graph for GraphRAG
Kafka Apache Kafka (KRaft) 9092 Event streaming, audit log, feedback loop
Prometheus Prometheus 9090 Metrics collection (API, GPU, containers, host)
Grafana Grafana 3001 5 pre-provisioned dashboards
SearXNG SearXNG 8888 Privacy-respecting meta search engine

Two-Tier Model Architecture

The system distinguishes two model tiers to dynamically balance speed and quality:

Tier properties and escalation criteria
Tier Parameters VRAM (4-bit) Use case Escalation trigger
T1 ≤ 20 B 8–16 GB Fast first opinion, handles most requests Confidence < 0.65
T2 > 20 B 16–40 GB Complex reasoning, low-confidence escalations Final answer

4-Layer Caching Architecture

L1

Semantic Cache

ChromaDB vector search
Cosine distance < 0.15 → direct hit, full pipeline skipped

permanent
L2

Plan Cache

Redis: planner LLM output
Saves ~1,600 tokens per cache hit

30 minutes
L3

GraphRAG Cache

Redis: Neo4j context queries
Avoids redundant graph traversals

1 hour
L4

Performance Scores

Redis: model ratings per category
Laplace smoothing for expert routing

permanent

12 Specialized Expert Domains

Every expert is optimized for its domain — with tailored system prompts, model selection and tier strategy.

Expert categories, tier assignment and use cases
Category Tier Example Models Domain Special Feature
general T2 gemma3:27b, qwen3.5:35b General knowledge, definitions, explanations
math T1+T2 phi4:14bqwq:32b Calculations, equations, statistics + MCP + SymPy
technical_support T1+T2 deepseek-coder-v2:16bdevstral:24b IT, DevOps, Docker, networking, Linux
code_reviewer T2 devstral:24b, qwen3-coder:30b Code review, security audits, refactoring
creative_writer T2 gemma3:27b, qwen3.5:35b Content creation, marketing, storytelling
medical_consult T1+T2 phi4:14bgemma3:27b Medical information, symptoms, health topics Critic node
legal_advisor T2 magistral:24b, command-r:35b German law: BGB, StGB, HGB and more Critic + MCP
translation T2 translategemma:27b, qwen3.5:35b Professional translation, multiple languages
data_analyst T1 phi4:14b Statistics, Pandas, data analysis, SQL + MCP stats
science T2 gemma3:27b Chemistry, biology, physics, research
reasoning T1+T2 phi4:14bdeepseek-r1:32b Complex logic, strategy, multi-step analysis Thinking node
vision T2 Multimodal T2 models Image, screenshot, document analysis Base64 input

The Confidence System

Every expert returns a confidence score alongside its answer. This determines whether the result is used directly or escalated to a more capable Tier-2 model:

Output Modes

The model field controls the desired output style:

Available output modes
Model ID Mode Description
moe-orchestratorStandardFull answers with explanations
moe-orchestrator-codeCodeCode output only, no prose
moe-orchestrator-conciseConciseMaximum 120 words, no padding
moe-orchestrator-researchResearchDeep analysis with citations
moe-orchestrator-reportReportStructured report with sections
moe-orchestrator-agentAgentTool-use optimized for agents
moe-orchestrator-planPlanTask planning with step list

Deterministic Tools Without Hallucinations

LLMs hallucinate on calculations, date arithmetic and legal paragraphs. MCP Precision Tools replaces these with exact, verifiable computations.

✦ Mathematics

  • calculate – Safe arithmetic evaluation
  • solve_equation – SymPy equation solver
  • prime_factorize – Prime factorization
  • gcd_lcm – Greatest common divisor / LCM
  • roman_numeral – Arabic ↔ Roman numerals

📅 Date & Time

  • date_diff – Difference between dates
  • date_add – Add/subtract from a date
  • day_of_week – Calculate day of week

📏 Units & Statistics

  • unit_convert – km, miles, kg, lb, °C, °F …
  • statistics_calc – Mean, median, stdev, percentiles

🔒 Cryptography & Encoding

  • hash_text – MD5, SHA-256, SHA-512
  • base64_codec – Base64 encode/decode

🌐 Networking

  • subnet_calc – CIDR analysis, netmask, broadcast

📜 Text & Patterns

  • regex_extract – Apply regular expressions
  • text_analyze – Word count, characters, sentences
  • json_query – JSONPath extraction

⚖ German Law

  • legal_search_laws – Search across statutes
  • legal_get_law_overview – Law overview
  • legal_get_paragraph – Fetch specific paragraphs
  • legal_fulltext_search – Full-text (BGB, StGB …)

Hardware Others Called Scrap

The motivation: consumer GPUs and decommissioned enterprise cards running stable under full load. Distributed inference does not require cutting-edge hardware.

N1

AMD Ryzen / Consumer RTX

  • CPUAMD Ryzen 5 5600G
  • RAM64 GB DDR4
  • GPUs3× RTX 2060 12 GB + 2× RTX 3060 12 GB
  • Storage1 TB SATA SSD
  • PSU1,200 W ATX
  • OS / RuntimeDebian 13 + Ollama (Docker CE)
Total VRAM: 60 GB
N2

Intel i5 / Tesla M10

  • CPUIntel Core i5-4590
  • RAM32 GB DDR3
  • GPUs1× Tesla M10 (4× 8 GB = 32 GB)
  • Storage512 GB SATA SSD
  • PSU1,000 W
  • OS / RuntimeDebian 13 + Ollama (Docker CE)
Total VRAM: 32 GB
N3

AMD Athlon II / Tesla M10

  • CPUAMD Athlon II X2 270
  • RAM16 GB DDR3
  • GPUs1× Tesla M10 (4× 8 GB = 32 GB)
  • Storage512 GB SATA SSD
  • PSU550 W
  • OS / RuntimeDebian 13 + Ollama (Docker CE)
Total VRAM: 32 GB
N4/5

Gigabyte G431-MM0 HPC × 2

  • CPUAMD EPYC Embedded 3151
  • RAM128 GB DDR4 ECC
  • GPUs3× Tesla M10 per node (96 GB each)
  • Form factor4U HPC Server, 10× PCIe Gen3
  • Storage1 TB SATA SSD
  • OS / RuntimeDebian 13 + Ollama (Docker CE)
VRAM per node: 96 GB · Combined: 192 GB
N6

Gigabyte G431-MM0 / Tesla K80

  • CPUAMD EPYC Embedded 3151
  • RAM128 GB DDR4 ECC
  • GPUs7× Tesla K80 (2× 12 GB each = 168 GB)
  • SpecialOllama37 – CC37 fork for Kepler (K80)
  • Storage1 TB SATA SSD
  • OS / RuntimeDebian 13 + Ollama37 (Docker CE)
Total VRAM: 168 GB (Kepler architecture)
EXP

Experiment: Thin Client eGPU

  • BaseDell Wyse Thin Client
  • ModMiniPCI → PCIe x16 + 3D-printed eGPU enclosure
  • GPUTesla M10 (4× 8 GB = 32 GB)
  • RAM16 GB DDR3
  • Resultgpt-oss:20b at 15 tokens/s ✓
  • PSU550 W external
Proof of Concept: ✓ Successful

Motivation: Privacy by design, independence, and putting hardware others called scrap to work on a cutting-edge problem.

OpenAI-Compatible from Day One

MoE Sovereign behaves like the OpenAI API. Any existing integration — Open WebUI, Claude Code, LangChain, LlamaIndex — works without code changes.

Quick Start with cURL

bash POST /v1/chat/completions
curl -X POST http://<YOUR-SERVER>:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR-API-KEY>" \
  -d '{
    "model": "moe-orchestrator",
    "messages": [
      {"role": "user", "content": "Explain the difference between TCP and UDP"}
    ],
    "stream": false
  }'

Streaming Response

bash Streaming via SSE
curl -X POST http://<YOUR-SERVER>:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR-API-KEY>" \
  -d '{
    "model": "moe-orchestrator-code",
    "messages": [{"role": "user", "content": "Write a Python Fibonacci function"}],
    "stream": true
  }'

Python with the openai Library

python OpenAI SDK drop-in
from openai import OpenAI

client = OpenAI(
    base_url="http://<YOUR-SERVER>:8002/v1",
    api_key="<YOUR-API-KEY>"
)

response = client.chat.completions.create(
    model="moe-orchestrator-research",
    messages=[{"role": "user", "content": "Analyze the trade-offs of Kubernetes"}]
)
print(response.choices[0].message.content)

Feedback & Learning Loop

bash POST /v1/feedback
curl -X POST http://<YOUR-SERVER>:8002/v1/feedback \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR-API-KEY>" \
  -d '{
    "response_id": "chatcmpl-abc123",
    "rating": 5,
    "comment": "Very precise answer"
  }'

Feedback is processed via Kafka and feeds into expert scoring with Laplace smoothing — the system learns which models perform best for each category over time.

Full API reference, authentication, budget management and more are covered in the documentation. Read the full documentation ↗

Current State & Roadmap

The project is in active development. Hardware and infrastructure are stable; the focus is now on fine-tuning expert configuration.

Phase 1: Hardware & Infrastructure

All GPU nodes assembled, Ollama cluster configured, services (Redis, Neo4j, ChromaDB, Kafka, Prometheus, Grafana) operational. VRAM management optimized, inference running stable.

🔄

Phase 2: Expert Fine-Tuning (current)

Identifying best models per task type, optimizing system prompts, building expert templates in the MoE Portal, testing MCP tool integration end-to-end.

📋

Phase 3: Open-Source Release

Publication on GitHub under Apache 2.0 license once all core features are established and thoroughly tested. Full documentation available via MkDocs at docs.moe-admin.de.

License: Apache 2.0 · Language: Python + FastAPI + LangGraph · Minimum hardware: 1 GPU with ≥ 8 GB VRAM