vLLM Semantic Router
vLLM Semantic Router is a research-driven project focused on frontier problems in LLMRouting and Token economy. We build system-level intelligence for Mixture-of-Models (MoM): deciding how to capture the right signals, select the right model path, enforce the right policy, and spend the right token budget for each request.
The project sits between clients and model backends as an Envoy External
Processor (ext_proc), turning routing from ad hoc application logic into an
observable, configurable control plane for multi-model systems.
Research Focus
We use the project to answer a small set of hard systems questions:
- How do we capture missing signals from requests, responses, users, and runtime context?
- How do we compose those signals into robust routing and policy decisions?
- How do multiple models collaborate as a system instead of serving as isolated endpoints?
- How do we optimize latency, spend, and tool usage as part of a practical token economy?
- How do we add safety, feedback, and observability without fragmenting the serving stack?
Core System
Signal-Driven Decision Engine
Captures and combines 9 types of request signals to make intelligent routing decisions:
| Signal Type | Description | Use Case |
|---|---|---|
| keyword | Pattern matching with AND/OR operators | Fast rule-based routing for specific terms |
| embedding | Semantic similarity using embeddings | Intent detection and semantic understanding |
| domain | MMLU domain classification (14 categories) | Academic and professional domain routing |
| fact_check | ML-based fact-checking requirement detection | Identify queries needing fact verification |
| user_feedback | User satisfaction and feedback classification | Handle follow-up messages and corrections |
| preference | LLM-based route preference matching | Complex intent analysis via external LLM |
| language | Multi-language detection (100+ languages) | Route queries to language-specific models |
| context | Token-count based context classification | Route short/long context requests to suitable models |
| complexity | Query difficulty classification (easy/medium/hard) | Match model capability to task difficulty |
How it works: Signals are extracted from requests, combined using AND/OR operators in decision rules, and used to select the best model and configuration.
Plugin Chain Architecture
Extensible plugin system for request/response processing:
| Plugin Type | Description | Use Case |
|---|---|---|
| semantic-cache | Semantic similarity-based caching | Reduce latency and costs for similar queries |
| jailbreak | Adversarial prompt detection | Block prompt injection and jailbreak attempts |
| pii | Personally identifiable information detection | Protect sensitive data and ensure compliance |
| system_prompt | Dynamic system prompt injection | Add context-aware instructions per route |
| header_mutation | HTTP header manipulation | Control routing and backend behavior |
| hallucination | Token-level hallucination detection | Real-time fact verification during generation |
How it works: Plugins form a processing chain, each plugin can inspect/modify requests and responses, with configurable enable/disable per decision.
Key Benefits
A Control Plane for LLMRouting
- Policy instead of hard-coded branches: Move routing logic out of application code into reusable signals, decisions, and configuration.
- Capability-aware selection: Route by task shape, risk, and quality requirements instead of defaulting every request to one model.
A Practical Token Economy Layer
- Spend budget where it matters: Reserve premium models, long context, and tool calls for the requests that need them.
- Reduce waste without collapsing quality: Use semantic caching, context-aware routing, and explicit policy to control latency and token spend.
Governance in the Request Path
- Built-in safety and compliance: Apply jailbreak, PII, hallucination, prompt, and header controls at the same layer that makes routing decisions.
- Observable decisions: Keep routing and policy outcomes auditable so teams can tune behavior with data instead of guesswork.
A Research Surface That Can Ship
- Fast experimentation: Add new signals, algorithms, and plugins without rewriting the serving path.
- Production alignment: Connect experimentation, observability, and deployment in one maintained system.
Use Cases
- Multi-model inference gateways: Route to specialized models based on capability, context, and policy.
- Cost-aware copilots: Balance quality, latency, and spend for internal assistants and developer tooling.
- Safety-sensitive assistants: Enforce PII, jailbreak, and hallucination controls in the live request path.
- Research platforms: Evaluate routing policies, collect feedback signals, and iterate on model collaboration strategies.
Start Here
- Overview for project goals, semantic routing concepts, and collective intelligence.
- Installation for setup, deployment options, and configuration.
- Fleet Simulator for planning GPU fleets, evaluating routing strategies, and reading the guide PDF.
- Capacities for signals, projections, decisions, plugins, algorithms, and global controls.
- Proposals for design work that has not yet been folded into the stable docs set.
Contributing
We welcome contributions! Please see our Contributing Guide for details.
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.