Version: Latest

vLLM Semantic Router

vLLM Semantic Router is a research-driven project focused on frontier problems in LLMRouting and Token economy. We build system-level intelligence for Mixture-of-Models (MoM): deciding how to capture the right signals, select the right model path, enforce the right policy, and spend the right token budget for each request.

The project sits between clients and model backends as an Envoy External Processor (ext_proc), turning routing from ad hoc application logic into an observable, configurable control plane for multi-model systems.

Research Focus

We use the project to answer a small set of hard systems questions:

How do we capture missing signals from requests, responses, users, and runtime context?
How do we compose those signals into robust routing and policy decisions?
How do multiple models collaborate as a system instead of serving as isolated endpoints?
How do we optimize latency, spend, and tool usage as part of a practical token economy?
How do we add safety, feedback, and observability without fragmenting the serving stack?

Core System

Signal-Driven Decision Engine

Captures and combines 9 types of request signals to make intelligent routing decisions:

Signal Type	Description	Use Case
keyword	Pattern matching with AND/OR operators	Fast rule-based routing for specific terms
embedding	Semantic similarity using embeddings	Intent detection and semantic understanding
domain	MMLU domain classification (14 categories)	Academic and professional domain routing
fact_check	ML-based fact-checking requirement detection	Identify queries needing fact verification
user_feedback	User satisfaction and feedback classification	Handle follow-up messages and corrections
preference	LLM-based route preference matching	Complex intent analysis via external LLM
language	Multi-language detection (100+ languages)	Route queries to language-specific models
context	Token-count based context classification	Route short/long context requests to suitable models
complexity	Query difficulty classification (easy/medium/hard)	Match model capability to task difficulty

How it works: Signals are extracted from requests, combined using AND/OR operators in decision rules, and used to select the best model and configuration.

Plugin Chain Architecture

Extensible plugin system for request/response processing:

Plugin Type	Description	Use Case
semantic-cache	Semantic similarity-based caching	Reduce latency and costs for similar queries
jailbreak	Adversarial prompt detection	Block prompt injection and jailbreak attempts
pii	Personally identifiable information detection	Protect sensitive data and ensure compliance
system_prompt	Dynamic system prompt injection	Add context-aware instructions per route
header_mutation	HTTP header manipulation	Control routing and backend behavior
hallucination	Token-level hallucination detection	Real-time fact verification during generation

How it works: Plugins form a processing chain, each plugin can inspect/modify requests and responses, with configurable enable/disable per decision.

Key Benefits

A Control Plane for LLMRouting

Policy instead of hard-coded branches: Move routing logic out of application code into reusable signals, decisions, and configuration.
Capability-aware selection: Route by task shape, risk, and quality requirements instead of defaulting every request to one model.

A Practical Token Economy Layer

Spend budget where it matters: Reserve premium models, long context, and tool calls for the requests that need them.
Reduce waste without collapsing quality: Use semantic caching, context-aware routing, and explicit policy to control latency and token spend.

Governance in the Request Path

Built-in safety and compliance: Apply jailbreak, PII, hallucination, prompt, and header controls at the same layer that makes routing decisions.
Observable decisions: Keep routing and policy outcomes auditable so teams can tune behavior with data instead of guesswork.

A Research Surface That Can Ship

Fast experimentation: Add new signals, algorithms, and plugins without rewriting the serving path.
Production alignment: Connect experimentation, observability, and deployment in one maintained system.

Use Cases

Multi-model inference gateways: Route to specialized models based on capability, context, and policy.
Cost-aware copilots: Balance quality, latency, and spend for internal assistants and developer tooling.
Safety-sensitive assistants: Enforce PII, jailbreak, and hallucination controls in the live request path.
Research platforms: Evaluate routing policies, collect feedback signals, and iterate on model collaboration strategies.

Start Here

Overview for project goals, semantic routing concepts, and collective intelligence.
Installation for setup, deployment options, and configuration.
Fleet Simulator for planning GPU fleets, evaluating routing strategies, and reading the guide PDF.
Capacities for signals, projections, decisions, plugins, algorithms, and global controls.
Proposals for design work that has not yet been folded into the stable docs set.

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

vLLM Semantic Router

Research Focus​

Core System​

Signal-Driven Decision Engine​

Plugin Chain Architecture​

Key Benefits​

A Control Plane for LLMRouting​

A Practical Token Economy Layer​

Governance in the Request Path​

A Research Surface That Can Ship​

Use Cases​

Start Here​

Contributing​

License​