Skip to main content
Documentation

vLLM Semantic Router

vLLM Semantic Router is a research-driven project focused on frontier problems in

Version: Latest

vLLM Semantic Router

vLLM Semantic Router is a research-driven project focused on frontier problems in LLMRouting and Token economy. We build system-level intelligence for Mixture-of-Models (MoM): deciding how to capture the right signals, select the right model path, enforce the right policy, and spend the right token budget for each request.

The project sits between clients and model backends as an Envoy External Processor (ext_proc), turning routing from ad hoc application logic into an observable, configurable control plane for multi-model systems.

Research Focus

We use the project to answer a small set of hard systems questions:

  1. How do we capture missing signals from requests, responses, users, and runtime context?
  2. How do we compose those signals into robust routing and policy decisions?
  3. How do multiple models collaborate as a system instead of serving as isolated endpoints?
  4. How do we optimize latency, spend, and tool usage as part of a practical token economy?
  5. How do we add safety, feedback, and observability without fragmenting the serving stack?

Core System

Signal-Driven Decision Engine

Captures and combines 9 types of request signals to make intelligent routing decisions:

Signal TypeDescriptionUse Case
keywordPattern matching with AND/OR operatorsFast rule-based routing for specific terms
embeddingSemantic similarity using embeddingsIntent detection and semantic understanding
domainMMLU domain classification (14 categories)Academic and professional domain routing
fact_checkML-based fact-checking requirement detectionIdentify queries needing fact verification
user_feedbackUser satisfaction and feedback classificationHandle follow-up messages and corrections
preferenceLLM-based route preference matchingComplex intent analysis via external LLM
languageMulti-language detection (100+ languages)Route queries to language-specific models
contextToken-count based context classificationRoute short/long context requests to suitable models
complexityQuery difficulty classification (easy/medium/hard)Match model capability to task difficulty

How it works: Signals are extracted from requests, combined using AND/OR operators in decision rules, and used to select the best model and configuration.

Plugin Chain Architecture

Extensible plugin system for request/response processing:

Plugin TypeDescriptionUse Case
semantic-cacheSemantic similarity-based cachingReduce latency and costs for similar queries
jailbreakAdversarial prompt detectionBlock prompt injection and jailbreak attempts
piiPersonally identifiable information detectionProtect sensitive data and ensure compliance
system_promptDynamic system prompt injectionAdd context-aware instructions per route
header_mutationHTTP header manipulationControl routing and backend behavior
hallucinationToken-level hallucination detectionReal-time fact verification during generation

How it works: Plugins form a processing chain, each plugin can inspect/modify requests and responses, with configurable enable/disable per decision.

Key Benefits

A Control Plane for LLMRouting

  • Policy instead of hard-coded branches: Move routing logic out of application code into reusable signals, decisions, and configuration.
  • Capability-aware selection: Route by task shape, risk, and quality requirements instead of defaulting every request to one model.

A Practical Token Economy Layer

  • Spend budget where it matters: Reserve premium models, long context, and tool calls for the requests that need them.
  • Reduce waste without collapsing quality: Use semantic caching, context-aware routing, and explicit policy to control latency and token spend.

Governance in the Request Path

  • Built-in safety and compliance: Apply jailbreak, PII, hallucination, prompt, and header controls at the same layer that makes routing decisions.
  • Observable decisions: Keep routing and policy outcomes auditable so teams can tune behavior with data instead of guesswork.

A Research Surface That Can Ship

  • Fast experimentation: Add new signals, algorithms, and plugins without rewriting the serving path.
  • Production alignment: Connect experimentation, observability, and deployment in one maintained system.

Use Cases

  • Multi-model inference gateways: Route to specialized models based on capability, context, and policy.
  • Cost-aware copilots: Balance quality, latency, and spend for internal assistants and developer tooling.
  • Safety-sensitive assistants: Enforce PII, jailbreak, and hallucination controls in the live request path.
  • Research platforms: Evaluate routing policies, collect feedback signals, and iterate on model collaboration strategies.

Start Here

  • Overview for project goals, semantic routing concepts, and collective intelligence.
  • Installation for setup, deployment options, and configuration.
  • Fleet Simulator for planning GPU fleets, evaluating routing strategies, and reading the guide PDF.
  • Capacities for signals, projections, decisions, plugins, algorithms, and global controls.
  • Proposals for design work that has not yet been folded into the stable docs set.

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.