System-level intelligence

Signal
before scale

抢先体验 Open white paper

System-brain routing: signal-led, entropy-aware, ruthlessly clear.

Signals13

13 signal families spanning intent, safety, modality, context, and preference.

Selection12

12 selectors across symbolic policy, latency heuristics, reinforcement learning, and ML routing.

Surfaces03

One architecture across cpu-local, amd-local, and ci-k8s.

Quick start

One supported local path. Copy the installer, run it, then open the dashboard.

Install locally in one line.

The supported first-run path is a single installer that sets up the CLI and local serve flow on macOS and Linux.

One-liner installmacOS / Linux

curl -fsSL https://vllm-semantic-router.com/zh-Hans/install.sh | bash

Installs into ~/.local/share/vllm-sr, writes ~/.local/bin/vllm-sr, and keeps Windows on the manual pip flow in the docs.

Full installation guide

Research

Papers behind the router.

Research threads that trace the router's evolving ideas across safety, multimodality, orchestration, and system design.

2026 / PaperPOSITION PAPER

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

vLLM Semantic Router Team

arXiv Technical Report

We introduce vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into deployment-specific routing policies across cost, privacy, latency, and safety constraints.

Read paper

2026 / PaperVISION PAPER

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

arXiv Technical Report

We synthesize the project’s recent routing, fleet, multimodal, and governance results into the Workload-Router-Pool (WRP) architecture, connecting signal-driven routing to a full-stack inference optimization framework and outlining future research directions across workload, router, and pool design.

Read paper

2026 / Paper

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We formalize the visual confused deputy as a security failure mode in computer-using agents and introduce a dual-channel guardrail that independently checks click targets and action reasoning before execution.

Read paper

2026 / Paper

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We introduce Outcome-Aware Tool Selection (OATS), an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.

Read paper

2026 / Paper

Adaptive Vision-Language Model Routing for Computer Use Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We propose Adaptive VLM Routing (AVR), which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.

Read paper

2026 / Paper

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We combine Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.

Read paper

2026 / Paper

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We present a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets, without requiring hardware profiling runs up front.

Read paper

2026 / Paper

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We derive the minimum-cost two-pool LLM fleet directly from the workload CDF and P99 TTFT target, then use Compress-and-Route to make the optimal boundary deployable in practice.

Read paper

2026 / Paper

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We derive the 1/W law showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing topology a larger energy-efficiency lever than a pure GPU generation upgrade.

Read paper

2026 / Paper

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

arXiv Technical Report

We show how probabilistic ML predicates in policy languages can silently co-fire on the same query, and implement conflict detection plus a softmax-based prevention mechanism in the Semantic Router DSL.

Read paper

2026 / Paper

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We show that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a 235B model’s performance on persistent user-specific queries while cutting effective inference cost by 96%.

Read paper

2025 / Paper

When to Reason: Semantic Router for vLLM

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

NeurIPS - MLForSys

We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial.

Read paper

2025 / Paper

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

We present a category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.

Read paper

2025 / Paper

Semantic Inference Routing Protocol (SIRP)

Huamin Chen, Luay Jalil

Internet Engineering Task Force (IETF)

This document specifies the Semantic Inference Routing Protocol (SIRP), a framework for content-level classification and semantic routing in AI inference systems.

Read paper

2025 / Paper

Multi-Provider Extensions for Agentic AI Inference APIs

H. Chen, L. Jalil, N. Cocker

Internet Engineering Task Force (IETF) - Network Management Research Group

This document specifies multi-provider extensions for agentic AI inference APIs. Published: 20 October 2025. Intended Status: Informational. Expires: 23 April 2026.

Read paper

Papers that frame how the router sees, decides, and scales.

See all papers and talks

Core logic

Charting the LLM system brain.

A research-driven stack for uncharted territory, probing the frontier where signals, policies, and models converge into one intelligence layer.

Signal extraction

Encoder signals turn raw requests into legible semantic state.

Decision engine

Neural signals meet symbolic rules in auditable routing logic.

Plugin chain

Cache, safety, rewrite, and tracing attach as composable behaviors.

Intent-to-policy compile

Natural language intent compiles into neural-symbolic policy before execution begins.

Frontier LLM systems

Research drives the stack itself, exploring frontier LLM systems beyond settled patterns.

Full dashboard support

Operate routing, topology, controls, and runtime feedback from one integrated dashboard.

路由蓝图

系统如何工作

通过交互式演示理解信号提取、决策逻辑与模型路由行为。

香农映射

从通信理论到路由流水线的结构映射。

用户请求是在编码前的原始源消息。

基于编码器模型

编码器驱动的智能

专用编码器模型从每个请求中提取语义 — 理解意图、排序相关性、跨模态实时分类内容。

Signal surfaces

Sequence classification, token labeling, embeddings, and reranking collapse into one system-intelligence layer.

SEQ_CLSSequence classification for domain, jailbreak, fact-check, and feedback routing.

TOKENToken labeling for PII and safety-sensitive spans that need localized intervention.

EMBEDEmbedding and rerank paths for semantic cache, similarity search, and candidate scoring.

Hugging Face Models

MOD

多模态

检测并路由文本、图像和音频输入到合适的模态模型。

Input

"Is machine learning related to AI?"

Tokenizer

[CLS]IsmachinelearningrelatedtoAI?[SEP]

Embedding

Token Emb

Segment Emb

Position Emb

h₀ = Σ

Encoder Block

×N

ATTNMulti-Head Attention

NORMAdd & Norm

FFNFeed-Forward

NORMAdd & Norm

Signals

CLS

Sentence-Level (CLS Token)[CLS] → Linear Head → "computer science"TaskType: SEQ_CLS

DomainJailbreakFact-checkFeedbackModality

BIO

Token-Level (Per Token)Each token → BIO Label → O O B-LOC I-LOC OTaskType: TOKEN_CLS

PII Detection

EMB

Bi-Encodermean-pooling(h₁..hₙ) → [0.23, -0.41, 0.87, ...]TaskType: EMBEDDING

Semantic CacheSimilarityComplexity-CLJailbreak-CL

RER

Cross-Encoder[CLS] query [SEP] candidate [SEP] → scoreTaskType: CROSS_LEARNING

RerankMulti-Modal

BIE

Bi-Encoder 嵌入

独立编码查询和候选项为稠密向量，用于相似度搜索和语义缓存。

XCE

Cross-Encoder 学习

联合交叉注意力评分查询-候选对，实现高精度重排序。

CLS

分类

基于自研 BERT 的领域、越狱、PII 和事实核查的分类器，覆盖多个 signal

ATT

全注意力

跨 token 和句子的双向注意力 — 双向完整上下文，非因果掩码。

2DM

2DMSE

推理时自适应调整嵌入层数和维度，按需平衡计算量与精度。

MRL

无需重训即可截断嵌入向量到任意维度 — 按请求平衡精度与速度。

Contributors

认识我们的团队

vLLM Semantic Router 背后的优秀成员

维护者

Huamin Chen

Distinguished Engineer @Red Hat

维护者

Chen Wang

Senior Staff Research Scientist @IBM

维护者

Yue Zhu

Staff Research Scientist @IBM

维护者

Xunzhuo Liu

Intelligent Routing @vLLM

提交者

Senan Zedan

R&D Manager @Red Hat

提交者

samzong

AI Infrastructure / Cloud-Native PM @DaoCloud

提交者

Liav Weiss

Software Engineer @Red Hat

提交者

Asaad Balum

Senior Software Engineer @Red Hat

提交者

Yehudit

Software Engineer @Red Hat

提交者

Noa Limoy

Software Engineer @Red Hat

提交者

JaredforReal

Software Engineer @Z.ai

提交者

Srinivas A

Software Engineer @Yokogawa

提交者

carlory

Open Source Engineer @DaoCloud

提交者

Yossi Ovadia

Senior Principal Engineer @Red Hat

提交者

Jintao Zhang

Senior Software Engineer @Kong

提交者

yuluo-yx

Individual Contributor

提交者

cryo-zd

Individual Contributor

提交者

OneZero-Y

Individual Contributor

提交者

aeft

Individual Contributor

提交者

Hao Wu

Individual Contributor

提交者