跳到主要内容
System-level intelligence

Signal
before scale

System-brain routing: signal-led, entropy-aware, ruthlessly clear.

Signals13

13 signal families spanning intent, safety, modality, context, and preference.

Selection12

12 selectors across symbolic policy, latency heuristics, reinforcement learning, and ML routing.

Surfaces03

One architecture across cpu-local, amd-local, and ci-k8s.

Quick start

One supported local path. Copy the installer, run it, then open the dashboard.

Install locally in one line.

The supported first-run path is a single installer that sets up the CLI and local serve flow on macOS and Linux.

One-liner installmacOS / Linux
curl -fsSL https://vllm-semantic-router.com/zh-Hans/install.sh | bash

Installs into ~/.local/share/vllm-sr, writes ~/.local/bin/vllm-sr, and keeps Windows on the manual pip flow in the docs.

Research

Papers behind the router.

Research threads that trace the router's evolving ideas across safety, multimodality, orchestration, and system design.

2026 / PaperPOSITION PAPER

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

vLLM Semantic Router Team

arXiv Technical Report

We introduce vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into deployment-specific routing policies across cost, privacy, latency, and safety constraints.

2026 / PaperVISION PAPER

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

arXiv Technical Report

We synthesize the project’s recent routing, fleet, multimodal, and governance results into the Workload-Router-Pool (WRP) architecture, connecting signal-driven routing to a full-stack inference optimization framework and outlining future research directions across workload, router, and pool design.

2026 / Paper

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We formalize the visual confused deputy as a security failure mode in computer-using agents and introduce a dual-channel guardrail that independently checks click targets and action reasoning before execution.

2026 / Paper

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We introduce Outcome-Aware Tool Selection (OATS), an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.

2026 / Paper

Adaptive Vision-Language Model Routing for Computer Use Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We propose Adaptive VLM Routing (AVR), which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.

2026 / Paper

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We combine Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.

2026 / Paper

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We present a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets, without requiring hardware profiling runs up front.

2026 / Paper

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We derive the minimum-cost two-pool LLM fleet directly from the workload CDF and P99 TTFT target, then use Compress-and-Route to make the optimal boundary deployable in practice.

2026 / Paper

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv Technical Report

We derive the 1/W law showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing topology a larger energy-efficiency lever than a pure GPU generation upgrade.

2026 / Paper

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

arXiv Technical Report

We show how probabilistic ML predicates in policy languages can silently co-fire on the same query, and implement conflict detection plus a softmax-based prevention mechanism in the Semantic Router DSL.

2026 / Paper

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv Technical Report

We show that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a 235B model’s performance on persistent user-specific queries while cutting effective inference cost by 96%.

2025 / Paper

When to Reason: Semantic Router for vLLM

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

NeurIPS - MLForSys

We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial.

2025 / Paper

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

We present a category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.

2025 / Paper

Semantic Inference Routing Protocol (SIRP)

Huamin Chen, Luay Jalil

Internet Engineering Task Force (IETF)

This document specifies the Semantic Inference Routing Protocol (SIRP), a framework for content-level classification and semantic routing in AI inference systems.

2025 / Paper

Multi-Provider Extensions for Agentic AI Inference APIs

H. Chen, L. Jalil, N. Cocker

Internet Engineering Task Force (IETF) - Network Management Research Group

This document specifies multi-provider extensions for agentic AI inference APIs. Published: 20 October 2025. Intended Status: Informational. Expires: 23 April 2026.

Core logic

Charting the LLM system brain.

A research-driven stack for uncharted territory, probing the frontier where signals, policies, and models converge into one intelligence layer.

Signal extraction

Encoder signals turn raw requests into legible semantic state.

Decision engine

Neural signals meet symbolic rules in auditable routing logic.

Plugin chain

Cache, safety, rewrite, and tracing attach as composable behaviors.

Intent-to-policy compile

Natural language intent compiles into neural-symbolic policy before execution begins.

Frontier LLM systems

Research drives the stack itself, exploring frontier LLM systems beyond settled patterns.

Full dashboard support

Operate routing, topology, controls, and runtime feedback from one integrated dashboard.

路由蓝图

系统如何工作

通过交互式演示理解信号提取、决策逻辑与模型路由行为。

香农映射

从通信理论到路由流水线的结构映射。

用户请求是在编码前的原始源消息。

基于编码器模型

编码器驱动的智能

专用编码器模型从每个请求中提取语义 — 理解意图、排序相关性、跨模态实时分类内容。

Signal surfaces

Sequence classification, token labeling, embeddings, and reranking collapse into one system-intelligence layer.

SEQ_CLSSequence classification for domain, jailbreak, fact-check, and feedback routing.
TOKENToken labeling for PII and safety-sensitive spans that need localized intervention.
EMBEDEmbedding and rerank paths for semantic cache, similarity search, and candidate scoring.
MOD

多模态

检测并路由文本、图像和音频输入到合适的模态模型。

Input
"Is machine learning related to AI?"
Tokenizer
[CLS]IsmachinelearningrelatedtoAI?[SEP]
Embedding
Token Emb
Segment Emb
Position Emb
h₀ = Σ
Encoder Block
×N
ATTNMulti-Head Attention
NORMAdd & Norm
FFNFeed-Forward
NORMAdd & Norm
Signals
CLS
Sentence-Level (CLS Token)[CLS] → Linear Head → "computer science"TaskType: SEQ_CLS
DomainJailbreakFact-checkFeedbackModality
BIO
Token-Level (Per Token)Each token → BIO Label → O O B-LOC I-LOC OTaskType: TOKEN_CLS
PII Detection
EMB
Bi-Encodermean-pooling(h₁..hₙ) → [0.23, -0.41, 0.87, ...]TaskType: EMBEDDING
Semantic CacheSimilarityComplexity-CLJailbreak-CL
RER
Cross-Encoder[CLS] query [SEP] candidate [SEP] → scoreTaskType: CROSS_LEARNING
RerankMulti-Modal
BIE

Bi-Encoder 嵌入

独立编码查询和候选项为稠密向量,用于相似度搜索和语义缓存。

XCE

Cross-Encoder 学习

联合交叉注意力评分查询-候选对,实现高精度重排序。

CLS

分类

基于自研 BERT 的领域、越狱、PII 和事实核查的分类器,覆盖多个 signal

ATT

全注意力

跨 token 和句子的双向注意力 — 双向完整上下文,非因果掩码。

2DM

2DMSE

推理时自适应调整嵌入层数和维度,按需平衡计算量与精度。

MRL

MRL

无需重训即可截断嵌入向量到任意维度 — 按请求平衡精度与速度。

Contributors

认识我们的团队

vLLM Semantic Router 背后的优秀成员

Huamin Chen维护者

Huamin Chen

Distinguished Engineer @Red Hat

Chen Wang维护者

Chen Wang

Senior Staff Research Scientist @IBM

Yue Zhu维护者

Yue Zhu

Staff Research Scientist @IBM

Xunzhuo Liu维护者

Xunzhuo Liu

Intelligent Routing @vLLM

Senan Zedan提交者

Senan Zedan

R&D Manager @Red Hat

samzong提交者

samzong

AI Infrastructure / Cloud-Native PM @DaoCloud

Liav Weiss提交者

Liav Weiss

Software Engineer @Red Hat

Asaad Balum提交者

Asaad Balum

Senior Software Engineer @Red Hat

Yehudit提交者

Yehudit

Software Engineer @Red Hat

Noa Limoy提交者

Noa Limoy

Software Engineer @Red Hat

JaredforReal提交者

JaredforReal

Software Engineer @Z.ai

Srinivas A提交者

Srinivas A

Software Engineer @Yokogawa

carlory提交者

carlory

Open Source Engineer @DaoCloud

Yossi Ovadia提交者

Yossi Ovadia

Senior Principal Engineer @Red Hat

Jintao Zhang提交者

Jintao Zhang

Senior Software Engineer @Kong

yuluo-yx提交者

yuluo-yx

Individual Contributor

cryo-zd提交者

cryo-zd

Individual Contributor

OneZero-Y提交者

OneZero-Y

Individual Contributor

aeft提交者

aeft

Individual Contributor

Hao Wu提交者

Hao Wu

Individual Contributor

Qiping Pan提交者

Qiping Pan

Individual Contributor

Huamin Chen维护者

Huamin Chen

Distinguished Engineer @Red Hat

Chen Wang维护者

Chen Wang

Senior Staff Research Scientist @IBM

Yue Zhu维护者

Yue Zhu

Staff Research Scientist @IBM

Xunzhuo Liu维护者

Xunzhuo Liu

Intelligent Routing @vLLM

Senan Zedan提交者

Senan Zedan

R&D Manager @Red Hat

samzong提交者

samzong

AI Infrastructure / Cloud-Native PM @DaoCloud

Liav Weiss提交者

Liav Weiss

Software Engineer @Red Hat

Asaad Balum提交者

Asaad Balum

Senior Software Engineer @Red Hat

Yehudit提交者

Yehudit

Software Engineer @Red Hat

Noa Limoy提交者

Noa Limoy

Software Engineer @Red Hat

JaredforReal提交者

JaredforReal

Software Engineer @Z.ai

Srinivas A提交者

Srinivas A

Software Engineer @Yokogawa

carlory提交者

carlory

Open Source Engineer @DaoCloud

Yossi Ovadia提交者

Yossi Ovadia

Senior Principal Engineer @Red Hat

Jintao Zhang提交者

Jintao Zhang

Senior Software Engineer @Kong

yuluo-yx提交者

yuluo-yx

Individual Contributor

cryo-zd提交者

cryo-zd

Individual Contributor

OneZero-Y提交者

OneZero-Y

Individual Contributor

aeft提交者

aeft

Individual Contributor

Hao Wu提交者

Hao Wu

Individual Contributor

Qiping Pan提交者

Qiping Pan

Individual Contributor

Huamin Chen维护者

Huamin Chen

Distinguished Engineer @Red Hat

Chen Wang维护者

Chen Wang

Senior Staff Research Scientist @IBM

Yue Zhu维护者

Yue Zhu

Staff Research Scientist @IBM

Xunzhuo Liu维护者

Xunzhuo Liu

Intelligent Routing @vLLM

Senan Zedan提交者

Senan Zedan

R&D Manager @Red Hat

samzong提交者

samzong

AI Infrastructure / Cloud-Native PM @DaoCloud

Liav Weiss提交者

Liav Weiss

Software Engineer @Red Hat

Asaad Balum提交者

Asaad Balum

Senior Software Engineer @Red Hat

Yehudit提交者

Yehudit

Software Engineer @Red Hat

Noa Limoy提交者

Noa Limoy

Software Engineer @Red Hat

JaredforReal提交者

JaredforReal

Software Engineer @Z.ai

Srinivas A提交者

Srinivas A

Software Engineer @Yokogawa

carlory提交者

carlory

Open Source Engineer @DaoCloud

Yossi Ovadia提交者

Yossi Ovadia

Senior Principal Engineer @Red Hat

Jintao Zhang提交者

Jintao Zhang

Senior Software Engineer @Kong

yuluo-yx提交者

yuluo-yx

Individual Contributor

cryo-zd提交者

cryo-zd

Individual Contributor

OneZero-Y提交者

OneZero-Y

Individual Contributor

aeft提交者

aeft

Individual Contributor

Hao Wu提交者

Hao Wu

Individual Contributor

Qiping Pan提交者

Qiping Pan

Individual Contributor

Maintainers, committers, and contributors across research, infrastructure, and open-source operations.

查看所有团队成员
Documentation

Architecture, written to be used.

Install, configure, train, and operate from one dense documentation graph.

Docs index
Community

Research and builders in one loop.

Papers, working groups, and contributors evolve the same system in public.

Community routes