系统级智能

先懂信号
再谈规模

抢先体验阅读白皮书

先把请求看明白，再把系统做大；以信号驱动，以不确定性为标尺，让决策足够清晰。

信号14

14 类核心信号，覆盖 5 种启发式检测与 9 种学习式检测。

选择器12

12 种选择器，覆盖符号策略、时延启发式、强化学习与机器学习路由。

论文16

16 篇研究论文，覆盖路由、系统、安全与多模态。

快速上手

只保留一条官方支持的本地启动路径：复制安装命令，执行后即可进入控制台。

一条命令，本地跑起来。

首跑路径收敛为一个安装脚本，负责在 macOS 和 Linux 上配置 CLI 与本地服务流程。

一键安装macOS / Linux

curl -fsSL https://vllm-semantic-router.com/zh-Hans/install.sh | bash

默认安装到 ~/.local/share/vllm-sr，写入 ~/.local/bin/vllm-sr；Windows 仍按文档中的手动 pip 方式安装。

查看完整安装指南

研究

这些论文，构成了路由器的底层思路。

从安全、多模态到编排与系统设计，这些研究线索持续塑造 vLLM Semantic Router 的演进方向。

2026 / 论文立场论文

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

vLLM Semantic Router Team

arXiv 技术报告

We introduce vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into deployment-specific routing policies across cost, privacy, latency, and safety constraints.

查看论文

2026 / 论文愿景论文

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

arXiv 技术报告

We synthesize the project’s recent routing, fleet, multimodal, and governance results into the Workload-Router-Pool (WRP) architecture, connecting signal-driven routing to a full-stack inference optimization framework and outlining future research directions across workload, router, and pool design.

查看论文

2026 / 论文

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We formalize the visual confused deputy as a security failure mode in computer-using agents and introduce a dual-channel guardrail that independently checks click targets and action reasoning before execution.

查看论文

2026 / 论文

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We introduce Outcome-Aware Tool Selection (OATS), an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.

查看论文

2026 / 论文

Adaptive Vision-Language Model Routing for Computer Use Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We propose Adaptive VLM Routing (AVR), which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.

查看论文

2026 / 论文

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We combine Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.

查看论文

2026 / 论文

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We present a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets, without requiring hardware profiling runs up front.

查看论文

2026 / 论文

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We derive the minimum-cost two-pool LLM fleet directly from the workload CDF and P99 TTFT target, then use Compress-and-Route to make the optimal boundary deployable in practice.

查看论文

2026 / 论文

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We derive the 1/W law showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing topology a larger energy-efficiency lever than a pure GPU generation upgrade.

查看论文

2026 / 论文

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

arXiv 技术报告

We show how probabilistic ML predicates in policy languages can silently co-fire on the same query, and implement conflict detection plus a softmax-based prevention mechanism in the Semantic Router DSL.

查看论文

2026 / 论文

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We show that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a 235B model’s performance on persistent user-specific queries while cutting effective inference cost by 96%.

查看论文

2026 / 论文RAG 验证

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

arXiv 技术报告

We present a real-time verification component for long-document RAG that processes contexts up to 32K tokens, balancing latency and grounding coverage so interactive systems can detect unsupported answers without falling back to truncated checks.

查看论文

2025 / 论文

When to Reason: Semantic Router for vLLM

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

NeurIPS - MLForSys

We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial.

查看论文

2025 / 论文

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

We present a category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.

查看论文

2025 / 论文

Semantic Inference Routing Protocol (SIRP)

Huamin Chen, Luay Jalil

互联网工程任务组（IETF）

This document specifies the Semantic Inference Routing Protocol (SIRP), a framework for content-level classification and semantic routing in AI inference systems.

查看论文

2025 / 论文

Multi-Provider Extensions for Agentic AI Inference APIs

H. Chen, L. Jalil, N. Cocker

Internet Engineering Task Force (IETF) - Network Management Research Group

This document specifies multi-provider extensions for agentic AI inference APIs. Published: 20 October 2025. Intended Status: Informational. Expires: 23 April 2026.

查看论文

它们决定路由器如何感知、如何判断，以及如何扩展。

查看全部论文与分享

核心架构

LLM 系统的大脑，怎么搭起来。

从信号、投影到策略与模型选择，这是一套面向真实系统的研究型路由栈。

信号提取

启发式规则和学习式检测器把原始请求转成可计算的路由状态。

投影协调

把分桶、打分和映射结果整理成可复用的路由事实，供后续决策复用。

决策引擎

信号与投影结果在可审计的符号规则里汇合，形成明确的路由决策。

插件链

缓存、安全、改写和链路追踪作为可组合能力挂接在同一条链路上。

前沿系统研究

这不是套壳式路由，而是一套围绕下一代 LLM 系统持续演进的研究栈。

控制台与运维

拓扑、路由、控制面与运行反馈都能在同一个控制台里观测和操作。

路由蓝图

系统如何工作

通过交互式演示理解信号提取、投影协调、决策逻辑与模型路由行为。

香农映射

从通信理论到路由流水线的结构映射。

用户请求是在编码前的原始源消息。

编码器模型驱动

先理解，再生成

专门训练的编码器先读懂意图、排序相关性、识别模态，再把结果交给生成模型。

信号入口面

序列分类、token 标注、嵌入检索和重排序，最终汇合成同一层系统智能。

SEQ_CLS序列分类负责领域识别、越狱检测、事实核查与反馈路由。

TOKENToken 标注负责定位 PII 与高风险片段，便于做局部干预。

EMBED嵌入与重排序链路支撑语义缓存、相似检索和候选打分。

查看 Hugging Face 模型

MOD

多模态

检测并路由文本、图像和音频输入到合适的模态模型。

Input

"Is machine learning related to AI?"

Tokenizer

[CLS]IsmachinelearningrelatedtoAI?[SEP]

Embedding

Token Emb

Segment Emb

Position Emb

h₀ = Σ

Encoder Block

×N

ATTNMulti-Head Attention

NORMAdd & Norm

FFNFeed-Forward

NORMAdd & Norm

Signals

CLS

Sentence-Level (CLS Token)[CLS] → Linear Head → "computer science"TaskType: SEQ_CLS

DomainJailbreakFact-checkFeedbackModality

BIO

Token-Level (Per Token)Each token → BIO Label → O O B-LOC I-LOC OTaskType: TOKEN_CLS

PII Detection

EMB

Bi-Encodermean-pooling(h₁..hₙ) → [0.23, -0.41, 0.87, ...]TaskType: EMBEDDING

Semantic CacheSimilarityComplexity-CLJailbreak-CL

RER

Cross-Encoder[CLS] query [SEP] candidate [SEP] → scoreTaskType: CROSS_LEARNING

RerankMulti-Modal

BIE

Bi-Encoder 嵌入

独立编码查询和候选项为稠密向量，用于相似度搜索和语义缓存。

XCE

Cross-Encoder 学习

联合交叉注意力评分查询-候选对，实现高精度重排序。

CLS

分类

基于自研 BERT 的领域、越狱、PII 和事实核查的分类器，覆盖多个 signal

ATT

全注意力

跨 token 和句子的双向注意力 — 双向完整上下文，非因果掩码。

2DM

2DMSE

推理时自适应调整嵌入层数和维度，按需平衡计算量与精度。

MRL

无需重训即可截断嵌入向量到任意维度 — 按请求平衡精度与速度。

贡献者

谁在把这套系统做成现实

从研究到基础设施，这个项目由一群持续交付的人共同推进。

维护者

Huamin Chen

Distinguished Engineer @Red Hat

维护者

Chen Wang

Senior Staff Research Scientist @IBM

维护者

Yue Zhu

Staff Research Scientist @IBM

维护者

Xunzhuo Liu

Intelligent Routing @vLLM

提交者

Senan Zedan

R&D Manager @Red Hat

提交者

samzong

AI Infrastructure / Cloud-Native PM @DaoCloud

提交者

Liav Weiss

Software Engineer @Red Hat

提交者

Asaad Balum

Senior Software Engineer @Red Hat

提交者

Yehudit

Software Engineer @Red Hat

提交者

Noa Limoy

Software Engineer @Red Hat

提交者

JaredforReal

Software Engineer @Z.ai

提交者

Srinivas A

Software Engineer @Yokogawa

提交者

carlory

Open Source Engineer @DaoCloud

提交者

Yossi Ovadia

Senior Principal Engineer @Red Hat

提交者

Jintao Zhang

Senior Software Engineer @Kong

提交者

yuluo-yx

Individual Contributor

提交者

cryo-zd

Individual Contributor

提交者

OneZero-Y

Individual Contributor

提交者

aeft

Individual Contributor

提交者

Hao Wu

Individual Contributor

提交者