Full-day Tutorial: Networking for Stateful LLM Inference

Monday, September 8th

Location

The tutorial will take place Online

Presenters

Junchen Jiang (University of Chicago)
Yuhan Liu (University of Chicago)
Zhuohan Gu (MIT)
Qizheng Zhang (Stanford University)
Chen Wang (IBM)
Yue Zhu (IBM)
Ruoyu Qin (Tsinghua University)
Yuwei An (Carneigie Mellon University)
Xiangfeng Zhu (University of Washington)
Kuntai Du (TensorMesh, Inc)
Huaizheng Zhang (ByteDance, Inc)
Shaoting Feng (University of Chicago)

Tutorial Timetable

08:45 — 09:15	Overview of Distributed LLM Inference Systems and LMCache
09:15 — 10:00	Setup and One-Click Installation of LMCache
Session A \| Feel the Speedup
10:00 — 10:30	Build an Agent Application with LMCache — session 1
10:30 — 11:00	Morning coffee break
11:00 — 11:30	Build an Agent Application with LMCache — session 2
11:30 — 12:15	Visualize the Speedup Using the LMCache Grafana Dashboard
12:15 — 12:45	LMCache and vLLM Integration
12:45 — 14:00	Lunch break
Session B \| Dive into the Details
14:00 — 14:30	Run Disaggregated Prefill with LMCache
14:30 — 15:00	Code Walkthrough: Mooncake Backend Integration in LMCache
15:00 — 15:45	Implement a Simple KV Cache Compression Algorithm in LMCache
15:45 — 16:15	Afternoon coffee break
16:15 — 16:45	Run Multi-Modality Models with LMCache
16:45 — 17:15	Autoscaling with LMCache
17:15 — 18:00	Q&A and Wrap-up

Summary

This tutorial offers a comprehensive, hands-on introduction to LMCache, a high-performance KV cache management layer for distributed LLM inference. The morning session begins with an overview of distributed LLM inference systems and a one-click installation of LMCache. The first session focuses on experiencing LMCache’s performance benefits through building agentic and retrieval-augmented generation (RAG) applications and visualizing the speedups brought by LMCache with Grafana. After lunch, Session B dives deeper into technical details such as KV cache sharing, disaggregated prefill [3], Mooncake storage backend integration [7], KV cache compression [1, 6], and multi-modality support. The afternoon concludes with sessions on autoscaling, vLLM integration, and an open Q&A and wrap-up.

Motivation

Stateful LLM inference, where prior inputs are reused for future responses, is now ubiquitous in applications like agent workflows, multi-turn chat, and RAG. This creates a distributed, network-intensive workload, as shared context—often encoded in large KV caches (e.g., multi-GBs) —must be transferred across GPU nodes or fetched from remote storage. Efficient KV cache management and transfer is thus critical. Prior work tackles this via KV reuse (e.g., AttentionStore [4], CacheGen [1]), disaggregated prefill [3], and RAG-aware caching [2, 5]. This tutorial presents LMCache, which enables stateful LLM optimizations and new research.

Outline

Overview of Distributed LLM Inference Systems and LMCache
One-Click Installation of LMCache
Build an Agent Application with LMCache
Build a Retrieval-Augmented Generation Application with LMCache
Visualize the Speedup Using the LMCache Grafana Dashboard
Run KV Cache Sharing Across Nodes
Run Disaggregated Prefill with LMCache (XpYd)
Implement a Simple KV Cache Compression Algorithm in LMCache
Run Multi-Modality Models with LMCache
Autoscaling with LMCache
LMCache and vLLM Integration

Expected Audience and Prerequisites

This tutorial is targeted for general audience who are interested in LLM inference and network for LLM inference.
Prior knowledge in LLMs is recommended but not necessary.
Audience should bring their own laptops that are able to connect to virtual machines.

Biographies

Junchen Jiang is an Associate Professor of Computer Science at the University of Chicago. His research interests are networked systems and their intersections with machine learning. He received his bachelor’s degree from Tsinghua University (Yao Class) in 2011 and his Ph.D. from CMU in 2017. He has received two Google Faculty Research Awards, an NSF CAREER Award, a Best Paper Award at ACM EuroSys, and a CMU Computer Science Doctoral Dissertation Award. https://people.cs.uchicago.edu/~junchenj/

Yuhan Liu is a fifth-year Ph.D. student at the University of Chicago, co-advised by Junchen Jiang and Shan Lu. Her research focuses on building an efficient KV cache layer for serving LLMs, including KV cache compression, dynamic blending of KV caches, and cross-model KV cache sharing. She also leads LMCache and vLLM Production Stack, the open-source KV cache layer of vLLM.

Zhuohan Gu is a first-year PhD student at MIT EECS. His research interests lie broadly in computer systems and machine learning, with a recent focus on ML/LLM inference and AI infrastructure. He works on high-performance KV cache management in LLM serving, including KV cache compression, P/D disaggregation, and KV blending/editing. Zhuohan is also a community builder for two open-source projects LMCache and vLLM production stack as a member of LMCache Lab. He graduated from the University of Chicago in 2025 with a BS in Mathematics and Computer Science.

Qizheng Zhang is a third-year PhD student in the computer science department at Stanford University, advised by Kunle Olukotun. His research interest is broadly in computer systems (networking, distributed systems, architecture) and machine learning, with a recent focus on advancing the performance-cost trade-off of agentic AI applications.

Chen Wang is a Senior Research Scientist at the IBM T.J. Watson Research Center. Her research interests lie in Kubernetes, Container Cloud Resource Management, Cloud Native AI platforms, GenAI Inference/Finetuning Systems and applying AI/ML in Cloud management. She is an open-source advocate, a Kubernetes contributor, an Open Source Summit speaker, and a KubeCon speaker/reviewer. She has authored work cited over 900+ times, with an H-index of 15, and has received multiple prestigious honors at IBM Research, including the Outstanding Technical Achievement Award and Outstanding Innovation Achievement Award. She was recognized as one of IBM’s youngest Master Inventors, with over 45+ filed patents. She obtained an MS and a Ph.D. in Electrical & Computer Engineering from Carnegie Mellon University (CMU).

Yue Zhu is a Staff Research Scientist specializing in foundation model systems and distributed storage systems. Yue obtained a Ph.D. in Computer Science from Florida State University in 2021 and has consistently contribute to sustainability for foundation models and scalable and efficient storage solutions.

Ruoyu Qin is a Ph.D. student in the Department of Computer Science and Technology at Tsinghua University, working in the MADSys Lab under the supervision of Prof. Mingxing Zhang. His research focuses on efficient machine learning systems, with an emphasis on large-scale distributed LLM deployments. Currently, he is an AI infrastructure research intern at Moonshot AI, a startup specializing in LLM technologies. At Moonshot, he led the development of Mooncake, a KVCache-centric disaggregated inference system. His paper on Mooncake received the Erik Riedel Best Paper Award at FAST’25.

Yuwei An is a graduate student from Carneigie Mellon University, working with Prof. Beidi Chen. His research interests lie in the intersection of LLMs efficiency and System for LLM. He received his bachelor’s degree from Tsinghua University in 2023. Personal website: https://oasis-git.github.io/

Xiangfeng Zhu is a Ph.D. student at Paul G. Allen School of Computer Science and Engineering at the University of Washington, advised by Arvind Krishnamurthy and Ratul Mahajan. His research interests are broadly in computer systems and networking. His recent works have focused on microservices, service meshes, and application networking.

Kuntai Du is the Chief Scientist in TensorMesh, Inc. After working with vLLM team for more than one year, he joined TensorMesh to work on deploying LLM serving systems at scale. His research covers disaggregated prefilling, KV cache offloading and sharing, and also KV-cache-based prefill-decode accelerations. His research is instantiated into two open-source projects: LMCache (the best glue layer between the inference engine and the infrastructure, and vLLM Production Stack) the most performant production-level deployment under the hood of vLLM projects.

Huaizheng Zhang is a Research Scientist at ByteDance, USA. He received his Ph.D. from Nanyang Technological University, Singapore. His research focuses on Machine Learning Systems (MLSys) and Large Language Model Infrastructure (LLM Infra). He has published 20+ papers in top conferences such as NeurIPS, AAAI, and ACM MM, and serves as a reviewer for ICLR, NeurIPS, ICML, and other leading venues. His open-source MLSys projects on GitHub have accumulated 7,000+ stars, widely adopted by the community.

Shaoting Feng is a pre-doctoral master student at the University of Chicago, advised by Junchen Jiang. His research focuses on systems for large language models (LLMs) and computer networking. He is also one of the core contributors and maintainers of LMCache and production stack.

References

[1] Liu, Yuhan, et al. "Cachegen: Kv cache compression and streaming for fast large language model serving." Proceedings of the ACM SIGCOMM 2024 Conference. 2024.

[2] Yao, Jiayi, et al. "Cacheblend: Fast large language model serving for rag with cached knowledge fusion." Proceedings of the Twentieth European Conference on Computer Systems. 2025.

[3] Zhong, Yinmin, et al. "{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving." 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 2024.

[4] Gao, Bin, et al. "{Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}." 2024 USENIX Annual Technical Conference (USENIX ATC 24). 2024.

[5] Jin, Chao, et al. "Ragcache: Efficient knowledge caching for retrieval-augmented generation." arXiv preprint arXiv:2404.12457 (2024).

[6] Liu, Zirui, et al. "Kivi: A tuning-free asymmetric 2bit quantization for kv cache." arXiv preprint arXiv:2402.02750 (2024).

[7] Qin, Ruoyu, et al. "Mooncake: A kvcache-centric disaggregated architecture for llm serving." arXiv preprint arXiv:2407.00079 (2024).