ACM SIGCOMM 2022, Amsterdam, The Netherlands

ACM SIGCOMM 2022 TUTORIAL: In-Network Machine Learning using Taurus

For the slides and recordings of the panel and keynote, and instructions for the tutorial, please check here:

Tutorial Program

  • Monday August 22, 2022 9:00 am - 5:00 pm CEST (Room: Administratiezaal)

  • 9:00 am - 10:30 am CEST Session I

  • 9:00 am - 10:30 am CEST

    An overview of the Taurus data-plane architecture and the Spatial language

  • 10:30 am - 11:00 am CEST Break

  • 11:00 am - 12:30 pm CEST Session II

  • 11:00 am - 12:30 pm CEST

    Hands-on exercises using the in-network ML development environment based on Taurus

    - ML processing pipeline
    - Compiler
    - Behavioral model
    - Lab exercises

  • 12:30 pm - 1:30 pm CEST Lunch

  • 1:30 pm - 3:00 pm CEST Session III

  • 1:30 pm - 3:00 pm CEST

    Keynote: Systems for ML and ML for Systems: A Virtuous Cycle

    Speaker: Kunle Olukotun, Stanford University and SambaNova Systems
    Abstract: This talk is about the virtuous interplay between machine learning (ML) and systems. I will show examples of how systems optimized for ML computation can be used to train more accurate and capable ML models and how these ML models can be used to improve upon the ad-hoc heuristics used in system design and management. These improved systems can then be used to train better ML models. The latest trend in ML is the development of Foundation models. Foundation models are large pretrained models that have obtained state-of-the-art quality in natural language processing, vision, speech, and other areas. These models are challenging to train and serve because they are characterized by billions of parameters, irregular data access (sparsity) and irregular control flow. I will explain how Reconfigurable Dataflow Accelerators (RDAs) can be designed to accelerate foundation models with these characteristics. SambaNova Systems is using RDA technology to achieve record-setting performance on foundation models. I will describe how the RDAs can also be used to build Taurus, an intelligent network data plane that enables ML models to be used to manage computer networks at full line-rate bandwidths. In particular, a Taurus prototype detects two orders of magnitude more events in a security application than a state-of-the-art system based on conventional network technology.

  • 3:00 pm - 3:30 pm CEST Break

  • 3:45 pm - 5:00 pm CEST Session IV

  • 3:45 pm - 5:00 pm CEST

    Panel: Role of ML in networking and vice-versa

    Panelists: Kunle Olukuton, Jennifer Rexford, Nick Feamster, Minlan Yu
    Moderator: Muhammad Shahbaz

Call For Participation

This tutorial will expose attendees to the exciting new world of in-network machine learning (ML) that's enabled by Taurus. Through lectures and lab exercises, attendees will not only learn the internals of Taurus but also write per-packet ML applications (using KMeans, DNNs, and LSTMs) in Spatial and test them using Taurus’s behavioral model; hence, gaining hands-on experience with in-network machine learning.

We will present the design of the Taurus switch, emphasizing the role of its MapReduce block in enabling per-packet ML, by supporting new computational primitives inside the switch. We will also provide an overview of the Spatial language, and through a series of exercises show how to write ML applications using P4 + Spatial and compile them to the Taurus switch. By the end of the tutorial, attendees will be able to build and run novel per-packet ML models in Spatial and evaluate them using the Taurus behavioral model with Mininet, a virtual network environment.


Maintaining strict security and service-level objectives (SLOs) in next-generation hyperscale datacenter, enterprise, and edge networks demand that compute-intensive management and control decisions are made on the current state of the entire network (e.g., topology, queue sizes, and link and server loads), and applied per-packet at line-rate, in a fast-and-intelligent way [1]. A delay of even a few microseconds in today’s (petabit-bisection-bandwidth) networks would result in (a) missing millions of anomalous packets [1], (b) saturating switch queues and causing congestion [6], (c) excessive retransmissions due to packet drops [7], (d) and imbalanced traffic and server loads [8], ultimately resulting in loss of revenue, higher operating costs, and unsatisfied end-users.

Unfortunately, the dominant solutions available today are either fast-yet-dumb or slow-but-intelligent. Network operators run services (like load balancing, anomaly detection, and congestion control) using switches and routers, which can react in nanoseconds to network conditions [1]. However, these devices are designed for routing packets and have a constrained programming model (e.g., match-action tables or MATs [6]), which limits these services to simple heuristics. Conversely, control-plane servers (managing the network) can make complicated data-driven decisions [1]. However, the round trip (10 µs or more) between the controller and switch fundamentally limits the control plane’s reaction speed, even with fast packet IO (e.g., Intel’s DPDK) and dedicated hardware (e.g., TPU or GPU).

We believe, “it is now time to bridge this gap between speed and intelligence.” And, to do so, we present and open-source, Taurus, a novel data-plane switch architecture for per-packet ML (appeared in ASPLOS ’22 [1] and winner of IETF/IRTF ANRP Prize ’22). Taurus extends the Protocol-Independent Switch Architecture (PISA) [5] with a new Map-Reduce (MR) block, based on a spatial SIMD architecture that supports a variety of ML models [3]. The block is accompanied by an open-source language, Spatial [2], that along with the P4 language [4] specifies the various components of the Taurus switch.


A rough outline follows:

  • Session I: An overview of the Taurus data-plane architecture and the Spatial language.

  • Session II: Hands-on exercises using the in-network ML development environment based on Taurus.

    • ML processing pipeline
    • Compiler
    • Behavioral model
    • Lab exercises
  • Session III: Keynote: Systems for ML and ML for Systems: A Virtuous Cycle, by Kunle Olukotun.

  • Session IV: A panel of luminaries from ML, Architecture, and Networking - bringing them together for the first time at SIGCOMM to discuss the role of ML in networking.

Audience Expectations and Prerequisites

Attendees are not expected to have any prior knowledge of P4 or Spatial languages; the necessary understanding to finish the lab exercises will be provided during the tutorial. However, we require that attendees meet the following expectations: (a) Attendees must bring their own laptops. (b) We will provide a VM image containing the required packages and tools, which they run on their machines. (c) We will provide detailed handouts to help follow the tutorial.


  • Tushar Swamy

    Stanford University

    • Bio:

      Tushar Swamy is a Ph.D. candidate in the Electrical Engineering Department at Stanford University, where he is advised by Kunle Olukotun. His research is at intersection of machine learning, networking, and architecture, where he develops hardware/software stack for dataplane-based machine learning infrastructure and services. Tushar has received the IETF/IRTF ANRP Prize '22 for his work on ML-capable switches and was named a Goldwater Scholar in 2014.

  • Annus Zulfiqar

    Purdue University

    • Bio:

      Annus Zulfiqar is a Ph.D. candidate and Ross Fellow in the Computer Science Department at Purdue University, where he is advised by Muhammad Shahbaz. His research focuses on designing the next-generation hardware/software abstractions and architectures for emerging workloads (e.g., in-network machine learning). Before joining Purdue, he worked as a Design Engineer at the Center for Advanced Research in Engineering (CARE), Pakistan, where he designed Wi-Fi/Ethernet/LTE-capable IoT Sensor Node Networks for Industrial Machine Telemetry. He received his undergraduate in Electrical Engineering from National University of Sciences and Technology (NUST), Pakistan.

  • Alex Rucker

    Stanford University

    • Bio:

      Alex Rucker is a fifth-year Ph.D. student at Stanford University studying Electrical Engineering and advised by Kunle Olukotun. His research interests are centered on cross-stack (language, compiler, and hardware) optimizations that can make coarse-grained reconfigurable hardware a reality for applications beyond dense linear algebra. At Stanford, he has worked on bringing sparse applications (Capstan) and applications with irregular control flow (Revet) to vectorized reconfigurable dataflow hardware. He previously received his BS in Electrical and Computer Engineering from Cornell in 2017.

  • Muhammad Shahbaz

    Purdue University

    • Bio:

      Muhammad Shahbaz is a Kevin C. and Suzanne L. Kahn New Frontiers Assistant Professor in Computer Science at Purdue University. His research focuses on the design and development of domain-specific abstractions, compilers, and architectures for emerging workloads (including machine learning and self-driving networks). Shahbaz received his Ph.D. and M.A. in Computer Science from Princeton University and B.E. in Computer Engineering from the National University of Sciences and Technology (NUST). Before joining Purdue, Shahbaz worked as a postdoc at Stanford University and a Research Assistant at Georgia Tech and the University of Cambridge. Shahbaz has built open-source systems, including Pisces, SDX, and NetFPGA-10G, that are widely used in industry and academia. He received the Facebook, Google, and Intel Research Awards; IETF/IRTF ANRP Prize; ACM SOSR Systems Award; APNet Best Paper Award; Best of CAL Paper Award; Internet2 Innovation Award; and Outstanding Graduate Teaching Assistant Award.

  • Kunle Olukotun

    Stanford University

    • Bio:

      Kunle Olukotun is the Cadence Design Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is well known as a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. Olukotun founded SambaNova Systems (to build AI hardware and integrated systems to run AI applications from the data center to the cloud) and Afara Websystems (to develop high-throughput, low-power multicore processors for server systems). The Afara multicore processor, called Niagara, was acquired by Sun Microsystems. Niagara derived processors now power all Oracle SPARC-based servers. Olukotun currently directs the Stanford Pervasive Parallelism Lab (PPL), which seeks to proliferate the use of heterogeneous parallelism in all application areas using Domain Specific Languages (DSLs). Olukotun is a member of the Data Analytics for What’s Next (DAWN) Lab which is developing infrastructure for usable machine learning. Olukotun is an ACM Fellow and IEEE Fellow for contributions to multiprocessors on a chip and multi-threaded processor design and is the recipient of of the 2018 IEEE Harry H. Goode Memorial Award. Olukotun received his Ph.D. in Computer Engineering from The University of Michigan.


[1] Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Ishan Gaur, and Kunle Olukotun. Taurus: A Data Plane Architecture for Per-Packet ML. In ASPLOS, 2022.

[2] David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Spatial: A Language and Compiler for Application Accelerators. In ACM/SIGPLAN PLDI ’18.

[3] Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A Reconfigurable Architecture for Parallel Patterns. In ACM/IEEE ISCA ’17.

[4] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. P4: Programming Protocol-Independent Packet Processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87–95.

[5] Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, and Mark Horowitz. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. In ACM SIGCOMM ’13.

[6] Francis Y Yan, Jestin Ma, Greg D Hill, Deepti Raghavan, Riad S Wahby, Philip Levis, and Keith Winstein. Pantheon: The Training Ground for Internet Congestion-Control Research. In USENIX ATC ’18.

[7] Mo Dong, Qingxi Li, Doron Zarchy, P Brighten Godfrey, and Michael Schapira. PCC: Re-architecting Congestion Control for Consistent High Performance. In USENIX NSDI ’15.

[8] Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In ACM SIGCOMM ’14.