Adoption of Software-defined Networking (SDN) in critical environments, such as factory automation, avionics and smart-grid networks, will require in-band control. In such networks, the out-of-band control model, prevalent in data center deployments, is inapplicable due to high wiring costs and installation efforts. Existing designs for seamlessly enabling in-band control plane cater only for single-controller operation, assume proprietary switch modifications, and/or require a high number of manual configuration steps, making them non-resilient to failures and hard to deploy.
To address these concerns, we design two nearly completely automated bootstrapping schemes for a multi-controller in-band network control plane resilient to link, switch, and controller failures. One assumes hybrid OpenFlow/legacy switches with (R)STP and the second uses an incremental approach that circumvents (R)STP. We implement both schemes as OpenDaylight extensions, and qualitatively evaluate their performance with respect to: the time required to converge the bootstrapping procedure; the time required to dynamically extend the network; and the resulting flow table occupancy. The proposed schemes enable fast bootstrapping of a robust, in-band managed network with support for seamless redundancy of control flows and network extensions, while ensuring interoperability with off-the-shelf switches. The presented schemes were demonstrated successfully in an operational industrial network with critical fail-safe requirements.
We propose a push-based approach to network monitoring that allows the detection, within the dataplane, of traffic aggregates. Notifications from the switch to the controller are sent only if required, avoiding the transmission or processing of unnecessary data. Furthermore, the dataplane iteratively refines the responsible IP prefixes, allowing the controller to receive information with a flexible granularity. We implemented our solution, Elastic Trie, in P4 and for two different FPGA devices. We evaluated it with packet traces from an ISP backbone. Our approach can spot changes in the traffic patterns and detect (with 95% of accuracy) either hierarchical heavy hitters with less than 8KB or superspreaders with less than 300KB of memory, respectively. Additionally, it reduces controller-dataplane communication overheads by up to two orders of magnitude with respect to state-of-the-art solutions.
There is a semantic gap between the high-level intents of network operators and the low-level configurations that achieve the intents. Previous works tried to bridge the gap using verification or synthesis techniques, both requiring formal specifications of the intended behavior which are rarely available or even known in the real world. This paper discusses an alternative approach for bridging the gap, namely to infer the high-level intents from the low-level network behavior. Specifically, we provide Anime, a framework and a tool that given a set of observed forwarding behavior, automatically infers a set of possible intents that best describe all observations. Our results show that Anime can infer high-quality intents from the low-level forwarding behavior with acceptable performance.
Network functions (NFs) are critical components in the network data plane. Their efficiency is important to the whole network's end-to-end performance. We identify three types of runtime redundant logic in NFs when they are deployed with concrete configured rules. We propose to use compiler techniques (e.g., program slicing, constant propagation, dead code elimination, symbolic execution) to optimize away the redundancy. We implement a prototype named NFReducer using LLVM. Our evaluation on two IDSes shows that after eliminating the redundant logic, the packet processing rate of the two IDSes can be significantly improved.
In distributed storage systems, erasure coding (EC) is a crucial technology to enable high fault tolerance with lower storage overheads than data replication. EC can reconstruct missing data by downloading parity data from survived machines. However, downloading streams of EC multiplex the available network I/O on the receiving end, leading to a substantially low data reconstruction speed. In this paper, we present NetEC, a novel in-network accelerating system that fully offloads EC to programmable switching ASICs. NetEC prevents multiplexing network I/O through on-switch downloading stream aggregation, thus significantly improving reconstruction speed. NetEC addresses three key challenges: computation offloading of complex EC operations, rate synchronization of multiple downloading streams, and deep payload inspection/assembly. We implement NetEC on hardware programmable switches. Evaluation shows that compared to HDFS-EC, NetEC significantly improves reconstruction rate by 2.7x-9.0x and eliminates CPU overheads, with low switch memory usage.
This paper challenges the common assumption that SDN networks shall be run only at lowest layers of the stack, i.e., L2 and L3. Using as use case data center networks providing virtualized services, we show how state-of-the-art solutions already employ some application-level processing via a central controller. With this in mind, we question if the lessons learned from a decade of SDN networking can be also extended to the upper layers. We make the case for a full-stack SDN framework that encompasses all protocol layers in the network stack, and call for further research in the area.
Unexplained performance degradation is one of the most severe problems in data center networks. The increasing scale of the network makes it even harder to maintain good performance for all users with a low-cost solution. Our system SpiderMon monitors network performance and debugs performance failures inside the network with little overhead. SpiderMon provides a two-phase solution that runs in the data plane. In the monitoring phase, it keeps track of the performance of every flow in the network; upon detecting performance problems, it triggers a debugging phase using a causality analyzer to find out the root cause of performance degradation. To implement these two phases, SpiderMon exploits the capabilities of high-speed programmable switches (e.g., per-packet monitoring, stateful memory). We prototype SpiderMon on using the BMv2 model of P4, and our preliminary evaluation shows that SpiderMon is able to quickly find the root cause of performance degradation problems with minimal overhead. SpiderMon achieves nearly-zero overhead during the monitoring phase and efficiently collects relevant data from switches during the debugging phase.
Recent verification works have found numerous bugs in P4 programs. While it is obvious bugs are undesirable, it is currently not known what effects these bugs have in practice? In this paper we take a first look at the potential of exploitation for such bugs: we first examine how three different targets behave when unspecified behaviours are triggered, finding a range of potentially exploitable behaviours; we use these to attack two concrete programs. We find that the security impact of such exploits can be high, but that the severity of the attack depends on the target.
In the emerging SD-WAN environments, network packets are heavily encapsulated to support various policy-driven network slices and secure connectivity. While such heavy encapsulation increases the chance of IP fragmentation in SD-WAN traffic, existing mechanisms to avoid IP fragmentation such as path MTU discovery, TSO/GSO and pre-fragmentation are often not sufficient. This makes IP reassembly network function (NF) an important ingredient of the SD-WAN architecture, and maximizing its processing efficiency is one of critical design criteria in resource-constrained SD-WAN gateways. In this paper, we show that the state-of-the-art design of IP reassembly NF, which is based on per-packet run-to-completion model, does not utilize underlying hardware caches effectively, and in turn re-architect IP reassembly NF to make it more cache-friendly. The new design leverages well-known software engineering techniques such as loop fission, loop unrolling and prefetch-friendly data structures. Using prototype evaluation, we show that the redesigned IP reassembly NF can support 25-35% higher packet processing throughput than the state-of-the-art design. The broader lesson learned is that the predominantly-adopted per-packet run-to-completion model may not be desirable for any complex NF or NF chain due to instruction cache pressure.
As modern switches become increasingly more powerful, flexible, and programmable, network operators have an ever greater need to monitor their behavior. Many existing systems provide the ability to observe and analyze traffic that arrives at switches, but do not provide visibility into the experience of packets within the switch. To fill this gap, we present PacketScope, a network telemetry system that lets us peek inside network switches to ask a suite of useful queries about how switches modify, drop, delay, and forward packets. PacketScope gives network operators an intuitive and powerful Spark-like dataflow language to express these queries. To minimize the overhead of PacketScope on switch metadata, our compiler uses a "tag little, compute early" strategy that tags packets with metadata as they move through the switch pipeline, and computes query results as early as possible to free up pipeline resources for later processing. PacketScope also combines information from the ingress and egress pipelines to answer aggregate queries about packets dropped due to a full queue.
Recent architectures of the mobile packet core advocate the separation of the control and dataplane components, with all signaling messages being processed by the control plane entities. This paper presents the design, implementation, and evaluation of TurboEPC, a redesign of the mobile packet core that revisits the division of work between the control and data planes. In TurboEPC, the control plane offloads a small amount of user state to programmable dataplane switches, using which the switches can correctly process a subset of signaling messages within the dataplane itself. The messages that are offloaded to the dataplane in TurboEPC constitute a significant fraction of the total signaling traffic in the packet core, and handling these messages on dataplane switches closer to the end-user improves both control plane processing throughput and latency. We implemented the TurboEPC design using P4-based software and hardware switches. The TurboEPC hardware prototype shows throughput and latency improvements by up to 102x and 98% respectively when the switch hardware stores the state of 65K concurrent users, and 22× and 97% respectively when the switch CPU is busy forwarding dataplane traffic at linerate, over the traditional EPC.
The rapid adoption of Reconfigurable Optical Add-Drop Multiplexers (ROADMs) is setting the stage for the dynamic reconfiguration of the network topology in optical backbones. The conventional approach to enable programmability in the physical layer requires solving a cross-layer optimization formulation that captures the interplay between the IP and optical layers. However, as the network scales, the complexity and run time of cross-layer optimization formulations grow prohibitively, resulting in heuristic-based solutions that sacrifice optimality for scalability. We propose a flow-based graph abstraction, called OptFlow, that is able to find the optimal allocation faster than a cross-layer optimization formulation. The key idea in OptFlow is that topology programmability is abstracted by "network flows," enabling service providers to use multi-commodity flow formulations, such as conventional Traffic Engineering, to solve a cross-layer optimization. OptFlow augments the physical graph and uses it as input to the unmodified flow-based Traffic Engineering algorithm, capturing a variety of IP-layer optimization goals such as max throughput, min hop count, and max-min fairness. Due to its flow-based nature, OptFlow inherently provides an abstraction for consistent network updates. To benchmark our key assumptions in OptFlow, we build a small testbed prototype consisting of four ROADMs. To evaluate the optimality and run time of large networks, we simulate five WAN topologies with up to 100 nodes and 390 links. Our results show that OptFlow matches the throughput performance of an optimal cross-layer formulation but has faster computation times. The run time speed-up of OptFlow increases as the network scales, with up to 8x faster execution times in our simulations.
Recent work introduced load-balancing algorithms that dynamically pick the best path entirely in the data plane, to react to traffic dynamics on a small timescale. This paper takes the next step to balance load dynamically across multiple paths in the data plane. The design of such a load-balancing primitive raises interesting challenges due to the hardware constraints of the data plane. We show that these constraints create practical problems for Weighted-Cost MultiPath (WCMP), which replicates hash-table entries in proportion to the weight of each path. Under these hardware constraints, naïve implementations of WCMP take a long time to converge to new weights. We then present a hash-based data structure that achieves adaptive traffic splitting in programmable data planes. Our data structure carefully partitions the arithmetic operations required to a) split traffic in proportion to the path weights and b) update the path weights, by leveraging a multi-stage pipeline and stateful ALUs. By doing so, accurate splitting and efficient updates are done at line rate. We implement our data structure in P4 and our preliminary evaluation shows significant reduction in flow completion time compared to other data-plane load-balancing schemes such as HULA.
Network applications often define policies to manage network traffic based on its attributes (e.g., a service chain, valid next-hops, permission flags). These policies match against packets' attributes in switches before being applied. However, the prior works of identifying attributes all incur a high memory cost in the data plane. This paper presents MEME, a scheme that clusters the attributes in packets to reduce the memory usage. MEME also leverages match-action tables and reconfigurable parsers on modern hardware switches to achieve 87.7% lower memory usage, and applies a graph algorithm to achieve 1-2 orders of magnitude faster compilation time than the prior state of the art [12]. These performance gains pave the way for deployment of a real system desired by the world's largest Internet Exchange Points.
While programmable switches provide operators with much-needed control over the network, they also increase the potential sources of packet-processing errors. Bugs can happen anywhere: in the P4 program, the controller installing rules into tables, or the compiler that maps the P4 program into the resource-constrained switch pipelines. Most of these bugs manifest themselves after certain sequences of packets with certain combinations of rules in the tables. Tracking each packet's execution path through the P4 program, i.e., the sequence of tables hit and the actions applied, directly in the data plane is useful in localizing such bugs as they occur in real time. The fact that programmable data planes require P4 programs to be loop-free and can perform simple integer arithmetic operations makes them amenable to Ball-Larus encoding, a well-known technique in profiling execution paths in software programs that can efficiently encode all N paths in a single [log(N)]-bit variable. However, for real-world P4 programs, the path variable can get quite large, making it inefficient for integer arithmetic at line rate. Moreover, the encoding could require a subset of tables, that would otherwise have no data dependency, to update the same variable. By carefully breaking up the P4 program into disjoint partitions and tracking each partition's execution path separately, we show how to minimally augment P4 programs to track the execution path of each packet.
OpenFlow feature support differs between devices due to device-specific hardware constraints. OpenFlow places the burden of addressing these differences on the controller, which increases development cost and restricts device interoperability. This paper investigates reducing this burden by algorithmically transforming an existing ruleset to fit an incompatible fixed-function pipeline to improve device interoperability. Existing rule-fitting schemes in the literature require metadata to link rules between tables into a path through the pipeline, but not all pipelines support metadata. We developed a novel approach that does not rely on any particular pipeline features, like metadata, and considers the pipeline's constraints, including both the matches and actions available. This paper presents our implementation, including ruleset preprocessing techniques, methods of transforming rules, and how we use a partially constrained boolean satisfiability problem to select from these transformations and build the final ruleset. While future work remains towards real-world deployment, our approach demonstrates fitting rulesets to fixed-function pipelines without metadata is feasible, and our techniques to reduce the size of the problem are beneficial.
Bloom filters are used in many networking applications to answer set membership queries at low cost but suffer from false positives. We study Bloom filter constructions that when representing a set of size up to d taken from a finite universe of size n, completely avoid false positives. We suggest memory-efficient Bloom filters constructions with a false positive free zone to allow representations of larger sets through linear memory dependency in the set size. Our first construction relies on Orthogonal Latin Square (OLS) codes and the second relies on the representation of elements through values of polynomials defined modulo primes. Beyond Bloom filters supporting set membership, we also consider sketches allowing a more general functionality such as flow size estimation. In particular, we show the applicability of the false positive free zone for accurate size estimation in the famous Count-Min sketch. We compare the new constructions to existing approaches through analytical and experimental evaluations for showing their superiority.