Network failures and performance issues in large-scale data center networks are inevitable causing major outages followed by services to go down. While troubleshooting, it is important to understand what is going on in the network through a telemetry system. However, building a telemetry system that answers a diverse set of queries is expensive and technically challenging mainly because of overhead incurred while collecting and processing a huge amount of data. To address this problem, we developed tools that look across the network entities (e.g., hosts, network core) for right division of labor to get fine-grained information on short timescales and understand what is going on in the network.

Self-managing network aims to minimize human involvement. But the underlying software and hardware ecosystem is becoming increasingly complex. In such a large and complex systems, runtime bugs and failures are inevitable. The aim of this work is to: (1) detect bugs that happen at runtime; and (2) provide essential information for fixing those bugs. However, the key challenge is to design a practically deployable bug detection system that consume minimal system resources and detect bugs in near real time. My group address this challenge with novel network primitives that follow software/hardware codesign principles.

Network control is largely derived based on measurement, but today measurement remains decoupled from the control with the human in the middle of control-loop introducing uncertainty and the possibility of errors. In contrast, operators could use fine-grained measurements to automate network control at scale. Based on this observation, my goal is to get the humans out of the way by deriving and tightly integrating measurement, inference, and control operations. Such derived operations monitor metrics relevant to a high-level policy goal (e.g., traffic engineering, routing), perform necessary analysis, and execute the decisions. In this work, I close the control-loop by synthesizing distributed low-level operations to realize high-level policy. This push is towards not just making things work correctly with debugging, but also by automating the division of labor among network entities.