In a previous blog post, I wrote about the analogy of medical telemetry to network telemetry and the requirement for telemetry solutions to offer metrics at a high granularity (specific) and at short intervals (frequent). This post will explore which network metrics are required for your network telemetry solution to tell you how customers experience your network.
Telemetry as a concept is useful at all layers and levels of the network.
Application layer telemetry, as is becoming the norm in the micro-service world with systems like Prometheus, can provide detailed insight into the behavior of live systems and how they interact with other systems. For example, one of our services tracks the latency to Google Datastore for every query – and there are LOT of queries.
Looking a layer lower, one could imagine a telemetry solution that provides insight into the behavior of network-specific applications such as OSPF or BGP. This could include tracking the time required for each routing table computation or events for each new route that is installed or simply the CPU and memory usage for these specific processes.
At the data plane, telemetry solutions could provide insight into the queue depth, interface drops, fabric drops, forwarding table changes etc. Many of these change too quickly to observe with polling at longer intervals.
With perhaps the exception of the application level, all of the above can be considered types of network telemetry. But notice that they provide insight into particular points in the network – aspects of individual nodes or devices. Insight at these levels is very important for optimization and debugging but provides limited insight into what actually matters – the network’s performance from the user’s point of view.
The purpose of the network is to deliver data to its users. Detailed insight at any particular node in the network does not provide the end-to-end picture of how your customers experience the network. In order to understand and optimize the network as experienced by your customers, I believe there are three network metrics that matter most:
These, conceptually simple metrics can majorly characterize the network’s performance. Of course, like any other type of telemetry, the utility of the metrics is directly related to the granularity and frequency at which they are measured. A network telemetry solution that doesn’t measure by individual users cannot provide you insight into how individual users experience the network.
Additionally, it is extremely important to understand the statistical distribution within each of these network metrics to truly understand network performance. Jitter, which is latency variation, is a familiar concept to many networking people. What is perhaps less well known is the requirement to understand the variation found in these other network metrics. In a future blog post I will talk more about this concept but for now, consider the chart below
‘Series and 4 Time Period Average’ by Dan Siemon. Network telemetry
The chart above shows three different views of the data. The blue line is the actual metric samples take from the network. Look at the red line – this line shows the average over four intervals. This type of averaging is very common in network monitoring solutions and is implicit in long sample periods (eg 5m, 1m). Notice the drastic difference between the blue line and the red line at time 3. By only looking at the average, it’s easy to think the network has a much lower utilization than actually the case. As I will discuss in a future post, plotting the distribution or additional percentiles goes a long way to show you what’s really happening in your network.
“Network Telemetry is a powerful concept that can make understanding, debugging and optimizing your network possible in ways that didn’t exist before.“
However, it is important to keep in mind that the network’s goal is to deliver data to its users. Measuring network metrics related to individual network elements can contribute to debugging and optimization but offers little insight into the performance of the network as experienced by its users.
If you like this post, please help us by sharing it with your network by using the icons here