Traditional network monitoring tools are not sufficient to ensure your network delivers a good user experience because they measure what is, at best, a proxy for the customer’s real network experience. This post explains the difference between traditional network monitoring and subscriber QoE based monitoring.
The purpose of any computer network is to provide access to services at a quality level that is acceptable to the user of the network. Poor quality of experience (QoE) causes customer churn and drives up support costs.
Users of the network don’t care how many routers, switches, radios or cables it takes for them to get to the services they care about. Yet, almost all network monitoring tools focus on the network infrastructure itself, not the subscriber’s traffic. Before we look at the alternative, let’s look at how traditional network monitoring works.
Traditional Network Monitoring
Most network management systems (NMSs) make use of a few simple tools to get information about the network. The simplest and most common of these are ICMP ping and SNMP.
ICMP pings are used to obtain simple, course measurements of reachability and latency. This is accomplished by sending a network node an ICMP echo request message and tracking how long it takes to receive an ICMP reply message. If no response is received this indicates that the target is either down or unreachable and if some messages are lost, packet loss can be calculated.
SNMP is a management protocol supported by almost all network elements. It provides the ability to extract information such as the number of interfaces, counters on the interfaces and other system information such as CPU and memory usage.
Figure 1 shows a very simple WISP network with an NMS system. The red lines show a logical view of the communication between the NMS node and the managed network elements.
Typically the NMS will send one or a series of pings and/or do SNMP queries for system information every few minutes. ICMP, SNMP, and other traditional network monitoring tools are important and required part of your network management arsenal but it’s important to understand the limitations of this type of network monitoring. Specifically, it tells you far less about the end user experience than you might think.
Problems and Limitations
Fundamentally, traditional network monitoring approaches are focused on the elements that make up the network such as routers and switches. It is common to use the metrics related to these elements such as the latency to the element or throughput as a proxy for the per-subscriber experience. There are several reasons why this approach is a rough approximation at best, and at worst, is very misleading.
Networks are often built with redundant paths. This can easily lead to situations where the metrics, such as latency, reported by the NMS system don’t measure the same path as customer traffic. Consider the network in the figure below.
In this network, there are two redundant paths to a customer and the routing protocol has selected the top path for subscriber traffic while the NMS traffic follows the bottom path. If the top path gets congested or starts dropping packets, the NMS based measurements provide no insight into this because, from the NMS’s perspective, there is no problem.
Understanding subscriber QoE requires metrics on a per-subscriber basis. To see why, consider an AP with ten users and a total throughput of 25 Mb/s. From these two numbers, the most that can be concluded is that the average user gets 2.5Mb/s. The reality could be that there is a single user getting 20Mb/s and the other nine users each achieve 0.55Mb/s. That is, you have one really happy customer and nine unhappy ones.
This simple example may be addressed by per-SM counters in the AP but this is not possible when considering more advanced QoE metrics such as latency and TCP retransmissions. In general, network element based management does not have the metric granularity required to say anything intelligent about the per-subscriber QoE.
NMS systems typically send ICMP echo requests or SNMP queries every one to five minutes. Between these monitoring intervals, no new information is obtained. This makes it impossible to monitor for events or changes that occur on an interval that is shorter than the monitoring interval. For example, a 30-second outage that ruined a customer’s game session can go completely undetected when using 1 minute ICMP ping intervals. In general, the more frequent measurements are taken, the more accurate the estimate of the ‘true’ QoE is.
It is important to note that the answer to this problem isn’t as simple as increasing the frequency of ICMP echo requests or SNMP queries. In both cases, the more frequently these operations are performed, the more management traffic is added to the network. Also, for many network elements, SNMP queries are resource intensive making it impossible to query at short intervals.
Subscriber QoE is about the experience from the subscriber’s point of view. Somewhat unfortunately for ISPs, the subscriber point of view includes network elements in their home such as their Wi-Fi router and the end device itself. For example, subscribers cannot distinguish poor QoE caused by AP to SM problems from poor Wi-Fi in their home. In both cases, they are likely to blame their ISP and are more likely to churn or drive up your support costs with complaints.
Interacting with in-home network elements is outside the domain of traditional NMS approaches and is technically impractical due to the presence of home NAT gateways.
Differentiated Network Treatment
Perhaps the most insidious way that network element based monitoring can lead to an incorrect view of subscriber QoE comes from traffic being treated differently by network elements on the path. The figure below provides a logical view of how this can occur.
Notice that in this example, the traffic from the NMS follows the same path through the network as subscriber traffic, however, due to default or configured queueing policies, the traffic is treated differently. This can lead to situations where the NMS reports low latency but the actual subscriber packets experience high latency. Specifically, we have seen real world examples where subscriber TCP traffic was directed to different queues than ICMP traffic. In this situation, the subscriber TCP traffic experienced over 50ms of latency while the NMS was reporting latency to the AP at only 5-10ms.
It is important to note that many types of network equipment have default policies that direct traffic through different queues based on things like the IP DSCP mark. As a result, this type of problem can occur in your network even if you haven’t done any special network configuration.
Subscriber QoE Monitoring
Subscriber QoE monitoring takes a fundamentally different approach as compared to network element based monitoring. Instead of monitoring network elements and inferring the subscriber’s QoE, this approach directly extracts QoE metrics from the subscriber’s traffic. This avoids the pitfalls discussed earlier and provides a critical augmentation of existing network element based monitoring solutions.
To understand how this works, consider the figure above. There are three flows to three different devices in a subscriber’s home. A subscriber QoE based monitoring solution is able to analyze the traffic for all of these flows in real time to extract QoE metrics. Using this approach, important QoE metrics such as latency, loss, jitter, and throughput can be calculated on a per-tower, sector and subscriber basis, providing you with the insight into subscriber experience that you need.
The advantages of subscriber QoE based monitoring approach include:
- No new traffic is added to the access network. QoE information is obtained by processing the subscriber’s traffic.
- QoE metrics cover the entire subscriber experience including home Wi-Fi quality.
- By measuring the subscriber’s traffic directly you can be sure that network QoS configuration is not causing misleading results.
- Metrics can be extracted with a high granularity (per-subscriber and per-IP address) and with a higher frequency than ICMP or SNMP allow.
- Ability to target network upgrades to areas of the network with poor QoE, and perhaps more importantly, delay upgrades to parts of the network that have a good QoE even though they have high load.
A network that delivers a good QoE has happier customers that are less likely to churn and less likely to call support. Traditional NMS tools are based on monitoring network elements such as routers and switches and inferring the quality that the subscriber receives. This approach cannot measure the entire subscriber experience and can often be wildly misleading for many reasons.
Subscriber QoE based monitoring focuses directly on QoE metrics instead of network elements. This enables WISPs to understand the QoE the network delivers down to a tower, sector and subscriber level. With this information in hand, WISPs can better plan their network upgrades and identify troublesome areas of the network.
Preseem is a network telemetry platform with a focus on providing WISPs with the QoE insights & optimization needed to improve customer satisfaction, reduce churn and lower support costs. Click here to learn more.
If you like this post, please help us by sharing it with your network by using the icons here