Quality of Experience Monitoring vs. Traditional NMS Solutions

Network engineer consulting laptop in server room

by Dan Siemon | July 28, 2017 | Networking

How Quality of Experience Monitoring Differs from Traditional NMS Tools

Traditional network monitoring tools are not enough to ensure your network delivers a good user experience. That’s because they measure what is, at best, a proxy for the customer’s real network experience. This post explains the difference between traditional network monitoring and subscriber quality of experience monitoring, or QoE monitoring.

The purpose of computer networks is to provide access to services at a quality level acceptable to the user. Poor QoE causes customer churn while also driving up support costs.

Network users don’t care how many routers, switches, radios, or cables it takes to get the services they want. However, most network monitoring tools focus on the network infrastructure itself, not subscriber traffic. Before we look at the alternative, let’s look at how traditional network monitoring works.

Traditional Network Monitoring

Most network management systems (NMS) make use of a few simple tools to get information about the network. The simplest and most common of these are ICMP ping and SNMP.

ICMP pings obtain simple measurements of reachability and latency by sending an ICMP echo request message to a network node. Then, it tracks how long it takes to receive an ICMP reply message. If no response is received, this means the target is either down or unreachable. And, if some messages are lost, packet loss can be calculated.

SNMP is a management protocol that extracts information such as the number of interfaces, counters on the interfaces, and other system information like CPU and memory usage.

Figure 1 shows a very simple WISP network with an NMS system. The red lines show a logical view of the communication between the NMS node and the managed network elements.

Diagram of WISP network with NMS Solution

Typically, the NMS sends one or a series of pings and/or does SNMP queries for system information every few minutes. ICMP, SNMP, and other traditional network monitoring tools are an important and required part of your network management arsenal. However, it’s important to understand the limitations of this type of network monitoring. It tells you far less about the end user experience than you might think.

Problems and Limitations

Traditional network monitoring approaches focus on the elements that make up the network, such as routers and switches. Related metrics like the latency to the element or throughput are a proxy for the per-subscriber experience. There are several reasons why this approach is a rough approximation at best, and at worst, very misleading.

Network Paths

Networks often have redundant paths. This can easily lead to situations where the metrics, such as latency reported by the NMS, don’t measure the same path as customer traffic. Consider the network in the figure below.

Diagram of ISP network with redundant paths

In this network, there are two redundant paths to a customer. The routing protocol has selected the top path for subscriber traffic, while the NMS traffic follows the bottom path. If the top path gets congested or starts dropping packets, the NMS-based measurements provide no insight into this. That’s because, from the NMS’s perspective, there is no problem.

Granularity

Understanding subscriber QoE requires metrics on a per-subscriber basis. To see why, consider an AP with 10 users and a total throughput of 25Mb/s. From these two numbers, the most that can be concluded is that the average user gets 2.5Mb/s. The reality could be that there is a single user getting 20Mb/s and the other nine users each achieve 0.55Mb/s. That means you’ve got one really happy customer and nine unhappy ones.

Per-SM counters in the AP could address this simple example. However, this isn’t possible when considering more advanced QoE metrics such as latency and TCP re-transmissions. In general, network element-based management doesn’t have the metric granularity required to say anything intelligent about per-subscriber QoE.

Frequency

NMS systems send ICMP echo requests or SNMP queries every one to five minutes. No new information is obtained between these monitoring intervals. This makes it impossible to monitor for events or changes that occur in anything less than the monitoring interval. For example, a 30-second outage that ruined a customer’s game session can go completely undetected when using one-minute ICMP ping intervals. In general, more frequent measurements increase the accuracy of QoE estimates.

The answer to this problem isn’t as simple as increasing the frequency of ICMP echo requests or SNMP queries. In both cases, increasing the frequency of these operations just adds more traffic to the network. Also, for many network elements, SNMP queries are resource-intensive, making it impossible to query at short intervals.

Incomplete Picture

Subscriber QoE is about the experience from the subscriber’s point of view. Somewhat unfortunately for ISPs, the subscriber point of view includes network elements in their home. These include their Wi-Fi router and the end device itself. For example, subscribers can’t distinguish poor QoE caused by AP to SM problems from poor Wi-Fi in their home. In both cases, they’re likely to blame their ISP and are more likely to churn or drive up your support costs with complaints.

Interacting with in-home network elements is outside the domain of traditional NMS approaches. It’s also technically impractical due to the presence of home NAT gateways.

Differentiated Network Treatment

The most insidious way that network element-based monitoring can lead to an incorrect view of subscriber QoE comes from traffic being treated differently by network elements on the path. The figure below provides a logical view of how this can occur.

Diagram of network with poor queueing policies

Notice that in this example, the traffic from the NMS follows the same path through the network as subscriber traffic. However, due to default or configured queueing policies, the traffic is treated differently. This can lead to situations where the NMS reports low latency but the actual subscriber packets experience high latency.

We’ve seen real-world examples where subscriber TCP traffic was directed to different queues than ICMP traffic. In this situation, the subscriber TCP traffic experienced over 50ms of latency while the NMS was reporting latency to the AP at only 5-10ms.

Many types of network equipment have default policies that direct traffic through different queues based on things like the IP DSCP mark. As a result, this type of problem can occur in your network even if you haven’t done any special network configuration.

Subscriber Quality of Experience Monitoring

Subscriber QoE monitoring takes a fundamentally different approach compared to network element-based monitoring. Instead of monitoring network elements and inferring the subscriber’s QoE, this approach directly extracts QoE metrics from the subscriber’s traffic. This avoids the pitfalls discussed earlier. It also provides a critical augmentation of existing network element-based monitoring solutions.

Diagram of Quality of Experience monitoring on a network

To understand how this works, consider the figure above. There are three flows to three different devices in a subscriber’s home. A subscriber quality of experience monitoring solution can analyze the traffic for all of these flows in real time to extract QoE metrics.

Using this approach, important QoE metrics such as latency, loss, jitter, and throughput can be calculated on a per-tower, sector, and subscriber basis. This gives you the insight into subscriber experience that you need.

The advantages of a subscriber QoE-based monitoring approach include:

No new traffic is added to the access network. QoE information is obtained by processing the subscriber’s traffic.
QoE metrics cover the entire subscriber experience, including home Wi-Fi quality.
By measuring the subscriber’s traffic directly, you know that network QoS configuration is not causing misleading results.
Metrics can be extracted with a high granularity (per-subscriber and per-IP address) and with a higher frequency than ICMP or SNMP allow.
You can target network upgrades to areas of the network with poor QoE. You can also delay upgrades to parts of the network that have good QoE even though they have high load.

Summary

A network that delivers good QoE has happier customers that are less likely to leave and less likely to call support. Traditional NMS tools are based on monitoring network elements such as routers and switches, and inferring the quality that the subscriber receives. This approach can’t measure the entire subscriber experience and can often be wildly misleading for many reasons.

Subscriber quality of experience monitoring focuses directly on QoE metrics instead of network elements. This enables WISPs to understand the QoE the network delivers down to a tower, sector and subscriber level. With this information in hand, WISPs can better plan their network upgrades and identify troublesome areas of the network.

About Preseem

Preseem is a network telemetry platform with a focus on providing WISPs with the QoE insights and optimization needed to improve customer satisfaction, reduce churn and lower support costs. Book a demo with us to learn more.