Network Monitoring

Updated 10 hours ago

Networks are invisible until they break. You don't think about the path between your application and its users until that path fails—and then nothing else matters. A perfectly healthy server running flawless code becomes useless the moment the network between it and users fails. The application didn't break. The path broke.

Network monitoring watches these paths. It tracks everything from the switch in your server rack to the undersea cables carrying traffic across oceans. Since every distributed system depends entirely on network reliability, network monitoring provides the visibility that makes troubleshooting possible.

Why Network Monitoring Matters

Networks fail in ways that server monitoring and application monitoring can't see. Your dashboards show green—CPU normal, memory fine, application responding to health checks—but users report the site is down. Without network monitoring, you're blind to an entire category of failure.

Connectivity failures happen when networks become unreachable due to equipment failure, misconfiguration, or physical damage. A backhoe cuts a fiber line. A router reboots unexpectedly. A firewall rule blocks legitimate traffic. Even brief connectivity loss disrupts user sessions and breaks service availability.

Performance degradation creates problems that seem mysterious without network visibility. The network is "up" but congested. Packets are flowing but slowly. Users experience timeouts and errors while your application logs show nothing wrong. The application isn't the problem—the path is.

Capacity exhaustion occurs when bandwidth utilization approaches interface limits. A 1 Gbps link carrying 950 Mbps isn't down, but it's dropping packets and creating latency. Understanding network capacity prevents bottlenecks and informs when to upgrade.

Security threats often appear first in network traffic. DDoS attacks, data exfiltration, lateral movement by attackers—network monitoring provides the first indication of many security incidents.

Core Network Metrics

Network health comes down to a handful of fundamental measurements:

Bandwidth utilization measures throughput as a percentage of interface capacity. A 1 Gbps interface passing 800 Mbps operates at 80% utilization. High sustained utilization indicates congestion, though brief spikes are normal.

Packet loss occurs when network devices drop packets due to congestion, errors, or configuration problems. This metric matters more than most people realize—even 1-2% packet loss significantly impacts TCP performance. Lost packets trigger retransmissions and congestion control algorithms that dramatically reduce throughput.

Latency measures round-trip time for packets traveling between points. Under 50ms feels instant. Over 200ms feels sluggish. Over 500ms breaks interactive applications entirely.

Jitter quantifies variation in latency. Consistent 100ms latency works better than latency bouncing between 50ms and 150ms. Video calls and voice calls are particularly sensitive—jitter creates the choppy, robotic audio that makes conversations painful.

Errors including CRC errors, frame errors, and interface errors indicate physical layer problems. Bad cables. Failing network cards. Electromagnetic interference from nearby equipment. These errors often precede complete failures.

ICMP-Based Monitoring

The simplest form of network monitoring uses Internet Control Message Protocol (ICMP)—a protocol from the 1980s that still works remarkably well for basic reachability testing.

Ping sends ICMP echo requests and waits for echo replies. It's essentially shouting "ARE YOU THERE?" across the network and listening for a response. Simple, but effective. Ping tells you whether a host is reachable and how long the round trip takes.

Traceroute maps network paths by exploiting a clever trick: it sends packets with incrementally increasing TTL (Time To Live) values. Each router along the path decrements the TTL and, when it hits zero, returns an ICMP Time Exceeded message. The result reveals every hop packets take from source to destination. When something breaks, traceroute shows you exactly where in the path the problem lies.

Path MTU discovery determines the maximum packet size that can traverse a path without fragmentation. MTU mismatches cause subtle problems—packets get fragmented or dropped, degrading performance in ways that are difficult to diagnose without knowing to look.

ICMP has limitations. Many networks filter or rate-limit ICMP traffic, so ping working doesn't guarantee application traffic works. But for basic connectivity verification, ICMP remains valuable.

SNMP Monitoring

Simple Network Management Protocol (SNMP) provides standardized access to network device metrics:

Interface statistics expose bytes in/out, packets in/out, errors, and discards for every network interface. SNMP makes bandwidth monitoring possible across equipment from different vendors—Cisco, Juniper, Arista—all speaking the same protocol.

Device health metrics including CPU usage, memory utilization, temperature, and power supply status help predict equipment failures before they impact service. A switch running at 95% CPU or 45°C above normal won't fail immediately, but it's heading toward trouble.

Routing table monitoring tracks changes that might indicate instability or misconfiguration. Unexpected route changes often precede connectivity problems.

ARP and MAC address tables reveal which devices connect to which network segments. Useful for troubleshooting and for detecting unauthorized devices on your network.

SNMP comes in three versions. SNMPv1 and v2c are widely deployed but lack meaningful security—community strings (essentially passwords) travel in plaintext. SNMPv3 adds authentication and encryption, making it suitable for production networks where security matters.

Flow Monitoring

Flow-based monitoring analyzes traffic patterns rather than just interface statistics:

NetFlow, sFlow, and IPFIX export records about traffic flows—source address, destination address, ports, protocol, bytes transferred. Flow data answers questions interface statistics can't: What traffic is actually using your bandwidth?

Top talkers analysis identifies which hosts or applications consume the most capacity. When bandwidth utilization spikes, flow data reveals whether it's a backup job, a video conference, or a compromised host exfiltrating data.

Traffic classification categorizes flows by application. Understanding that 60% of bandwidth serves video streaming, 20% web traffic, and 10% file transfers informs Quality of Service policies and capacity planning.

Baseline deviations detect unusual patterns. A host that normally sends 10 MB/day suddenly sending 10 GB warrants investigation.

BGP Monitoring

Border Gateway Protocol routes Internet traffic between autonomous systems. If you connect to the Internet, BGP health directly affects your reachability:

Peer status tracks whether BGP neighbors are established and exchanging routes. A failed BGP peer means lost connectivity to whatever networks that peer provided access to.

Route count monitoring detects route leaks (accidentally announcing routes you shouldn't) and route withdrawals (unexpectedly losing routes you need). Large BGP changes often precede connectivity problems visible to users.

AS path analysis examines the autonomous system path traffic takes. Unexpected paths might indicate BGP hijacking—where someone announces your IP space as their own—or simply inefficient routing adding unnecessary latency.

Route flapping occurs when routes repeatedly appear and disappear. Excessive flapping is self-reinforcing: other networks dampen flapping routes to protect themselves, making your routes less reachable even during the "up" periods.

DNS Monitoring

Domain Name System health affects every Internet service. If DNS breaks, nothing else matters—users can't reach services they can't resolve:

Query response time measures how long DNS servers take to answer. Slow DNS delays everything that follows, since applications can't connect to services they can't find.

Resolution success rate tracks whether queries receive correct answers. DNS failures make services unreachable even when those services are running perfectly, waiting patiently for requests that will never arrive.

Authoritative server monitoring for domains you host ensures your DNS servers respond correctly. If your authoritative DNS fails, your domain becomes invisible to the Internet.

Recursive resolver monitoring tracks the DNS servers your systems use for external lookups. Problems with your resolvers affect connectivity to every external service.

Wireless Network Monitoring

Wireless networks require monitoring dimensions that wired networks don't:

Signal strength and signal-to-noise ratio directly affect performance. Weak signals cause retransmissions and slow throughput. Users experience this as "the WiFi is slow" without understanding why.

Channel utilization measures spectrum usage. Multiple access points on the same channel compete for airtime, reducing performance for everyone.

Access point health including connected client counts, authentication rates, and roaming patterns reveals whether wireless infrastructure handles current load.

Interference detection identifies non-WiFi sources degrading performance. Microwave ovens, Bluetooth devices, and cordless phones can all interfere with WiFi, creating intermittent problems that are maddening to troubleshoot without proper monitoring.

Application-Layer Network Monitoring

Network monitoring extends beyond infrastructure to application protocols:

HTTP/HTTPS monitoring tracks response times, status codes, and TLS handshake performance. This answers whether web services are actually reachable and responsive, not just whether the underlying network is up.

Database connection monitoring verifies that applications can actually connect to databases and execute queries. Database connectivity problems often manifest as application timeouts without clear errors.

API endpoint monitoring tests critical operations end-to-end. Can you actually call the payment API and get a response? This combines network and application monitoring into something closer to user experience.

Certificate monitoring ensures TLS certificates remain valid. Certificate expiration causes connection failures—browsers refuse to connect, API clients reject responses. Easy to prevent with monitoring, embarrassing when it happens.

Network Path Monitoring

Understanding how packets traverse networks helps pinpoint problems:

Multi-hop latency measurement identifies which segments introduce delay. If total latency is 200ms, knowing that 150ms occurs in one specific hop tells you exactly where to look.

Asymmetric routing detection identifies when forward and return paths differ. This can cause problems with stateful firewalls and makes troubleshooting confusing—the problem might be on a path you're not examining.

Packet capture using tools like tcpdump and Wireshark examines actual packets for deep troubleshooting. Not continuous monitoring, but invaluable during incident investigation when you need to see exactly what's happening on the wire.

Network Device Monitoring

Network equipment itself requires attention:

Switch and router health including CPU, memory, and temperature helps prevent device failures before they disrupt service.

Spanning Tree Protocol monitoring detects topology changes that might indicate loops or switching problems. STP issues can take down entire network segments.

VLAN configuration monitoring ensures network segmentation remains correct. VLAN misconfigurations cause unexpected connectivity or, worse, security boundary violations.

Port status tracking identifies when interfaces go down. A port flapping between up and down often indicates cable problems or device issues.

Cloud Network Monitoring

Cloud environments present distinct challenges:

Virtual network monitoring tracks cloud VPCs, subnets, route tables, and security groups. Cloud networking misconfiguration commonly causes connectivity problems that look mysterious until you check the configuration.

Transit and peering connections between cloud regions or to on-premises networks require monitoring. These connections are single points of failure for hybrid architectures.

Internet gateway monitoring verifies that cloud resources can reach the Internet and vice versa.

Throughput limits in cloud environments are often soft limits that trigger throttling when exceeded. Monitoring actual throughput against these limits prevents unexpected performance degradation.

Best Practices

Effective network monitoring requires more than deploying tools:

Monitor from multiple locations. Network problems are often location-specific. A service might be unreachable from Europe while working fine in North America. Single-location monitoring creates blind spots.

Establish baselines. You can't detect anomalies without knowing what's normal. Understand typical bandwidth utilization, latency ranges, and traffic patterns for your network.

Alert on trends, not just thresholds. Gradually increasing latency or error rates often indicate emerging problems. Catching these trends before they become outages is the goal.

Document your topology. Understanding how networks interconnect is essential for interpreting monitoring data. Keep documentation current as networks change.

Test your monitoring. Intentionally create network problems in test environments. Verify that monitoring detects them. Monitoring that doesn't alert on real problems provides false confidence.

Frequently Asked Questions About Network Monitoring

Was this page helpful?

😔

🤨

😃