Log Monitoring

Updated 10 hours ago

A metric tells you the patient has a fever. A log tells you they ate the fish.

Log monitoring collects the discrete events that systems generate—the specific errors, the exact requests, the authentication attempts, the state changes—and makes them searchable, analyzable, and alertable. While metrics answer "how much" and "how fast," logs answer "what exactly" and "why."

Why Logs Matter

Every interesting thing that happens in a system leaves a trace. A web server records each HTTP request. An application catches an exception and writes the stack trace. A database logs a slow query. A firewall notes a blocked connection attempt.

These traces are evidence. When something goes wrong—and something always goes wrong—logs are how you reconstruct what happened. Metrics might show a spike in errors at 3:47 PM. Logs show you the specific error message, which user triggered it, what they were trying to do, and what the system state was when it failed.

This is the fundamental value: logs provide narrative where metrics provide numbers.

Log Levels

Not all logs are equally urgent. Standard levels categorize importance:

DEBUG captures detailed diagnostic information—function calls, variable values, internal state. Useful during development, usually too verbose for production.

INFO documents normal operations. Service started. User logged in. Report generated. These create an audit trail of what the system did.

WARNING indicates concerning conditions that don't prevent operation. Disk space getting low. Deprecated API called. Retry attempted. Worth watching, not worth waking someone.

ERROR signals failures. Request couldn't complete. Exception caught. Operation failed. These typically trigger alerts.

CRITICAL indicates severe problems threatening the service itself. Database unreachable. Memory exhausted. Security breach detected. These demand immediate response.

The hierarchy matters for filtering. In production, you might alert on ERROR and CRITICAL while keeping INFO and WARNING available for investigation. DEBUG stays off unless you're actively hunting a specific problem.

Structured vs. Unstructured

How logs are formatted determines how useful they become.

Unstructured logs are free-form text:

User login failed for john.doe@example.com from 192.168.1.100

Human-readable, but extracting the username or IP requires parsing the text.

Structured logs use consistent formats like JSON:

{"level":"error","event":"login_failed","user":"john.doe@example.com","ip":"192.168.1.100","timestamp":"2025-12-10T14:23:45Z"}

Now you can query "all login failures from this IP range" without writing regex. You can aggregate "failed logins per user" without text parsing. The machine can read it efficiently while humans can still understand it.

The investment in structured logging pays dividends every time you investigate an incident.

Log Aggregation

Distributed systems scatter logs across dozens or hundreds of machines. A single request might touch ten services, each generating its own logs.

Log aggregation solves this by collecting logs from everywhere into one searchable place.

Log shippers like Filebeat, Fluentd, or Logstash run on each server, reading log files and forwarding them to central collectors. They buffer when networks are slow and retry when delivery fails.

Central collection means engineers search one interface instead of SSHing to individual servers. When a user reports a problem, you search the central logs for their request ID instead of guessing which server handled it.

Time synchronization is critical. If server clocks drift, event sequences become garbled—you're reading a mystery novel with the pages shuffled. NTP or similar protocols keep clocks aligned so logs from different servers actually correlate.

Storage Tiers

Logs accumulate fast. A busy system generates gigabytes daily, terabytes weekly. Storing everything in fast, searchable indexes forever isn't economical.

Storage tiers balance access speed against cost:

Hot storage keeps recent logs (typically 7-30 days) in fast indexes like Elasticsearch. These are the logs you search constantly—current incidents, recent deployments, active investigations.

Warm storage archives older logs (30-90 days) in slower but cheaper storage. Still searchable, just not instant.

Cold storage preserves ancient logs (90+ days) in archival systems like S3 Glacier. Good for compliance requirements and historical analysis, but not designed for regular access.

Retention policies automatically age logs through these tiers and eventually delete them. The policy balances storage costs against "how far back might we ever need to look?"

Searching and Analyzing

Collected logs are only valuable if you can find what you need.

Full-text search finds logs containing specific words. Search "connection timeout" to find every instance across all services.

Field-based filtering narrows by specific values: all errors from the payment service, all requests from a specific user, all database queries over one second.

Aggregations count and summarize. How many errors per service per hour? What's the distribution of response times by endpoint? Which error messages appear most frequently?

Time-series views reveal patterns. Error counts per minute, graphed over time, show whether a problem is growing, stable, or resolving.

The combination enables investigation workflows: start broad ("show me all errors in the last hour"), filter down ("just the payment service"), correlate ("what else happened at the same time?"), and understand.

Log-Based Alerting

Logs can trigger alerts when specific patterns appear:

Pattern matching alerts on specific content. "OutOfMemoryError" in application logs? Alert immediately.

Threshold alerts trigger on volume. More than 100 errors per minute? Something's wrong.

Anomaly detection identifies unusual patterns without explicit thresholds. Error rate is 10x the normal baseline for this time of day? Flag it.

Correlation rules catch specific combinations. Database connection error followed by application timeout within 30 seconds? That's a recognizable failure pattern.

Deduplication prevents alert storms. One alert saying "500 occurrences of this error in the last 5 minutes" beats 500 individual alerts.

Parsing and Enrichment

Raw logs often need processing before they're useful.

Parsing extracts structure from text. A web server log line becomes separate fields: timestamp, IP, HTTP method, path, status code, duration.

Enrichment adds context. IP addresses get geographic locations attached. User IDs get account information. This context enables queries like "errors from enterprise customers" or "traffic from this country."

Normalization standardizes formats across sources. Different services might log timestamps differently, or use different field names for the same concept. Normalization makes them queryable together.

Common Use Cases

Error investigation: Stack traces and request context in error logs diagnose bugs. What was the user doing? What was the application state? What sequence of events led here?

Performance analysis: Duration fields in logs reveal slow operations. That database query taking 10 seconds? The log shows exactly which query.

Security monitoring: Failed login patterns, unusual access, potential intrusion indicators—all visible in logs.

Audit trails: Who changed what, when. Required for compliance, invaluable for investigating unauthorized changes.

User behavior: How users actually interact with the system. Where do they struggle? What features do they use?

Integration with Other Observability

Logs work best alongside metrics and traces.

Metric to log drill-down: A dashboard shows a CPU spike. Click through to logs from that timeframe to see what caused it.

Trace to log correlation: A distributed trace shows a request took 5 seconds. The logs from each service it touched explain why.

Log-derived metrics: Count error logs to create error rate metrics. Parse durations from logs to track latency. The boundary between logs and metrics blurs productively.

Best Practices

Structure your logs. JSON or similar formats. Consistent field names across services. The investment in structured logging pays back every investigation.

Include context. Request IDs, user IDs, operation details. A log should be understandable without reading the code that generated it.

Use levels appropriately. Normal operations aren't errors. Save WARNING and ERROR for actually concerning conditions. Otherwise you create the noise that hides the signal.

Protect sensitive data. Don't log passwords, credit card numbers, or personal information. Implement redaction.

Correlate across services. Trace IDs or correlation IDs connecting related logs across services turn fragmented evidence into coherent stories.

Log what matters. The tension: log too little and you're blind during incidents; log too much and you drown in noise while paying for storage you can't usefully search. Find the balance.

Frequently Asked Questions About Log Monitoring

Was this page helpful?

😔

🤨

😃