Database Connection Checks

Updated 10 hours ago

Databases form the backbone of most applications, storing everything from user accounts to transaction records. When databases fail, applications often fail completely—not gracefully, not partially, but completely. Database connection checks verify databases accept connections and execute queries correctly, providing early warning before problems cascade into outages.

The Layers of Database Health

Database health isn't binary. A database can be running but not accepting connections. It can accept connections but not execute queries. It can execute queries but return stale data. Each layer of checking reveals different truths:

Layer 1: Is the process alive? A TCP port check (3306 for MySQL, 5432 for PostgreSQL, 1433 for SQL Server) confirms the database server process is running and network connectivity exists. This is the minimum bar.

Layer 2: Can you get in? Authentication testing verifies credentials work and the database accepts new connections. Connection pools might be exhausted. Authentication systems might be failing. The process running doesn't mean you can connect.

Layer 3: Can it do work? Query execution—even something as simple as SELECT 1—proves the query engine functions. A database can accept connections while being unable to process SQL due to corrupted state or resource exhaustion.

Layer 4: Is the data current? For replicated databases, replication lag determines whether you're reading reality or history. A replica that's hours behind is technically working but practically useless.

A successful connection proves the database server is running. It proves nothing about whether it can actually do its job.

Query Execution Testing

Beyond connections, verify the database actually executes queries:

Simple queries like SELECT 1 or SELECT NOW() require minimal processing but confirm the SQL engine works. If these fail, nothing else will succeed.

Health check tables provide more realistic testing. A dedicated table with one row allows queries like SELECT status FROM health_check WHERE id = 1, verifying read operations against actual data.

Write-then-read tests verify both operations work. Write test data, read it back, delete it. This catches scenarios where reads work but writes silently fail—a particularly insidious failure mode.

Transaction testing begins a transaction, performs operations, then rolls back. This validates transaction handling without modifying production data.

Query execution time reveals problems before they become critical. A query that normally completes in 10 milliseconds but suddenly takes 5 seconds is screaming that something is wrong.

Replication: The Hidden Failure Mode

Databases often use replication for redundancy and read scaling. Replication failures are dangerous precisely because they're invisible to applications:

Replication lag measures the delay between primary and replica. Replicas always lag slightly—that's physics. But minutes or hours of lag means applications reading from replicas see the past, not the present. A customer updates their email address, checks their profile, and sees the old one. They update again. Now you have two pending updates and a confused customer.

Broken replication might not be immediately obvious. The replica keeps serving queries from its existing data. Nothing errors. Nothing alerts. The primary and replica simply drift apart until the divergence becomes catastrophic.

Failover readiness determines whether a replica can become primary without data loss. A replica that's hours behind can't safely take over—you'd lose hours of transactions. Monitor whether your backup plan is actually viable.

Connection Pool Health

Connection pools reuse database connections for efficiency. Pool problems cause intermittent failures that are maddeningly hard to diagnose:

Pool exhaustion happens when all connections are in use. New requests wait. Then they timeout. The database is fine—your application just can't reach it.

Wait times reveal approaching exhaustion. If applications increasingly wait for available connections, you're heading toward failures even if you haven't hit them yet.

Connection leaks occur when application code acquires connections but never returns them. The pool slowly drains. Everything works until it doesn't.

Saturation alerts should fire well before exhaustion. If you have 100 connections and regularly use 95, you're one traffic spike away from outages.

Database-Specific Monitoring

Each database system has metrics that reveal health beyond basic connectivity:

MySQL/MariaDB: Replication status via SHOW SLAVE STATUS, InnoDB buffer pool hit ratio, slow query counts, table lock wait times.

PostgreSQL: Replication lag via pg_stat_replication, autovacuum status, dead tuple accumulation, connection counts against max_connections.

MongoDB: Replica set status, oplog size and window (determines how far behind a replica can fall and still catch up), cache hit ratio.

Redis: Memory usage relative to maxmemory, key eviction rates, persistence status, replication lag.

SQL Server: Buffer cache hit ratio, page life expectancy, deadlock frequency, Always On availability group health.

Generic connection checks tell you a database is alive. Database-specific metrics tell you whether it's healthy.

Locks, Blocks, and Deadlocks

Database locks coordinate concurrent access. When they go wrong, everything slows down:

Lock wait times measure how long queries wait for locks held by other queries. High wait times mean queries are fighting each other for resources.

Deadlocks occur when Query A waits for Query B which waits for Query A. Neither can proceed. Databases detect and break deadlocks by killing one query, but frequent deadlocks indicate design problems—usually transactions that acquire locks in inconsistent orders.

Long-running queries are often the culprits. A query running for minutes might hold locks that block dozens of other queries. Those blocked queries queue up, each holding their own locks, creating cascading delays.

Blocking chains show the full picture: Query A blocks B blocks C blocks D. Kill the root blocker and the chain clears. But you need visibility to find it.

Resource Exhaustion

Databases consume resources. When resources run out, failures cascade:

Memory is critical because databases cache data in RAM. Insufficient memory forces constant disk reads. Performance doesn't degrade linearly—it falls off a cliff.

Disk space seems obvious but catches organizations constantly. Databases can't write when disks fill. Transaction logs grow. Temporary tables accumulate. Alert at 80% full, not 99%.

CPU spikes indicate intensive processing or poorly optimized queries. A query missing an index might scan millions of rows, consuming CPU that should serve other requests.

Cache hit ratios reveal whether your memory allocation is sufficient. Hit ratios above 95% mean most queries find data in memory. Below that, you're hitting disk constantly.

Backup Validation

Backups are useless if they don't restore. Many organizations discover this at the worst possible moment:

Completion verification confirms backups finish on schedule. Backup jobs fail silently more often than anyone wants to admit.

Size tracking catches problems. A backup suddenly half its normal size didn't capture everything. A backup twice its normal size might indicate bloat or include unintended data.

Age monitoring ensures recent backups exist. If your newest backup is two weeks old, your backup system is broken regardless of whether anyone noticed.

Restore testing is the only true validation. Periodically restore backups to test systems. Run verification queries. Confirm the data is complete and consistent. Organizations that skip this step discover their backups are corrupt or incomplete during crisis recovery—when it's too late.

Location verification confirms backups are stored safely. Backups on the same disk as the database don't survive disk failures. Backups in the same datacenter don't survive datacenter failures.

High Availability Validation

High availability setups need testing, not trust:

Failover testing proves automated failover actually works. Simulate primary failures periodically. If you've never tested failover, you don't have high availability—you have hope.

Split-brain detection catches scenarios where multiple nodes believe they're primary. Both accept writes. Data diverges. Reconciliation becomes a nightmare.

Quorum monitoring applies to clustered databases. Losing enough nodes to lose quorum means writes stop. The cluster is technically still running but can't do useful work.

Cluster communication verification ensures all members can reach each other. Network partitions can isolate nodes, reducing redundancy even when individual nodes are healthy.

Read/Write Split Monitoring

Applications often route reads to replicas and writes to the primary:

Read replica availability verification ensures applications have somewhere to send read queries. If replicas fail, reads either fail or suddenly hit the primary, potentially overwhelming it.

Write primary health confirmation ensures the database can accept writes. Some failure modes leave reads working while writes fail—applications can display data but not change it.

Routing verification confirms applications send queries to the right place. Misconfigured routing that sends writes to read-only replicas causes confusing errors.

Lag tolerance determines whether replica lag matters for your application. Some applications tolerate stale reads. Others require strong consistency and should not use replicas at all.

Security Monitoring

Database security requires ongoing verification:

Privilege changes warrant alerts. New superuser accounts or unexpected privilege escalations might indicate compromise.

Access pattern anomalies reveal potential data exfiltration. A service account suddenly querying entire customer tables instead of individual records needs investigation.

Encryption verification confirms encryption-at-rest and encryption-in-transit remain enabled. Configuration changes—intentional or accidental—can disable encryption silently.

Audit log integrity ensures you can investigate incidents. Gaps in audit logs prevent forensic analysis.

Geographic Distribution

Globally distributed databases need location-aware monitoring:

Regional latency measurements reveal user experience. Users far from database servers experience slower response times regardless of how fast the database itself operates.

Cross-region replication lag is inherently higher due to network latency between regions. Set appropriate thresholds—what's alarming within a datacenter might be normal between continents.

Regional independence verification confirms one region's failure doesn't cascade to others. Properly configured distributed databases isolate failures. Improperly configured ones don't.

Key Takeaways

Database health has layers—connection, authentication, query execution, replication—and each layer can fail independently while the others appear healthy. Replication lag is invisible to applications but causes users to see stale data, making it one of the most dangerous failure modes to miss. Connection pool exhaustion causes application failures even when the database itself works perfectly. Query performance degradation is often gradual, making monitoring essential to catch problems before they become outages. Backup validation through actual restore testing is the only way to know backups work—discovering backup problems during recovery is discovering them too late.

Frequently Asked Questions About Database Connection Checks

Was this page helpful?

😔

🤨

😃