Heartbeat Monitoring

Updated 10 hours ago

When a backup job fails, no user complains. No error page appears. No timeout triggers. The job simply doesn't run, and the silence is indistinguishable from success—until the day you need those backups and discover they stopped three weeks ago.

Heartbeat monitoring exists because silence lies.

The Inversion

Traditional monitoring asks: "Are you there?" It pings servers, checks endpoints, waits for responses. If a service answers, it's alive. If it doesn't, something's wrong.

Heartbeat monitoring inverts this relationship entirely. Instead of asking systems if they're alive, you wait for them to prove it. The monitored process must actively reach out—send a ping, hit an endpoint, update a timestamp—to demonstrate it ran successfully. If that signal stops arriving, something has failed.

This is a dead man's switch. The train operator must press the button every 30 seconds. If they become incapacitated, they stop pressing, and the brakes engage automatically. The system doesn't ask "are you okay?" It assumes you're not okay unless you keep proving otherwise.

For scheduled jobs, batch processes, and periodic tasks, this inversion is essential. These systems don't respond to requests—they wake up on schedules, do their work, and go silent again. Traditional monitoring can't see them. Heartbeat monitoring can.

What Heartbeats Catch

Scheduler failures: If cron dies, if the Kubernetes CronJob controller crashes, if the cloud scheduler hiccups—jobs simply don't start. Nothing breaks visibly. Heartbeat monitoring notices the silence.

Silent crashes: A job starts at 2 AM, processes half the data, hits an unhandled exception, and dies. The scheduler thinks it ran. Logs might be lost or buried. But the completion heartbeat never arrives.

Accidental misconfiguration: Someone disables a job "temporarily" and forgets. Someone deletes a schedule. Someone deploys a config change that breaks job registration. The jobs stop running, and without heartbeat monitoring, no one knows until consequences accumulate.

Resource starvation: The system is technically healthy—CPU looks fine, memory isn't exhausted—but something prevents jobs from starting. Heartbeat monitoring doesn't care why. It only cares that the expected signal didn't arrive.

When Jobs Should Heartbeat

After successful completion is the most common pattern. A daily report job runs at 2 AM, finishes around 2:05 AM, and sends a heartbeat. The monitoring system expects that heartbeat by 2:30 AM. If it doesn't arrive, something went wrong.

At start and completion helps distinguish "job never started" from "job started but died midway." Two missing heartbeats means the scheduler failed. One missing heartbeat means the job crashed during execution.

Periodically during execution works for long-running jobs. A job that processes millions of records for hours should heartbeat every few minutes to prove it's making progress, not hung in a deadlock or spinning on a failed database connection.

With execution metrics elevates heartbeats from "I'm alive" to "I'm alive and here's what happened." Row counts, error counts, execution duration. When these values suddenly change—90% fewer rows processed, 10x more errors—something is wrong even if the heartbeat arrived.

Timing and Grace

Heartbeat timing requires two numbers: the expected interval and the grace period.

The expected interval matches the job schedule. Daily job? Daily heartbeat. Hourly sync? Hourly heartbeat.

The grace period accommodates normal variation. If a job usually takes 5 minutes but sometimes takes 20 during heavy load, the grace period should be generous enough to avoid false alarms. Thirty minutes past expected completion gives breathing room without masking genuine failures.

Timezones matter. A job scheduled for 2 AM EST should have heartbeat expectations in EST. A monitoring system configured in UTC will expect heartbeats at 7 AM UTC in winter and 6 AM UTC in summer. Get this wrong and you'll chase phantom alerts twice a year.

The Gap in Traditional Monitoring

Consider a backup server. Traditional uptime monitoring pings it every minute—"are you there?"—and the server responds. Green checkmarks everywhere. The server is up.

But the backup jobs haven't run in two weeks. The scheduler configuration was accidentally overwritten during a deployment. The server is technically healthy. It just isn't doing its job.

Log monitoring could catch this if someone noticed the absence of backup log entries. But absence is hard to notice in high-volume environments. You're looking for something that isn't there, in a sea of things that are.

Infrastructure monitoring shows healthy CPU, memory, and disk. Nothing looks wrong because nothing is actively failing. The failure mode is inaction, and inaction doesn't show up on dashboards.

Heartbeat monitoring fills this gap by expecting action. It doesn't trust silence. It requires proof of life.

Implementation

The simplest implementation: an HTTP endpoint that records timestamps. Jobs POST to it after completion. A background process checks for overdue heartbeats and triggers alerts.

Dedicated services like Healthchecks.io, Cronitor, and Dead Man's Snitch offer exactly this with polished interfaces and reliable alerting. General monitoring platforms like Datadog and New Relic include heartbeat monitoring among their capabilities.

However you implement it, several details matter:

Retry heartbeat delivery. If the monitoring endpoint is temporarily unreachable, jobs should retry rather than silently failing to heartbeat. Network blips shouldn't trigger false alerts.

Authenticate heartbeats. Public endpoints need tokens or secrets. Otherwise, malicious actors—or misconfigured systems—could send fake heartbeats masking real failures.

Monitor the heartbeat monitor. If your heartbeat monitoring service goes down, it can't receive heartbeats or send alerts. The heartbeat monitor should itself heartbeat to an external service. Turtles all the way down, but necessary.

Test failure detection. Intentionally skip a scheduled job. Verify the alert fires. It's not enough to hope the system works—prove it works before you need it.

Frequently Asked Questions About Heartbeat Monitoring

Was this page helpful?

😔

🤨

😃