Skip to main content

Overview

A dead-man’s switch is a safety mechanism that triggers an alert when an expected event doesn’t happen. For CronJobs, this means alerting when a job fails to run on its schedule.
Traditional monitoring waits for failures. Dead-man’s switch monitoring detects absence of success, catching:
  • Misconfigured schedules
  • Resource quota exhaustion preventing job creation
  • Controller or cluster issues
  • Accidentally suspended CronJobs

How It Works

The dead-man’s switch analyzer checks if enough time has elapsed since the last successful job execution:
// From internal/analyzer/sla.go:161-250
func (a *analyzer) CheckDeadManSwitch(ctx context.Context, 
    cronJob *batchv1.CronJob, 
    config *v1alpha1.DeadManSwitchConfig) (*DeadManResult, error)
1

Fetch Last Execution

Query the database for the last successful execution of the CronJob:
lastSuccess, _ := a.store.GetLastSuccessfulExecution(ctx, cronJobNN)
if lastSuccess != nil {
    result.TimeSinceSuccess = time.Since(lastSuccess.CompletionTime)
}
2

Calculate Expected Interval

Determine how often the job should run, either from:
  • Fixed interval: maxTimeSinceLastSuccess (e.g., 25h for daily jobs)
  • Auto-detected: Parse cron schedule and add buffer
if config.AutoFromSchedule != nil && config.AutoFromSchedule.Enabled {
    interval, _ := parseScheduleInterval(cronJob.Spec.Schedule)
    buffer := config.AutoFromSchedule.Buffer  // Default: 1h
    expectedInterval = interval + buffer
}
3

Calculate Missed Schedules

Count how many scheduled runs were missed:
missedCount := int32(timeSinceLastRun / expectedInterval)
threshold := config.AutoFromSchedule.MissedScheduleThreshold  // Default: 1

if missedCount >= threshold {
    result.Triggered = true
}
4

Dispatch Alert

If triggered, create an alert with severity critical (configurable via severityOverrides):
type: DeadManTriggered
severity: critical
message: "No jobs have run for 26h. Missed 1 scheduled run(s)"

Configuration

Fixed Interval

Specify a fixed time window:
deadManSwitch:
  enabled: true
  maxTimeSinceLastSuccess: 25h
maxTimeSinceLastSuccess
duration
Alert if no successful execution within this duration.Choosing the right value:
  • For daily jobs (0 0 * * *): Use 25h (24h schedule + 1h buffer)
  • For hourly jobs (0 * * * *): Use 75m (60m schedule + 15m buffer)
  • For weekly jobs (0 0 * * 0): Use 169h (168h schedule + 1h buffer)
Always add a buffer to account for execution time and scheduling jitter. Kubernetes doesn’t guarantee exact schedule timing.

Auto-Detection from Schedule

Automatically parse the cron schedule:
deadManSwitch:
  enabled: true
  autoFromSchedule:
    enabled: true
    buffer: 1h
    missedScheduleThreshold: 2
autoFromSchedule.enabled
bool
Enable auto-detection from the CronJob’s schedule field.
autoFromSchedule.buffer
duration
default:"1h"
Extra time added to the detected interval.For a daily job (0 0 * * *), the detected interval is 24h. With a 1h buffer, the total expected interval is 25h.
autoFromSchedule.missedScheduleThreshold
int32
default:"1"
Number of missed schedules before alerting.Set to 2 to allow one missed run (useful for flaky jobs):
missedScheduleThreshold: 2  # Alert only after 2 consecutive misses

Schedule Parsing

The analyzer parses cron expressions using the standard 5-field format:
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday=0)
│ │ │ │ │
* * * * *
schedule: "0 0 * * *"
autoFromSchedule:
  enabled: true
  buffer: 1h
# Interval: 24h, Expected: 25h
Auto-detection may produce incorrect intervals for:
  • Irregular schedules (e.g., 0 0 1,15 * * - 1st and 15th of month)
  • Timezone-specific schedules with DST transitions
  • Very complex cron expressions
For these cases, use maxTimeSinceLastSuccess with a manually calculated interval.

Caching

Schedule parsing is expensive. The analyzer uses an LRU cache to avoid repeated parsing:
// From internal/analyzer/sla.go:18-33
var (
    scheduleCache     *lru.Cache[string, cron.Schedule]
    scheduleCacheOnce sync.Once
)

func getScheduleCache() *lru.Cache[string, cron.Schedule] {
    scheduleCacheOnce.Do(func() {
        cache, _ := lru.New[string, cron.Schedule](1000)
        scheduleCache = cache
    })
    return scheduleCache
}
The cache stores up to 1000 unique schedules. This is sufficient for most clusters (CronJobs typically share common schedules like 0 0 * * *).

Alert Behavior

When the dead-man’s switch triggers:

Alert Message

No jobs have run for 26h15m. Missed 1 scheduled run(s) (threshold: 1, expected interval: 25h)
The message includes:
  • Time elapsed since last run
  • Number of missed schedules
  • Configured threshold
  • Expected interval

Alert Severity

Default severity is critical. Override via:
alerting:
  severityOverrides:
    deadManTriggered: warning

Duplicate Suppression

Once triggered, the alert remains active until:
  • A job succeeds (clears the alert)
  • The alert is manually cleared
  • The monitor is deleted
Duplicate suppression (default: 1h) prevents re-alerting within the suppression window.

Edge Cases

No Execution History

For newly created CronJobs with no execution history:
// From internal/analyzer/sla.go:226-230
if lastExec == nil {
    if cronJob.CreationTimestamp.Add(expectedInterval).Before(time.Now()) {
        elapsed := time.Since(cronJob.CreationTimestamp.Time)
        missedCount = int32(elapsed / expectedInterval)
    }
}
The analyzer calculates missed schedules from the CronJob’s creation time.

Suspended CronJobs

If suspendedHandling.pauseMonitoring is true (default), dead-man’s switch checks are skipped for suspended CronJobs to avoid false alarms.
suspendedHandling:
  pauseMonitoring: true  # Don't alert while suspended
  alertIfSuspendedFor: 168h  # But alert if suspended > 7 days

Timezone Handling

The analyzer uses the CronJob’s spec.timeZone field (Kubernetes 1.25+) when parsing schedules:
// From internal/controller/cronjobmonitor_controller.go:878-883
loc := time.UTC
if timezone != nil && *timezone != "" {
    if l, err := time.LoadLocation(*timezone); err == nil {
        loc = l
    }
}
For CronJobs without a timezone, UTC is assumed.

Examples

Critical Daily Backup

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
  namespace: databases
spec:
  selector:
    matchLabels:
      type: backup
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h  # Daily with 1h buffer
  alerting:
    channelRefs:
      - name: pagerduty-dba
        severities: [critical]
    severityOverrides:
      deadManTriggered: critical  # Wake up on-call

Flexible Reporting Jobs

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: weekly-reports
spec:
  selector:
    matchLabels:
      type: report
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      buffer: 2h
      missedScheduleThreshold: 2  # Allow 1 missed run
  alerting:
    severityOverrides:
      deadManTriggered: warning  # Non-critical reports

High-Frequency Jobs

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: frequent-sync
spec:
  selector:
    matchNames:
      - every-5min-sync
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 10m  # 5m schedule + 5m buffer
  alerting:
    alertDelay: 2m  # Wait 2m to avoid transient alerts

Monitoring the Monitor

Expose dead-man’s switch metrics via Prometheus:
# Time since last successful run
guardian_cronjob_last_success_timestamp_seconds

# Dead-man switch status (1 = triggered, 0 = ok)
guardian_cronjob_deadman_triggered

Troubleshooting

Problem: Auto-detection calculates the wrong interval for schedules like 0 0 1 * * (monthly).Solution: Use fixed maxTimeSinceLastSuccess:
deadManSwitch:
  maxTimeSinceLastSuccess: 744h  # 31 days
Problem: Dead-man’s switch triggers during scheduled downtime.Solution: Configure maintenance windows:
maintenanceWindows:
  - name: monthly-upgrade
    schedule: "0 0 1 * *"
    duration: 4h
    suppressAlerts: true
Problem: Alert fires immediately for a brand new CronJob.Solution: The analyzer waits for expectedInterval to elapse from creation time before alerting. If you see immediate alerts, check:
  • Is the CronJob actually running? (kubectl get jobs)
  • Is the schedule valid? (kubectl describe cronjob)

Next Steps

SLA Tracking

Monitor success rates and detect regressions

Alert Configuration

Customize alert behavior and routing

Suggested Fixes

Automatically suggest remediation actions