Overview
TheCronJobMonitor is a namespaced Custom Resource that defines monitoring configuration for Kubernetes CronJobs. It enables dead-man’s switch monitoring, SLA tracking, alerting, and data retention policies.
API Group: guardian.illenium.net/v1alpha1
Kind: CronJobMonitor
Scope: Namespaced
Basic Example
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: critical-jobs
namespace: production
spec:
selector:
matchLabels:
tier: critical
deadManSwitch:
enabled: true
maxTimeSinceLastSuccess: 25h
sla:
enabled: true
minSuccessRate: 95
windowDays: 7
alerting:
enabled: true
channelRefs:
- name: slack-ops
severities: [critical, warning]
Spec Fields
selector
Specifies which CronJobs to monitor. An empty selector matches all CronJobs in the monitor’s namespace.
Show properties
Show properties
Selects CronJobs by exact label matches.
matchLabels:
tier: critical
app: backup
Selects CronJobs by label expressions. Each expression is a Supported operators:
LabelSelectorRequirement.matchExpressions:
- key: tier
operator: In
values: [critical, high]
- key: environment
operator: NotIn
values: [dev, test]
In, NotIn, Exists, DoesNotExistExplicitly lists CronJob names to monitor. Only valid when watching a single namespace.
matchNames:
- daily-backup
- weekly-report
Explicitly lists namespaces to watch for CronJobs. If empty and
namespaceSelector is not set, watches only the monitor’s namespace.namespaces:
- production
- staging
Selects namespaces by labels. CronJobs in matching namespaces will be monitored. Uses standard
LabelSelector format.namespaceSelector:
matchLabels:
env: production
Watches CronJobs in all namespaces (except globally ignored ones). Takes precedence over
namespaces and namespaceSelector.deadManSwitch
Configures dead-man’s switch behavior to alert if CronJobs don’t execute successfully within expected intervals.
Show properties
Show properties
Turns on dead-man’s switch monitoring.
Alerts if no successful execution within this duration. Use for fixed-interval monitoring.Example:
"25h" for daily jobs with 1-hour bufferMutually exclusive with autoFromSchedule.Auto-calculates expected interval from the CronJob’s schedule. Alternative to
maxTimeSinceLastSuccess.sla
Configures SLA tracking for success rates and execution durations.
Show properties
Show properties
Turns on SLA tracking and alerting.
Minimum acceptable success rate percentage. Alerts when success rate falls below this threshold.Validation: Range
0-100Rolling window in days for success rate calculation.Validation: Minimum value is
1Maximum acceptable execution duration. Alerts if any job execution exceeds this duration.Example:
"30m" for 30 minutesAlerts if P95 duration increases by this percentage compared to the baseline.Validation: Range
1-1000Example: 50 means alert if P95 increases by 50% or moreNumber of days for calculating the baseline duration.Validation: Minimum value is
1suspendedHandling
Configures behavior for suspended CronJobs.
Show properties
Show properties
maintenanceWindows
Defines scheduled maintenance periods during which alerts can be suppressed.
Show array item properties
Show array item properties
Identifies this maintenance window.
Cron expression defining when the window starts.Example:
"0 2 * * 0" for every Sunday at 2 AMHow long the maintenance window lasts.Example:
"4h" for 4 hoursTimezone for the schedule.Example:
"America/New_York"Whether to suppress alerts during this window.
maintenanceWindows:
- name: weekly-maintenance
schedule: "0 2 * * 0"
duration: 4h
timezone: America/New_York
suppressAlerts: true
alerting
Configures alert channels and behavior.
Show properties
Show properties
Turns on alerting for this monitor.
References to cluster-scoped
AlertChannel custom resources.Show array item properties
Show array item properties
Specifies what context to include in alert notifications.
Show properties
Show properties
Include pod logs in alerts.
Number of log lines to include.Validation: Range
1-10000Specific container name for logs. Defaults to the first container.
Include init container logs.
Include Kubernetes events.
Include pod status details.
Include suggested fixes based on failure patterns.
Prevents re-sending the same alert within this time window.
Delays alert dispatch to allow transient issues to resolve. If the issue resolves (e.g., next job succeeds) before the delay expires, the alert is cancelled and never sent. Useful for flaky jobs.Example:
"5m" waits 5 minutes before sending failure alertsCustomizes severity for specific alert types. Only
critical and warning are valid.Show properties
Show properties
Severity for missed schedule alerts.Validation: Enum
critical, warningSeverity for job failure alerts.Validation: Enum
critical, warningSeverity for SLA breach alerts.Validation: Enum
critical, warningSeverity for dead-man’s switch alerts.Validation: Enum
critical, warningSeverity for duration regression alerts.Validation: Enum
critical, warningseverityOverrides:
jobFailed: critical
slaBreached: warning
deadManTriggered: critical
Defines custom patterns for suggesting fixes based on failure context. These are merged with built-in patterns, with custom patterns taking priority.
Show array item properties
Show array item properties
Identifies this pattern. Use built-in names like
"oom-killed" to override built-in patterns.Match criteria - at least one field must be specified.
Show properties
Show properties
Matches specific exit codes. Example:
137 for OOM killed.Matches container termination reason (exact match, case-insensitive). Example:
"OOMKilled"Matches reason using regex. Example:
".*Killed.*"Matches log content using regex. Example:
"connection timed out|ETIMEDOUT"Matches event messages using regex.
Fix text to display. Supports Go template variables:
{\{.Namespace}\}- CronJob namespace{\{.Name}\}- CronJob name{\{.ExitCode}\}- Container exit code{\{.Reason}\}- Termination reason{\{.JobName}\}- Job name
Determines evaluation order (higher = checked first). Built-in patterns use priorities 1-100. Use values >100 to override built-ins.
suggestedFixPatterns:
- name: custom-oom
match:
exitCode: 137
suggestion: "Container OOM killed. Increase memory in {\{.Namespace}\}/{\{.Name}\}"
priority: 150
- name: connection-timeout
match:
logPattern: "connection timed out|ETIMEDOUT"
suggestion: "Network timeout. Check connectivity to external services."
priority: 50
dataRetention
Configures data lifecycle management for this monitor’s execution history.
Show properties
Show properties
Overrides global retention for this monitor’s execution history. If not set, uses the global
history-retention.default-days setting.Validation: Minimum value is 1Defines behavior when a monitored CronJob is deleted.Validation: Enum
retain, purge, purge-after-daysretain: Keep all historical datapurge: Immediately delete all datapurge-after-days: Wait before purging (requirespurgeAfterDays)
Days to wait before purging data. Required when
onCronJobDeletion is purge-after-days.Validation: Minimum value is 0Defines behavior when a CronJob is recreated (detected via UID change).Validation: Enum
retain, resetretain: Keep old historyreset: Delete history from the old UID
Enables storing job logs in the database. If not set, uses global
--storage.log-storage-enabled setting.How long to keep stored logs. If not set, uses the same value as
retentionDays.Validation: Minimum value is 1Maximum log size to store per execution in KB. If not set, uses global
--storage.max-log-size-kb setting.Validation: Minimum value is 1Enables storing Kubernetes events in the database. If not set, uses global
--storage.event-storage-enabled setting.dataRetention:
retentionDays: 60
onCronJobDeletion: purge-after-days
purgeAfterDays: 7
onRecreation: retain
storeLogs: true
logRetentionDays: 30
maxLogSizeKB: 200
storeEvents: true
Status Fields
Thestatus subresource provides real-time monitoring state and metrics.
The generation last processed by the controller.
The monitor’s operational state.Values:
Initializing, Active, Degraded, ErrorWhen the controller last reconciled this monitor.
Aggregate counts across all monitored CronJobs.
Show properties
Show properties
Total number of CronJobs being monitored.
Number of healthy CronJobs.
Number of CronJobs with warnings.
Number of CronJobs in critical state.
Number of suspended CronJobs.
Number of CronJobs with currently running jobs.
Total number of active alerts across all CronJobs.
Per-CronJob status information.
Show array item properties
Show array item properties
CronJob name.
CronJob namespace.
Health status.Values:
healthy, warning, critical, suspended, unknownWhether the CronJob is suspended.
When the last Job succeeded.
When the last Job failed.
Duration of the last completed Job.
When the next Job will be created.
SLA metrics for this CronJob.
Show properties
Show properties
Success rate percentage.
Total number of job runs.
Number of successful runs.
Number of failed runs.
Average execution duration in seconds.
50th percentile (median) duration in seconds.
95th percentile duration in seconds.
99th percentile duration in seconds.
Currently running jobs for this CronJob.
Current alerts for this CronJob.
Show array item properties
Show array item properties
Alert type (e.g., “JobFailed”, “MissedSchedule”, “SLABreached”).
Alert severity (“critical” or “warning”).
Human-readable alert description.
When the alert became active.
When the alert was last sent to channels.
Container exit code (for JobFailed alerts).
Failure reason (e.g., “OOMKilled”, “Error”).
Actionable guidance for resolving the alert.
Standard Kubernetes condition array. Common condition types:
Ready: Monitor is operational and tracking CronJobsProgressing: Monitor is initializing or updatingDegraded: Monitor is experiencing issues but operational
Complete Example
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: full-featured
namespace: production
spec:
selector:
matchExpressions:
- key: tier
operator: In
values: [critical, high]
deadManSwitch:
enabled: true
maxTimeSinceLastSuccess: 25h
sla:
enabled: true
minSuccessRate: 95
windowDays: 7
maxDuration: 30m
durationRegressionThreshold: 50
durationBaselineWindowDays: 14
suspendedHandling:
pauseMonitoring: true
alertIfSuspendedFor: 168h
maintenanceWindows:
- name: weekly-maintenance
schedule: "0 2 * * 0"
duration: 4h
timezone: America/New_York
suppressAlerts: true
alerting:
enabled: true
channelRefs:
- name: pagerduty-oncall
severities: [critical]
- name: slack-ops
severities: [critical, warning]
severityOverrides:
jobFailed: critical
slaBreached: warning
missedSchedule: warning
deadManTriggered: critical
durationRegression: warning
suppressDuplicatesFor: 1h
alertDelay: 5m
includeContext:
logs: true
logLines: 100
events: true
podStatus: true
suggestedFixes: true
suggestedFixPatterns:
- name: custom-oom
match:
exitCode: 137
suggestion: "OOM killed. Increase memory for {\{.Namespace}\}/{\{.Name}\}"
priority: 150
dataRetention:
retentionDays: 60
onCronJobDeletion: purge-after-days
purgeAfterDays: 7
onRecreation: retain
storeLogs: true
logRetentionDays: 30
maxLogSizeKB: 200
storeEvents: true