Troubleshooting - CronJob Guardian

This guide covers common issues you may encounter when using CronJob Guardian and how to resolve them.

Monitor Not Discovering CronJobs

Symptoms

CronJobMonitor shows totalCronJobs: 0 even though CronJobs exist in the namespace.

Possible Causes and Solutions

Verify the monitor is in Active phase

kubectl get cronjobmonitor my-monitor -n production

If the phase is not Active, check the monitor’s conditions:

kubectl describe cronjobmonitor my-monitor -n production

Check the selector configuration

Ensure your selector matches the CronJob labels:

# List CronJobs with labels
kubectl get cronjob -n production --show-labels

# Check if your selector matches
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 10 selector

Verify namespace permissions

For cluster-wide or multi-namespace monitoring, ensure the Guardian service account has proper RBAC permissions:

kubectl auth can-i list cronjobs --as=system:serviceaccount:cronjob-guardian:cronjob-guardian-controller -n production

Should return yes.

Check ignored namespaces

Verify the namespace isn’t in the ignored list:

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 ignored-namespaces

If monitoring fails silently, check the controller logs:

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller

Alerts Not Being Sent

Symptoms

Jobs are failing, but no alerts are received in Slack, PagerDuty, etc.

Diagnosis Steps

Check if alerts are active in the monitor status

kubectl describe cronjobmonitor my-monitor -n production | grep -A 10 "Active Alerts"

If alerts are listed here, the monitor is detecting failures.

Verify AlertChannel is ready

kubectl get alertchannel -A

Ensure the READY column shows true.If false, check the AlertChannel status:

kubectl describe alertchannel slack-alerts

Test the alert channel

Send a test alert:

curl -X POST http://localhost:8080/api/v1/channels/slack-alerts/test

Or via the dashboard: Channels → select channel → Send Test Alert.

Check alert channel references in the monitor

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 channelRefs

Ensure the channel name matches the AlertChannel resource name.

Verify severity routing

alerting:
  channelRefs:
    - name: slack-alerts
      severities: [critical, warning]  # Ensure this includes the alert severity

If you only route critical but the alert is warning, it won’t be sent to that channel.

Check rate limiting

kubectl get alertchannel slack-alerts -o yaml | grep -A 5 rateLimiting

If you’re hitting rate limits, increase them or reduce alert volume.

Common Alert Channel Issues

Slack Webhook Invalid

kubectl describe alertchannel slack-alerts

If you see errors like invalid webhook URL or 401 Unauthorized:

Verify the webhook URL is correct
Regenerate the webhook in Slack if necessary
Update the secret:

kubectl delete secret slack-webhook -n cronjob-guardian
kubectl create secret generic slack-webhook \
  --namespace cronjob-guardian \
  --from-literal=url=https://hooks.slack.com/services/YOUR/NEW/URL

PagerDuty Routing Key Invalid

Check the PagerDuty routing key:

kubectl get secret pagerduty-key -n cronjob-guardian -o jsonpath='{.data.routing-key}' | base64 -d

Ensure it matches your PagerDuty service integration key.

Dead-Man’s Switch Not Triggering

Symptoms

CronJob hasn’t run in days, but no dead-man’s switch alert is sent.

Solutions

Verify dead-man's switch is enabled

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 deadManSwitch

Ensure enabled: true.

Check the time window

deadManSwitch:
  enabled: true
  maxTimeSinceLastSuccess: 25h  # Must exceed CronJob schedule

If your job runs daily, the window should be > 24h (e.g., 25h with buffer).

Verify the job has succeeded at least once

Dead-man’s switch requires at least one successful run. Check execution history:

curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions

Check if the CronJob is suspended

kubectl get cronjob my-job -n production -o jsonpath='{.spec.suspend}'

If true, and suspendedHandling.pauseMonitoring: true, the dead-man’s switch is paused.

SLA Not Updating

Symptoms

Success rate and duration metrics are stale or show 0%.

Solutions

Check if SLA analyzer is running

curl http://localhost:8080/api/v1/health | jq '.analyzerEnabled'

Should return true.

Verify execution history is being stored

curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions

If empty, check storage configuration.

Check SLA recalculation interval

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep sla-recalculation-interval

Default is 5 minutes. Metrics update on this schedule.

Verify SLA is enabled for the monitor

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 sla

Storage Issues

Database Connection Errors

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "storage\|database"

Common errors:

SQLite: Permission Denied

Ensure the persistent volume has correct permissions:

# values.yaml
storage:
  type: sqlite
  sqlite:
    persistence:
      enabled: true
      storageClass: standard
      accessModes:
        - ReadWriteOnce

PostgreSQL: Connection Refused

Verify the PostgreSQL service is reachable:

kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h postgres.default.svc.cluster.local -U guardian -d guardian

If connection fails, check:

PostgreSQL is running: kubectl get pod -l app=postgresql
Service exists: kubectl get svc postgres
Credentials are correct in the secret

Execution History Not Stored

Check storage health

curl http://localhost:8080/api/v1/admin/storage-stats

Ensure healthy: true.

Verify storage backend is configured

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 10 storage

Check for write errors in logs

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "failed to.*execution"

High Memory Usage

Symptoms

Guardian controller pod is OOMKilled or uses excessive memory.

Solutions

Reduce log storage

Disable log storage or reduce max log size:

# values.yaml
storage:
  logStorageEnabled: false  # Or reduce maxLogSizeKB
  maxLogSizeKB: 100  # Default: 200

Shorten retention period

historyRetention:
  defaultDays: 7  # Reduce from 30

Reduce monitored CronJobs

If monitoring hundreds of CronJobs, consider splitting into multiple monitors or using more specific selectors.

Increase memory limits

# values.yaml
resources:
  limits:
    memory: 512Mi  # Increase from default
  requests:
    memory: 256Mi

Controller Crashes or Restarts

Diagnosis

# Check pod status
kubectl get pod -n cronjob-guardian

# View recent logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=100

# Check previous logs if pod restarted
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --previous

Common Causes

Panic in Reconciliation Loop

Look for panic stack traces in logs. If you find a bug, report it with:

Full stack trace
CronJobMonitor YAML that triggered the panic
Guardian version

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -A 50 "panic"

Leader Election Issues

If running multiple replicas:

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "leader"

Ensure only one pod is the leader at a time.

API Server Not Responding

Symptoms

Dashboard is unreachable or API requests timeout.

Solutions

Verify API is enabled

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 api

Ensure enabled: true.

Check API service

kubectl get svc -n cronjob-guardian cronjob-guardian-api

Verify endpoints exist:

kubectl get endpoints -n cronjob-guardian cronjob-guardian-api

Test API from within cluster

kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://cronjob-guardian-api.cronjob-guardian.svc.cluster.local:8080/api/v1/health

Check controller logs for API errors

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "api server"

Prometheus Metrics Not Scraped

Verify ServiceMonitor

kubectl get servicemonitor -n cronjob-guardian

If using Prometheus Operator, ensure the ServiceMonitor is created and matches your Prometheus selector:

kubectl get prometheus -A -o yaml | grep serviceMonitorSelector

Test Metrics Endpoint

kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://cronjob-guardian-controller-metrics.cronjob-guardian.svc.cluster.local:8443/metrics

Common Configuration Mistakes

Incorrect Namespace for AlertChannel

AlertChannels are cluster-scoped, so they don’t have a namespace:

# Wrong
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-alerts
  namespace: production  # Remove this

# Correct
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-alerts  # No namespace

Selector Doesn’t Match Any CronJobs

# List CronJobs with labels
kubectl get cronjob -n production --show-labels

# Check if your matchLabels align
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 matchLabels

Wrong Timezone in Maintenance Windows

Use IANA timezone names, not abbreviations:

# Wrong
timezone: PST

# Correct
timezone: America/Los_Angeles

Getting Help

If you’re still stuck:

Check Logs

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200

Describe Resources

kubectl describe cronjobmonitor my-monitor -n production
kubectl describe alertchannel slack-alerts

Check Events

kubectl get events -n cronjob-guardian --sort-by='.lastTimestamp'

Report Issues

Open an issue on GitHub with:

Guardian version
Kubernetes version
Relevant logs and configuration

Debugging Checklist

Before reporting an issue, gather this information:

# Guardian version
kubectl get deployment -n cronjob-guardian cronjob-guardian-controller -o jsonpath='{.spec.template.spec.containers[0].image}'

# Kubernetes version
kubectl version --short

# Monitor status
kubectl describe cronjobmonitor my-monitor -n production

# Controller logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200 > guardian-logs.txt

# Health status
curl http://localhost:8080/api/v1/health

# Storage stats
curl http://localhost:8080/api/v1/admin/storage-stats

Get Started

Core Concepts

Guides

Operations

Documentation Index

​Monitor Not Discovering CronJobs

​Symptoms

​Possible Causes and Solutions

​Alerts Not Being Sent

​Symptoms

​Diagnosis Steps

​Common Alert Channel Issues

​Slack Webhook Invalid

​PagerDuty Routing Key Invalid

​Dead-Man’s Switch Not Triggering

​Symptoms

​Solutions

​SLA Not Updating

​Symptoms

​Solutions

​Storage Issues

​Database Connection Errors

​SQLite: Permission Denied

​PostgreSQL: Connection Refused

​Execution History Not Stored

​High Memory Usage

​Symptoms

​Solutions

​Controller Crashes or Restarts

​Diagnosis

​Common Causes

​Panic in Reconciliation Loop

​Leader Election Issues

​API Server Not Responding

​Symptoms

​Solutions

​Prometheus Metrics Not Scraped

​Verify ServiceMonitor

​Test Metrics Endpoint

​Common Configuration Mistakes

​Incorrect Namespace for AlertChannel

​Selector Doesn’t Match Any CronJobs

​Wrong Timezone in Maintenance Windows

​Getting Help

Check Logs

Describe Resources

Check Events

Report Issues

​Debugging Checklist

​Next Steps

Production Setup

API Reference

Monitor Not Discovering CronJobs

Symptoms

Possible Causes and Solutions

Alerts Not Being Sent

Symptoms

Diagnosis Steps

Common Alert Channel Issues

Slack Webhook Invalid

PagerDuty Routing Key Invalid

Dead-Man’s Switch Not Triggering

Symptoms

Solutions

SLA Not Updating

Symptoms

Solutions

Storage Issues

Database Connection Errors

SQLite: Permission Denied

PostgreSQL: Connection Refused

Execution History Not Stored

High Memory Usage

Symptoms

Solutions

Controller Crashes or Restarts

Diagnosis

Common Causes

Panic in Reconciliation Loop

Leader Election Issues

API Server Not Responding

Symptoms

Solutions

Prometheus Metrics Not Scraped

Verify ServiceMonitor

Test Metrics Endpoint

Common Configuration Mistakes

Incorrect Namespace for AlertChannel

Selector Doesn’t Match Any CronJobs

Wrong Timezone in Maintenance Windows

Getting Help

Debugging Checklist

Next Steps