Documentation Index Fetch the complete documentation index at: https://mintlify.com/iLLeniumStudios/cronjob-guardian/llms.txt
Use this file to discover all available pages before exploring further.
This guide covers common issues you may encounter when using CronJob Guardian and how to resolve them.
Monitor Not Discovering CronJobs
Symptoms
CronJobMonitor shows totalCronJobs: 0 even though CronJobs exist in the namespace.
Possible Causes and Solutions
Verify the monitor is in Active phase
kubectl get cronjobmonitor my-monitor -n production
If the phase is not Active, check the monitor’s conditions: kubectl describe cronjobmonitor my-monitor -n production
Check the selector configuration
Ensure your selector matches the CronJob labels: # List CronJobs with labels
kubectl get cronjob -n production --show-labels
# Check if your selector matches
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 10 selector
Verify namespace permissions
For cluster-wide or multi-namespace monitoring, ensure the Guardian service account has proper RBAC permissions: kubectl auth can-i list cronjobs --as=system:serviceaccount:cronjob-guardian:cronjob-guardian-controller -n production
Should return yes.
Check ignored namespaces
Verify the namespace isn’t in the ignored list: kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 ignored-namespaces
If monitoring fails silently, check the controller logs: kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller
Alerts Not Being Sent
Symptoms
Jobs are failing, but no alerts are received in Slack, PagerDuty, etc.
Diagnosis Steps
Check if alerts are active in the monitor status
kubectl describe cronjobmonitor my-monitor -n production | grep -A 10 "Active Alerts"
If alerts are listed here, the monitor is detecting failures.
Verify AlertChannel is ready
kubectl get alertchannel -A
Ensure the READY column shows true. If false, check the AlertChannel status: kubectl describe alertchannel slack-alerts
Test the alert channel
Send a test alert: curl -X POST http://localhost:8080/api/v1/channels/slack-alerts/test
Or via the dashboard: Channels → select channel → Send Test Alert .
Check alert channel references in the monitor
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 channelRefs
Ensure the channel name matches the AlertChannel resource name.
Verify severity routing
alerting :
channelRefs :
- name : slack-alerts
severities : [ critical , warning ] # Ensure this includes the alert severity
If you only route critical but the alert is warning, it won’t be sent to that channel.
Check rate limiting
kubectl get alertchannel slack-alerts -o yaml | grep -A 5 rateLimiting
If you’re hitting rate limits, increase them or reduce alert volume.
Common Alert Channel Issues
Slack Webhook Invalid
kubectl describe alertchannel slack-alerts
If you see errors like invalid webhook URL or 401 Unauthorized:
Verify the webhook URL is correct
Regenerate the webhook in Slack if necessary
Update the secret:
kubectl delete secret slack-webhook -n cronjob-guardian
kubectl create secret generic slack-webhook \
--namespace cronjob-guardian \
--from-literal=url=https://hooks.slack.com/services/YOUR/NEW/URL
Check the PagerDuty routing key:
kubectl get secret pagerduty-key -n cronjob-guardian -o jsonpath='{.data.routing-key}' | base64 -d
Ensure it matches your PagerDuty service integration key.
Dead-Man’s Switch Not Triggering
Symptoms
CronJob hasn’t run in days, but no dead-man’s switch alert is sent.
Solutions
Verify dead-man's switch is enabled
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 deadManSwitch
Ensure enabled: true.
Check the time window
deadManSwitch :
enabled : true
maxTimeSinceLastSuccess : 25h # Must exceed CronJob schedule
If your job runs daily, the window should be > 24h (e.g., 25h with buffer).
Verify the job has succeeded at least once
Dead-man’s switch requires at least one successful run. Check execution history: curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions
Check if the CronJob is suspended
kubectl get cronjob my-job -n production -o jsonpath='{.spec.suspend}'
If true, and suspendedHandling.pauseMonitoring: true, the dead-man’s switch is paused.
SLA Not Updating
Symptoms
Success rate and duration metrics are stale or show 0%.
Solutions
Check if SLA analyzer is running
curl http://localhost:8080/api/v1/health | jq '.analyzerEnabled'
Should return true.
Verify execution history is being stored
curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions
If empty, check storage configuration.
Check SLA recalculation interval
kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep sla-recalculation-interval
Default is 5 minutes. Metrics update on this schedule.
Verify SLA is enabled for the monitor
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 sla
Storage Issues
Database Connection Errors
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "storage\|database"
Common errors:
SQLite: Permission Denied
Ensure the persistent volume has correct permissions:
# values.yaml
storage :
type : sqlite
sqlite :
persistence :
enabled : true
storageClass : standard
accessModes :
- ReadWriteOnce
PostgreSQL: Connection Refused
Verify the PostgreSQL service is reachable:
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
psql -h postgres.default.svc.cluster.local -U guardian -d guardian
If connection fails, check:
PostgreSQL is running: kubectl get pod -l app=postgresql
Service exists: kubectl get svc postgres
Credentials are correct in the secret
Execution History Not Stored
Check storage health
curl http://localhost:8080/api/v1/admin/storage-stats
Ensure healthy: true.
Verify storage backend is configured
kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 10 storage
Check for write errors in logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "failed to.*execution"
High Memory Usage
Symptoms
Guardian controller pod is OOMKilled or uses excessive memory.
Solutions
Reduce log storage
Disable log storage or reduce max log size: # values.yaml
storage :
logStorageEnabled : false # Or reduce maxLogSizeKB
maxLogSizeKB : 100 # Default: 200
Shorten retention period
historyRetention :
defaultDays : 7 # Reduce from 30
Reduce monitored CronJobs
If monitoring hundreds of CronJobs, consider splitting into multiple monitors or using more specific selectors.
Increase memory limits
# values.yaml
resources :
limits :
memory : 512Mi # Increase from default
requests :
memory : 256Mi
Controller Crashes or Restarts
Diagnosis
# Check pod status
kubectl get pod -n cronjob-guardian
# View recent logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=100
# Check previous logs if pod restarted
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --previous
Common Causes
Panic in Reconciliation Loop
Look for panic stack traces in logs. If you find a bug, report it with:
Full stack trace
CronJobMonitor YAML that triggered the panic
Guardian version
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -A 50 "panic"
Leader Election Issues
If running multiple replicas:
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "leader"
Ensure only one pod is the leader at a time.
API Server Not Responding
Symptoms
Dashboard is unreachable or API requests timeout.
Solutions
Verify API is enabled
kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 api
Ensure enabled: true.
Check API service
kubectl get svc -n cronjob-guardian cronjob-guardian-api
Verify endpoints exist: kubectl get endpoints -n cronjob-guardian cronjob-guardian-api
Test API from within cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://cronjob-guardian-api.cronjob-guardian.svc.cluster.local:8080/api/v1/health
Check controller logs for API errors
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "api server"
Prometheus Metrics Not Scraped
Verify ServiceMonitor
kubectl get servicemonitor -n cronjob-guardian
If using Prometheus Operator, ensure the ServiceMonitor is created and matches your Prometheus selector:
kubectl get prometheus -A -o yaml | grep serviceMonitorSelector
Test Metrics Endpoint
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://cronjob-guardian-controller-metrics.cronjob-guardian.svc.cluster.local:8443/metrics
Common Configuration Mistakes
Incorrect Namespace for AlertChannel
AlertChannels are cluster-scoped , so they don’t have a namespace:
# Wrong
apiVersion : guardian.illenium.net/v1alpha1
kind : AlertChannel
metadata :
name : slack-alerts
namespace : production # Remove this
# Correct
apiVersion : guardian.illenium.net/v1alpha1
kind : AlertChannel
metadata :
name : slack-alerts # No namespace
Selector Doesn’t Match Any CronJobs
# List CronJobs with labels
kubectl get cronjob -n production --show-labels
# Check if your matchLabels align
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 matchLabels
Wrong Timezone in Maintenance Windows
Use IANA timezone names, not abbreviations:
# Wrong
timezone : PST
# Correct
timezone : America/Los_Angeles
Getting Help
If you’re still stuck:
Check Logs kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200
Describe Resources kubectl describe cronjobmonitor my-monitor -n production
kubectl describe alertchannel slack-alerts
Check Events kubectl get events -n cronjob-guardian --sort-by= '.lastTimestamp'
Report Issues Open an issue on GitHub with:
Guardian version
Kubernetes version
Relevant logs and configuration
Debugging Checklist
Before reporting an issue, gather this information:
# Guardian version
kubectl get deployment -n cronjob-guardian cronjob-guardian-controller -o jsonpath='{.spec.template.spec.containers[0].image}'
# Kubernetes version
kubectl version --short
# Monitor status
kubectl describe cronjobmonitor my-monitor -n production
# Controller logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200 > guardian-logs.txt
# Health status
curl http://localhost:8080/api/v1/health
# Storage stats
curl http://localhost:8080/api/v1/admin/storage-stats
Next Steps
Production Setup Best practices for production deployments
API Reference Complete REST API documentation