Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/iLLeniumStudios/cronjob-guardian/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers common issues you may encounter when using CronJob Guardian and how to resolve them.

Monitor Not Discovering CronJobs

Symptoms

CronJobMonitor shows totalCronJobs: 0 even though CronJobs exist in the namespace.

Possible Causes and Solutions

1

Verify the monitor is in Active phase

kubectl get cronjobmonitor my-monitor -n production
If the phase is not Active, check the monitor’s conditions:
kubectl describe cronjobmonitor my-monitor -n production
2

Check the selector configuration

Ensure your selector matches the CronJob labels:
# List CronJobs with labels
kubectl get cronjob -n production --show-labels

# Check if your selector matches
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 10 selector
3

Verify namespace permissions

For cluster-wide or multi-namespace monitoring, ensure the Guardian service account has proper RBAC permissions:
kubectl auth can-i list cronjobs --as=system:serviceaccount:cronjob-guardian:cronjob-guardian-controller -n production
Should return yes.
4

Check ignored namespaces

Verify the namespace isn’t in the ignored list:
kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 ignored-namespaces
If monitoring fails silently, check the controller logs:
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller

Alerts Not Being Sent

Symptoms

Jobs are failing, but no alerts are received in Slack, PagerDuty, etc.

Diagnosis Steps

1

Check if alerts are active in the monitor status

kubectl describe cronjobmonitor my-monitor -n production | grep -A 10 "Active Alerts"
If alerts are listed here, the monitor is detecting failures.
2

Verify AlertChannel is ready

kubectl get alertchannel -A
Ensure the READY column shows true.If false, check the AlertChannel status:
kubectl describe alertchannel slack-alerts
3

Test the alert channel

Send a test alert:
curl -X POST http://localhost:8080/api/v1/channels/slack-alerts/test
Or via the dashboard: Channels → select channel → Send Test Alert.
4

Check alert channel references in the monitor

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 channelRefs
Ensure the channel name matches the AlertChannel resource name.
5

Verify severity routing

alerting:
  channelRefs:
    - name: slack-alerts
      severities: [critical, warning]  # Ensure this includes the alert severity
If you only route critical but the alert is warning, it won’t be sent to that channel.
6

Check rate limiting

kubectl get alertchannel slack-alerts -o yaml | grep -A 5 rateLimiting
If you’re hitting rate limits, increase them or reduce alert volume.

Common Alert Channel Issues

Slack Webhook Invalid

kubectl describe alertchannel slack-alerts
If you see errors like invalid webhook URL or 401 Unauthorized:
  1. Verify the webhook URL is correct
  2. Regenerate the webhook in Slack if necessary
  3. Update the secret:
kubectl delete secret slack-webhook -n cronjob-guardian
kubectl create secret generic slack-webhook \
  --namespace cronjob-guardian \
  --from-literal=url=https://hooks.slack.com/services/YOUR/NEW/URL

PagerDuty Routing Key Invalid

Check the PagerDuty routing key:
kubectl get secret pagerduty-key -n cronjob-guardian -o jsonpath='{.data.routing-key}' | base64 -d
Ensure it matches your PagerDuty service integration key.

Dead-Man’s Switch Not Triggering

Symptoms

CronJob hasn’t run in days, but no dead-man’s switch alert is sent.

Solutions

1

Verify dead-man's switch is enabled

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 deadManSwitch
Ensure enabled: true.
2

Check the time window

deadManSwitch:
  enabled: true
  maxTimeSinceLastSuccess: 25h  # Must exceed CronJob schedule
If your job runs daily, the window should be > 24h (e.g., 25h with buffer).
3

Verify the job has succeeded at least once

Dead-man’s switch requires at least one successful run. Check execution history:
curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions
4

Check if the CronJob is suspended

kubectl get cronjob my-job -n production -o jsonpath='{.spec.suspend}'
If true, and suspendedHandling.pauseMonitoring: true, the dead-man’s switch is paused.

SLA Not Updating

Symptoms

Success rate and duration metrics are stale or show 0%.

Solutions

1

Check if SLA analyzer is running

curl http://localhost:8080/api/v1/health | jq '.analyzerEnabled'
Should return true.
2

Verify execution history is being stored

curl http://localhost:8080/api/v1/cronjobs/production/my-job/executions
If empty, check storage configuration.
3

Check SLA recalculation interval

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep sla-recalculation-interval
Default is 5 minutes. Metrics update on this schedule.
4

Verify SLA is enabled for the monitor

kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 sla

Storage Issues

Database Connection Errors

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "storage\|database"
Common errors:

SQLite: Permission Denied

Ensure the persistent volume has correct permissions:
# values.yaml
storage:
  type: sqlite
  sqlite:
    persistence:
      enabled: true
      storageClass: standard
      accessModes:
        - ReadWriteOnce

PostgreSQL: Connection Refused

Verify the PostgreSQL service is reachable:
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h postgres.default.svc.cluster.local -U guardian -d guardian
If connection fails, check:
  • PostgreSQL is running: kubectl get pod -l app=postgresql
  • Service exists: kubectl get svc postgres
  • Credentials are correct in the secret

Execution History Not Stored

1

Check storage health

curl http://localhost:8080/api/v1/admin/storage-stats
Ensure healthy: true.
2

Verify storage backend is configured

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 10 storage
3

Check for write errors in logs

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "failed to.*execution"

High Memory Usage

Symptoms

Guardian controller pod is OOMKilled or uses excessive memory.

Solutions

1

Reduce log storage

Disable log storage or reduce max log size:
# values.yaml
storage:
  logStorageEnabled: false  # Or reduce maxLogSizeKB
  maxLogSizeKB: 100  # Default: 200
2

Shorten retention period

historyRetention:
  defaultDays: 7  # Reduce from 30
3

Reduce monitored CronJobs

If monitoring hundreds of CronJobs, consider splitting into multiple monitors or using more specific selectors.
4

Increase memory limits

# values.yaml
resources:
  limits:
    memory: 512Mi  # Increase from default
  requests:
    memory: 256Mi

Controller Crashes or Restarts

Diagnosis

# Check pod status
kubectl get pod -n cronjob-guardian

# View recent logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=100

# Check previous logs if pod restarted
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --previous

Common Causes

Panic in Reconciliation Loop

Look for panic stack traces in logs. If you find a bug, report it with:
  • Full stack trace
  • CronJobMonitor YAML that triggered the panic
  • Guardian version
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -A 50 "panic"

Leader Election Issues

If running multiple replicas:
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "leader"
Ensure only one pod is the leader at a time.

API Server Not Responding

Symptoms

Dashboard is unreachable or API requests timeout.

Solutions

1

Verify API is enabled

kubectl get configmap cronjob-guardian-config -n cronjob-guardian -o yaml | grep -A 5 api
Ensure enabled: true.
2

Check API service

kubectl get svc -n cronjob-guardian cronjob-guardian-api
Verify endpoints exist:
kubectl get endpoints -n cronjob-guardian cronjob-guardian-api
3

Test API from within cluster

kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://cronjob-guardian-api.cronjob-guardian.svc.cluster.local:8080/api/v1/health
4

Check controller logs for API errors

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller | grep -i "api server"

Prometheus Metrics Not Scraped

Verify ServiceMonitor

kubectl get servicemonitor -n cronjob-guardian
If using Prometheus Operator, ensure the ServiceMonitor is created and matches your Prometheus selector:
kubectl get prometheus -A -o yaml | grep serviceMonitorSelector

Test Metrics Endpoint

kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://cronjob-guardian-controller-metrics.cronjob-guardian.svc.cluster.local:8443/metrics

Common Configuration Mistakes

Incorrect Namespace for AlertChannel

AlertChannels are cluster-scoped, so they don’t have a namespace:
# Wrong
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-alerts
  namespace: production  # Remove this

# Correct
apiVersion: guardian.illenium.net/v1alpha1
kind: AlertChannel
metadata:
  name: slack-alerts  # No namespace

Selector Doesn’t Match Any CronJobs

# List CronJobs with labels
kubectl get cronjob -n production --show-labels

# Check if your matchLabels align
kubectl get cronjobmonitor my-monitor -n production -o yaml | grep -A 5 matchLabels

Wrong Timezone in Maintenance Windows

Use IANA timezone names, not abbreviations:
# Wrong
timezone: PST

# Correct
timezone: America/Los_Angeles

Getting Help

If you’re still stuck:

Check Logs

kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200

Describe Resources

kubectl describe cronjobmonitor my-monitor -n production
kubectl describe alertchannel slack-alerts

Check Events

kubectl get events -n cronjob-guardian --sort-by='.lastTimestamp'

Report Issues

Open an issue on GitHub with:
  • Guardian version
  • Kubernetes version
  • Relevant logs and configuration

Debugging Checklist

Before reporting an issue, gather this information:
# Guardian version
kubectl get deployment -n cronjob-guardian cronjob-guardian-controller -o jsonpath='{.spec.template.spec.containers[0].image}'

# Kubernetes version
kubectl version --short

# Monitor status
kubectl describe cronjobmonitor my-monitor -n production

# Controller logs
kubectl logs -n cronjob-guardian deployment/cronjob-guardian-controller --tail=200 > guardian-logs.txt

# Health status
curl http://localhost:8080/api/v1/health

# Storage stats
curl http://localhost:8080/api/v1/admin/storage-stats

Next Steps

Production Setup

Best practices for production deployments

API Reference

Complete REST API documentation