Logo Wael's Digital Garden

🚨Alerting on Kyverno Policy Validation Failures with Loki

Context#

We needed to create an alert for Kyverno policy violations across our clusters. The goal was to catch when a resource fails validation (e.g., policy="require-requests-limits") and surface relevant metadata like the resource name, kind, and target namespace directly in the alert.

The Data#

Kyverno logs validation failures to stderr in JSON format. Below is a trimmed example of the relevant fields we extract:

{
  "action": "Enforce",
  "cluster": "pve-cluster-mon0",
  "kind": "Job",
  "message": "validation failed",
  "name": "cert-manager-startupapicheck",
  "namespace": "kyverno",
  "namespace_extracted": "cert-manager",
  "policy": "require-requests-limits",
  "resource_gvk": "batch/v1, Kind=Job"
}
JSON
{
  "action": "Enforce",
  "cluster": "pve-cluster-mon0",
  "kind": "Job",
  "message": "validation failed",
  "name": "cert-manager-startupapicheck",
  "namespace": "kyverno",
  "namespace_extracted": "cert-manager",
  "policy": "require-requests-limits",
  "resource_gvk": "batch/v1, Kind=Job"
}

Note: The namespace field in the log is "kyverno" (the source), while namespace_extracted contains the actual namespace of the offending resource (cert-manager in this case).

The LogQL Query#

To create a useful alert, we needed to:

  1. Filter for validation failed events in the kyverno namespace.
  2. Parse the JSON to turn log fields into labels.
  3. Rename namespace_extracted to namespace so the alert reflects the target namespace, not the logging namespace.
  4. Aggregate using sum by to expose specific dynamic labels in the final alert metrics.

Final Query:

sum by (cluster, policy, name, namespace, kind) (
  count_over_time(
    {namespace="kyverno"} 
    |= `validation failed` 
    | json 
    | label_format namespace=namespace_extracted [1m]
  )
) > 0
Code snippet
sum by (cluster, policy, name, namespace, kind) (
  count_over_time(
    {namespace="kyverno"} 
    |= `validation failed` 
    | json 
    | label_format namespace=namespace_extracted [1m]
  )
) > 0

Key Detail: The label_format must occur inside the count_over_time range vector to successfully rewrite the label before aggregation.

Alert Configuration#

Here is the final PrometheusRule definition:

groups:
  - name: KyvernoAlerts
    rules:
      - alert: KyvernoPolicyValidationFailed
        # We use the expression derived from RefId A in Grafana, 
        # discarding the Grafana-specific Reducer/Threshold blocks.
        expr: |
          sum by (cluster, policy, name, namespace, kind) (
            count_over_time(
              {namespace="kyverno"} 
              |= `validation failed` 
              | json 
              | label_format namespace=namespace_extracted [1m]
            )
          ) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Kyverno Policy '{{ $labels.policy }}' failed"
          description: >-
            The {{ $labels.kind }} '{{ $labels.name }}' in namespace '{{ $labels.namespace }}' 
            failed validation against policy '{{ $labels.policy }}' on {{ $labels.cluster }}.