🚨Alerting on Kyverno Policy Validation Failures with Loki
Context#
We needed to create an alert for Kyverno policy violations across our clusters. The goal was to catch when a resource fails validation (e.g., policy="require-requests-limits") and surface relevant metadata like the resource name, kind, and target namespace directly in the alert.
The Data#
Kyverno logs validation failures to stderr in JSON format. Below is a trimmed example of the relevant fields we extract:
{
"action": "Enforce",
"cluster": "pve-cluster-mon0",
"kind": "Job",
"message": "validation failed",
"name": "cert-manager-startupapicheck",
"namespace": "kyverno",
"namespace_extracted": "cert-manager",
"policy": "require-requests-limits",
"resource_gvk": "batch/v1, Kind=Job"
}{
"action": "Enforce",
"cluster": "pve-cluster-mon0",
"kind": "Job",
"message": "validation failed",
"name": "cert-manager-startupapicheck",
"namespace": "kyverno",
"namespace_extracted": "cert-manager",
"policy": "require-requests-limits",
"resource_gvk": "batch/v1, Kind=Job"
}
Note: The namespace field in the log is "kyverno" (the source), while namespace_extracted contains the actual namespace of the offending resource (cert-manager in this case).
The LogQL Query#
To create a useful alert, we needed to:
- Filter for
validation failedevents in thekyvernonamespace. - Parse the JSON to turn log fields into labels.
- Rename
namespace_extractedtonamespaceso the alert reflects the target namespace, not the logging namespace. - Aggregate using
sum byto expose specific dynamic labels in the final alert metrics.
Final Query:
sum by (cluster, policy, name, namespace, kind) (
count_over_time(
{namespace="kyverno"}
|= `validation failed`
| json
| label_format namespace=namespace_extracted [1m]
)
) > 0sum by (cluster, policy, name, namespace, kind) (
count_over_time(
{namespace="kyverno"}
|= `validation failed`
| json
| label_format namespace=namespace_extracted [1m]
)
) > 0
Key Detail: The label_format must occur inside the count_over_time range vector to successfully rewrite the label before aggregation.
Alert Configuration#
Here is the final PrometheusRule definition:
groups:
- name: KyvernoAlerts
rules:
- alert: KyvernoPolicyValidationFailed
# We use the expression derived from RefId A in Grafana,
# discarding the Grafana-specific Reducer/Threshold blocks.
expr: |
sum by (cluster, policy, name, namespace, kind) (
count_over_time(
{namespace="kyverno"}
|= `validation failed`
| json
| label_format namespace=namespace_extracted [1m]
)
) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Kyverno Policy '{{ $labels.policy }}' failed"
description: >-
The {{ $labels.kind }} '{{ $labels.name }}' in namespace '{{ $labels.namespace }}'
failed validation against policy '{{ $labels.policy }}' on {{ $labels.cluster }}.