Continuously scan your Kubernetes cluster for security issues using trivy operator

My colleagues @lukas.maikowski, @philipp.markiewka and @philipp.dziosa and I used our recent slack day @ cloudogu to get started with the trivy operator for kubernetes. Here’s what we achieved.

TL;DR it’s easy to scan your cluster for CVEs. But then what?
We were able to display the amount of CVEs per Image on a grafana dashboard.
Any ideas on how to create alerts on new CVEs?
Edit 01/2023: The vulnerability numbers fluctuate a lot. How to smooth them out?

Installation

We started with our

and installed via Helm.
We had to configure serviceMonitors for kube-prometheus-stack to pick up metrics.
Some trivy scans OOM-failed so we had to increase the default memory limits.

helm upgrade -i trivy-operator aqua/trivy-operator \
--namespace trivy-system \
--create-namespace \
--set="trivy.ignoreUnfixed=true" \
--set="serviceMonitor.enabled=true" \
--set="serviceMonitor.namespace=monitoring" \
--set="trivy.resources.requests.memory=1G" \
--set="trivy.resources.limits.memory=1G" \
--set-json='serviceMonitor.labels={ "release": "kube-prometheus-stack"}'\
--version v0.3.0

Unfortunately, the helm chart cannot yet be found on artifacthub. The available values can be found in the GitHub repo, though.
Edit 01/23: The helm chart is now available on artifact hub.

Usage

After installing the operator scans all images in cluster and stores the results in k8s custom resources in the namespace of the individual pods. You can display these using, e.g.

kubectl get vulnerabilityreports -oyaml

For those who use lens, there’s also an extension.

However, we don’t use the CRs (yet?)

Metrics

Instead we visualize based on metrics, using Grafana.
There does not seem to be a dashboard, yet (see this issue) so we built one:

Source: Grafana Dashboard for Trivy Operator · GitHub

Alerting

If you want to alert on CVEs, metrics would also be an option.

Unfortunately, trivy op metrics don’t include the actual CVE number, only the amount.

So we can only send unspecific “new CVE for image” alerts.

We tried to alert on new CVEs, but ran into the issue that PromQL increase() considers creation of new timeseries as reset: increase() should consider creation of new timeseries as reset · Issue #1673 · prometheus/prometheus · GitHub.

increase(trivy_image_vulnerabilities{namespace="argocd-production", severity="Critical"}[5m]) 

So we would miss the first CVE of each image version.

Any ideas?

Docs

There is only few docs about the operator in the trivy docs itself, but impressive separate docs to be found here:

Summary

So, summing up: Trivy operator itself is awesome. What to do with the scan results has to grow some further to be intuitive.

Update 01/2023

After using the dashboard for some months we recognize that the values fluctuate a lot with the number of vuln reports. How to smooth them out?

3 Likes

The chart can now be found on ArtifactHub.

2 Likes