Skip to content

Helm-deployed agent crashing on launch: Helm-deployed ClusterRoles for operator, agent etc still use retina.io instead of retina.sh #1936

@ivucica

Description

@ivucica

Describe the bug

Service account in kube-system, retina-agent, seemingly has no permissions to list Retina's own CRs defined as being in retina.sh namespace.

EDIT Below, retina-agent-init is actually running the operator image instead of the init. My bad! The apigroups are, nonetheless, wrong.

Here's an excerpt from retina-agent-init container of the retina-agent-* pod (it never reaches the retina-agent container):

[EDIT] This is from the retina-operator image, by accident. The bug still applies.

E1118 23:39:21.428138       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.MetricsConfiguration: failed to list *v1alpha1.MetricsConfiguration: metricsconfigurations.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"metricsconfigurations\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"
W1118 23:39:22.001828       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "captures" in API group "retina.sh" at the cluster scope
E1118 23:39:22.001924       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.Capture: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"captures\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"
W1118 23:39:24.764710       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "jobs" in API group "batch" at the cluster scope
E1118 23:39:24.765201       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
W1118 23:39:26.796528       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1alpha1.MetricsConfiguration: metricsconfigurations.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "metricsconfigurations" in API group "retina.sh" at the cluster scope
E1118 23:39:26.796687       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.MetricsConfiguration: failed to list *v1alpha1.MetricsConfiguration: metricsconfigurations.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"metricsconfigurations\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"
W1118 23:39:26.887540       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "captures" in API group "retina.sh" at the cluster scope
E1118 23:39:26.888200       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.Capture: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"captures\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"

This error is similar to #1122, but actually unrelated.

Examining the relevant ClusterRole object, retina-cluster-reader, I can see that in multiple places it seems to reference retina.io which is gone since pull request #26 which replaced retina.io with retina.sh in most places:

- apiGroups:
- retina.io
resources:
- retinaendpoints
verbs:
- get
- list
- watch

- apiGroups:
- retina.io
resources:
- retinaendpoints
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- retina.io
resources:
- metricsconfigurations
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- retina.io
resources:
- retinaendpoints/finalizers
verbs:
- update
- apiGroups:
- retina.io
resources:
- retinaendpoints/status
verbs:
- get
- patch
- update

This is also happening in the operator ClusterRole object, retina-operator-role:

- apiGroups:
- retina.io
resources:
- captures
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- retina.io
resources:
- captures/finalizers
verbs:
- update
- apiGroups:
- retina.io
resources:
- captures/status
verbs:
- get
- patch
- update

The 'standard' (non-hubble) helm deployment seems to define RBACs with the correct API group:

- apiGroups:
- retina.sh
resources:
- retinaendpoints
verbs:
- get
- list
- watch

- apiGroups:
- retina.sh
resources:
- retinaendpoints
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- retina.sh
resources:
- metricsconfigurations
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- retina.sh
resources:
- retinaendpoints/finalizers
verbs:
- update
- apiGroups:
- retina.sh
resources:
- retinaendpoints/status
verbs:
- get
- patch
- update

Even after addressing the issue in the agent role, there's still missing rules in retina-agent-init:

[EDIT] This is from the retina-operator image, by accident. The bug still applies.

W1119 00:09:25.596718       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "jobs" in API group "batch" at the cluster scope
E1119 00:09:25.597944       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
W1119 00:09:25.601708       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "captures" in API group "retina.sh" at the cluster scope
E1119 00:09:25.602039       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.Capture: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"captures\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"
W1119 00:09:26.899665       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "captures" in API group "retina.sh" at the cluster scope
E1119 00:09:26.899927       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.Capture: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"captures\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"
W1119 00:09:26.914116       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "jobs" in API group "batch" at the cluster scope
E1119 00:09:26.914710       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
W1119 00:09:30.043144       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "jobs" in API group "batch" at the cluster scope
E1119 00:09:30.043616       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
W1119 00:09:30.049414       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "captures" in API group "retina.sh" at the cluster scope
E1119 00:09:30.050010       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.Capture: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"captures\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"
W1119 00:09:33.326252       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "jobs" in API group "batch" at the cluster scope
E1119 00:09:33.326494       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
W1119 00:09:33.636222       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "captures" in API group "retina.sh" at the cluster scope
E1119 00:09:33.636389       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1alpha1.Capture: failed to list *v1alpha1.Capture: captures.retina.sh is forbidden: User \"system:serviceaccount:kube-system:retina-agent\" cannot list resource \"captures\" in API group \"retina.sh\" at the cluster scope" logger="UnhandledError"

Standard deployment is also not listing this for retina-cluster-reader so it's likely also broken right now, unless this is not enabled on standard deployment:

  • apigroup=batch/v1, resource=jobs, op=list (presumably also get, watch)
  • apigroup=retina.sh/v1alpha1, resource=capture, op=list (presumably also get, watch)

I've added this:

- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - get
  - watch
  - list

and updated this:

- apiGroups:
  - retina.sh
  resources:
  - retinaendpoints
  - captures
  verbs:
  - get
  - list
  - watch

I have not observed errors in operator's logs due to wrong ClusterRole apigroup (retina.io); the pod seems to be reporting healthy. I may have missed something though.

To Reproduce
Steps to reproduce the behavior:

  1. Install helmchart oci://ghcr.io/microsoft/retina/charts/retina-hubble v0.0.33-dev-rc1 (declared in https://api.github.com/repos/microsoft/retina/releases/latest)
  2. Use kubectl logs to observe error in the pod
  3. Use kubectl edit clusterrole retina-cluster-reader -n kube-system and kubectl edit clusterrole retina-operator-role -n kube-system to see incorrect apigroup used

Expected behavior
A clear and concise description of what you expected to happen.

The matching apigroup is used between CRDs, and the apigroups used in the operator and agent, and the clusterroles granting agent and operator rights to perform changes to the cluster.

Screenshots
If applicable, add screenshots to help explain your problem.

n/a

Platform (please complete the following information):

Additional context
Add any other context about the problem here.

This also breaks hubble-relay since it can't talk to the agent.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

No status

Relationships

None yet

Development

No branches or pull requests

Issue actions