Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Oct 2, 2025

This PR is a part of this endeavor:

GPU Driver container should avoid re-installing drivers on spurious container restarts

Relevant PRs:

@karthikvetrivel karthikvetrivel marked this pull request as draft October 2, 2025 16:05
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from fbbd6e4 to 391dd63 Compare October 3, 2025 14:17
@elezar elezar removed their request for review October 15, 2025 13:48
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 391dd63 to 427409e Compare November 3, 2025 17:43
@karthikvetrivel karthikvetrivel changed the title Add DRIVER_VERSION & FORCE_REINSTALL fields for avoiding driver reinstalls Check if NVIDIA kernel modules are loaded to avoid modprobe Nov 3, 2025
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 427409e to eae084f Compare November 14, 2025 15:01
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from 0fad456 to 12a3616 Compare December 4, 2025 22:03
@karthikvetrivel karthikvetrivel marked this pull request as ready for review December 8, 2025 20:58
Comment on lines +151 to +162
{{- if .Driver.Spec.Version }}
- name: DRIVER_VERSION
value: {{ .Driver.Spec.Version | quote }}
{{- end }}
{{- if .Driver.Spec.KernelModuleType }}
- name: KERNEL_MODULE_TYPE
value: {{ .Driver.Spec.KernelModuleType | quote }}
{{- end }}
{{- if .Driver.Spec.DriverType }}
- name: DRIVER_TYPE
value: {{ .Driver.Spec.DriverType | quote }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- do we not need to make similar changes to the DaemonSet that gets rendered by the ClusterPolicy controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are we expecting our controller to set the DRIVER_VERSION variable or are we just relying on this variable being set in our driver-container image itself? (I think we should just do the latter).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are we expecting our controller to set the DRIVER_VERSION variable or are we just relying on this variable being set in our driver-container image itself? (I think we should just do the latter).

For the init container, the controller must set it. For the driver container, we rely on the image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense that these variables need to always be added to the init container. I think my first question still stands -- do we need to make similar changes to the ClusterPolicy controller? Or is this already being set there?

Also, I notice that you only added the DRIVER_TYPE variable to the test files in internal/state/testdata/golden. Shouldn't we also be adding DRIVER_VERSION by default to the initContainer that gets rendered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit--we do need to make similar changes in the ClusterPolicy controller. They're in now.

Also, I notice that you only added the DRIVER_TYPE variable to the test files in internal/state/testdata/golden. Shouldn't we also be adding DRIVER_VERSION by default to the initContainer that gets rendered?

Looking into this now!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRIVER_VERSION is conditionally rendered in the template--the golden files only include DRIVER_VERSION when the test fixture sets .Spec.Version.

I've now updated the test fixtures to set the version in getMinimalDriverRenderData(), which means that the golden files now include.

@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 3 times, most recently from a33e6af to 7f1d75c Compare December 16, 2025 21:28
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 7f1d75c to 9a888c0 Compare December 17, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants