-
Notifications
You must be signed in to change notification settings - Fork 431
Add weekly forward compatibility testing #1884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
d55dd82 to
9dab00e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.
Key changes:
- Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
- Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
- Created get-latest-images.sh script to fetch latest commit-based images from component repositories
- Extended install-operator.sh to support device-plugin and mig-manager image overrides
- Added Slack notifications for test failures
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/forward-compatibility.yaml | New workflow for weekly forward compatibility testing with Slack notifications |
| .github/workflows/ci.yaml | Refactored to use reusable workflows instead of monolithic job definitions |
| .github/workflows/variables.yaml | New reusable workflow for shared CI variables |
| .github/workflows/e2e-tests.yaml | New reusable e2e test workflow with optional component image inputs |
| .github/workflows/image-builds.yaml | New reusable workflow for GPU operator image builds |
| .github/workflows/release.yaml | New reusable workflow for release processes |
| .github/workflows/golang-checks.yaml | New reusable workflow for Go checks and tests |
| .github/workflows/config-checks.yaml | New reusable workflow for configuration validation |
| .github/workflows/code-scanning.yaml | New reusable workflow for CodeQL scanning |
| .github/scripts/get-latest-images.sh | Script to fetch latest component images from NVIDIA repositories |
| tests/scripts/install-operator.sh | Extended to support device-plugin and mig-manager image overrides |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
84a0c31 to
54d6fc9
Compare
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| # Use workflow_dispatch inputs if provided, otherwise fetch latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is to allow manual triggering of the workflow in case we want to test a very specific image without having to wait for the "weekly" run. But I'll set it in a way that even during manual triggers, we will simply fetch all the latests images from operands.
| toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }} | ||
| device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }} | ||
| mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a stab at this comment, please let me know what you think
| toolkit_image: | ||
| description: 'Override container-toolkit image' | ||
| required: false | ||
| type: string | ||
| device_plugin_image: | ||
| description: 'Override device-plugin image' | ||
| required: false | ||
| type: string | ||
| mig_manager_image: | ||
| description: 'Override mig-manager image' | ||
| required: false | ||
| type: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| with: | ||
| operator_image: ${{ needs.variables.outputs.operator_image_base }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, done!
| with: | ||
| commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} | ||
| operator_image_base: ${{ needs.variables.outputs.operator_image_base }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, done!
.github/workflows/e2e-tests.yaml
Outdated
| with: | ||
| operator_image: ${{ inputs.operator_image }} | ||
| operator_version: ${{ inputs.operator_version }} | ||
| toolkit_image: ${{ inputs.toolkit_image }} | ||
| device_plugin_image: ${{ inputs.device_plugin_image }} | ||
| mig_manager_image: ${{ inputs.mig_manager_image }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for passing these variables as input to the variables workflow?
Split monolithic ci.yaml (448 lines) into 6 specialized workflows. Refactored ci.yaml as orchestrator (118 lines) with centralized variable calculation. All workflows support workflow_call, workflow_dispatch, and standalone execution. Eliminates duplicate variable calculations. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Implement automated weekly tests validating GPU Operator against latest container-toolkit, device-plugin, and mig-manager images from GHCR. - Add forward-compatibility.yaml workflow with Slack alerts - Create get-latest-images.sh for fetching latest commit-based tags - Extend e2e-tests.yaml and install-operator.sh for component overrides - Add variables.yaml reusable workflow for shared CI variables Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Replace individual component image environment variables with a values override file approach. Add helper scripts for generating values files from env vars and component images. Ensure clean separation between Helm values file (-f) and --set flags to avoid conflicts. This reduces workflow boilerplate by ~58% and makes adding new operands trivial (no workflow changes needed). Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
54d6fc9 to
203941a
Compare
Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.
Changes:
Components tested:
We could add other operands later, but I wanted to start with the core ones.
Schedule: Every Monday at 2 AM UTC