Skip to content

Conversation

@jbtrystram
Copy link
Member

Allow skipping the node image tests. Allow unblocking the nodeimage pipeline when the QEMU artifact for the latest RHCOS build is not available.
Patch best viewed without whitespace change.

Allow skipping the node image tests. Allow unblocking the nodeimage
pipeline when the QEMU artifact for the latest RHCOS build is not
available.
Patch best viewed without whitespace change.
@dustymabe
Copy link
Member

the QEMU artifact for the latest RHCOS build is not available.

In what case does that happen? Did someone run with EARLY_ARCH_JOBS set?

@aaradhak
Copy link
Member

aaradhak commented Dec 3, 2025

@dustymabe Looks like we have encountered this issue in the build-node-image job run today - https://jenkins-rhcos--prod-pipeline.apps.int.prod-stable-spoke1-dc-iad2.itup.redhat.com/job/build-node-image/1235/

@dustymabe
Copy link
Member

dustymabe commented Dec 3, 2025

https://jenkins-rhcos--prod-pipeline.apps.int.prod-stable-spoke1-dc-iad2.itup.redhat.com/job/build/2573/parameters/ had EARLY_ARCH_JOBS by @sdodson so that's why
https://jenkins-rhcos--prod-pipeline.apps.int.prod-stable-spoke1-dc-iad2.itup.redhat.com/job/build-node-image/1235/ ended up failing.

Rather than skipping tests here we can just simply rerun the build-node-image job later. I don't see much value in skipping tests.

@dustymabe
Copy link
Member

actually it looks like the real problem with https://jenkins-rhcos--prod-pipeline.apps.int.prod-stable-spoke1-dc-iad2.itup.redhat.com/job/build-node-image/1235/ is that x86_64 is trying to download different images than the other arches (again, because EARLY_ARCH_JOBS), but EARLY_ARCH_JOBS isn't really the entire problem here, it just exposes it.

I think the real problem is that we're not enforcing that we are running the test against the same RHCOS for all arches.

We probably should enforce that we download the same RHCOS qemu as the node image is based on.

i.e. we need to update

cosa buildfetch \
--arch=$arch --artifact qemu --url=s3://${s3_dir}/builds \
--aws-config-file \${AWS_BUILD_UPLOAD_CONFIG} --find-build-for-arch
to replace --find-build-for-arch with --build=$BUILDID where we found the RHCOS buildid that was used to build the node image.

@jbtrystram
Copy link
Member Author

We probably should enforce that we download the same RHCOS qemu as the node image is based on.

i.e. we need to update

cosa buildfetch \
--arch=$arch --artifact qemu --url=s3://${s3_dir}/builds \
--aws-config-file \${AWS_BUILD_UPLOAD_CONFIG} --find-build-for-arch
to replace --find-build-for-arch with --build=$BUILDID where we found the RHCOS buildid that was used to build the node image.

The issue @Roshan-R hit when working on this is that we upload incomplete builds for rhcos. So you'd have to parse the meta file first to find a build with all the arches.
We're rebasing to the container image for the tests so the build we boot with does not matter too much.

@dustymabe
Copy link
Member

The issue @Roshan-R hit when working on this is that we upload incomplete builds for rhcos. So you'd have to parse the meta file first to find a build with all the arches.

If the build-node-image job is running it was (typically) triggered by a release job for the RHCOS base image the node image is being derived from. If we use the same RHCOS that we used to build the node image on top of (by using buildfetch --build=<build> then we know the build isn't incomplete (i.e. the release job wouldn't have run if it was incomplete).

We're rebasing to the container image for the tests so the build we boot with does not matter too much.

👍

@sdodson
Copy link

sdodson commented Dec 4, 2025

I don't have the understanding of the pipeline that anyone else here does, but it looks like a change triggered ART automation to kick off build-node-image independent of the rhel-9.6 build, build-arch, and release job completion. The build-node-image jobs that were subsequently triggered by the successful release job were just fine. Could the build, build-arch, and release jobs relevant to downstream build-node-image jobs just hold a lock that prevents starting new instances until complete?

Leveraging the early arch builds flag is highly valuable as it trims the overall pipeline duration by at least an hour.

dustymabe added a commit to dustymabe/fedora-coreos-pipeline that referenced this pull request Dec 4, 2025
…uild

This ensures we don't somehow pick up a different base qemu image than
what we were built on. It also eliminates some awkward race conditions
where a newer in progress RHCOS build was causing node image tests to
fail. xref: coreos#1268
@dustymabe
Copy link
Member

I opened #1279

dustymabe added a commit that referenced this pull request Dec 5, 2025
…uild

This ensures we don't somehow pick up a different base qemu image than
what we were built on. It also eliminates some awkward race conditions
where a newer in progress RHCOS build was causing node image tests to
fail. xref: #1268
@jbtrystram
Copy link
Member Author

If the build-node-image job is running it was (typically) triggered by a release job for the RHCOS base image the node image is being derived from.

Not necessarily. ART triggers the build-node-image job multiple times a day.

If we use the same RHCOS that we used to build the node image on top of (by using buildfetch --build=<build> then we know the build isn't incomplete (i.e. the release job wouldn't have run if it was incomplete).

Yes, but back when this job was written, we were allowing incomplete builds to be released, which is why we had to use --find-build-for-arch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants