Skip to content

Conversation

@kolluria
Copy link
Contributor

@kolluria kolluria commented Dec 6, 2025

Summary

Implements a periodic cleanup routine to remove orphaned CNS finalizers from PVCs when CnsNodeVMBatchAttachment CRs are deleted before removing PVC finalizers, preventing stuck PVCs and blocked namespace deletions.

Problem

During namespace deletion or manual disk detachment scenarios, the CnsNodeVMBatchAttachment reconciler can remove the finalizer from the CRD before removing finalizers from associated PVCs. This causes:

  1. Stuck PVCs: PVCs remain in terminating state indefinitely with cns.vmware.com/pvc-protection finalizer
  2. Blocked Namespace Deletion: Namespaces cannot be deleted due to stuck PVCs
  3. Resource Leaks: Orphaned resources requiring manual intervention

Root Cause

The issue manifests in scenarios where a volume entry is removed from the CnsNodeVmBatchAttach spec and the underlying VM is deleted simulatenously.
In such scenarios, the finalizer from the LOST PVC never gets removed leaving it perpetually stuck in terminating state.

Solution

Introduced cleanupPVCs() cleanup routine that:

  1. Identifies Orphaned PVCs: Lists all PVCs with cns.vmware.com/pvc-protection finalizer
  2. Validates Against Active CRs: Checks if PVC is still referenced in any CnsNodeVMBatchAttachment CR spec
  3. Removes Orphaned Finalizers: Safely removes finalizers from PVCs not referenced in any active CR

It runs periodically similar to existing CnsRegisterVolume and CnsUnregisterVolume cleanup routines.

Testing Done

Test Scenarios

Scenario 1: PVC attached to a Pod ✓

Expected: Ignored by cleanup (not processed)
Result: PASS

Scenario 2: PVC actively attached to a VM ✓

Expected: Finalizer preserved (PVC still in CR spec)
Result: PASS

Scenario 3: Orphaned PVC ✓

Simulated this by manually adding the finalizer on the PVC.
Expected: Finalizer removed (PVC not in any CR spec)
Result: PASS

Scenario 4: Unused PVCs ✓

Expected: No Op
Result: Pass

Verification

Deleted all the PVCs and verified that that the PVCs that are still in use are stuck in Terminating state while both the unused PVCs and the orphaned PVCs got cleaned up.

> kt delete pvc --all --wait=false
persistentvolumeclaim "test-pvc-no-finalizer" deleted from cns-batch-attach-bug namespace
persistentvolumeclaim "test-pvc-orphaned-1" deleted from cns-batch-attach-bug namespace
persistentvolumeclaim "test-pvc-orphaned-2" deleted from cns-batch-attach-bug namespace
persistentvolumeclaim "test-pvc-orphaned-3" deleted from cns-batch-attach-bug namespace
persistentvolumeclaim "test-pvc-orphaned-4" deleted from cns-batch-attach-bug namespace
persistentvolumeclaim "test-pvc-orphaned-5" deleted from cns-batch-attach-bug namespace
persistentvolumeclaim "test-pvc-with-vm" deleted from cns-batch-attach-bug namespace
persistentvolumeclaim "test-vm-with-pvc-7ac30fca" deleted from cns-batch-attach-bug namespace

> kt get pvc
NAME                        STATUS        VOLUME                                           CAPACITY   ACCESS MODES   STORAGECLASS                VOLUMEATTRIBUTESCLASS   AGE
test-pvc-no-finalizer       Terminating   pvc-7748ed9a-4c8b-459b-8a26-f008f6bce985         1Gi        RWO            wcpglobal-storage-profile   <unset>                 28m
test-pvc-with-vm            Terminating   pvc-1cac9c21-5b0a-4a9f-ad43-1f3be9314be2         2Gi        RWO            wcpglobal-storage-profile   <unset>                 24m
test-vm-with-pvc-7ac30fca   Terminating   static-pv-6b70659e-fc03-4c63-a1a5-e3a372c4b981   16Gi       RWO            wcpglobal-storage-profile   <unset>                 9m26s

Test Results

All 4 test scenarios passed successfully:

Test Expected Behavior Result
PVC without finalizer Ignored ✓ PASS
PVC with active VM Finalizer preserved ✓ PASS
Orphaned PVC Finalizer removed ✓ PASS
Unused PVC No Op ✓ PASS

Safety Considerations

  • Conservative Approach: Only removes finalizers from PVCs NOT in any active CR spec
  • No False Positives: PVCs actively being processed by reconciler are preserved
  • Idempotent: Safe to run multiple times
  • Cluster-Wide Scan: Processes all namespaces to catch orphaned PVCs anywhere

Special Notes for Reviewer

  • This is a safety net mechanism for edge cases, not a replacement for proper finalizer management in the reconciler
  • The cleanup routine follows the exact same pattern as existing cleanUpCnsRegisterVolumeInstances and cleanUpCnsUnregisterVolumeInstances routines (infinite loop with sleep intervals)
  • Like other cleanup routines, this does not have a graceful shutdown mechanism - it follows the established pattern in the codebase

Release Note

Add periodic cleanup routine to remove orphaned CNS finalizers from PVCs when CnsNodeVMBatchAttachment CRs are deleted prematurely

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kolluria

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 6, 2025
@kolluria kolluria marked this pull request as ready for review December 6, 2025 09:47
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2025
@kolluria
Copy link
Contributor Author

kolluria commented Dec 6, 2025

/assign @kolluria
/cc @skogta @akutz

@k8s-ci-robot
Copy link
Contributor

@kolluria: GitHub didn't allow me to request PR reviews from the following users: skogta.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @kolluria
/cc @skogta @akutz

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kolluria kolluria force-pushed the cleanup-batch-attach-pvcs branch from a32f8f8 to 18fa18d Compare December 6, 2025 09:50
@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #651

@kolluria kolluria force-pushed the cleanup-batch-attach-pvcs branch from 18fa18d to c560755 Compare December 6, 2025 12:01
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 6, 2025
@kolluria kolluria force-pushed the cleanup-batch-attach-pvcs branch from c560755 to d5b849c Compare December 6, 2025 12:28
@deepakkinni
Copy link
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #710

@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #655

}

// map to hold all the PVCs that have the finalizer added by the batch attach reconciler
pvcMap := make(map[string]map[string]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A description of what this map holds will be helpful


// map to hold all the PVCs that have the finalizer added by the batch attach reconciler
pvcMap := make(map[string]map[string]struct{})
for _, pvc := range pvcList {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider only those PVCs which have deletion timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be a more general purpose finalizer remover. We do not have to wait for the pvc to get deleted to cleanup - as long as a PVC has the finalizer but no batch attach CR, the finalizer can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there can be PVCs being used by pod VMs or attached via the old CnsNodeVMAttachment CR. We should not be removing finalizer from them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I see that older node vm attach and file access configs are also using this CR - but PodVMs won't have this finalizer on them.

Let me think more on how to address this.

@deepakkinni
Copy link
Collaborator

SUCCESS --- Jenkins Build #718

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants