Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix non control plane node deletion when the cluster has no control plane machines #11552

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

damdo
Copy link
Member

@damdo damdo commented Dec 10, 2024

/area machine

What this PR does / why we need it:

I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the Cluster object spec.ControlPlaneRef is nil.

While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:

I1205 07:39:46.310195       1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"

this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.

Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.

This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.

Proposed Solution:
This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.

@k8s-ci-robot
Copy link
Contributor

@damdo: The label(s) area/bug cannot be applied, because the repository doesn't have them.

In response to this:

/area bug

What this PR does / why we need it:

I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the Cluster object spec.ControlPlaneRef is nil.

While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:

I1205 07:39:46.310195       1 machine_controller.go:359] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/ibmpowervs-machineset-5jgxh" namespace="openshift-cluster-api" name="ibmpowervs-machineset-5jgxh" reconcileID="796cc502-4b6d-49dd-b51f-0c0e9afe99c7" MachineSet="openshift-cluster-api/ibmpowervs-machineset" Cluster="openshift-cluster-api/p-mad02-powervs-5-quo-np4rs" Node="ibmpowervs-machineset-5jgxh" cause="no control plane members"

this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.

Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.

This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.

Proposed Solution:
This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Dec 10, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 10, 2024
@damdo damdo changed the title 🐛 Fix non-control-plane node deletion when the cluster has no control-plane nodes 🐛 Fix non-control-plane node deletion when the cluster has no control-plane machines Dec 10, 2024
@damdo damdo force-pushed the fix-non-cp-node-deletion branch from fad721a to c85a473 Compare December 10, 2024 09:37
@damdo damdo changed the title 🐛 Fix non-control-plane node deletion when the cluster has no control-plane machines 🐛 Fix non control plane node deletion when the cluster has no control plane machines Dec 10, 2024
@damdo
Copy link
Member Author

damdo commented Dec 10, 2024

/cc @fabriziopandini @sbueringer

@chrischdi
Copy link
Member

/test help

@k8s-ci-robot
Copy link
Contributor

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-latestk8s-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-31-1-32-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Member

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

@damdo
Copy link
Member Author

damdo commented Dec 10, 2024

It looks like CI is happy

@damdo
Copy link
Member Author

damdo commented Dec 12, 2024

Who should I tag for a review @chrischdi ?

@damdo
Copy link
Member Author

damdo commented Dec 16, 2024

/area machine

@k8s-ci-robot k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management and removed do-not-merge/needs-area PR is missing an area label labels Dec 16, 2024
@damdo
Copy link
Member Author

damdo commented Dec 16, 2024

/assign @enxebre

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.
Also, please be aware that CAPI node deletion is best effort, and the ultimate responsible for removing a node is the CPI

@@ -2651,6 +2731,7 @@ func TestIsDeleteNodeAllowed(t *testing.T) {
},
},
},
extraMachines: []*clusterv1.Machine{cp1, cp2},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the testing struct slightly so we can pass in zero control plane machines, otherwise without this change there are always two control plane machines provided to each test entry.

@enxebre
Copy link
Member

enxebre commented Dec 16, 2024

Thanks Damiano, change lgtm pending Fabrizio feedback

@damdo damdo force-pushed the fix-non-cp-node-deletion branch from c85a473 to 80cf6ef Compare December 18, 2024 11:22
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from enxebre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damdo
Copy link
Member Author

damdo commented Dec 18, 2024

Thanks @fabriziopandini and @enxebre for your review.
I've addressed your feedback 👍

@damdo damdo force-pushed the fix-non-cp-node-deletion branch from 80cf6ef to 33c5b6d Compare December 18, 2024 15:42
@damdo
Copy link
Member Author

damdo commented Dec 18, 2024

/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main

// number of remaining control plane members and whether or not this
// machine is one of them.
numOfControlPlaneMachines := len(machines.Filter(collections.ControlPlaneMachines(cluster.Name)))
if numOfControlPlaneMachines == 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not filtering by active machines anymore, what happens if this returns 2 because there's another cp machine but that one is also being deleted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants