-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix non control plane node deletion when the cluster has no control plane machines #11552
base: main
Are you sure you want to change the base?
Conversation
@damdo: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
fad721a
to
c85a473
Compare
/test help |
@chrischdi: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-e2e-conformance-ci-latest-main |
It looks like CI is happy |
Who should I tag for a review @chrischdi ? |
/area machine |
/assign @enxebre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments.
Also, please be aware that CAPI node deletion is best effort, and the ultimate responsible for removing a node is the CPI
@@ -2651,6 +2731,7 @@ func TestIsDeleteNodeAllowed(t *testing.T) { | |||
}, | |||
}, | |||
}, | |||
extraMachines: []*clusterv1.Machine{cp1, cp2}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the testing struct slightly so we can pass in zero control plane machines, otherwise without this change there are always two control plane machines provided to each test entry.
Thanks Damiano, change lgtm pending Fabrizio feedback |
c85a473
to
80cf6ef
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @fabriziopandini and @enxebre for your review. |
80cf6ef
to
33c5b6d
Compare
/test pull-cluster-api-e2e-conformance-ci-latest-main |
// number of remaining control plane members and whether or not this | ||
// machine is one of them. | ||
numOfControlPlaneMachines := len(machines.Filter(collections.ControlPlaneMachines(cluster.Name))) | ||
if numOfControlPlaneMachines == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not filtering by active machines anymore, what happens if this returns 2 because there's another cp machine but that one is also being deleted?
/area machine
What this PR does / why we need it:
I'm working with a cluster where CAPI is installed on day-2 as a non-control-plane machine management system and hence the
Cluster
objectspec.ControlPlaneRef
is nil.While issuing a deletion for a worker machine (managed by a CAPI MachineSet) I'm observing warnings like the following:
this prevents the associated Kubernetes Node from being deleted, leaving an orphaned node.
Upon investigation, the code decides whether to delete the Node by counting control plane machines in the cluster. If there are no active control plane machines, deletion is blocked. However, this logic fails to consider whether the node being deleted belongs to a control plane or worker machine. Deleting a worker node poses no risk, regardless of the control plane state.
This issue is critical for setups where the control plane isn’t managed by CAPI, as the control plane machine count will always be zero from CAPI’s perspective.
Proposed Solution:
This PR modifies the logic to block node deletion only if the node belongs to a control plane machine and it is the last remaining control plane machine. Additionally, I adjusted the filtering to include all machines (not just non-deleting ones), providing a more accurate view of the cluster state, which aids both understanding and testing.