Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use kube-vip-cloud-provider in CAPV clusters #3164

Open
ybizeul opened this issue Aug 21, 2024 · 9 comments
Open

Unable to use kube-vip-cloud-provider in CAPV clusters #3164

ybizeul opened this issue Aug 21, 2024 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ybizeul
Copy link

ybizeul commented Aug 21, 2024

/kind bug

What steps did you take and what happened:

  • Provision a new cluster with a control node and a worker node
  • Deploy kube-vip-cloud-provider
  • Configure ip ranges for the provider
  • Create a new LoadBalancer service
  • the provider annotates the service correctly with IP from the pool
  • kube-vip fails with :
time="2024-08-21T18:55:27Z" level=error msg="[endpoint] unable to find shortname from my-cluster-lbd6d"

What did you expect to happen:

I was expecting kube-vip to assign external ip to the IP given by kube-vip-cloud-provider

Anything else you would like to add:

This seems to be related to kube-vip/kube-vip#723 which prevents provisioning of the IP on the service when node names aren't FQDNs (which is what CAPV does it seems, and I didn't find a way to change that yet. Adding a searchdomain leads to the same result).

After changing kube-vip Pod image to v0.8.2 in cluster.yaml as :

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
[...]
spec:
  kubeadmConfigSpec:
[...]
    files:
    - content: |
[...]
            image: ghcr.io/kube-vip/kube-vip:v0.8.2
[...]
      path: /etc/kubernetes/manifests/kube-vip.yaml

Environment:

  • Cluster-api-provider-vsphere version:
v1.11
  • Kubernetes version: (use kubectl version):
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.31.0
  • OS (e.g. from /etc/os-release):
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3975.2.0
VERSION_ID=3975.2.0
BUILD_ID=2024-08-05-2103
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3975.2.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3975.2.0:*:*:*:*:*:*:*"
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 21, 2024
@lubronzhan
Copy link
Contributor

More of a limitation in kube-vip. And as we discussed in slack, this is fixed in kube-vip 0.8.

But even kube-vip is bumped to 0.8, since node name doesn't match the fqdn name, CPI will fail to find the corresponding VM for the node.

I think we should revert this PR fcd243d

It was aiming at addressing the node name length issue, but then if customer has search domain defined for hostname and the length is within 64 character, CPI will fail to initialize node

If you all think it's reasonable I can create an issue to revert it

@ybizeul
Copy link
Author

ybizeul commented Aug 22, 2024

Thank you @lubronzhan.

I will also add my remarks from this morning.

We have that issue with node names, related to v0.6.4, which isn't a problem for initial deployment and control plane HA
But as it is, the kube-vip static pods are only deployed on control nodes anyways, which, as far as I know, doesn't have LB Service resources.

Maybe CAPV intended way to implement kube-vip for workloads is through a new daemonset on the workload cluster, the problem with that is the control plane kube-vip has svc_enable to true so I'm afraid both will try to handle the new resources.

@lubronzhan
Copy link
Contributor

More of a limitation in kube-vip. And as we discussed in slack, this is fixed in kube-vip 0.8.

But even kube-vip is bumped to 0.8, since node name doesn't match the fqdn name, CPI will fail to find the corresponding VM for the node.

I think we should revert this PR fcd243d

It was aiming at addressing the node name length issue, but then if customer has search domain defined for hostname and the length is within 64 character, CPI will fail to initialize node

If you all think it's reasonable I can create an issue to revert it

Ohk I would correct mine comment around the real issue. It's actually the /etc/hosts is not having the short-hostname ip entry so kube-vip can't resolve it. And it should be fixed in kube-vip 0.8. This is unrelated to reverting fcd243d

Maybe CAPV intended way to implement kube-vip for workloads is through a new daemonset on the workload cluster, the problem with that is the control plane kube-vip has svc_enable to true so I'm afraid both will try to handle the new resources.

Current way would support externalTrafficPolicy: cluster, which most people use, maybe that's why it's not removed. If you want to deploy different set of kube-vip, you can modify your kubeadmControlPlane to remove the svc_enable: true from the files section which contains kube-vip manifest.

@ybizeul
Copy link
Author

ybizeul commented Aug 22, 2024

But there is still the problem that 0.6.4 wouldn't work even with cluster policy, because of the short name bug right ?

@lubronzhan
Copy link
Contributor

But there is still the problem that 0.6.4 wouldn't work even with cluster policy, because of the short name bug right ?

Yes. So need to upgrade to new 0.8.

This PR fcd243d#diff-e76b4b2137138f55f29ce20dd0ab8287648f3d8eb23a841a2e5c51ff88949750R120 also add the local_hostname to /etc/hosts. Maybe that's also one of the reason, so only the short hostname is in your /etc/hosts. Could you check that?

@chrischdi
Copy link
Member

chrischdi commented Aug 23, 2024

Maybe also relevant: the kube-vip static pod currently runs with its own /etc/hosts:

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/templates/cluster-template.yaml#L179-L182

This is to workaround the issue at:

An improved way may be a preKubeadmCommand which copies the original /etc/hosts and just adds kubernetes to it.

Note: this workaround is only required when kube-vip runs as static pod.

So instead of adding the static file /etc/kube-vip.hosts, we could maybe use a preKubeadmCommand like the following if this helps:

sed -E -e 's/^(127.0.0.1.*)/\1 kubernetes/g' /etc/hosts > /etc/kube-vip.hosts

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 21, 2024
@sbueringer
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants