Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers #6442

Open
Park-Jiyeonn opened this issue Dec 23, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Park-Jiyeonn
Copy link

Park-Jiyeonn commented Dec 23, 2024

Report

I am experiencing an issue where the status of HPAs created by KEDA frequently change due to minor differences in the Prometheus query results used in my scalers. For example, the status value for my CPU utilization query might change from 55.001 to 55.002, causing KEDA-operator to frequently trigger the Reconcile process.

Currently, I have a large number of Prometheus-based scalers in my cluster, and this frequent Reconciliation significantly impacts the operator's performance. When a new ScaledObject is created, it can take up to 10 minutes or more for its associated Reconciliation to be processed because the operator is constantly handling updates triggered by these small annotation changes.

Is there a way to avoid such frequent Reconciliation for minor or insignificant annotation changes? Or is there a best practice to handle this situation when using Prometheus-based scalers?

Expected Behavior

The operator should not trigger Reconciliation for minor or insignificant changes in HPA status. Alternatively, there should be a way to configure a tolerance or threshold for such changes to avoid excessive Reconcile events.

Actual Behavior

The operator is repeatedly triggered for Reconciliation due to minor changes in HPA annotations, significantly impacting performance and delaying the processing of new ScaledObjects.

Steps to Reproduce the Problem

  1. Create a ScaledObject using a Prometheus query that monitors CPU utilization.
  2. Observe that the status for the generated HPA frequently change due to small fluctuations in the query result.
  3. Monitor the KEDA-operator logs and notice that Reconciliation is triggered frequently.
  4. Add many Prometheus-based scalers to the cluster, and attempt to create a new ScaledObject.
  5. Notice the delay in Reconciliation for the newly created ScaledObject due to the operator being overloaded.

Logs from KEDA operator

A large number of similar logs:

2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
......

KEDA Version

< 2.12.0

Kubernetes Version

< 1.28

Platform

Any

Scaler Details

Prometheus

Anything else?

If this issue is considered a bug and there is agreement on a potential solution, I am willing to contribute code to fix it. I would appreciate any guidance or suggestions from the maintainers on how to approach the fix, such as which parts of the codebase to modify or any specific design considerations to keep in mind.

@Park-Jiyeonn Park-Jiyeonn added the bug Something isn't working label Dec 23, 2024
@LY-today
Copy link

Encountered the same problem
58953c42-14b6-481f-aae7-2bc81dccdfe2
7b1e7af7-7955-469a-a5dc-c63424bcb985

@deefreak
Copy link
Contributor

@Park-Jiyeonn I have faced this issue last year and we debugged it and fixed it as well,
Here is the link to the issue : #5281
and the PR for the fix: #5282

This fix is there from KEDA version 2.13.0.
So, if you upgrade to this version, it should go away

@Park-Jiyeonn
Copy link
Author

@Park-Jiyeonn I have faced this issue last year and we debugged it and fixed it as well, Here is the link to the issue : #5281 and the PR for the fix: #5282

This fix is there from KEDA version 2.13.0. So, if you upgrade to this version, it should go away

Thank you for your response! I’ve identified that the issue was indeed caused by the excessive number of Reconcile events. However, we are also encountering another problem: each Reconcile operation is very slow, and our apiserver’s QPS is consistently capped at 20. After some investigation, we realized this was due to the default QPS limit of 20 in the Controller-runtime client, so we’ve adjusted the client’s QPS settings to resolve this bottleneck.

Currently, we are using KEDA version 2.8 due to compatibility constraints with our cluster version. If you are aware of any other known performance issues or best practices for this version of KEDA, I would greatly appreciate it if you could share them with us. This would help us avoid potential problems in advance.

Thanks again for your help!

@deefreak
Copy link
Contributor

We also had this rate limit happening at the api server, which we resolved by increasing it.
We were earlier using KEDA 2.10.0 where we observed these issues.
Apart from these, I don't think there was anything else.

@Park-Jiyeonn Park-Jiyeonn changed the title Frequent Reconciliation Caused by Minor Changes in HPA Annotations from Prometheus Scalers Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To Triage
Development

No branches or pull requests

3 participants