Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers #6442

Park-Jiyeonn · 2024-12-23T13:46:31Z

Report

I am experiencing an issue where the status of HPAs created by KEDA frequently change due to minor differences in the Prometheus query results used in my scalers. For example, the status value for my CPU utilization query might change from 55.001 to 55.002, causing KEDA-operator to frequently trigger the Reconcile process.

Currently, I have a large number of Prometheus-based scalers in my cluster, and this frequent Reconciliation significantly impacts the operator's performance. When a new ScaledObject is created, it can take up to 10 minutes or more for its associated Reconciliation to be processed because the operator is constantly handling updates triggered by these small annotation changes.

Is there a way to avoid such frequent Reconciliation for minor or insignificant annotation changes? Or is there a best practice to handle this situation when using Prometheus-based scalers?

Expected Behavior

The operator should not trigger Reconciliation for minor or insignificant changes in HPA status. Alternatively, there should be a way to configure a tolerance or threshold for such changes to avoid excessive Reconcile events.

Actual Behavior

The operator is repeatedly triggered for Reconciliation due to minor changes in HPA annotations, significantly impacting performance and delaying the processing of new ScaledObjects.

Steps to Reproduce the Problem

Create a ScaledObject using a Prometheus query that monitors CPU utilization.
Observe that the status for the generated HPA frequently change due to small fluctuations in the query result.
Monitor the KEDA-operator logs and notice that Reconciliation is triggered frequently.
Add many Prometheus-based scalers to the cluster, and attempt to create a new ScaledObject.
Notice the delay in Reconciliation for the newly created ScaledObject due to the operator being overloaded.

Logs from KEDA operator

A large number of similar logs:

2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
2024-12-23T13:42:10Z INFO Reconciling ScaledObject {......}
......

KEDA Version

< 2.12.0

Kubernetes Version

< 1.28

Platform

Any

Scaler Details

Prometheus

Anything else?

If this issue is considered a bug and there is agreement on a potential solution, I am willing to contribute code to fix it. I would appreciate any guidance or suggestions from the maintainers on how to approach the fix, such as which parts of the codebase to modify or any specific design considerations to keep in mind.

The text was updated successfully, but these errors were encountered:

LY-today · 2024-12-23T13:50:23Z

Encountered the same problem

deefreak · 2024-12-28T03:23:27Z

@Park-Jiyeonn I have faced this issue last year and we debugged it and fixed it as well,
Here is the link to the issue : #5281
and the PR for the fix: #5282

This fix is there from KEDA version 2.13.0.
So, if you upgrade to this version, it should go away

Park-Jiyeonn · 2024-12-28T11:20:03Z

@Park-Jiyeonn I have faced this issue last year and we debugged it and fixed it as well, Here is the link to the issue : #5281 and the PR for the fix: #5282

This fix is there from KEDA version 2.13.0. So, if you upgrade to this version, it should go away

Thank you for your response! I’ve identified that the issue was indeed caused by the excessive number of Reconcile events. However, we are also encountering another problem: each Reconcile operation is very slow, and our apiserver’s QPS is consistently capped at 20. After some investigation, we realized this was due to the default QPS limit of 20 in the Controller-runtime client, so we’ve adjusted the client’s QPS settings to resolve this bottleneck.

Currently, we are using KEDA version 2.8 due to compatibility constraints with our cluster version. If you are aware of any other known performance issues or best practices for this version of KEDA, I would greatly appreciate it if you could share them with us. This would help us avoid potential problems in advance.

Thanks again for your help!

deefreak · 2024-12-28T11:55:49Z

We also had this rate limit happening at the api server, which we resolved by increasing it.
We were earlier using KEDA 2.10.0 where we observed these issues.
Apart from these, I don't think there was anything else.

Park-Jiyeonn added the bug Something isn't working label Dec 23, 2024

keda-automation added this to Roadmap - KEDA Core Dec 23, 2024

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Dec 23, 2024

Park-Jiyeonn changed the title ~~Frequent Reconciliation Caused by Minor Changes in HPA Annotations from Prometheus Scalers~~ Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers #6442

Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers #6442

Park-Jiyeonn commented Dec 23, 2024 •

edited

Loading

LY-today commented Dec 23, 2024

deefreak commented Dec 28, 2024

Park-Jiyeonn commented Dec 28, 2024

deefreak commented Dec 28, 2024

Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers #6442

Frequent Reconciliation Caused by Minor Changes in HPA Status from Prometheus Scalers #6442

Comments

Park-Jiyeonn commented Dec 23, 2024 • edited Loading

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

LY-today commented Dec 23, 2024

deefreak commented Dec 28, 2024

Park-Jiyeonn commented Dec 28, 2024

deefreak commented Dec 28, 2024

Park-Jiyeonn commented Dec 23, 2024 •

edited

Loading