Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: Support scoring for devices and nodes in scheduling #4970

Open
4 tasks
johnbelamaric opened this issue Nov 20, 2024 · 11 comments
Open
4 tasks

DRA: Support scoring for devices and nodes in scheduling #4970

johnbelamaric opened this issue Nov 20, 2024 · 11 comments
Assignees
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@johnbelamaric
Copy link
Member

Enhancement Description

DRA supports the concept of "under specifying" a request. This gives the scheduler more flexibility to satisfy a request, increasing the likelihood of success in environments with scarce resources. For example, rather than asking for a specific model of a device, the user can ask for any one of a set of models, as long as it has some minimum specified amount of memory.

Currently, DRA uses a "first fit" algorithm during scheduling. This can lead to inefficient choices. Building on the example above, if the user asks for a device with at least 4GB of memory, if the first device found has 80GB of memory, it will be chosen, even if there is another option with exactly 4GB. If scoring were available, the scheduler could evaluate the "waste" associated with each possible option, and make a more efficient choice.

Scoring is also critical in other situations where there is optionality in how to satisfy a request. For instance, in #4816 the user is allowed to provide a list of preferences. While that works to choose the "best" option on a given node, in reality most nodes have homogeneous selections of devices. So, in a cluster with nodes that meet the first option in the list, and nodes that meet the second option in the list, either one could be chosen. If DRA could score the nodes based on whether they satisfy the first or second option, then preference could be given to the first option across nodes, not just within a node.

The last important place for scoring is to help with fragmentation and bin packing. This is especially relevant with the implementation of #4815 pending. The ability to dynamical choose partitions of a device can lead to fragmentation; scoring can alleviate that to some extent. It is not a complete solution to that problem, but can help.

  • One-line enhancement description (can be used as a release note): Scoring support for improved Node and device selection during Pod scheduling
  • Kubernetes Enhancement Proposal: TBD
  • Discussion Link: TBD (live discussions at KubeCon)
  • Primary contact (assignee): @johnbelamaric
  • Responsible SIGs: sig-scheduling
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (x.y): 1.33
    • Beta release target (x.y): 1.34
    • Stable release target (x.y): 1.35
  • Alpha
    • KEP (k/enhancements) update PR(s):
    • Code (k/k) update PR(s):
    • Docs (k/website) update PR(s):

/assign @johnbelamaric
/cc @pohly @klueska @mortent @alculquicondor @wojtek-t
/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 20, 2024
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Nov 20, 2024
@johnbelamaric
Copy link
Member Author

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Nov 20, 2024
@alculquicondor
Copy link
Member

cc @dom4ha

@johnbelamaric
Copy link
Member Author

cc @asm582 @kannon92

Abhishek, I don't seem to have Olivier's github handle, please cc him as well.

@johnbelamaric
Copy link
Member Author

cc @catblade

@kannon92
Copy link
Contributor

cc @tardieu

@wojtek-t
Copy link
Member

Thanks for filing that John. Let me just share my high-level thoughts about it.

I obviously sympathize with the goal and usecase, but I would like us to spend time thinking about implementation.
I think the primary option for many k8s folks would be to somehow incorporate it into existing scheduler model, namely into the concept of priorities in the scheduler.
Saying that I'm not a fan of this mode would be an euphenism - the scoring model of scheduler has in my opinion two primary problems (and a number of smaller ones):

  1. it's completely unintuitive for the and user and it's pretty much impossible to reason about
    Given that the score is an affinity combination of individual scoring functions, as a user (despite the fact that I worked on that codebase in the past), having a set of feasible nodes, I can't predict which node would finally be chosen, because different functions are pulling in different directions.

  2. it was historically causing (and still somewhat is) a bunch of pain from performance/scalability perspective
    With introduction of even more complex rules, we would only add to this problem.

I didn't yet put enough thoughts into it, but I would like to explore a very different model of a cleaner decision tree. In that model, we stack-rank the preferences and on a given level we immediately reject all options with a score that is different than the highest for this particular scoring function.
This immediately solve the problem of intuitiveness and reasoning about the choice, but would also help with performance as in majority of cases we don't need to compute all scores for all nodes.

Obviously we have a compatibility problem, but given we're talking about new feature maybe it's the right moment to seriously consider it now. And once we prove it, maybe we will be able to somehow get rid the old model in the future...

@johnbelamaric
Copy link
Member Author

Thanks @wojtek-t, good points. A few thoughts.

I would characterize your first point as "predictability" - as a user I have an idea of what's going to happen. We can try to make that a goal. One factor to consider is who needs to influence the decision, and how that affects predictability. I can think of (at least) four different roles that probably want some say in the scoring:

  • The end user - they may want to prioritize for performance or cost, for example.
  • The cluster admin - they will want to control how their infrastructure is used. Scoring can be one mechanism to influence that (for example, should we have a way for them to prioritize one device type over another, for cost/contractual reasons?).
  • The cloud provider / node architect - they may have some reasons to affect placement to ensure the best bin packing on their infrastructure, or to ensure the best performance
  • The device vendor - at the very least, they would be the source of performance data (device X is faster than device Y), if that becomes a scoring dimension

During the KEP process we will have to figure which of these we want to address, what the scope of control they should have, how to manage weights given to each, what APIs they need to influence decisions, and how to make all that comprehensible and predictable. It should be fun :)

On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity? One thing I would put explicitly out-of-scope for this KEP is any kind of cross-pod affinity or gang scheduling. I think we need a different solution than our existing kube-scheduler for that. As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue). For example, if I know that the next workload will need a full GPU, and I can fit the current workload on a MIG on an in-use GPU, but with slightly less performance, ideally I would probably choose the MIG. But in this KEP, I don't want to try to do that. I think it's already hard enough when we only make decisions for the single Pod we're looking at. That said, I think that whatever we do should be useful to solutions like Kueue or a multi-Pod scheduler to leverage as needed.

@johnbelamaric
Copy link
Member Author

cc @jingxu97

@dom4ha
Copy link
Member

dom4ha commented Nov 21, 2024

If we want to support both the end user preference and cluster admin policy (to maintain high utilization), we probably cannot just stack rank, as we may end up either listening to the end user preference unconditionally, or follow admin policy ignoring the user preference (at least in the simplest implementation).

Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model. For instance all kinds of DRA factors could have much stronger weights than all other non-DRA factors, without stack ranking all existing plugin types. I bet that scheduler uses flat weights because it's hard to definitely point out which factors are more and which are less important, so unpredictability at some level is probably inevitable.

Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point.

@wojtek-t
Copy link
Member

Thanks for your thoughts, so let me share my further thoughts (some of which I already had before, some are triggered by above comments).

I would characterize your first point as "predictability"

Yes - it's exactly that. And I fully agree that there are multiple personas that would potentially like to affect that. In my mental model I had only 3 (workload owner, cluster admin and provider), but indeed we may need to think if provider shouldn't be split. My thinking about how those 3 affect the decisions are:

  1. provider - exposes what "scoring function" are actually available for cluster admins and workload users
  2. cluster-admin - defines the default order on those functions and potentially some additional constraints (e.g. function X can't be used at all, function A has to be more important than B, ...)
  3. workload owner - can somehow affect the default order of scoring functions within the constraints provided by the admin

I'm happy to hear other thoughts on it, but this is how I was thinking about it.

On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity?

It's a bit of both:

  • for non-local decisions (like pod affinity) these are obviously much harder - and I'm fine with making these our of scope for now
  • for local decisions, if we have only a single node that is strictly better than any other wrt the most important scoring function, then we also potentially lose a bunch of resources computing all O(10) scores for every node;

As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue)

I have a bunch of thoughts about that too, but I'm fine with making at out-of-scope. If we build a reasonable building block here, it should be enough. And given we want to focus purely on "intra-node" aspect, I think we're fine here.

Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model.

Technically that's true, but still see my performance point above - we may unnecessary use order of magnitude more resources in that model. If we want predictability, let's not hack around to achieve it, but rather let's do that properly.

Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point.

CA is a bit different here, because it doesn't look only on a single node. It has an internal concept of scoring, but it's a bit different. I would like to achieve better unification here, but for the sake of making progress, I would also put it out of scope of now.
@x13n - for your thoughts if you disagree

@x13n
Copy link
Member

x13n commented Nov 22, 2024

I don't disagree, but will share my thoughts nevertheless :)

CA indeed has its own scoring mechanism. In fact, it has more than one. The most complex one was indeed score based and good luck reasoning about this one:

	priceSubScore := (totalNodePrice + stabilizationPrice) / (totalPodPrice + stabilizationPrice)
	// How well the node matches generic cluster needs
	nodeUnfitness := p.nodeUnfitness(preferredNode, nodeInfo.Node())

	// TODO: normalize node count against preferred node.
	supressedUnfitness := (nodeUnfitness-1.0)*(1.0-math.Tanh(float64(option.NodeCount-1)/15.0)) + 1.0

(https://github.com/kubernetes/autoscaler/blob/5458e1c208d87e988eeef59523b166e2a9b2d622/cluster-autoscaler/expander/price/price.go#L141-L146)

More recently we've introduced a way of composing different expanders (i.e. CA scoring functions) in a way that each one can filter out some options that are preferred from its perspective, eventually leading to a single option being chosen. This is very similar to the decision tree proposal above and I agree it is much simpler to reason about (not only as a user, but also as CA maintainer & operator).

I think there might be some opportunity to bring the two scoring mechanisms closer, especially if we build some kind of multi-pod scheduler. One might argue CA is already a multi-pod scheduler that can also create nodes. If there's no capacity in the cluster, the nodes created by CA are essentially the only option for pods to be scheduled, so scheduler scoring doesn't matter much in an autoscaled cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 📋 Backlog
Status: Needs Triage
Development

No branches or pull requests

7 participants