-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Readme doc for physical mode #49457
Conversation
1bf9563
to
34184d3
Compare
Signed-off-by: dentiny <[email protected]>
34184d3
to
32610f7
Compare
src/ray/common/cgroup/README.md
Outdated
@@ -0,0 +1,35 @@ | |||
### Ray core cgroup documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why heading 3 and heading 4, instead of 1 and 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start from header 4 and going bigger (3 -> 2 -> 1). Do you have a strong preference on that?
|
||
- The feature is built upon cgroup, which only supports linux; | ||
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode; | ||
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this more clear on what "heuristric estimation" we use. Move this to ## Physical execution mode
on how to use.
e.g.
If physical execution mode is enabled, Ray uses cgroup to restrict resource usage. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage.
For example,
@ray.remote(memory=500 * 1024 * 1024)
def some_function(x):
pass
obj = some_function.remote()
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.
Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
src/ray/common/cgroup/README.md
Outdated
/ \ | ||
.../system_cgroup .../application_cgroup | ||
/ \ | ||
.../default_cgroup/ .../<task_id>_<attempt_id>_cgroup (*N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] task_default, and task_<task_id>_<attempt_id> ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my latest commit I just call it default
; do we need to differentiate task and actor here? Since from users' perspective they're different concepts, and cgroup is actually exposed to users.
src/ray/common/cgroup/README.md
Outdated
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode; | ||
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value). | ||
|
||
#### Implementation details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[general] I am a bit worried about cgroup leaks on (1) subprocesses holding the cgroup can't be deleted on task exit, (2) raylet crash. Do you have some good ideas on this?
For (1) we certainly don't want these procs. We already use linux subreaper to kill leaked procs via RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper and maybe we can have another flag like RAY_kill_child_processes_on_worker_exit_with_cgroup
so on cgroup exit we SIGKILL all procs in cgroup. WDYT?
For (2) I can't think of a good idea. If you SIGKILL raylet, all cgroups are destined to leak. Maybe we can just accept it. But if so we need to document it both in here and in the eventual user facing document.
Also we have a Q (3): do we want to restrict processes created by raylet to don't have permission to move their own cgroups? Well, Ray does not have a lot of security features so we might be OK with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) subprocesses holding the cgroup can't be deleted on task exit
I see, integrating the cgroup cleanup is a good idea, but we might need a way to get cgroup folder name with unknown PID. Or a hacky way is to have another setup of cgroups which cannot be deleted after task / actor completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For (2) I can't think of a good idea. If you SIGKILL raylet, all cgroups are destined to leak. Maybe we can just accept it. But if so we need to document it both in here and in the eventual user facing document.
I cannot think of a good way either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we have a Q (3): do we want to restrict processes created by raylet to don't have permission to move their own cgroups? Well, Ray does not have a lot of security features so we might be OK with this.
Don't think it's something we could handle at the moment; theoretically we are not container-ize user process, so it could access everything in the node / filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[general] I am a bit worried about cgroup leaks on (1) subprocesses holding the cgroup can't be deleted on task exit
At the initial version, I would like to implement a simple RAII-style cgroup setup and cleanup for raylet.
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments and suggestions. The goal is: another software engineer can implement it out with this readme file. hence, in tomorrow they can debug it as well.
|
||
obj = some_function.remote() | ||
``` | ||
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what happens if you set memory=1 byte, and try to move a proc to such a cgroup? is it " proc move failure" or proc get killed instantly after moving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The later.
src/ray/common/cgroup/README.md
Outdated
Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications. | ||
|
||
A few benefits: | ||
- It prevents application from eating up unlimited resource to starve other applications running on the same node; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I think these 2 bullet points are not parallel or mutual exclusive... and it's not "benefits" but more of a neutrual "behavior". Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?
I renamed to it "user application".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it's not "benefits" but more of a neutrual "behavior"
Well it's the benefit / motivation compared to "no cgroup", which is the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these 2 bullet points are not parallel or mutual exclusive...
A second read, these two bullet points are expression the similar meaning, so I combine them into one.
src/ray/common/cgroup/README.md
Outdated
obj = some_function.remote() | ||
``` | ||
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails. | ||
+ If a task / actor is not annotated with resource usage, ray caps max resource usage via heuristric estimation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make it dead clear on the "heuristric estimation"? that is, do we have any limitations on the default leaf cgroup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might mis-understand, but I added an implementation detail on how max resource is set.
src/ray/common/cgroup/README.md
Outdated
/ \ | ||
.../internal .../application | ||
/ \ | ||
.../default/ .../<task_id>_<attempt_id> (*N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] remote the / after default ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: is it more clear to use a tree like shape?
/sys/fs/cgroup/ray_node_<node_id>
├── internal
└── application
├── default
└── <task_id>_<attempt_id> (*N)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also specify the attempt_id is raylet-wide, or per-task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also specify the attempt_id is raylet-wide, or per-task.
I haven't decided yet, leave a TODO to fill in later.
|
||
- Raylet is responsible to create cgroup folder at startup, and cleanup the folder at its destruction | ||
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s) | ||
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say the max resource = what. it should default to 0.8 total mem, or a user specified mem resource amount on node start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've already mentioned several lines above? Anyway add more wording as you wish.
src/ray/common/cgroup/README.md
Outdated
- Raylet is responsible to create cgroup folder at startup, and cleanup the folder at its destruction | ||
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s) | ||
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes | ||
- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write about the attempt id here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering what are you expecting here? Anyway I added some wording on attempt id.
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
7859cd1
to
6e18292
Compare
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a question left
|
||
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s); in detail, raylet is responsible to create cgroup folder `/sys/fs/cgroup/ray_node_<node_id>`, `/sys/fs/cgroup/ray_node_<node_id>/internal` and `/sys/fs/cgroup/ray_node_<node_id>/application` at startup, and cleans up the folder upon process exit; | ||
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes | ||
+ The max resource respects users' input on node start, or a heuristic value 80% of all logical resource will be taken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads "if user does not set, we cap to 0.8 * logical resources == 0.64 physical resources", while I think we really mean:
"if user does not set, we cap to 1 * logical resources == 0.8 physical resources" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand, I haven't mentioned any physical resource in the doc, why do you think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-worded a little, hopefully it gets better :)
Hi @rynewang , I'm wondering do I need anything else before merge? |
Signed-off-by: dentiny <[email protected]>
It's clear enough to me now. We can still reword it later but now it's already OK. |
Discussed with @rynewang offline.
It would be nice to have a short doc illustrating cgroup construction and lifecycle.