-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray Data checkpoint #49438
Comments
thanks for posting this. and I agree that checkpointing would be a useful feature for Ray Data users. Regarding the file-based approach, I think the main problem is that not all data source are file-based and support seeking. Also we'll need to enable And regarding the row-based approach, the overheads can be significant if the dataset has millions of rows. Also consider that both approaches are not hard to implement at the app level. I.E., manually add a filter op after the read. I would suggest doing so before we come up with a general solution. |
ExecutionOptions.preserve_order seems unnecessary, right? It just records whether the current record has been processed or not. File - level ones seem to be sufficient for most scenarios. |
What exactly do you want to checkpoint? From "requiring the filesystem to seek and read the file", I assume you want to checkpoint something like "for this file, all rows before row X have been finished". If so, you'll need to preserve the execution order. Alternatively, you can also checkpoint individual row numbers. That said, it's not the most critical problem. I think the main problem is how to support non-file-based data sources. |
Yes, like supporting SQL data source is a problem. Unstructured data is mainly files? |
Description
Background
The design of the checkpoint for providing ray data needs to see how the community views this issue
Design catalog
File Level Checkpoint
Row Level Checkpoint
Use case
No response
The text was updated successfully, but these errors were encountered: