Ray Data checkpoint #49438

Jay-ju · 2024-12-25T11:53:15Z

Description

Background
The design of the checkpoint for providing ray data needs to see how the community views this issue
Design catalog

Ray Data jobs do not need to repeat the full amount of data when an unrecoverable serious error occurs and needs to be restarted
- Priority support for OneToOneOpearator, such as various forms of Maps, including read, write, map, map_batches, limit, filter, etc
- We will consider AllToAllOperator (such as repartition/shuffle) in the future, which involves splitting between batches and has higher processing logic complexity
Currently designing two levels of checkpoints.
- File-level checkpoint. If different paragraphs in a single file are partially processed successfully and partially failed, they are processed in blocks. During recovery, Ray's Block data is obtained according to different paragraphs, requiring the filesystem to seek and read the file.
- Line-level checkpoint. Record the processed line number, and when restoring, only need to match the unprocessed line number to process the content again, with more granularity
Personally, I prefer to implement the first type: file-level checkpoint.
- The state storage at the row level is relatively large, which affects the write performance
- At present, the minimum concurrency granularity of FileBasedDatasources such as read_parquet/read_csv is a single file, and when it comes to block granularity, it is mainly a paragraph (a few lines of data) of a single file.
Constraints
- Does not support repartition/shuffle

File Level Checkpoint

File level

Fromwhere is the line of block corresponding to the source Header File, and offset is the line of offset corresponding to the source file

Row Level Checkpoint

Row level, record < file_name, row_index >
- Starting from the source operator, the row number of processed data is recorded in the meta, and the data stream is passed between upstream and downstream operators, and recorded to external storage after sinking

Use case

No response

raulchen · 2024-12-27T03:21:33Z

thanks for posting this. and I agree that checkpointing would be a useful feature for Ray Data users.

Regarding the file-based approach, I think the main problem is that not all data source are file-based and support seeking. Also we'll need to enable ExecutionOptions.preserve_order which could significantly degrade performance.

And regarding the row-based approach, the overheads can be significant if the dataset has millions of rows.

Also consider that both approaches are not hard to implement at the app level. I.E., manually add a filter op after the read. I would suggest doing so before we come up with a general solution.

Jay-ju · 2024-12-27T03:32:01Z

thanks for posting this. and I agree that checkpointing would be a useful feature for Ray Data users.

Regarding the file-based approach, I think the main problem is that not all data source are file-based and support seeking. Also we'll need to enable ExecutionOptions.preserve_order which could significantly degrade performance.

And regarding the row-based approach, the overheads can be significant if the dataset has millions of rows.

Also consider that both approaches are not hard to implement at the app level. I.E., manually add a filter op after the read. I would suggest doing so before we come up with a general solution.

ExecutionOptions.preserve_order seems unnecessary, right? It just records whether the current record has been processed or not. File - level ones seem to be sufficient for most scenarios.

raulchen · 2024-12-28T00:53:19Z

ExecutionOptions.preserve_order seems unnecessary, right?

What exactly do you want to checkpoint? From "requiring the filesystem to seek and read the file", I assume you want to checkpoint something like "for this file, all rows before row X have been finished". If so, you'll need to preserve the execution order. Alternatively, you can also checkpoint individual row numbers.

That said, it's not the most critical problem. I think the main problem is how to support non-file-based data sources.

Jay-ju · 2024-12-30T03:23:35Z

Yes, like supporting SQL data source is a problem. Unstructured data is mainly files?

Jay-ju added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 25, 2024

richardliaw added the data Ray Data-related issues label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Data checkpoint #49438

Ray Data checkpoint #49438

Jay-ju commented Dec 25, 2024 •

edited

Loading

raulchen commented Dec 27, 2024

Jay-ju commented Dec 27, 2024

raulchen commented Dec 28, 2024

Jay-ju commented Dec 30, 2024

Ray Data checkpoint #49438

Ray Data checkpoint #49438

Comments

Jay-ju commented Dec 25, 2024 • edited Loading

Description

Use case

raulchen commented Dec 27, 2024

Jay-ju commented Dec 27, 2024

raulchen commented Dec 28, 2024

Jay-ju commented Dec 30, 2024

Jay-ju commented Dec 25, 2024 •

edited

Loading