Dectect schema evolution or partition evolution for append DataFile #777

ZENOTME · 2024-12-11T17:17:34Z

After #349, we support appending DataFile now. But I found there are some check may miss now: When we append DataFile, schema evolution or partition evolution may happen in the table after we generate the DataFile, which will cause the info of DataFile invalid. E.g partition value in DataFile will be invalid when partition evolution happen. lower_bound(upper_bound) will be invalid when schema evolution happen. So we need to detect the case that DataFile is incompatible with table.

For partition evolution, we have two ways to detect:

Ensure that the partition value schema matches the existing partition spec in terms of type, this is the way we have now. But there are some case it can't detect for this way, e.g. partition spec type <p1: int, p2: int> reorder to <p2: int, p1: int>
Ensure that the partition value schema matches the existing partition spec in terms of field name or field id.

For schema evolution:

It may still lead to partition evolution, and the detection method for partition values is the same as mentioned above.
Check whether the lower_bound/upper_bound is match using the field ID.

Based on the above analysis, we need to make the following fixes:

The partition in DataFile should include types to facilitate validation. e.g. the field name and field id
Append operations need to add validation checks for scheme evolution: lower bounds, upper_bound.

I'm not sure whether my understand is correct, please correct me if something wrong. cc @Fokko @liurenjie1024 @Xuanwo

Fokko · 2024-12-12T08:47:15Z

This is a very interesting question, that I'm happy to elaborate on.

But there are some case it can't detect for this way, e.g. partition spec type <p1: int, p2: int> reorder to <p2: int, p1: int>

This is true for V1 tables, here the field-IDs are omitted and not written to the files. Therefore for V1 tables there are special rules:

The is an issue to create a dedicated API to evolve the partition, that enforces these rules for V1 tables: #732. I think this would be a great thing to have since violating these rules might brick the table, or even worse; data corruption.

For V2 tables, we used field-ID projection, where the reader will read and project the files correctly into the structs based on the Field-IDs. This allows for re-ordering, and when reading the files, they will be read into the correct position of the struct. The write order of fields doesn't make any difference for V2, as they will be re-ordered on read. Of course, I would suggest keeping the same order as the partition spec, to keep everyone sane.

Ensure that the partition value schema matches the existing partition spec in terms of field name or field id.

This ties in with a discussion I had early this week with @c-thiel that resulted in #771.

My suggestion was to make the field-ID required regardless of the version (see #763). This is safe to do if we adhere to the imitations of partition-evolution mentioned above. When reading V1 tables, we can sequentially add the IDs to each of the partition specs, starting at 1000: <1000 p1 int, 1001 p2: int>. This way we can fully rely on the field-IDs, instead of the order for V1. We can never match these on names. Keep also in mind that this will simplify when someone has a V1 table, write a couple of peta's, and then upgrades it into a V2 table. Then we still have to correctly handle the old V1 DataFiles since they are not rewritten.

The partition in DataFile should include types to facilitate validation. e.g. the field name and field id

I think that's a great thing to do anyway. It isn't super expensive, and will avoid folks bricking their table. Preferably by field-ID for both V1 and V2, otherwise order for V1, and field-IDs for V2.

Append operations need to add validation checks for scheme evolution: lower bounds, upper_bound.

I'm not sure if I fully understand this one. We know the type in the file, and we know what to project to. Iceberg currently has a fairly limited set of promotions. This is because we encode the upper- and lower bound into binary. Based on the number of bytes, we can safely determine if it is a float (4 bytes), double (8 bytes), and if we need to promote the type based on the current schema.

We can do some cool stuff here, for example, if you query id >= 2^31+1 then we know that it doesn't fit into a int field. If you have promoted the id column over time, then we can skip the file based on the schema :) In PyIceberg/Java we have the AboveMax/BelowMin to indicate this. This will be done when we bind the evaluator to the schema. Looping in @sdd since he did a lot on this part 🥳

I know that this is a lot of text, hope this helps, and always happy to elaborate

Edit: I've did some checks today, and it seem to work pretty well with minor modifications: #786

sdd · 2024-12-14T23:50:08Z

We can do some cool stuff here, for example, if you query id >= 2^31+1 then we know that it doesn't fit into a int field. If you have promoted the id column over time, then we can skip the file based on the schema :) In PyIceberg/Java we have the AboveMax/BelowMin to indicate this. This will be done when we bind the evaluator to the schema. Looping in @sdd since he did a lot on this part 🥳

Thanks for looping me in @Fokko, I'd not seen AboveMax / BelowMin in the other impls before - pretty cool. I'll think about how we would go about implementing this in rust - we don't do many conversions at the moment in Datum::to anyway - maybe we could add AboveMax / BelowMin to the PrimitiveLiteral enum. Something for the backlog :-)

Fokko · 2024-12-15T14:21:28Z

I've actually started on adding some more conversions, good beginner task for a Rust novice like me :)

liurenjie1024 · 2024-12-28T03:10:51Z

Thanks @ZENOTME for raising this.

The partition in DataFile should include types to facilitate validation. e.g. the field name and field id

Do you mean we should maintain a partition schema in DataFile struct, or do you mean to maintain it in serialized format?

Append operations need to add validation checks for scheme evolution: lower bounds, upper_bound.

I'm also confusing about this part. I think it's reasonable to validate that the arrow record batch's schema matches current table schema we are using. But doing schema evolution at the time of writing doesn't sound like a good idea to me. They should be two transaction updates.

ZENOTME mentioned this issue Dec 15, 2024

Discussion: make DataFile Serializable && Deserializable #774

Open

Fokko mentioned this issue Dec 18, 2024

Reported and actual arrow schema of the table can be different #813

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dectect schema evolution or partition evolution for append DataFile #777

Dectect schema evolution or partition evolution for append DataFile #777

ZENOTME commented Dec 11, 2024

Fokko commented Dec 12, 2024 •

edited

Loading

sdd commented Dec 14, 2024

Fokko commented Dec 15, 2024

liurenjie1024 commented Dec 28, 2024

Dectect schema evolution or partition evolution for append DataFile #777

Dectect schema evolution or partition evolution for append DataFile #777

Comments

ZENOTME commented Dec 11, 2024

Fokko commented Dec 12, 2024 • edited Loading

sdd commented Dec 14, 2024

Fokko commented Dec 15, 2024

liurenjie1024 commented Dec 28, 2024

Fokko commented Dec 12, 2024 •

edited

Loading