Add table statistics #1285

ndrluis · 2024-11-04T01:52:14Z

The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.

ndrluis · 2024-11-04T02:10:45Z

I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation.

@Fokko @sungwy @kevinjqliu

kevinjqliu

Thanks for the PR! Added a few comments. I think it would also be helpful to include integration tests

pyiceberg/table/metadata.py

kevinjqliu · 2024-11-04T16:33:24Z

pyiceberg/table/statistics.py

+    statistics_path: str = Field(alias="statistics-path")
+    file_size_in_bytes: int = Field(alias="file-size-in-bytes")
+    file_footer_size_in_bytes: int = Field(alias="file-footer-size-in-bytes")
+    blob_metadata: List[BlobMetadata] = Field(alias="blob-metadata")


nit, missing key_metadata
https://iceberg.apache.org/spec/#table-statistics

kevinjqliu · 2024-11-04T16:35:01Z

pyiceberg/table/__init__.py

+            The alter table builder.
+        """
+        updates = (
+            RemoveStatisticsUpdate(


do you mind linking the java implementation? do we want to remove all stats?

I understand that we want to remove the statistics of a specific snapshot, and I understand that we have one statistics file per snapshot.

The equivalent code would be the SetStatistics class, which follows the same pattern as our ManageSnapshot class. This is the scenario I want to double-check to ensure we follow the same pattern.

kevinjqliu · 2024-11-04T16:39:58Z

pyiceberg/table/update/__init__.py

+    if update.snapshot_id != update.statistics.snapshot_id:
+        raise ValueError("Snapshot id in statistics does not match the snapshot id in the update")
+
+    rest_statistics = [stat for stat in base_metadata.statistics if stat.snapshot_id != update.snapshot_id]


nit: this can be a helper function to filter on snapshot_id

kevinjqliu · 2024-11-04T16:41:05Z

tests/conftest.py

+        },
+        {
+            "snapshot-id": 3055729675574597004,
+            "statistics-path": "s3://a/b/stats.puffin",


this does file need to exist on disk?

No, there is no validation in place. This is only used for clients that support puffin files and for the expire snapshot procedure, which removes this information from the metadata. If the user wants, they can also remove the file as part of the expire snapshot procedure.

ndrluis · 2024-11-10T23:44:04Z

@kevinjqliu could you please review it once more?

kevinjqliu

Added a few comments.

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

kevinjqliu · 2024-11-12T05:19:25Z

mkdocs/docs/api.md

+table.update_statistics()
+  .set_statistics(snapshot_id1, statistics_file1)
+  .remove_statistics(snapshot_id2)
+# Operations are applied on commit.


nit: add .commit() instead of the comment

or use snapshot_id=1

I added the commit() and kept the comment consistent with the other examples.

kevinjqliu · 2024-11-12T05:19:59Z

mkdocs/docs/api.md

+  update.set_statistics(1, statistics_file)
+  update.remove_statistics(2)


nit: replace 1/2 with snapshot_id1/snapshot_id2 to show the input relation

kevinjqliu · 2024-11-12T05:33:37Z

pyiceberg/table/statistics.py

+    blob_metadata: List[BlobMetadata] = Field(alias="blob-metadata")
+
+
+def reject_statistics(


nit: how about filter_statistics_by_snapshot_id?

ndrluis · 2024-11-12T12:41:50Z

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update.

What do you think?

kevinjqliu

LGTM! Thanks for working on this!

Regarding the integration tests, since we're manipulating table metadata to add/remove table stats, it would be great to verify that another source can interact with these stats. Not a hard blocker

ndrluis force-pushed the add-statistics branch from 6f0bee0 to a70edb2 Compare November 4, 2024 01:54

ndrluis changed the title ~~Add table statistics update~~ Add table statistics Nov 4, 2024

Add table statistics update

384e229

ndrluis force-pushed the add-statistics branch from a70edb2 to 384e229 Compare November 4, 2024 02:00

ndrluis changed the title ~~Add table statistics~~ WIP: Add table statistics Nov 4, 2024

ndrluis requested review from Fokko, sungwy and kevinjqliu November 4, 2024 02:21

kevinjqliu reviewed Nov 4, 2024

View reviewed changes

ndrluis force-pushed the add-statistics branch 2 times, most recently from 9b15c86 to d16ef47 Compare November 10, 2024 23:30

ndrluis requested a review from kevinjqliu November 10, 2024 23:42

ndrluis changed the title ~~WIP: Add table statistics~~ Add table statistics Nov 10, 2024

ndrluis marked this pull request as ready for review November 10, 2024 23:43

kevinjqliu reviewed Nov 12, 2024

View reviewed changes

ndrluis added 2 commits November 12, 2024 09:46

fixup! Add table statistics update

d3aaab3

fixup! Add table statistics update

11120bf

ndrluis force-pushed the add-statistics branch from d16ef47 to 11120bf Compare November 12, 2024 12:47

ndrluis requested a review from kevinjqliu November 12, 2024 12:49

kevinjqliu approved these changes Nov 12, 2024

View reviewed changes

ndrluis mentioned this pull request Nov 24, 2024

Delete orphan files #1200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add table statistics #1285

Add table statistics #1285

ndrluis commented Nov 4, 2024 •

edited

Loading

ndrluis commented Nov 4, 2024 •

edited

Loading

kevinjqliu left a comment

kevinjqliu Nov 4, 2024

kevinjqliu Nov 4, 2024

ndrluis Nov 4, 2024

kevinjqliu Nov 4, 2024

kevinjqliu Nov 4, 2024

ndrluis Nov 4, 2024

ndrluis commented Nov 10, 2024

kevinjqliu left a comment

kevinjqliu Nov 12, 2024

kevinjqliu Nov 12, 2024

ndrluis Nov 12, 2024

kevinjqliu Nov 12, 2024

kevinjqliu Nov 12, 2024

ndrluis commented Nov 12, 2024

kevinjqliu left a comment

		update.set_statistics(1, statistics_file)
		update.remove_statistics(2)

		blob_metadata: List[BlobMetadata] = Field(alias="blob-metadata")


		def reject_statistics(

Add table statistics #1285

Are you sure you want to change the base?

Add table statistics #1285

Conversation

ndrluis commented Nov 4, 2024 • edited Loading

ndrluis commented Nov 4, 2024 • edited Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndrluis commented Nov 10, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndrluis commented Nov 12, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

ndrluis commented Nov 4, 2024 •

edited

Loading

ndrluis commented Nov 4, 2024 •

edited

Loading