Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete orphan files #1200

Open
Tracked by #1065
sungwy opened this issue Sep 24, 2024 · 5 comments
Open
Tracked by #1065

Delete orphan files #1200

sungwy opened this issue Sep 24, 2024 · 5 comments
Assignees

Comments

@sungwy
Copy link
Collaborator

sungwy commented Sep 24, 2024

Introduce a new API to delete orphan files for a given table

Feature reference: https://iceberg.apache.org/docs/1.5.1/maintenance/#delete-orphan-files

@omkenge
Copy link
Contributor

omkenge commented Oct 29, 2024

Hi @sungwy
I would like to work on this ..
Can I ?

@sungwy
Copy link
Collaborator Author

sungwy commented Oct 29, 2024

Hey sure thing! I'll assign it to you @omkenge

@omkenge
Copy link
Contributor

omkenge commented Nov 21, 2024

Orphan File Deletion in Iceberg Tables
Here's a step-by-step breakdown of the logic behind the process:

  1. List All Files in Storage
  2. Extract Referenced Files from Table Metadata
  3. Identify Orphan Files
    By comparing the list of all files in storage with the list of files referenced by the Iceberg table, the script identifies orphan files.
    These are files that exist in storage but are not part of the current table metadata.
    The comparison is performed by subtracting the set of referenced files from the set of all files in storage.
  4. Delete Orphan Files

What is your opinion on this ?
@kevinjqliu @Fokko @sungwy

@kevinjqliu
Copy link
Contributor

That looks generally correct to me. There are a few caveats though. This assumes that the entire iceberg table (metadata and data files) is in a single location and that no other files should exist.

I think a good first step is to figure out all the files belonging to an iceberg table. Given a table, return all metadata and data file paths, including historical lineage, branches, and tags.

@ndrluis
Copy link
Collaborator

ndrluis commented Nov 24, 2024

@omkenge I believe you will need to wait for the merge of #1285. In the meantime, I will work on the partition statistics over the next few weeks. Before that, I believe we will be tracking all the files in the metadata (this needs to be double-checked). With that, you will be able to verify what could be removed.

Another point is the filesystem that will be responsible for scanning the directory. FileIO is not how we solve this, so we will need to use something else. Perhaps OpenDAL would be a good candidate. As a reference, you can see that the Java implementation uses the Hadoop filesystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants