-
Notifications
You must be signed in to change notification settings - Fork 425
feat: Add DeleteFileIndex to improve position delete lookup #2918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jayceslesar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically left 2 nits, existing integration tests are passing which gives confidence and the unit tests also look good here
| if lower and upper and lower == upper: | ||
| try: | ||
| return lower.decode("utf-8") | ||
| except (UnicodeDecodeError, AttributeError): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using contextlib.suppress here instead of the except pass
| ) | ||
|
|
||
| from pydantic import Field | ||
| from sortedcontainers import SortedList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but I noticed that there is just one more occurrence of sortedcontainers outside of the tests. Might be interesting to see if we can get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| completed_futures: SortedList[Future[list[ManifestFile]]] = SortedList(iterable=[], key=lambda f: futures_index[f]) |
last occurrence
| self._ensure_indexed() | ||
| if not self._files: | ||
| return [] | ||
| start_idx = bisect_left(self._seqs, seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note the old solution uses bisect_right, which might have been a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth it to add an integration test scenario for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yess, The old bisect_right returned the the point to the right of the sorted sequences with the same key. This meant that any delete file with the same seq number was excluded.
Per the Iceberg spec https://iceberg.apache.org/spec/#scan-planning: "The data file's data sequence number is less than or equal to the delete file's data sequence number"
This situation can occur when data files and position delete files are added in the same commit using the row delta logic i.e. a merge statement.
Looking at a test on the java side like this test. We can see Java is inclusive by returning the files with the same sequence number as part of the filter lookup.
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment around Record, apart from that, it looks good! Thanks @geruh for splitting this logic out, looks much better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a DeleteFileIndex class to significantly improve the performance of position delete file lookup during scan planning. The implementation replaces the previous linear scanning approach with an efficient indexing strategy that organizes delete files by exact file path and by partition key, using binary search for sequence number filtering.
Changes:
- Added new
DeleteFileIndexandPositionDeletesclasses for efficient delete file indexing and retrieval - Replaced
_match_deletes_to_data_filefunction withDeleteFileIndex.for_data_filemethod - Removed obsolete tests for the old implementation
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
pyiceberg/table/delete_file_index.py |
New module implementing DeleteFileIndex for indexing position deletes by path and partition, and PositionDeletes for lazy-sorted sequence number filtering using bisect |
pyiceberg/table/__init__.py |
Integrated DeleteFileIndex into DataScan.plan_files(), replacing the old SortedList-based approach; removed unused imports and old _match_deletes_to_data_file function |
tests/table/test_delete_file_index.py |
Comprehensive test suite covering empty index, sequence number filtering, path-specific deletes, partitioned deletes, deletion vectors, lazy sorting, and immutability after indexing |
tests/table/test_init.py |
Removed tests for the old _match_deletes_to_data_file function and cleaned up unused imports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This is awesome
I think this speeds up matching delete files from O(n) -> O(1), and also evaluates lazily
Great start to unlock eq deletes and DVs
| ) | ||
|
|
||
| from pydantic import Field | ||
| from sortedcontainers import SortedList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| completed_futures: SortedList[Future[list[ManifestFile]]] = SortedList(iterable=[], key=lambda f: futures_index[f]) |
last occurrence
| return evaluator.eval(delete_file) | ||
|
|
||
|
|
||
| def _referenced_data_file_path(delete_file: DataFile) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little awkward that delete_file is of type DataFile, we can refactor this later perhaps
| self._ensure_indexed() | ||
| if not self._files: | ||
| return [] | ||
| start_idx = bisect_left(self._seqs, seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note the old solution uses bisect_right, which might have been a bug
| def _create_deletion_vector( | ||
| sequence_number: int = 1, file_path: str = "s3://bucket/data.parquet", spec_id: int = 0 | ||
| ) -> ManifestEntry: | ||
| delete_file = DataFile.from_args( | ||
| content=DataFileContent.POSITION_DELETES, | ||
| file_path=f"s3://bucket/deletion-vector-{sequence_number}.puffin", | ||
| file_format=FileFormat.PUFFIN, | ||
| partition=Record(), | ||
| record_count=10, | ||
| file_size_in_bytes=100, | ||
| lower_bounds={PATH_FIELD_ID: file_path.encode()}, | ||
| upper_bounds={PATH_FIELD_ID: file_path.encode()}, | ||
| ) | ||
| delete_file._spec_id = spec_id | ||
| return ManifestEntry.from_args(status=ManifestEntryStatus.ADDED, sequence_number=sequence_number, data_file=delete_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, might as well add it to the source code 😄 we can follow up with this
| self._ensure_indexed() | ||
| if not self._files: | ||
| return [] | ||
| start_idx = bisect_left(self._seqs, seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth it to add an integration test scenario for this
|
Thanks @geruh and thank you @Fokko @jayceslesar @.copilot for the review |
Related to #2255.
Rationale for this change
This PR is a piece of the existing DFI PR in #2255. However, this rips out the existing delete->data matching behavior for deletes and indexes them for efficient lookup.
The previous implementation:
_InclusiveMetricsEvaluatorinstance for each data fileNow we extend this workflow with a
DeleteFileIndexthat:This aligns with the Java implementation of the DeleteFileIndex, following the python infra.
Are these changes tested?
New tests added and existing tests continue to pass
Are there any user-facing changes?
No