feat: Add DeleteFileIndex to improve position delete lookup #2918

geruh · 2026-01-16T03:21:14Z

Related to #2255.

Rationale for this change

This PR is a piece of the existing DFI PR in #2255. However, this rips out the existing delete->data matching behavior for deletes and indexes them for efficient lookup.

The previous implementation:

Scanned all delete files with sequence number >= data file's sequence number
Created a new _InclusiveMetricsEvaluator instance for each data file
Evaluated every candidate delete file against the data file's path

Now we extend this workflow with a DeleteFileIndex that:

INdexes path specific DVs
Indexes partition-scoped deletes by (spec_id, partition record)
Uses bisect_left for sequence number filtering

This aligns with the Java implementation of the DeleteFileIndex, following the python infra.

Are these changes tested?

New tests added and existing tests continue to pass

Are there any user-facing changes?

No

jayceslesar

I basically left 2 nits, existing integration tests are passing which gives confidence and the unit tests also look good here

pyiceberg/table/__init__.py

jayceslesar · 2026-01-17T15:24:14Z

pyiceberg/table/delete_file_index.py

+    if lower and upper and lower == upper:
+        try:
+            return lower.decode("utf-8")
+        except (UnicodeDecodeError, AttributeError):
+            pass


consider using contextlib.suppress here instead of the except pass

pyiceberg/table/delete_file_index.py

Fokko · 2026-01-22T15:21:27Z

pyiceberg/table/__init__.py

 )

 from pydantic import Field
-from sortedcontainers import SortedList


Unrelated to this PR, but I noticed that there is just one more occurrence of sortedcontainers outside of the tests. Might be interesting to see if we can get rid of it.

iceberg-python/pyiceberg/table/update/snapshot.py

Line 798 in 26ecfe7

completed_futures: SortedList[Future[list[ManifestFile]]] = SortedList(iterable=[], key=lambda f: futures_index[f])

last occurrence

Fokko · 2026-01-22T15:22:00Z

pyiceberg/table/delete_file_index.py

+        self._ensure_indexed()
+        if not self._files:
+            return []
+        start_idx = bisect_left(self._seqs, seq)


note the old solution uses bisect_right, which might have been a bug

might be worth it to add an integration test scenario for this

Yess, The old bisect_right returned the the point to the right of the sorted sequences with the same key. This meant that any delete file with the same seq number was excluded.

Per the Iceberg spec https://iceberg.apache.org/spec/#scan-planning: "The data file's data sequence number is less than or equal to the delete file's data sequence number"

This situation can occur when data files and position delete files are added in the same commit using the row delta logic i.e. a merge statement.

Looking at a test on the java side like this test. We can see Java is inclusive by returning the files with the same sequence number as part of the filter lookup.

Fokko

One small comment around Record, apart from that, it looks good! Thanks @geruh for splitting this logic out, looks much better!

Copilot

Pull request overview

This PR introduces a DeleteFileIndex class to significantly improve the performance of position delete file lookup during scan planning. The implementation replaces the previous linear scanning approach with an efficient indexing strategy that organizes delete files by exact file path and by partition key, using binary search for sequence number filtering.

Changes:

Added new DeleteFileIndex and PositionDeletes classes for efficient delete file indexing and retrieval
Replaced _match_deletes_to_data_file function with DeleteFileIndex.for_data_file method
Removed obsolete tests for the old implementation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`pyiceberg/table/delete_file_index.py`	New module implementing `DeleteFileIndex` for indexing position deletes by path and partition, and `PositionDeletes` for lazy-sorted sequence number filtering using bisect
`pyiceberg/table/__init__.py`	Integrated `DeleteFileIndex` into `DataScan.plan_files()`, replacing the old SortedList-based approach; removed unused imports and old `_match_deletes_to_data_file` function
`tests/table/test_delete_file_index.py`	Comprehensive test suite covering empty index, sequence number filtering, path-specific deletes, partitioned deletes, deletion vectors, lazy sorting, and immutability after indexing
`tests/table/test_init.py`	Removed tests for the old `_match_deletes_to_data_file` function and cleaned up unused imports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kevinjqliu

LGTM! This is awesome

I think this speeds up matching delete files from O(n) -> O(1), and also evaluates lazily
Great start to unlock eq deletes and DVs

kevinjqliu · 2026-01-22T17:40:07Z

pyiceberg/table/__init__.py

 )

 from pydantic import Field
-from sortedcontainers import SortedList


iceberg-python/pyiceberg/table/update/snapshot.py

Line 798 in 26ecfe7

completed_futures: SortedList[Future[list[ManifestFile]]] = SortedList(iterable=[], key=lambda f: futures_index[f])

last occurrence

kevinjqliu · 2026-01-22T17:57:04Z

pyiceberg/table/delete_file_index.py

+    return evaluator.eval(delete_file)
+
+
+def _referenced_data_file_path(delete_file: DataFile) -> str | None:


a little awkward that delete_file is of type DataFile, we can refactor this later perhaps

kevinjqliu · 2026-01-22T18:22:09Z

pyiceberg/table/delete_file_index.py

+        self._ensure_indexed()
+        if not self._files:
+            return []
+        start_idx = bisect_left(self._seqs, seq)


note the old solution uses bisect_right, which might have been a bug

kevinjqliu · 2026-01-22T18:34:11Z

tests/table/test_delete_file_index.py

+def _create_deletion_vector(
+    sequence_number: int = 1, file_path: str = "s3://bucket/data.parquet", spec_id: int = 0
+) -> ManifestEntry:
+    delete_file = DataFile.from_args(
+        content=DataFileContent.POSITION_DELETES,
+        file_path=f"s3://bucket/deletion-vector-{sequence_number}.puffin",
+        file_format=FileFormat.PUFFIN,
+        partition=Record(),
+        record_count=10,
+        file_size_in_bytes=100,
+        lower_bounds={PATH_FIELD_ID: file_path.encode()},
+        upper_bounds={PATH_FIELD_ID: file_path.encode()},
+    )
+    delete_file._spec_id = spec_id
+    return ManifestEntry.from_args(status=ManifestEntryStatus.ADDED, sequence_number=sequence_number, data_file=delete_file)


ha, might as well add it to the source code 😄 we can follow up with this

kevinjqliu · 2026-01-22T18:36:50Z

pyiceberg/table/delete_file_index.py

+        self._ensure_indexed()
+        if not self._files:
+            return []
+        start_idx = bisect_left(self._seqs, seq)


might be worth it to add an integration test scenario for this

kevinjqliu · 2026-01-22T22:32:28Z

Thanks @geruh and thank you @Fokko @jayceslesar @.copilot for the review

feat: Add DeleteFileIndex for Merge-on-Read file lookup

153eac5

jayceslesar reviewed Jan 17, 2026

View reviewed changes

address comments

5f7d79d

Fokko reviewed Jan 22, 2026

View reviewed changes

pyiceberg/table/delete_file_index.py Outdated Show resolved Hide resolved

Fokko reviewed Jan 22, 2026

View reviewed changes

Fokko approved these changes Jan 22, 2026

View reviewed changes

kevinjqliu requested a review from Copilot January 22, 2026 17:30

Copilot started reviewing on behalf of kevinjqliu January 22, 2026 17:31 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

kevinjqliu approved these changes Jan 22, 2026

View reviewed changes

Address comments from Fokko

e9e1f95

kevinjqliu merged commit 2d6a1b9 into apache:main Jan 22, 2026
11 checks passed

		return evaluator.eval(delete_file)


		def _referenced_data_file_path(delete_file: DataFile) -> str \| None:

feat: Add DeleteFileIndex to improve position delete lookup #2918

feat: Add DeleteFileIndex to improve position delete lookup #2918

Conversation

geruh commented Jan 16, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jayceslesar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants