Skip to content

Conversation

@xiangfu0
Copy link
Contributor

Motivation

  • Allow offline dimension tables to support UPSERT-like semantics so later segments can deterministically overwrite earlier rows with the same primary key instead of producing duplicates.
  • Prevent invalid table configuration combinations that would enable both upsert semantics and strict duplicate-key errors.

Description

  • Add an enableUpsert flag to DimensionTableConfig with a JSON-aware constructor, backwards-compatible constructors, and an isEnableUpsert() getter in pinot-spi.
  • Thread enableUpsert into DimensionTableDataManager by adding _enableUpsert, reading it from DimensionTableConfig, and changing duplicate-key handling to only throw when !_enableUpsert && _errorOnDuplicatePrimaryKey in both fast-lookup and memory-optimized loading paths.
  • Implement deterministic segment ordering when upsert is enabled via sortSegmentsForUpsert(...) which sorts by indexCreationTime then segmentName so later segments overwrite earlier ones.
  • Add validateDimensionTableConfig(...) in TableConfigUtils.validate to reject configs that enable both enableUpsert and errorOnDuplicatePrimaryKey simultaneously.
  • Add/adjust unit-test helpers and tests in DimensionTableDataManagerTest (including testUpsertOverwritesDuplicatePrimaryKey and new testUpsertDedupesAcrossSegments) and add a validation test in TableConfigUtilsTest; also add createSegmentFromCsv test helper and update existing tests to pass the new flag.

Testing

  • Unit tests added or updated: DimensionTableDataManagerTest#testUpsertOverwritesDuplicatePrimaryKey and DimensionTableDataManagerTest#testUpsertDedupesAcrossSegments, plus an invalid-config check in TableConfigUtilsTest; these cover overwrite semantics and invalid config detection.
  • No automated test suites (for example mvn runs or CI) were executed as part of this change.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds upsert semantics to offline dimension tables, allowing later segments to deterministically overwrite earlier rows with the same primary key instead of producing duplicates. The implementation includes configuration validation to prevent conflicting settings.

Changes:

  • Added enableUpsert flag to DimensionTableConfig with JSON-aware and backwards-compatible constructors
  • Modified DimensionTableDataManager to sort segments by creation time when upsert is enabled and conditionally allow duplicate keys
  • Added validation in TableConfigUtils to reject configurations that enable both enableUpsert and errorOnDuplicatePrimaryKey

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/DimensionTableConfig.java Adds _enableUpsert field with JSON-aware constructor and backwards-compatible overloads
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/TableConfigUtils.java Adds validation method to reject conflicting upsert and error-on-duplicate settings
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/TableConfigUtilsTest.java Tests validation of invalid dimension table configuration combining upsert and error-on-duplicate
pinot-core/src/main/java/org/apache/pinot/core/data/manager/offline/DimensionTableDataManager.java Implements upsert logic with segment sorting and conditional duplicate-key error handling
pinot-core/src/test/java/org/apache/pinot/core/data/manager/offline/DimensionTableDataManagerTest.java Adds test helper and tests for upsert behavior across segments, updates existing tests with new parameter

@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 65.62500% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.22%. Comparing base (c899956) to head (5820a8e).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...ata/manager/offline/DimensionTableDataManager.java 77.77% 3 Missing and 1 partial ⚠️
...e/pinot/spi/config/table/DimensionTableConfig.java 42.85% 3 Missing and 1 partial ⚠️
...he/pinot/segment/local/utils/TableConfigUtils.java 57.14% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17536      +/-   ##
============================================
+ Coverage     63.21%   63.22%   +0.01%     
  Complexity     1476     1476              
============================================
  Files          3170     3170              
  Lines        189508   189552      +44     
  Branches      28997    29002       +5     
============================================
+ Hits         119789   119845      +56     
+ Misses        60417    60399      -18     
- Partials       9302     9308       +6     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.17% <65.62%> (-0.03%) ⬇️
java-21 63.19% <65.62%> (+7.66%) ⬆️
temurin 63.22% <65.62%> (+0.01%) ⬆️
unittests 63.22% <65.62%> (+0.01%) ⬆️
unittests1 55.57% <62.50%> (+0.01%) ⬆️
unittests2 34.03% <21.87%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 requested a review from Copilot January 21, 2026 17:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.


private File createSegmentFromCsv(File csvFile, TableConfig tableConfig, Schema schema, String segmentName)
throws Exception {
File tableDataDir = new File(TEMP_DIR, OFFLINE_TABLE_NAME + "_upsert");
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded suffix '_upsert' in the directory name could cause collisions if createSegmentFromCsv is called multiple times in the same test or across tests. Consider using a unique directory name per invocation, such as appending the segment name or a timestamp.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants