Skip to content

LAST/FIRST/TAKE aggregation should support TEXT type and Scripts#5091

Open
LantaoJin wants to merge 9 commits intoopensearch-project:mainfrom
LantaoJin:pr/issues/5086
Open

LAST/FIRST/TAKE aggregation should support TEXT type and Scripts#5091
LantaoJin wants to merge 9 commits intoopensearch-project:mainfrom
LantaoJin:pr/issues/5086

Conversation

@LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Jan 30, 2026

Description

This PR includes following fixings:

  1. LAST/FIRST/TAKE aggregations on TEXT type should use root name:

fetchField can be used on TEXT type, but only with root name

  1. Use fetchScript API instead of fetchField for LAST/FIRST/TAKE aggregations on Script:

fetchField cannot be used in script pushdown, for example | eval new_name = upper(name) | stats last(new_name)

  1. LAST/FIRST/TAKE aggregations on ALIAS type should use alias name:

fetchSource cannot be used in Alias type, fetchField will dedupe the list if we use its original name.

  1. Pushdown LIMIT to TopHits Aggregation

  2. Use fetchField instead of fetchSource as possible

fetchField can return good format on date type

Related Issues

Resolves #5086

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ation

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • First/Last/Take aggregations now support text and alias-backed fields and improved script-based sorting in aggregation explain plans
  • Bug Fixes

    • Unified how top-hits return fields (explicit fields vs _source) and set _source:false where appropriate
    • Improved limit pushdown into top-hits to reduce returned data
  • Tests

    • Expanded integration tests, expected explain outputs, and test data mappings for text/date/alias scenarios

Walkthrough

Centralized TopHits construction and parsing; added script-sort and alias handling; enabled TopHits size pushdown; switched field-fetching between _source and fields; updated index mappings/test data for alias/text/date; and extended/added many integration tests and explain fixtures for FIRST/LAST/TAKE on text/alias fields.

Changes

Cohort / File(s) Summary
Aggregate analyzer & TopHits builder
opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java
Introduced createTopHitsBuilder(...); refactored MIN/MAX/ARG_MAX/ARG_MIN/TAKE/FIRST/LAST/LITERAL_AGG to use unified TopHits construction with consistent sort/script/_source handling.
Predicate & script-sort utilities
opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java
Added ScriptSortType utility and imports; added isStructField() and isAliasField() helpers to NamedFieldExpression; exposed script-sort type resolution for reuse.
TopHits parsing
opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java
Reworked parsing to prefer _source when present, fall back to fields, added helper methods for emptiness/leaf extraction, and adjusted single/merge/multi-hit handling.
Pushdown / scan adjustments
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java, opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java
Switched to external getScriptSortType utility; added TopHits size pushdown branch in AggPushDownAction.
MetricAggregationBuilder field fetch change
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/aggregation/dsl/MetricAggregationBuilder.java
Changed TopHits field fetch from fetchField() to fetchSource() API call.
Unit tests / expectations
opensearch/src/test/java/.../AggregateAnalyzerTest.java, opensearch/src/test/java/.../MetricAggregationBuilderTest.java
Updated expected DSL outputs: switched between fields and _source expressions and added _source:false assertions where applicable.
Integration tests
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteAliasFieldAggregationIT.java, integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java, integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java
Added/renamed alias-related constant, added new tests for FIRST/LAST/TAKE including text/alias cases, and adjusted test setup to load alias index.
Test data & mappings
integ-test/src/test/resources/alias.json, integ-test/src/test/resources/indexDefinitions/alias_index_mapping.json
Added original_text and original_date fields; added alias mappings (alias_text, alias_col, @timestamp); augmented test documents with original_text/original_date.
Expected explain fixtures
integ-test/src/test/resources/expectedOutput/calcite/**, .../calcite_no_pushdown/**, .../ppl/**
Updated many YAML expected plans to reflect top_hits field-selection changes (switches between explicit fields arrays and _source includes, _source:false additions) and added new explain fixtures (first/last text, take negative, alias scenarios).
Small test resource updates
integ-test/src/test/resources/expectedOutput/.../
Multiple formatting/whitespace adjustments and occasional duplicate/line changes in explain YAMLs to match new pushdown/field-selection outputs.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Query
    participant AA as AggregateAnalyzer
    participant PA as PredicateAnalyzer
    participant TBuild as TopHitsBuilder
    participant OS as OpenSearch API
    participant Parser as TopHitsParser

    Client->>AA: Request plan for FIRST/LAST/TAKE
    AA->>PA: Ask for script sort type when needed
    PA-->>AA: Return ScriptSortType
    AA->>TBuild: createTopHitsBuilder(size, sort, script, singleCol)
    TBuild->>TBuild: Configure sort/_source/fields/size/scriptFields
    AA->>OS: Push OpenSearch aggregation (top_hits) request
    OS-->>Parser: Return aggregation response (may include _source or fields)
    Parser->>Parser: Choose _source vs fields, extract leaf values or maps
    Parser-->>Client: Aggregation result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #4929: Overlapping TopHits aggregation/pushdown changes and planner adjustments.
  • PR #4979: Related AggregateAnalyzer refactor and TopHits/aggregation builder logic changes.
  • PR #4957: Related TopHitsParser and explain-fixtures updates affecting explain outputs.

Suggested labels

bugFix, calcite

Suggested reviewers

  • penghuo
  • ps48
  • kavithacm
  • derek-ho
  • joshuali925
  • GumpacG
  • Swiddis
  • Yury-Fridlyand
  • mengweieric
  • YANG-DB
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding TEXT type and Script support for LAST/FIRST/TAKE aggregations.
Description check ✅ Passed The description is well-structured and directly relates to the changeset, explaining five specific fixes including TEXT type support, script handling, alias type support, limit pushdown, and fetchField preference.
Linked Issues check ✅ Passed The PR addresses issue #5086 by implementing pushdown support for LAST/FIRST/TAKE aggregations on text-typed fields, along with script and alias support, matching the expected behavior in the linked issue.
Out of Scope Changes check ✅ Passed All changes are within scope: test additions verify new functionality, YAML files update expected outputs for the aggregation pushdown changes, and core implementation handles TEXT/script/alias types and limit pushdown as specified in issue #5086.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@LantaoJin LantaoJin added the enhancement New feature or request label Jan 30, 2026
@LantaoJin LantaoJin changed the title Use fetchSource API instead of fetchFields for LAST/FIRST/TAKE aggregation Use fetchSource API instead of fetchFields for LAST/FIRST/TAKE aggregation on TEXT type Feb 3, 2026
@LantaoJin LantaoJin changed the title Use fetchSource API instead of fetchFields for LAST/FIRST/TAKE aggregation on TEXT type LAST/FIRST/TAKE aggregation should support TEXT type and Scripts Feb 3, 2026
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteAliasFieldAggregationIT.java`:
- Line 171: Remove the leftover debug print System.out.println(actual) from the
test method in CalciteAliasFieldAggregationIT; locate the occurrence (the
System.out.println call printing the variable actual) and delete it (or replace
with an appropriate assertion or logger call if output is required for test
diagnostics) so the test contains no debug println statements.

In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java`:
- Around line 314-381: Add unit tests covering null/empty/invalid inputs for the
new text/script FIRST/LAST/TAKE functionality: create tests alongside
testFirstAggregationOnTextField, testLastAggregationOnTextField,
testFirstLastByGroupOnTextField, testFirstLastWithOtherAggregationsOnTextField
and testFirstLastMixedFields that (1) query rows where the text field (e.g.,
employer/email) contains NULL and assert FIRST/LAST return nulls via
verifySchema/verifyDataRows, (2) run queries that yield empty result sets and
assert schema is correct and verifyDataRows handles zero rows, and (3) run
invalid TAKE sizes (0 and negative) and assert the query fails with the expected
error (use executeQuery and assert exception or error message). Ensure tests use
the same helper methods (executeQuery, verifySchema, verifyDataRows) and include
descriptive names like testFirstWithNullTextField, testLastEmptyResultSet,
testTakeInvalidSize.

In
`@opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java`:
- Around line 842-848: The sort type is being derived from the whole row type
(helper.rowType) rather than the specific script expression type, so change the
call to getScriptSortType to use the expression's return type for rex.getKey():
obtain the expression type (e.g., from helper.inferScript(rex.getKey())'s
metadata or rex.getKey().getType()) and pass that type into getScriptSortType
instead of helper.rowType, then use that resulting
ScriptSortBuilder.ScriptSortType when calling
builder.sort(SortBuilders.scriptSort(...)); update any variable names
accordingly (referencing getScriptSortType, helper.inferScript(rex.getKey()),
rex.getKey(), builder.sort and ScriptSortBuilder.ScriptSortType).

In
`@opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java`:
- Around line 477-484: Rename the mistakenly named pattern variable
hotHitsAggBuilder to topHitsAggBuilder in AggPushDownAction where you check
builder instanceof TopHitsAggregationBuilder; update the instanceof declaration
and all subsequent references in that block (e.g., the size() comparisons and
hotHitsAggBuilder.size(size) call) so they use topHitsAggBuilder to match the
TopHitsAggregationBuilder class name.
🧹 Nitpick comments (3)
integ-test/src/test/resources/alias.json (1)

2-6: Consider adding an edge-case record for text/date fields.

The new values look realistic and match ISO-8601, but if alias tests need boundary coverage, consider one doc with an empty/whitespace-only text or a non-UTC offset timestamp.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java (1)

485-486: Update the comment to include TopHits.

The comment on line 485 should be updated to reflect that TopHits is now also supported.

📝 Suggested update
-    // now we only have Composite, Terms and MultiTerms bucket aggregations,
+    // now we only have Composite, Terms, MultiTerms bucket aggregations, and TopHits metric aggregation,
     // add code here when we could support more in the future.
opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java (1)

150-156: Consider handling List type in getLeafValue.

The recursive getLeafValue method handles nested Map values but doesn't account for List values. If the source contains a list (e.g., multi-value fields), this would return the list itself rather than extracting a leaf value.

Is this the intended behavior? If multi-value fields are expected in the source, you may need to handle List similarly:

private Object getLeafValue(Object object) {
  if (object instanceof Map map) {
    return getLeafValue(map.values().iterator().next());
  } else if (object instanceof List list && !list.isEmpty()) {
    return getLeafValue(list.get(0));
  } else {
    return object;
  }
}

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java`:
- Around line 844-851: The script-based sort added in AggregateAnalyzer (inside
the branch handling rex.getKey() instanceof RexCall || RexLiteral) currently
calls builder.sort(SortBuilders.scriptSort(script, sortType)) without applying
the requested sortOrder; update this branch to apply the same sortOrder used for
keyword sorts by setting the order on the ScriptSortBuilder (use the existing
sortOrder variable) before passing it to builder.sort, so script sorts honor
ascending/descending like the keyword path.
🧹 Nitpick comments (1)
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteAliasFieldAggregationIT.java (1)

158-193: Consider adding schema assertions for the new alias aggregation mix.
Data-row checks alone won’t catch type regressions across text/alias/numeric outputs; a verifySchemaInOrder(...) would tighten coverage.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Pushdown not work when aggregate on text data type with function FIRST/LAST/TAKE

1 participant