Skip to content

feat(compact): add FieldListaggAgg aggregate function#224

Open
Zouxxyy wants to merge 9 commits intoalibaba:mainfrom
Zouxxyy:dev/listagg-fix
Open

feat(compact): add FieldListaggAgg aggregate function#224
Zouxxyy wants to merge 9 commits intoalibaba:mainfrom
Zouxxyy:dev/listagg-fix

Conversation

@Zouxxyy
Copy link
Copy Markdown

@Zouxxyy Zouxxyy commented Apr 13, 2026

Summary

Add listagg aggregate function for MergeTree compaction. Concatenates
string values with a configurable delimiter, with optional distinct mode
to deduplicate tokens.

Test plan

Unit tests cover null handling, empty strings, custom delimiters,
distinct dedup, and type validation.

🤖 Generated with [Claude Code]

Copilot AI review requested due to automatic review settings April 13, 2026 12:45
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 13, 2026

CLA assistant check
All committers have signed the CLA.

@Zouxxyy Zouxxyy changed the title feat(compact): add FieldListaggAgg aggregate functions feat(compact): add FieldListaggAgg aggregate function Apr 13, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new listagg field aggregate for MergeTree compaction to concatenate string values with configurable delimiter support, including an optional distinct/dedup mode.

Changes:

  • Introduce FieldListaggAgg aggregator and wire it into FieldAggregatorFactory.
  • Add new field-level options for list-agg-delimiter and distinct via CoreOptions / Options.
  • Add unit tests covering delimiter/null/empty/distinct/type-validation scenarios.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/paimon/core/mergetree/compact/aggregate/field_listagg_agg.h Implements listagg aggregation, including distinct token deduplication logic.
src/paimon/core/mergetree/compact/aggregate/field_aggregator_factory.h Registers listagg in the aggregator factory.
src/paimon/core/core_options.h Adds CoreOptions accessors for listagg delimiter and distinct mode.
src/paimon/core/core_options.cpp Implements option parsing for delimiter and distinct mode.
include/paimon/defs.h Declares new option keys (distinct, list-agg-delimiter).
src/paimon/common/defs.cpp Defines the new option key constants.
src/paimon/core/mergetree/compact/aggregate/field_listagg_agg_test.cpp Adds unit tests for listagg behavior and validation.
src/paimon/CMakeLists.txt Adds the new test file to the test build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Reject empty delimiter when distinct is enabled to prevent
  infinite loop in AggDistinctImpl (find("") always returns 0)
- Pre-reserve string capacity and use append() instead of +=
  in AggDistinctImpl to avoid repeated reallocations
- Replace assert() with gtest ADD_FAILURE() in test CreateOptions()
  to ensure failures are reported properly even under NDEBUG

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lxy-9602
Copy link
Copy Markdown
Collaborator

Great work! Thanks for enhancing the aggregation functionality!

// When delimiter is empty, fall back to whitespace split
if (delimiter.empty()) {
delimiter = " ";
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the Java logic is: if distinct is true, it applies a fallback mechanism (e.g. deduplication), otherwise it directly concatenates the values.

@lxy-9602
Copy link
Copy Markdown
Collaborator

By the way, your change helped us identify a bug in the Java implementation:
In distinct mode, it uses contains() to check for duplicates, which can cause false positives due to substring matching.
For example, agg("ab", "a") would incorrectly treat "a" as duplicate even though it's not.

We're preparing a PR today to fix this on the Java side — thanks for catching this discrepancy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants