Skip to content

HIVE-29368: More accurate pessimistic stats combining#6359

Draft
konstantinb wants to merge 14 commits intoapache:masterfrom
konstantinb:HIVE-29368-pessimistic
Draft

HIVE-29368: More accurate pessimistic stats combining#6359
konstantinb wants to merge 14 commits intoapache:masterfrom
konstantinb:HIVE-29368-pessimistic

Conversation

@konstantinb
Copy link
Copy Markdown
Contributor

@konstantinb konstantinb commented Mar 13, 2026

What changes were proposed in this pull request?

HIVE-29368: More accurate pessimistic stats combining

Why are the changes needed?

To address the flaw of the current PessimisticStatCombiner of choosing the Maximum of NDV of branches of an expression. For constant expressions, it typically results in "1", significantly impacting subsequent GROUP BY estimates. The PR implements logic of
min(rows, Sum NDV(branch_i))
suggested by @zabetak here: #6308 (review)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Via unit test code, new .q file, verified changes of the existing impacted .q files; Will also run a batch of integration testing in a private Hive implementation within the next few days; will provide an update then

… PessimisticStatCombiner to use numRows to identify "const NULL" ColStatistics instances
@konstantinb
Copy link
Copy Markdown
Contributor Author

@zabetak @okumin
I was unable to find a way to distinguish between "const NULL" ColStatistics objects and objects that truly have an "unknown NDV" without changing the ColStatistics class or refactoring the StatEstimator interface, and making PessimisticStatCombiner aware of the total # of rows from the parent stats.

I believe that the latest changes for StatEstimator are minimally invasive and fully backwards-compatible. If this still feels "too intrusive", I'd appreciate some hints on possible alternatives

@zabetak
Copy link
Copy Markdown
Member

zabetak commented Mar 30, 2026

I was unable to find a way to distinguish between "const NULL" ColStatistics objects and objects that truly have an "unknown NDV" without changing the ColStatistics class or refactoring the StatEstimator interface, and making...

@konstantinb Since the previous version of the combiner didn't bother much about unknown NDV values, I am not too opinionated about handling the case where countDistinct is zero. Personally, I would be fine even with something simplistic like the following:

    if (stat.getCountDistint() >= 0 && result.getCountDistint() >= 0) {
      result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint()));
    }

From my perspective missing NDVs is an edge case and not something that should appear too often during optimization.

Having said that, I am not against a more elaborate solution like modifying ColStatistics or the StatsEstimator interface as per previous commits so I will defer the final decision to @konstantinb .

@sonarqubecloud
Copy link
Copy Markdown

mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 40 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 6144 Data size: 183480 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zabetak, this is a good example of the possible impact of the NDV==0 on statistics' estimates. Due to the issue https://issues.apache.org/jira/browse/HIVE-29534, the Metastore NDV (35) of the column table_onetimestamp.val1 is getting ignored, making the NDV of the underlying ColStatistics object 0. Later, extractNDVGroupingColumns() sets this 0 value to 1. The new estimate of 6144 is way worse than the metastore value of 35, but could be way better than the original "1" in certain queries.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Since the new estimate is not always better maybe we should rediscuss the change in extractNDVGroupingColumns or maybe address HIVE-29534 first.

@konstantinb
Copy link
Copy Markdown
Contributor Author

From my perspective missing NDVs is an edge case and not something that should appear too often during optimization.

@zabetak, over the last few months of struggling with NDVs, I have identified various ways to trigger bad estimates with the current code. The following estimation change in an .out file immediately highlights a 35x underestimation of row count due to treating an NDV of 0 as a "true 0 value".

Too small NDV values can lead to severe underestimation of GROUP BY cardinality; too large NDVs can severely underestimate the cardinality of an IN() filter.

When dealing with very large tables (millions to billions of records), it can become prohibitively expensive to accurately track the number of unique values per column. This really makes missing (unknown) NDV a regular occurrence.

For every attempt to estimate NDV using numRows, I quickly discovered a counterexample of a skewed dataset that led to severe misestimation, causing either a performance problem or an outright query failure. I would be happy to provide those examples if you like.

In some of those cases, a better NDV estimation is possible. For example, with a CASE..WHEN statement with multiple constants. In many other cases, it is genuinely safer to use the most pessimistic algorithm, which handles "unknown NDV" columns surprisingly well.

The very last commit has passing tests. I would appreciate it if you could take another look when you get a chance.

The combiner logic is very close to

    if (stat.getCountDistint() >= 0 && result.getCountDistint() >= 0) {
      result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint()));
    }

except specific handling NDV==0 for non-constant null columns

cc @okumin

Copy link
Copy Markdown
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@konstantinb I am more or less OK with the changes in PessimisticStatCombiner. I outlined a potential caveat but not something that needs to be addressed immediately. I am a bit more skeptical about the changes in extractNDVGroupingColumns thus to expedite the PR maybe we should defer them to another patch. I know that there has been some back and forth with this PR and I really appreciate the effort that you put in improving the stats so I am trying to keep the scope limited to merge this sooner.

Comment on lines +43 to +45
default Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
throw new UnsupportedOperationException("This estimator requires parentStats");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change this to default?

Copy link
Copy Markdown
Contributor Author

@konstantinb konstantinb Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change eliminates the need to implement estimate(args) for impacted estimators while allowing other estimators that do not need numRows to keep their original estimate(args) implementation, without having to implement estimate(args, numRows). I.e., it felt "cleaner" and allowed me to reduce the # of Java files that required edits.

Comment on lines -2112 to +2117
if (cs.getNumNulls() > 0) {
// NDV needs to be adjusted if a column has a known NDV along with NULL values
// or if a column happens to be "const NULL"
if ((ndv > 0 && cs.getNumNulls() > 0) ||
(ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == parentStats.getNumRows())) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this change led to more .q.out changes and addresses an NDV issue on a slightly different place. Would it be possible to defer this to another patch/Jira so that we converge and merge this PR sooner. I understand the intention of the change but I am a bit reluctant to modify multiple things as part of the same change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why I suggest to defer this on another patch is cause I am not fully convinced if we should actually do a +1 on the NDV for nulls. If we continue on the assumption that is used elsewhere that null is not a value and thus does not contribute to NDV then we shouldn't modify the NDV; possibly we should set the numNulls to 1.

Copy link
Copy Markdown
Contributor Author

@konstantinb konstantinb Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zabetak thank you for your feedback; I fully understand your concerns and the goal of reducing code changes in one PR.

The main problem with keeping this code 100% intact is that any ColStatistics object coming into extractNDVGroupingColumns with an NDV of 0 and numNulls > 0 would be estimated with 1 GROUP BY row count; and this "1" estimate has proven to be the root cause of many severe under-estimations.

It makes this method logic directly correlated with the proposed changes to PessimisticStatCombiner

Technically, the following two variants are feasible alternatives to the current PR code:
1.
if ((ndv > 0 && cs.getNumNulls() > 0) { ndv = StatsUtils.safeAdd(ndv, 1); }

^^ This one adjusts NDV estimations except for columns with (potentially) unknown NDVs. Applying this change did not cause changes to any already modified .out files of this branch. I believe this is the safest possible change for this logic

  1. Completely removing the +1 adjustment for NULLs:
    ndvValues.add(cs.getCountDistint());
    ^^ This code edit modifies only three files on the current branch

If you firmly oppose any changes to extractNDVGroupingColumns, then we will likely need to get back to the drawing board on how to approach this whole problem in a different manner.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CASE WHEN statements can appear everywhere in the query plan and the changes in the combiner make this stats better. I understand that there is some correlation with grouping columns but not all queries have a GROUP BY clause thus I feel we could separate the two problems.

I am not opposed in changing extractNDVGroupingColumns but was thinking that its probably better to do it in a separate PR. If we can't, then option 2 seems more appealing although I couldn't find out why this +1 logic was introduced in the first place.

Overall, I trust your judgement since you spend much more time on the topic so I am happy to let you decide how you want to move forward.

Comment on lines +41 to +43
// NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows)
hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 && stat.getNumNulls() != numRows);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to other consumers as a NULL constant according to the criteria specified here. Probably an edge case but still wanted to highlight that its hard to cover everything from the moment that NDV = 0 is a valid value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have struggled with this too; technically, in the code before these changes, this makes the combined stats for the column come out with numNulls == numRows. The "estimated" flag could be used to decide how much trust a consumer should put to such statistics entries.

mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 40 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 6144 Data size: 183480 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Since the new estimate is not always better maybe we should rediscuss the change in extractNDVGroupingColumns or maybe address HIVE-29534 first.

@konstantinb
Copy link
Copy Markdown
Contributor Author

@zabetak I've opened a PR #6406 as per your suggestion for HIVE-29534
Unfortunately, it triggers a massive set of changes to output files (178 in total)

@konstantinb konstantinb marked this pull request as draft April 3, 2026 03:34
@konstantinb
Copy link
Copy Markdown
Contributor Author

@zabetak I think I have finally realized that while combining changes make little sense without computeNDVGroupingColumns() fixes, the actual problem in computeNDVGroupingColumns applies for non-combined columns too and we should likely fix it first. I will create a new story for that isolated fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants