HIVE-29368: More accurate pessimistic stats combining by konstantinb · Pull Request #6359 · apache/hive

konstantinb · 2026-03-13T06:21:59Z

What changes were proposed in this pull request?

HIVE-29368: More accurate pessimistic stats combining

Why are the changes needed?

To address the flaw of the current PessimisticStatCombiner of choosing the Maximum of NDV of branches of an expression. For constant expressions, it typically results in "1", significantly impacting subsequent GROUP BY estimates. The PR implements logic of
min(rows, Sum NDV(branch_i))
suggested by @zabetak here: #6308 (review)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Via unit test code, new .q file, verified changes of the existing impacted .q files; Will also run a batch of integration testing in a private Hive implementation within the next few days; will provide an update then

…ing?

… still remails "unknown"

… .out files

…g NDV logic

… PessimisticStatCombiner to use numRows to identify "const NULL" ColStatistics instances

…l diff

konstantinb · 2026-03-27T22:57:03Z

@zabetak @okumin
I was unable to find a way to distinguish between "const NULL" ColStatistics objects and objects that truly have an "unknown NDV" without changing the ColStatistics class or refactoring the StatEstimator interface, and making PessimisticStatCombiner aware of the total # of rows from the parent stats.

I believe that the latest changes for StatEstimator are minimally invasive and fully backwards-compatible. If this still feels "too intrusive", I'd appreciate some hints on possible alternatives

zabetak · 2026-03-30T10:03:52Z

I was unable to find a way to distinguish between "const NULL" ColStatistics objects and objects that truly have an "unknown NDV" without changing the ColStatistics class or refactoring the StatEstimator interface, and making...

@konstantinb Since the previous version of the combiner didn't bother much about unknown NDV values, I am not too opinionated about handling the case where countDistinct is zero. Personally, I would be fine even with something simplistic like the following:

    if (stat.getCountDistint() >= 0 && result.getCountDistint() >= 0) {
      result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint()));
    }

From my perspective missing NDVs is an edge case and not something that should appear too often during optimization.

Having said that, I am not against a more elaborate solution like modifying ColStatistics or the StatsEstimator interface as per previous commits so I will defer the final decision to @konstantinb .

…fixes, new .out files

sonarqubecloud · 2026-03-30T19:00:38Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

konstantinb · 2026-03-30T23:50:11Z

ql/src/test/results/clientpositive/llap/vectorized_stats.q.out

                      mode: hash
                      outputColumnNames: _col0
-                      Statistics: Num rows: 1 Data size: 40 Basic stats: COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 6144 Data size: 183480 Basic stats: COMPLETE Column stats: COMPLETE


@zabetak, this is a good example of the possible impact of the NDV==0 on statistics' estimates. Due to the issue https://issues.apache.org/jira/browse/HIVE-29534, the Metastore NDV (35) of the column table_onetimestamp.val1 is getting ignored, making the NDV of the underlying ColStatistics object 0. Later, extractNDVGroupingColumns() sets this 0 value to 1. The new estimate of 6144 is way worse than the metastore value of 35, but could be way better than the original "1" in certain queries.

Thanks for the explanation. Since the new estimate is not always better maybe we should rediscuss the change in extractNDVGroupingColumns or maybe address HIVE-29534 first.

konstantinb · 2026-03-31T00:17:42Z

From my perspective missing NDVs is an edge case and not something that should appear too often during optimization.

@zabetak, over the last few months of struggling with NDVs, I have identified various ways to trigger bad estimates with the current code. The following estimation change in an .out file immediately highlights a 35x underestimation of row count due to treating an NDV of 0 as a "true 0 value".

Too small NDV values can lead to severe underestimation of GROUP BY cardinality; too large NDVs can severely underestimate the cardinality of an IN() filter.

When dealing with very large tables (millions to billions of records), it can become prohibitively expensive to accurately track the number of unique values per column. This really makes missing (unknown) NDV a regular occurrence.

For every attempt to estimate NDV using numRows, I quickly discovered a counterexample of a skewed dataset that led to severe misestimation, causing either a performance problem or an outright query failure. I would be happy to provide those examples if you like.

In some of those cases, a better NDV estimation is possible. For example, with a CASE..WHEN statement with multiple constants. In many other cases, it is genuinely safer to use the most pessimistic algorithm, which handles "unknown NDV" columns surprisingly well.

The very last commit has passing tests. I would appreciate it if you could take another look when you get a chance.

The combiner logic is very close to

    if (stat.getCountDistint() >= 0 && result.getCountDistint() >= 0) {
      result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint()));
    }

except specific handling NDV==0 for non-constant null columns

cc @okumin

zabetak

@konstantinb I am more or less OK with the changes in PessimisticStatCombiner. I outlined a potential caveat but not something that needs to be addressed immediately. I am a bit more skeptical about the changes in extractNDVGroupingColumns thus to expedite the PR maybe we should defer them to another patch. I know that there has been some back and forth with this PR and I really appreciate the effort that you put in improving the stats so I am trying to keep the scope limited to merge this sooner.

zabetak · 2026-03-31T14:54:48Z

ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java

+  default Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+    throw new UnsupportedOperationException("This estimator requires parentStats");
+  }


Do we need to change this to default?

This change eliminates the need to implement estimate(args) for impacted estimators while allowing other estimators that do not need numRows to keep their original estimate(args) implementation, without having to implement estimate(args, numRows). I.e., it felt "cleaner" and allowed me to reduce the # of Java files that required edits.

zabetak · 2026-04-02T08:48:47Z

ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java

-        if (cs.getNumNulls() > 0) {
+        // NDV needs to be adjusted if a column has a known NDV along with NULL values
+        // or if a column happens to be "const NULL"
+        if ((ndv > 0 && cs.getNumNulls() > 0) ||
+            (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == parentStats.getNumRows())) {


It seems that this change led to more .q.out changes and addresses an NDV issue on a slightly different place. Would it be possible to defer this to another patch/Jira so that we converge and merge this PR sooner. I understand the intention of the change but I am a bit reluctant to modify multiple things as part of the same change.

Another reason why I suggest to defer this on another patch is cause I am not fully convinced if we should actually do a +1 on the NDV for nulls. If we continue on the assumption that is used elsewhere that null is not a value and thus does not contribute to NDV then we shouldn't modify the NDV; possibly we should set the numNulls to 1.

@zabetak thank you for your feedback; I fully understand your concerns and the goal of reducing code changes in one PR.

The main problem with keeping this code 100% intact is that any ColStatistics object coming into extractNDVGroupingColumns with an NDV of 0 and numNulls > 0 would be estimated with 1 GROUP BY row count; and this "1" estimate has proven to be the root cause of many severe under-estimations.

It makes this method logic directly correlated with the proposed changes to PessimisticStatCombiner

Technically, the following two variants are feasible alternatives to the current PR code:
1.
if ((ndv > 0 && cs.getNumNulls() > 0) { ndv = StatsUtils.safeAdd(ndv, 1); }

^^ This one adjusts NDV estimations except for columns with (potentially) unknown NDVs. Applying this change did not cause changes to any already modified .out files of this branch. I believe this is the safest possible change for this logic

Completely removing the +1 adjustment for NULLs:
ndvValues.add(cs.getCountDistint());
^^ This code edit modifies only three files on the current branch

If you firmly oppose any changes to extractNDVGroupingColumns, then we will likely need to get back to the drawing board on how to approach this whole problem in a different manner.

CASE WHEN statements can appear everywhere in the query plan and the changes in the combiner make this stats better. I understand that there is some correlation with grouping columns but not all queries have a GROUP BY clause thus I feel we could separate the two problems.

I am not opposed in changing extractNDVGroupingColumns but was thinking that its probably better to do it in a separate PR. If we can't, then option 2 seems more appealing although I couldn't find out why this +1 logic was introduced in the first place.

Overall, I trust your judgement since you spend much more time on the topic so I am happy to let you decide how you want to move forward.

zabetak · 2026-04-02T09:04:36Z

ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java

+    // NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows)
+    hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 && stat.getNumNulls() != numRows);
+


Note that if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to other consumers as a NULL constant according to the criteria specified here. Probably an edge case but still wanted to highlight that its hard to cover everything from the moment that NDV = 0 is a valid value.

I have struggled with this too; technically, in the code before these changes, this makes the combined stats for the column come out with numNulls == numRows. The "estimated" flag could be used to decide how much trust a consumer should put to such statistics entries.

zabetak · 2026-04-02T09:07:51Z

ql/src/test/results/clientpositive/llap/vectorized_stats.q.out

                      mode: hash
                      outputColumnNames: _col0
-                      Statistics: Num rows: 1 Data size: 40 Basic stats: COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 6144 Data size: 183480 Basic stats: COMPLETE Column stats: COMPLETE


Thanks for the explanation. Since the new estimate is not always better maybe we should rediscuss the change in extractNDVGroupingColumns or maybe address HIVE-29534 first.

konstantinb · 2026-04-03T00:09:44Z

@zabetak I've opened a PR #6406 as per your suggestion for HIVE-29534
Unfortunately, it triggers a massive set of changes to output files (178 in total)

konstantinb · 2026-04-03T03:35:53Z

@zabetak I think I have finally realized that while combining changes make little sense without computeNDVGroupingColumns() fixes, the actual problem in computeNDVGroupingColumns applies for non-combined columns too and we should likely fix it first. I will create a new story for that isolated fix

HIVE-29368: Could this be the truly accurate pessimistic stats combin…

d90457f

…ing?

asf-ci-hive added the tests pending label Mar 13, 2026

konstantinb mentioned this pull request Mar 13, 2026

HIVE-29368: limiting the NDV fix to the CASE clause handler only #6308

Closed

asf-ci-hive added tests unstable and removed tests pending labels Mar 13, 2026

HIVE-29368: NDV of 0 is "unknown", so combining it with anything else…

46485cf

… still remails "unknown"

asf-ci-hive added tests pending tests failed and removed tests unstable tests pending labels Mar 20, 2026

konstantinb added 2 commits March 20, 2026 11:45

Merge origin/master into HIVE-29368-pessimistic

380c3dd

HIVE-29368: use safeAdd

1d3fb05

asf-ci-hive added tests pending tests unstable and removed tests failed tests pending labels Mar 20, 2026

HIVE-29368: special handling of "const NULL" columns, unit tests, new…

874996d

… .out files

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Mar 21, 2026

HIVE-29368: trigger a rebuild

f89edb7

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Mar 21, 2026

konstantinb requested a review from zabetak March 25, 2026 19:19

asf-ci-hive added tests pending and removed tests failed labels Mar 25, 2026

HIVE-29368: buildColStatForConstant() can be simplified after removin…

dced9f5

…g NDV logic

asf-ci-hive added tests unstable and removed tests pending labels Mar 26, 2026

HIVE-29368: refactored StatEstimator to be aware of parent stats, and…

3690d4b

… PessimisticStatCombiner to use numRows to identify "const NULL" ColStatistics instances

asf-ci-hive added tests pending and removed tests unstable labels Mar 27, 2026

HIVE-29368: fully reverted buildColStatForConstant to reduce the tota…

cca7fc6

…l diff

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed labels Mar 27, 2026

HIVE-29368: tweaking extractNDVGroupingColumns conditions, test code …

8c2fd96

…fixes, new .out files

asf-ci-hive added tests pending and removed tests unstable labels Mar 30, 2026

asf-ci-hive added tests passed and removed tests pending labels Mar 30, 2026

konstantinb commented Mar 30, 2026

View reviewed changes

zabetak reviewed Apr 2, 2026

View reviewed changes

konstantinb marked this pull request as draft April 3, 2026 03:34

		// NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows)
		hasUnknownNDV = hasUnknownNDV \|\| (stat.getCountDistint() == 0 && stat.getNumNulls() != numRows);

Conversation

konstantinb commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

konstantinb commented Mar 27, 2026

Uh oh!

zabetak commented Mar 30, 2026

Uh oh!

sonarqubecloud bot commented Mar 30, 2026

Quality Gate passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb commented Mar 31, 2026

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb commented Apr 3, 2026

Uh oh!

konstantinb commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

konstantinb commented Mar 13, 2026 •

edited

Loading

konstantinb Apr 3, 2026 •

edited

Loading

konstantinb Apr 3, 2026 •

edited

Loading