Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -1578,9 +1578,11 @@ public static ColStatistics getColStatisticsFromExpression(HiveConf conf, Statis
csList.add(cs);
}
if (csList.size() == engfd.getChildren().size()) {
Optional<ColStatistics> res = se.estimate(csList);
Optional<ColStatistics> res = se.estimate(csList, parentStats);
if (res.isPresent()) {
ColStatistics newStats = res.get();
// NDV cannot exceed numRows
newStats.setCountDistint(Math.min(newStats.getCountDistint(), numRows));
colType = colType.toLowerCase();
newStats.setColumnType(colType);
newStats.setColumnName(colName);
Expand Down Expand Up @@ -2109,7 +2111,10 @@ private static List<Long> extractNDVGroupingColumns(List<ColStatistics> colStats
for (ColStatistics cs : colStats) {
if (cs != null) {
long ndv = cs.getCountDistint();
if (cs.getNumNulls() > 0) {
// NDV needs to be adjusted if a column has a known NDV along with NULL values
// or if a column happens to be "const NULL"
if ((ndv > 0 && cs.getNumNulls() > 0) ||
(ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == parentStats.getNumRows())) {
Comment on lines -2112 to +2117
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this change led to more .q.out changes and addresses an NDV issue on a slightly different place. Would it be possible to defer this to another patch/Jira so that we converge and merge this PR sooner. I understand the intention of the change but I am a bit reluctant to modify multiple things as part of the same change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why I suggest to defer this on another patch is cause I am not fully convinced if we should actually do a +1 on the NDV for nulls. If we continue on the assumption that is used elsewhere that null is not a value and thus does not contribute to NDV then we shouldn't modify the NDV; possibly we should set the numNulls to 1.

Copy link
Copy Markdown
Contributor Author

@konstantinb konstantinb Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zabetak thank you for your feedback; I fully understand your concerns and the goal of reducing code changes in one PR.

The main problem with keeping this code 100% intact is that any ColStatistics object coming into extractNDVGroupingColumns with an NDV of 0 and numNulls > 0 would be estimated with 1 GROUP BY row count; and this "1" estimate has proven to be the root cause of many severe under-estimations.

It makes this method logic directly correlated with the proposed changes to PessimisticStatCombiner

Technically, the following two variants are feasible alternatives to the current PR code:
1.
if ((ndv > 0 && cs.getNumNulls() > 0) { ndv = StatsUtils.safeAdd(ndv, 1); }

^^ This one adjusts NDV estimations except for columns with (potentially) unknown NDVs. Applying this change did not cause changes to any already modified .out files of this branch. I believe this is the safest possible change for this logic

  1. Completely removing the +1 adjustment for NULLs:
    ndvValues.add(cs.getCountDistint());
    ^^ This code edit modifies only three files on the current branch

If you firmly oppose any changes to extractNDVGroupingColumns, then we will likely need to get back to the drawing board on how to approach this whole problem in a different manner.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CASE WHEN statements can appear everywhere in the query plan and the changes in the combiner make this stats better. I understand that there is some correlation with grouping columns but not all queries have a GROUP BY clause thus I feel we could separate the two problems.

I am not opposed in changing extractNDVGroupingColumns but was thinking that its probably better to do it in a separate PR. If we can't, then option 2 seems more appealing although I couldn't find out why this +1 logic was introduced in the first place.

Overall, I trust your judgement since you spend much more time on the topic so I am happy to let you decide how you want to move forward.

ndv = StatsUtils.safeAdd(ndv, 1);
}
ndvValues.add(ndv);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,26 @@
import java.util.Optional;

import org.apache.hadoop.hive.ql.plan.ColStatistics;
import org.apache.hadoop.hive.ql.stats.StatsUtils;

/**
* Combines {@link ColStatistics} objects to provide the most pessimistic estimate.
*/
public class PessimisticStatCombiner {

private final long numRows;
private boolean inited;
private boolean hasUnknownNDV;
private ColStatistics result;

public PessimisticStatCombiner(long numRows) {
this.numRows = numRows;
}

public void add(ColStatistics stat) {
// NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows)
hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 && stat.getNumNulls() != numRows);

Comment on lines +41 to +43
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to other consumers as a NULL constant according to the criteria specified here. Probably an edge case but still wanted to highlight that its hard to cover everything from the moment that NDV = 0 is a valid value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have struggled with this too; technically, in the code before these changes, this makes the combined stats for the column come out with numNulls == numRows. The "estimated" flag could be used to decide how much trust a consumer should put to such statistics entries.

if (!inited) {
inited = true;
result = stat.clone();
Expand All @@ -41,8 +51,10 @@ public void add(ColStatistics stat) {
if (stat.getAvgColLen() > result.getAvgColLen()) {
result.setAvgColLen(stat.getAvgColLen());
}
if (stat.getCountDistint() > result.getCountDistint()) {
result.setCountDistint(stat.getCountDistint());
if (hasUnknownNDV) {
result.setCountDistint(0);
} else {
result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint()));
}
if (stat.getNumNulls() < 0 || result.getNumNulls() < 0) {
result.setNumNulls(-1);
Expand All @@ -63,8 +75,8 @@ public void add(ColStatistics stat) {
result.setFilterColumn();
}
}

public Optional<ColStatistics> getResult() {
return Optional.of(result);

}
}
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import java.util.Optional;

import org.apache.hadoop.hive.ql.plan.ColStatistics;
import org.apache.hadoop.hive.ql.plan.Statistics;

/**
* Enables statistics related computation on UDFs
Expand All @@ -39,5 +40,19 @@ public interface StatEstimator {
* @param argStats the statistics for every argument of the UDF
* @return {@link ColStatistics} estimate for the actual UDF.
*/
public Optional<ColStatistics> estimate(List<ColStatistics> argStats);
default Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
throw new UnsupportedOperationException("This estimator requires parentStats");
}
Comment on lines +43 to +45
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change this to default?

Copy link
Copy Markdown
Contributor Author

@konstantinb konstantinb Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change eliminates the need to implement estimate(args) for impacted estimators while allowing other estimators that do not need numRows to keep their original estimate(args) implementation, without having to implement estimate(args, numRows). I.e., it felt "cleaner" and allowed me to reduce the # of Java files that required edits.


/**
* Computes the output statistics with access to parent statistics.
* Override this method when the estimator uses more info for accurate estimation.
*
* @param argStats the statistics for every argument of the UDF
* @param parentStats statistics from the parent operator
* @return {@link ColStatistics} estimate for the actual UDF.
*/
default Optional<ColStatistics> estimate(List<ColStatistics> argStats, Statistics parentStats) {
return estimate(argStats);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.apache.hadoop.hive.ql.exec.vector.VectorizedExpressionsSupportDecimal64;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.plan.ColStatistics;
import org.apache.hadoop.hive.ql.plan.Statistics;
import org.apache.hadoop.hive.ql.stats.estimator.StatEstimator;
import org.apache.hadoop.hive.ql.stats.estimator.StatEstimatorProvider;
import org.apache.hadoop.hive.ql.stats.estimator.PessimisticStatCombiner;
Expand Down Expand Up @@ -89,8 +90,8 @@ public StatEstimator getStatEstimator() {
static class CoalesceStatEstimator implements StatEstimator {

@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
PessimisticStatCombiner combiner = new PessimisticStatCombiner();
public Optional<ColStatistics> estimate(List<ColStatistics> argStats, Statistics parentStats) {
PessimisticStatCombiner combiner = new PessimisticStatCombiner(parentStats.getNumRows());
for (int i = 0; i < argStats.size(); i++) {
combiner.add(argStats.get(i));
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
import org.apache.hadoop.hive.ql.exec.vector.VectorizedExpressionsSupportDecimal64;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.plan.ColStatistics;
import org.apache.hadoop.hive.ql.plan.Statistics;
import org.apache.hadoop.hive.ql.stats.estimator.StatEstimator;
import org.apache.hadoop.hive.ql.stats.estimator.StatEstimatorProvider;
import org.apache.hadoop.hive.ql.stats.estimator.PessimisticStatCombiner;
Expand Down Expand Up @@ -187,8 +188,8 @@ public StatEstimator getStatEstimator() {
static class IfStatEstimator implements StatEstimator {

@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
PessimisticStatCombiner combiner = new PessimisticStatCombiner();
public Optional<ColStatistics> estimate(List<ColStatistics> argStats, Statistics parentStats) {
PessimisticStatCombiner combiner = new PessimisticStatCombiner(parentStats.getNumRows());
combiner.add(argStats.get(1));
combiner.add(argStats.get(2));
return combiner.getResult();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.plan.ColStatistics;
import org.apache.hadoop.hive.ql.plan.Statistics;
import org.apache.hadoop.hive.ql.stats.estimator.PessimisticStatCombiner;
import org.apache.hadoop.hive.ql.stats.estimator.StatEstimator;
import org.apache.hadoop.hive.ql.stats.estimator.StatEstimatorProvider;
Expand Down Expand Up @@ -143,8 +144,8 @@ public StatEstimator getStatEstimator() {
static class WhenStatEstimator implements StatEstimator {

@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
PessimisticStatCombiner combiner = new PessimisticStatCombiner();
public Optional<ColStatistics> estimate(List<ColStatistics> argStats, Statistics parentStats) {
PessimisticStatCombiner combiner = new PessimisticStatCombiner(parentStats.getNumRows());
for (int i = 1; i < argStats.size(); i += 2) {
combiner.add(argStats.get(i));
}
Expand Down
Loading
Loading