HIVE-29368: More accurate pessimistic stats combining#6359
HIVE-29368: More accurate pessimistic stats combining#6359konstantinb wants to merge 14 commits intoapache:masterfrom
Conversation
… still remails "unknown"
… PessimisticStatCombiner to use numRows to identify "const NULL" ColStatistics instances
|
@zabetak @okumin I believe that the latest changes for StatEstimator are minimally invasive and fully backwards-compatible. If this still feels "too intrusive", I'd appreciate some hints on possible alternatives |
@konstantinb Since the previous version of the combiner didn't bother much about unknown NDV values, I am not too opinionated about handling the case where countDistinct is zero. Personally, I would be fine even with something simplistic like the following: if (stat.getCountDistint() >= 0 && result.getCountDistint() >= 0) {
result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint()));
}From my perspective missing NDVs is an edge case and not something that should appear too often during optimization. Having said that, I am not against a more elaborate solution like modifying |
…fixes, new .out files
|
| mode: hash | ||
| outputColumnNames: _col0 | ||
| Statistics: Num rows: 1 Data size: 40 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 6144 Data size: 183480 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
@zabetak, this is a good example of the possible impact of the NDV==0 on statistics' estimates. Due to the issue https://issues.apache.org/jira/browse/HIVE-29534, the Metastore NDV (35) of the column table_onetimestamp.val1 is getting ignored, making the NDV of the underlying ColStatistics object 0. Later, extractNDVGroupingColumns() sets this 0 value to 1. The new estimate of 6144 is way worse than the metastore value of 35, but could be way better than the original "1" in certain queries.
There was a problem hiding this comment.
Thanks for the explanation. Since the new estimate is not always better maybe we should rediscuss the change in extractNDVGroupingColumns or maybe address HIVE-29534 first.
@zabetak, over the last few months of struggling with NDVs, I have identified various ways to trigger bad estimates with the current code. The following estimation change in an .out file immediately highlights a 35x underestimation of row count due to treating an NDV of 0 as a "true 0 value". Too small NDV values can lead to severe underestimation of GROUP BY cardinality; too large NDVs can severely underestimate the cardinality of an IN() filter. When dealing with very large tables (millions to billions of records), it can become prohibitively expensive to accurately track the number of unique values per column. This really makes missing (unknown) NDV a regular occurrence. For every attempt to estimate NDV using numRows, I quickly discovered a counterexample of a skewed dataset that led to severe misestimation, causing either a performance problem or an outright query failure. I would be happy to provide those examples if you like. In some of those cases, a better NDV estimation is possible. For example, with a CASE..WHEN statement with multiple constants. In many other cases, it is genuinely safer to use the most pessimistic algorithm, which handles "unknown NDV" columns surprisingly well. The very last commit has passing tests. I would appreciate it if you could take another look when you get a chance. The combiner logic is very close to except specific handling NDV==0 for non-constant null columns cc @okumin |
zabetak
left a comment
There was a problem hiding this comment.
@konstantinb I am more or less OK with the changes in PessimisticStatCombiner. I outlined a potential caveat but not something that needs to be addressed immediately. I am a bit more skeptical about the changes in extractNDVGroupingColumns thus to expedite the PR maybe we should defer them to another patch. I know that there has been some back and forth with this PR and I really appreciate the effort that you put in improving the stats so I am trying to keep the scope limited to merge this sooner.
| default Optional<ColStatistics> estimate(List<ColStatistics> argStats) { | ||
| throw new UnsupportedOperationException("This estimator requires parentStats"); | ||
| } |
There was a problem hiding this comment.
Do we need to change this to default?
There was a problem hiding this comment.
This change eliminates the need to implement estimate(args) for impacted estimators while allowing other estimators that do not need numRows to keep their original estimate(args) implementation, without having to implement estimate(args, numRows). I.e., it felt "cleaner" and allowed me to reduce the # of Java files that required edits.
| if (cs.getNumNulls() > 0) { | ||
| // NDV needs to be adjusted if a column has a known NDV along with NULL values | ||
| // or if a column happens to be "const NULL" | ||
| if ((ndv > 0 && cs.getNumNulls() > 0) || | ||
| (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == parentStats.getNumRows())) { |
There was a problem hiding this comment.
It seems that this change led to more .q.out changes and addresses an NDV issue on a slightly different place. Would it be possible to defer this to another patch/Jira so that we converge and merge this PR sooner. I understand the intention of the change but I am a bit reluctant to modify multiple things as part of the same change.
There was a problem hiding this comment.
Another reason why I suggest to defer this on another patch is cause I am not fully convinced if we should actually do a +1 on the NDV for nulls. If we continue on the assumption that is used elsewhere that null is not a value and thus does not contribute to NDV then we shouldn't modify the NDV; possibly we should set the numNulls to 1.
There was a problem hiding this comment.
@zabetak thank you for your feedback; I fully understand your concerns and the goal of reducing code changes in one PR.
The main problem with keeping this code 100% intact is that any ColStatistics object coming into extractNDVGroupingColumns with an NDV of 0 and numNulls > 0 would be estimated with 1 GROUP BY row count; and this "1" estimate has proven to be the root cause of many severe under-estimations.
It makes this method logic directly correlated with the proposed changes to PessimisticStatCombiner
Technically, the following two variants are feasible alternatives to the current PR code:
1.
if ((ndv > 0 && cs.getNumNulls() > 0) { ndv = StatsUtils.safeAdd(ndv, 1); }
^^ This one adjusts NDV estimations except for columns with (potentially) unknown NDVs. Applying this change did not cause changes to any already modified .out files of this branch. I believe this is the safest possible change for this logic
- Completely removing the +1 adjustment for NULLs:
ndvValues.add(cs.getCountDistint());
^^ This code edit modifies only three files on the current branch
If you firmly oppose any changes to extractNDVGroupingColumns, then we will likely need to get back to the drawing board on how to approach this whole problem in a different manner.
There was a problem hiding this comment.
CASE WHEN statements can appear everywhere in the query plan and the changes in the combiner make this stats better. I understand that there is some correlation with grouping columns but not all queries have a GROUP BY clause thus I feel we could separate the two problems.
I am not opposed in changing extractNDVGroupingColumns but was thinking that its probably better to do it in a separate PR. If we can't, then option 2 seems more appealing although I couldn't find out why this +1 logic was introduced in the first place.
Overall, I trust your judgement since you spend much more time on the topic so I am happy to let you decide how you want to move forward.
| // NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows) | ||
| hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 && stat.getNumNulls() != numRows); | ||
|
|
There was a problem hiding this comment.
Note that if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to other consumers as a NULL constant according to the criteria specified here. Probably an edge case but still wanted to highlight that its hard to cover everything from the moment that NDV = 0 is a valid value.
There was a problem hiding this comment.
I have struggled with this too; technically, in the code before these changes, this makes the combined stats for the column come out with numNulls == numRows. The "estimated" flag could be used to decide how much trust a consumer should put to such statistics entries.
| mode: hash | ||
| outputColumnNames: _col0 | ||
| Statistics: Num rows: 1 Data size: 40 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 6144 Data size: 183480 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
Thanks for the explanation. Since the new estimate is not always better maybe we should rediscuss the change in extractNDVGroupingColumns or maybe address HIVE-29534 first.
|
@zabetak I've opened a PR #6406 as per your suggestion for HIVE-29534 |
|
@zabetak I think I have finally realized that while combining changes make little sense without computeNDVGroupingColumns() fixes, the actual problem in computeNDVGroupingColumns applies for non-combined columns too and we should likely fix it first. I will create a new story for that isolated fix |



What changes were proposed in this pull request?
HIVE-29368: More accurate pessimistic stats combining
Why are the changes needed?
To address the flaw of the current PessimisticStatCombiner of choosing the Maximum of NDV of branches of an expression. For constant expressions, it typically results in "1", significantly impacting subsequent GROUP BY estimates. The PR implements logic of
min(rows, Sum NDV(branch_i))suggested by @zabetak here: #6308 (review)
Does this PR introduce any user-facing change?
No
How was this patch tested?
Via unit test code, new .q file, verified changes of the existing impacted .q files; Will also run a batch of integration testing in a private Hive implementation within the next few days; will provide an update then