-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29368: More accurate pessimistic stats combining #6359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
d90457f
46485cf
380c3dd
1d3fb05
874996d
f89edb7
d7aed0e
bf047c0
d4826c2
4170c17
dced9f5
3690d4b
cca7fc6
8c2fd96
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,16 +21,26 @@ | |
| import java.util.Optional; | ||
|
|
||
| import org.apache.hadoop.hive.ql.plan.ColStatistics; | ||
| import org.apache.hadoop.hive.ql.stats.StatsUtils; | ||
|
|
||
| /** | ||
| * Combines {@link ColStatistics} objects to provide the most pessimistic estimate. | ||
| */ | ||
| public class PessimisticStatCombiner { | ||
|
|
||
| private final long numRows; | ||
| private boolean inited; | ||
| private boolean hasUnknownNDV; | ||
| private ColStatistics result; | ||
|
|
||
| public PessimisticStatCombiner(long numRows) { | ||
| this.numRows = numRows; | ||
| } | ||
|
|
||
| public void add(ColStatistics stat) { | ||
| // NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows) | ||
| hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 && stat.getNumNulls() != numRows); | ||
|
|
||
|
Comment on lines
+41
to
+43
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that if in the CASE statement you have one branch with unknown NDV and another that is the NULL constant the result of the combiner will appear to other consumers as a NULL constant according to the criteria specified here. Probably an edge case but still wanted to highlight that its hard to cover everything from the moment that NDV = 0 is a valid value.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have struggled with this too; technically, in the code before these changes, this makes the combined stats for the column come out with numNulls == numRows. The "estimated" flag could be used to decide how much trust a consumer should put to such statistics entries. |
||
| if (!inited) { | ||
| inited = true; | ||
| result = stat.clone(); | ||
|
|
@@ -41,8 +51,10 @@ public void add(ColStatistics stat) { | |
| if (stat.getAvgColLen() > result.getAvgColLen()) { | ||
| result.setAvgColLen(stat.getAvgColLen()); | ||
| } | ||
| if (stat.getCountDistint() > result.getCountDistint()) { | ||
| result.setCountDistint(stat.getCountDistint()); | ||
| if (hasUnknownNDV) { | ||
| result.setCountDistint(0); | ||
| } else { | ||
| result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint())); | ||
| } | ||
| if (stat.getNumNulls() < 0 || result.getNumNulls() < 0) { | ||
| result.setNumNulls(-1); | ||
|
|
@@ -63,8 +75,8 @@ public void add(ColStatistics stat) { | |
| result.setFilterColumn(); | ||
| } | ||
| } | ||
|
|
||
| public Optional<ColStatistics> getResult() { | ||
| return Optional.of(result); | ||
|
|
||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,6 +22,7 @@ | |
| import java.util.Optional; | ||
|
|
||
| import org.apache.hadoop.hive.ql.plan.ColStatistics; | ||
| import org.apache.hadoop.hive.ql.plan.Statistics; | ||
|
|
||
| /** | ||
| * Enables statistics related computation on UDFs | ||
|
|
@@ -39,5 +40,19 @@ public interface StatEstimator { | |
| * @param argStats the statistics for every argument of the UDF | ||
| * @return {@link ColStatistics} estimate for the actual UDF. | ||
| */ | ||
| public Optional<ColStatistics> estimate(List<ColStatistics> argStats); | ||
| default Optional<ColStatistics> estimate(List<ColStatistics> argStats) { | ||
| throw new UnsupportedOperationException("This estimator requires parentStats"); | ||
| } | ||
|
Comment on lines
+43
to
+45
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to change this to default?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change eliminates the need to implement estimate(args) for impacted estimators while allowing other estimators that do not need numRows to keep their original estimate(args) implementation, without having to implement estimate(args, numRows). I.e., it felt "cleaner" and allowed me to reduce the # of Java files that required edits. |
||
|
|
||
| /** | ||
| * Computes the output statistics with access to parent statistics. | ||
| * Override this method when the estimator uses more info for accurate estimation. | ||
| * | ||
| * @param argStats the statistics for every argument of the UDF | ||
| * @param parentStats statistics from the parent operator | ||
| * @return {@link ColStatistics} estimate for the actual UDF. | ||
| */ | ||
| default Optional<ColStatistics> estimate(List<ColStatistics> argStats, Statistics parentStats) { | ||
| return estimate(argStats); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this change led to more .q.out changes and addresses an NDV issue on a slightly different place. Would it be possible to defer this to another patch/Jira so that we converge and merge this PR sooner. I understand the intention of the change but I am a bit reluctant to modify multiple things as part of the same change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason why I suggest to defer this on another patch is cause I am not fully convinced if we should actually do a +1 on the NDV for nulls. If we continue on the assumption that is used elsewhere that null is not a value and thus does not contribute to NDV then we shouldn't modify the NDV; possibly we should set the
numNullsto 1.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zabetak thank you for your feedback; I fully understand your concerns and the goal of reducing code changes in one PR.
The main problem with keeping this code 100% intact is that any ColStatistics object coming into extractNDVGroupingColumns with an NDV of 0 and numNulls > 0 would be estimated with 1 GROUP BY row count; and this "1" estimate has proven to be the root cause of many severe under-estimations.
It makes this method logic directly correlated with the proposed changes to PessimisticStatCombiner
Technically, the following two variants are feasible alternatives to the current PR code:
1.
if ((ndv > 0 && cs.getNumNulls() > 0) { ndv = StatsUtils.safeAdd(ndv, 1); }^^ This one adjusts NDV estimations except for columns with (potentially) unknown NDVs. Applying this change did not cause changes to any already modified .out files of this branch. I believe this is the safest possible change for this logic
ndvValues.add(cs.getCountDistint());^^ This code edit modifies only three files on the current branch
If you firmly oppose any changes to extractNDVGroupingColumns, then we will likely need to get back to the drawing board on how to approach this whole problem in a different manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CASE WHEN statements can appear everywhere in the query plan and the changes in the combiner make this stats better. I understand that there is some correlation with grouping columns but not all queries have a GROUP BY clause thus I feel we could separate the two problems.
I am not opposed in changing extractNDVGroupingColumns but was thinking that its probably better to do it in a separate PR. If we can't, then option 2 seems more appealing although I couldn't find out why this +1 logic was introduced in the first place.
Overall, I trust your judgement since you spend much more time on the topic so I am happy to let you decide how you want to move forward.