Skip to content

Update explain to Include Singular, CROSS and User Collections#501

Merged
hadia206 merged 16 commits intomainfrom
Hadia/update_explain
Mar 24, 2026
Merged

Update explain to Include Singular, CROSS and User Collections#501
hadia206 merged 16 commits intomainfrom
Hadia/update_explain

Conversation

@hadia206
Copy link
Copy Markdown
Contributor

@hadia206 hadia206 commented Mar 9, 2026

Extends pydough.explain() so it can explain

  • Singular
  • CROSS and
  • user-generated collections (range_collection, dataframe_collection)

Extends pydough.expalin_term()

  • Singular
  • CROSS
  • UDF

@hadia206 hadia206 changed the title Update explain to include Singular/CROSS and user collections Update explain to Include Singular/CROSS and User Collections Mar 9, 2026
@hadia206 hadia206 marked this pull request as ready for review March 10, 2026 01:10
@hadia206 hadia206 requested review from a team, john-sanchez31, juankx-bodo and knassre-bodo and removed request for a team March 10, 2026 01:10
Copy link
Copy Markdown
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hadia206 hadia206 changed the title Update explain to Include Singular/CROSS and User Collections Update explain to Include Singular, CROSS and User Collections Mar 12, 2026
Copy link
Copy Markdown
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things should be added in my opinion, but so far looks great @hadia206 :)

Comment thread pydough/exploration/explain.py Outdated
Comment on lines +282 to +300
if root is not None:
qualified_node = qualify_node(node, session)
else:
# If the root is None, it means that the node was an expression
# without information about its context.
lines.append(
f"Cannot call pydough.explain on {display_raw(node)}.\n"
"Did you mean to use pydough.explain_term?"
)
# No root in the tree (e.g. UnqualifiedGeneratedCollection, or a
# bare expression like LOWER(first_name + last_name)). Try to
# qualify anyway for generated collections. If it still fails,
# raise an exception.
try:
qualified_node = qualify_node(node, session)
except Exception:
lines.append(
f"Cannot call pydough.explain on {display_raw(node)}.\n"
"Did you mean to use pydough.explain_term?"
)
return "\n".join(lines)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we jus got rid of if root is not None and always did the try-except?

Comment thread tests/test_exploration.py
Comment thread tests/test_exploration.py
Comment thread pydough/exploration/explain.py Outdated
Comment on lines +363 to +381
if isinstance(prop, CartesianProductMetadata):
child_name = prop.child_collection.name
left_desc = (
qualified_node.preceding_context.to_string()
if qualified_node.preceding_context is not None
else collection_name
)
lines.append(
"This node is a CROSS join: every row of the left "
"collection is paired with every row of the right "
"collection."
)
lines.append(f"Left (parent): {left_desc}")
lines.append(f"Right (child): {child_name}")
lines.append(
f"Metadata: {collection_name}.{property_name} -> {child_name}. "
f"Call pydough.explain(graph['{collection_name}']['{property_name}']) "
"to learn more."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily like this way of handling CartesianProductMetadata; what was wrong with using the same message as other sub-collection types, then relying on pydough.explain on the metadata to explain the distinction between it being a simple join vs general join vs cross join?

Copy link
Copy Markdown
Contributor Author

@hadia206 hadia206 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I'll revert that. I started with wrong CROSS test and went rabbit hole on that.

Comment thread tests/test_exploration.py
Comment thread pydough/exploration/explain.py Outdated
"Did you mean to use pydough.explain_term?"
)
# No root in the tree (e.g. UnqualifiedGeneratedCollection, or a
# bare expression like LOWER(first_name + last_name)). Try to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare expressions shouldn't use explain, they should use explain_term since things like LOWER(first_name) can only be explained within a context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do that because I was hitting a problem with if isinstance(qualified_node, PyDoughExpressionQDAG): being an issue with collection.
I updated the code to handle it differently. See if that's okay now

Comment thread tests/test_exploration.py
@hadia206 hadia206 requested a review from knassre-bodo March 16, 2026 21:10
Copy link
Copy Markdown
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more things I'd like iterated on, though some of these can potentially be in followups.

Comment thread pydough/exploration/explain.py Outdated
Comment on lines +221 to +226
def cross_impl() -> UnqualifiedNode:
return nations.CROSS(regions)


def cross_nations_impl() -> tuple[UnqualifiedNode, UnqualifiedNode]:
return nations.CROSS(regions), nations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about another one like this, where the term is a CROSS:

return customers.WHERE(market_segment == 'BUILDING'), CROSS(nations.WHERE(region.name == 'ASIA'))

And one where there is ancestry stuff going on:

return customers.CALCULATE(cust_nationkey).CROSS(nations), cust_nationkey

Comment thread tests/test_exploration.py
Comment on lines +1548 to +1550
TPCH.nations.TPCH.regions.CALCULATE(COUNT(nations.comment))
TPCH.nations.TPCH.regions.WHERE(HAS(nations))
TPCH.nations.TPCH.regions.ORDER_BY(COUNT(nations).DESC())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is displaying correctly... it should be TPCH.nations.CROSS(TPCH.regions).CALCULATE(...)

Not 100% sure how to fix this... that may need to be a followup as part of a larger refactoring

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CROSS collections render as TPCH.nations.TPCH.regions instead of TPCH.nations.CROSS(TPCH.regions)

Added a todo for now as I agree it's a larger refactoring effort.

Comment thread tests/test_exploration.py Outdated
├─── SubCollection[customers]
└─── Where[RANKING(by=(account_balance.DESC(na_pos='last')), levels=1) == 1]

This child is plural with regards to the collection, meaning its scalar terms can only be accessed by the collection if they are aggregated.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong... it shouldn't be plural. Some adjustments to the explain code may be required to get this to display the correct logic.

Comment thread tests/test_exploration.py Outdated

The term is the following expression: RANKING(by=(name.ASC(na_pos='first')))

This expression calls the window function 'RANKING' with the following arguments:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clean this up so it displasy nicely when there are no regular arguments, but it also explains any by / per arguments, as well as other ancillary arguments (e.g. cumulative, frame, default, n_buckets)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did some changes, let me know if this is okay now

unified both UDF and non-UDF branches with these improvements:

  • No positional args: says "This expression calls the window function 'NAME'." instead of "...with the following arguments:" followed by nothing
  • Ordering (by): new section lists each collation_arg when present
  • Partition levels (per): shown when levels is not None
  • Additional options: new section lists each kwarg (e.g. cumulative: True, allow_ties: False) when present

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behavior of specific known ones (cumulative, frame, default, n_buckets) can be explained in the string.

Comment thread tests/test_exploration.py
Comment on lines +2296 to +2298
This expression calls the user-defined window function 'NVAL' with the following arguments:
name
1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as earlier about other window function arguments

Comment thread tests/test_exploration.py
key

Description: Returns true if the argument is greater than zero.
This function is defined by the SQL macro: '{0} > 0'.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for macro-like UDFs we can follow this up with an example by:

  • Count how many arguments it has (from the arguments to the call)
  • Creating that many dummy variables (?a, ?b, ?c)
  • Injecting those into the SQL text via Python's .format method

So for this text it would be:

Suppose this function were called on arguments that are translated to the following in SQL: '?a'
Then the final SQL text for this function call would be: '?a > 0'

@hadia206 hadia206 requested a review from knassre-bodo March 23, 2026 21:32
Copy link
Copy Markdown
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @hadia206! Left a few final comments, but feel free to merge after

return [
f"This node accesses user-generated collection {self.name!r}.\n"
f"Columns: {', '.join(sorted(self.columns))}",
f"Unique columns: {', '.join(sorted(unique_terms))}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one should be in verbose, but the type of user generated collection should not be (add that in the individual overrides)

# TODO: when the collection is a CROSS, qualified_node.to_string()
# renders as e.g. "TPCH.nations.TPCH.regions" instead of the
# friendlier "TPCH.nations.CROSS(TPCH.regions)". Fixing this
# properly requires propagating CROSS identity through the QDAG
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or finding a way to render the unqualified node that looks nicer, like how QDAG looks

Comment thread tests/test_exploration.py Outdated

The term is the following expression: RANKING(by=(name.ASC(na_pos='first')))

This expression calls the window function 'RANKING' with the following arguments:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behavior of specific known ones (cumulative, frame, default, n_buckets) can be explained in the string.

Comment thread tests/test_exploration.py Outdated
TPCH.regions.CALCULATE(nations.customers.WHERE(RANKING(by=(account_balance.DESC(na_pos='last')), levels=1) == 1).SINGULAR.account_balance)

To learn more about this child, you can try calling pydough.explain on the following:
TPCH.regions.nations.customers.WHERE(RANKING(by=(account_balance.DESC(na_pos='last')), levels=1) == 1).SINGULAR
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, we should adjust the repr for SINGULAR in qdag to have the () at the end.

Comment thread tests/test_exploration.py Outdated
Comment on lines +2408 to +2409
Suppose this function were called on arguments that are translated to the following in SQL: '?a', '?b', '?c'
Then the final SQL text for this function call would be: 'ABS(?b - ?a) <= ?c'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part might a verbose-only thing

@hadia206 hadia206 merged commit 158a0df into main Mar 24, 2026
15 checks passed
@hadia206 hadia206 deleted the Hadia/update_explain branch March 24, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants