Skip to content

Add support for Oracle databases & dialect in PyDough#484

Open
john-sanchez31 wants to merge 71 commits intomainfrom
John/oracle
Open

Add support for Oracle databases & dialect in PyDough#484
john-sanchez31 wants to merge 71 commits intomainfrom
John/oracle

Conversation

@john-sanchez31
Copy link
Contributor

Resolves #479

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@john-sanchez31 john-sanchez31 requested review from hadia206, juankx-bodo and knassre-bodo and removed request for a team March 9, 2026 15:50
Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work John.
Please see my comments below

self,
args: list[SQLGlotExpression],
types: list[PyDoughType],
) -> SQLGlotExpression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment to explain why we needed to override the base, i.e. difference between this and the base?

)
match operator:
case pydop.DEFAULT_TO:
# sqlglot convert COALESCE in NVL for Oracle, which is fine for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# sqlglot convert COALESCE in NVL for Oracle, which is fine for
# sqlglot convert COALESCE to NVL for Oracle, which is fine for

sqlglot_expressions.Literal.string("YYYY-MM-DD"),
],
)
if isinstance(literal_expression.value, datetime.datetime):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime.datetime is a subclass of datetime.date, so a datetime value matches both checks. That means if you have a datetime it'll go to both and the second would silently overwrites the first.

So 2 things:
This should be if - elif and the check for datetime.datetime should come first

if isinstance(literal_expression.value, datetime.datetime):
    ...
elif isinstance(literal_expression.value, datetime.date):
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same should be applied to ANSI portion

"PyDough does not yet support datetime values with a timezone"
)
literal = sqlglot_expressions.Anonymous(
this="TO_DATE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not TO_TIMESTAMP like oracle_transform_bindings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was using TO_TIMESTAMP but that gave me problems with some operations because of datatypes, this is more consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we've confirmed this is working as expected for timestamps, I'm ok with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Oracle it seems that DATE type stores year, month, day, hour, minute, and second. It does not store fractional seconds or time zone information, which TIMESTAMP does. But the issue is that TIMESTAMP has to be worked differently when adding and subtracting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. This means sub-second precision is silently truncated, and comparisons against TIMESTAMP columns will also drop those sub-seconds. So we need to document that in the code (why TO_DATE was chosen instead of TO_TIMESTAMP) and also for the user

# PYDOUGH CHANGE: ignore SYSTIMESTAMP since it is a special case
# of a column that should not be considered external
if isinstance(c.this, exp.Identifier)
and c.this.this != "SYSTIMESTAMP"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it special for other dialects too? Let's confirm that.
If it's not, we need to figure out how to make this change only for Oracle

or document it clearly that SYSTEMTIMESTAMP is not recommended to be used as column name because ...

external_columns = get_scope_external_columns(scope)
if not parent:
continue
if scope.external_columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be updated too?

Suggested change
if scope.external_columns:
if external_columns:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget this

pytest.param(DatabaseDialect.SNOWFLAKE, id="snowflake"),
pytest.param(DatabaseDialect.MYSQL, id="mysql"),
pytest.param(DatabaseDialect.POSTGRES, id="postgres"),
pytest.param(DatabaseDialect.ORACLE, id="oracle"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Oracle to all_dialects_tpch_db_context as well.
See how it's used for division by zero test.
We need to start moving our tests to use that so that it's one function used for all. Perhaps you can do that for one test.

Pick your poison test_pipeline_e2e_oracle_tpch or test_pipeline_e2e_oracle_tpch_custom and I'll do the other one in my PR 🥲

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be able to start combining a few things across the PyDough pytest testing infrastructure, seeing as how we have a very high amount of duplicated code regarding our common test sets (tpch, tpch custom, defog, defog custom)

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost everything looks good! A few of Hadia's comments still need to be addressed, and I'll withhold approval until CI is passing stably after the changes (at which point I'll do a final once-over in case we've missed anything), but great job @john-sanchez31!

"PyDough does not yet support datetime values with a timezone"
)
literal = sqlglot_expressions.Anonymous(
this="TO_DATE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we've confirmed this is working as expected for timestamps, I'm ok with this.

@@ -420,6 +421,9 @@ def visit_join(self, join: Join) -> None:
if join_type == "SEMI" and join.cardinality.singular:
join_type == "INNER"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also fix the == to an = here, not sure how that one slipped through.

Comment on lines +349 to +352
connection: oracledb.connection
if connection := kwargs.pop("connection", None):
# If a connection object is provided, return it wrapped in
# DatabaseConnection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of pre-declaring the type, I'd assert that the type is oracledb.connection before returning.

return False

@property
def oracle_strftime_mapping(self) -> dict[str, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about this

}

PYDOP_TO_ORACLE_FUNC: dict[pydop.PyDoughExpressionOperator, str] = {
pydop.ABS: "ABS",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about LPAD and RPAD

QUARTER("2023-10-01"), # Q4
QUARTER("2023-11-15"), # Q4
QUARTER("2023-12-31"), # Q4
QUARTER(DATETIME("2023-01-15")), # Q1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes should not be necessary; functions like QUARTER should already generate such a cast if needed with make_datetime_arg

Comment on lines +2621 to +2622
quarters_since_1995=DATEDIFF("quarter", DATETIME("1995-01-01"), order_date),
quarters_until_2000=DATEDIFF("quarter", order_date, DATETIME("2000-01-01")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines 6 to 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may need to redo the sql generation because of this test causing it to fail.

Comment on lines +1 to +7
WITH _u_0 AS (
SELECT
oid AS _u_1
FROM main.organization
GROUP BY
1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, make sure to change that == to an = then redo the sql generation because of stuff like this

Comment on lines +40 to +41
supported_databases = {"postgres", "mysql", "sqlite", "snowflake"}
supported_databases = {"postgres", "mysql", "sqlite", "snowflake", "oracle"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, forgot to add BodoSQL here. Do you mind add that as well?

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work John. Almost there.
I still have a couple of comments

this="CHAR",
expressions=[sqlglot_expressions.Literal.number(13)],
), # carriage return
sqlglot_expressions.Literal.string(" "),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add comment like others


replacement_expr: SQLGlotExpression = (
sqlglot_expressions.Literal.string("\\s") if len(args) == 1 else args[1]
sqlglot_expressions.Literal.string("\\s\\t\\r\\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in MySQL \\s means all of them. You don't need to specify each separately
The regex engine natively understands \\s as a shorthand for the entire set of whitespace characters, including tabs and newlines.

Comment on lines +283 to +286
if step_idx != 1:
raise ValueError(
"SLICE function currently only supports the step being integer literal 1 or absent."
)
Copy link
Contributor

@hadia206 hadia206 Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I mean and in this case you can be more specific with error message

if not isinstance(step, sqlglot_expressions.Null):
    if isinstance(step, sqlglot_expressions.Literal):
        try:
            step_idx = int(step.this)
        except ValueError:
            raise ValueError(
                "SLICE function currently only supports the step being integer literal 1 or absent, got non-integer literal."
            )
        if step_idx != 1:
            raise ValueError(
                "SLICE function currently only supports the step being integer literal 1 or absent, got value > 1."
            )
    else:
        raise ValueError(...)

if isinstance(types[0], StringType):
args = [
sqlglot_expressions.Coalesce(
this=arg if arg.this != "" else sqlglot_expressions.Null(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I had it the other way around. isinstance guard was suggested to avoid accessing .this on a non-Literal expression without type checking

Suggested change
this=arg if arg.this != "" else sqlglot_expressions.Null(),
this=arg if not (isinstance(arg, sqlglot_expressions.Literal) and arg.this == "") else sqlglot_expressions.Null(),
Suggested change
this=arg if arg.this != "" else sqlglot_expressions.Null(),
this=arg if not (isinstance(arg, sqlglot_expressions.Literal) and arg.this == "") else sqlglot_expressions.Null(),

uses TO_CHAR with Oracle-specific date format semantics.
"""
if len(args) == 1:
# Length defaults to 4000 which is the max length of a VARCHAR2 in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if arg has more than 4000? Would it truncate, error,...? We need to document that to the user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the implementation, now uses TO_CHAR so the limit is not longer needed

bodo.user_logging.restore_default_bodo_verbose_logger()


@pytest.mark.bodosql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bodosql is not added all_dialects_tpch_db_context. Add it, if there're problems revert that delete.

)()

# DATABASE mode: Snowflake/Postgres throw errors, SQLite/MySQL return NULL
# DATABASE mode: Snowflake/Postgres/Oracle throw errors, SQLite/MySQL return NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check it for BodoSQL too, it was missed becuase BodoSQL was not added to all_dialects_tpch_db_context

pyproject.toml Outdated
@@ -46,6 +47,7 @@ mysql = ["mysql-connector-python==9.5.0"]
postgres = ["psycopg2-binary"]
server = ["fastapi", "httpx", "uvicorn"]
bodo = ["bodo>=2026.2", "bodosql>=2026.2", "pyiceberg[pyiceberg-core]==0.10.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR but I think this should be bodosql for consistency. Could you do that change too? Thanks!

"""
String joining must be implemented manually using the `||` operator with
explicit NULL handling via NVL to match CONCAT_WS semantics
for Oracle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE:
Other dialects use CONCAT_WS which skips NULLs ('a,b'), but this implementation uses NVL(arg, '') which replaces NULLs with empty string ('a,,b').
Since the other dialects agree on skip-NULL semantics, Oracle should match.

So lets update that and add a test case covering NULL arguments in JOIN_STRINGS

Also, document that in the user docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one, I tried JOIN_STRINGS(",", "a", None, "b") and in all dialects the result SQL was NULL, so oracle does match with Null values. Same when '' (empty_string) is used. Added a column on string_functions with None value.

Comment on lines +508 to +509
for arg in expr.iter_expressions():
quote_oracle_identifiers(arg, dialect)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought

Is this correct for replaced identifiers? After expr.replace(new_identifier), the subsequent for arg in expr.iter_expressions() iterates the detached expr's children rather than new_identifier's children.
This is ok. since Identifier nodes have no sub-expressions but is it safe to rely on this?

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing bad jumps out to me; once everything is passing, and you address that one comment I left, feel free to merge.

Comment on lines +350 to +354
date: datetime.date
dt: datetime.datetime
if self._dialect == DatabaseDialect.ANSI:
if isinstance(literal_expression.value, datetime.date):
date: datetime.date = literal_expression.value
if isinstance(literal_expression.value, datetime.datetime):
dt = literal_expression.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to move a lot of this function's core details into the bindings (e.g. create a self.bindings.convert_datetime_literal(...) so we can control it at a dialect level)

@john-sanchez31 john-sanchez31 requested a review from hadia206 March 20, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Oracle databases & dialect in PyDough

3 participants