Add support for Oracle databases & dialect in PyDough#484
Add support for Oracle databases & dialect in PyDough#484john-sanchez31 wants to merge 71 commits intomainfrom
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…iff days, trunc week, trunc day, join_strings, extract quarters, cast int
hadia206
left a comment
There was a problem hiding this comment.
Good work John.
Please see my comments below
pydough/sqlglot/transform_bindings/oracle_transform_bindings.py
Outdated
Show resolved
Hide resolved
| self, | ||
| args: list[SQLGlotExpression], | ||
| types: list[PyDoughType], | ||
| ) -> SQLGlotExpression: |
There was a problem hiding this comment.
Add comment to explain why we needed to override the base, i.e. difference between this and the base?
| ) | ||
| match operator: | ||
| case pydop.DEFAULT_TO: | ||
| # sqlglot convert COALESCE in NVL for Oracle, which is fine for |
There was a problem hiding this comment.
| # sqlglot convert COALESCE in NVL for Oracle, which is fine for | |
| # sqlglot convert COALESCE to NVL for Oracle, which is fine for |
| sqlglot_expressions.Literal.string("YYYY-MM-DD"), | ||
| ], | ||
| ) | ||
| if isinstance(literal_expression.value, datetime.datetime): |
There was a problem hiding this comment.
datetime.datetime is a subclass of datetime.date, so a datetime value matches both checks. That means if you have a datetime it'll go to both and the second would silently overwrites the first.
So 2 things:
This should be if - elif and the check for datetime.datetime should come first
if isinstance(literal_expression.value, datetime.datetime):
...
elif isinstance(literal_expression.value, datetime.date):
...
There was a problem hiding this comment.
The same should be applied to ANSI portion
| "PyDough does not yet support datetime values with a timezone" | ||
| ) | ||
| literal = sqlglot_expressions.Anonymous( | ||
| this="TO_DATE", |
There was a problem hiding this comment.
Why not TO_TIMESTAMP like oracle_transform_bindings?
There was a problem hiding this comment.
At first I was using TO_TIMESTAMP but that gave me problems with some operations because of datatypes, this is more consistent.
There was a problem hiding this comment.
As long as we've confirmed this is working as expected for timestamps, I'm ok with this.
There was a problem hiding this comment.
In Oracle it seems that DATE type stores year, month, day, hour, minute, and second. It does not store fractional seconds or time zone information, which TIMESTAMP does. But the issue is that TIMESTAMP has to be worked differently when adding and subtracting.
There was a problem hiding this comment.
Ok. This means sub-second precision is silently truncated, and comparisons against TIMESTAMP columns will also drop those sub-seconds. So we need to document that in the code (why TO_DATE was chosen instead of TO_TIMESTAMP) and also for the user
| # PYDOUGH CHANGE: ignore SYSTIMESTAMP since it is a special case | ||
| # of a column that should not be considered external | ||
| if isinstance(c.this, exp.Identifier) | ||
| and c.this.this != "SYSTIMESTAMP" |
There was a problem hiding this comment.
Is it special for other dialects too? Let's confirm that.
If it's not, we need to figure out how to make this change only for Oracle
or document it clearly that SYSTEMTIMESTAMP is not recommended to be used as column name because ...
| external_columns = get_scope_external_columns(scope) | ||
| if not parent: | ||
| continue | ||
| if scope.external_columns: |
There was a problem hiding this comment.
Shouldn't this be updated too?
| if scope.external_columns: | |
| if external_columns: |
| pytest.param(DatabaseDialect.SNOWFLAKE, id="snowflake"), | ||
| pytest.param(DatabaseDialect.MYSQL, id="mysql"), | ||
| pytest.param(DatabaseDialect.POSTGRES, id="postgres"), | ||
| pytest.param(DatabaseDialect.ORACLE, id="oracle"), |
There was a problem hiding this comment.
Add Oracle to all_dialects_tpch_db_context as well.
See how it's used for division by zero test.
We need to start moving our tests to use that so that it's one function used for all. Perhaps you can do that for one test.
Pick your poison test_pipeline_e2e_oracle_tpch or test_pipeline_e2e_oracle_tpch_custom and I'll do the other one in my PR 🥲
There was a problem hiding this comment.
Yes, we should be able to start combining a few things across the PyDough pytest testing infrastructure, seeing as how we have a very high amount of duplicated code regarding our common test sets (tpch, tpch custom, defog, defog custom)
knassre-bodo
left a comment
There was a problem hiding this comment.
Almost everything looks good! A few of Hadia's comments still need to be addressed, and I'll withhold approval until CI is passing stably after the changes (at which point I'll do a final once-over in case we've missed anything), but great job @john-sanchez31!
| "PyDough does not yet support datetime values with a timezone" | ||
| ) | ||
| literal = sqlglot_expressions.Anonymous( | ||
| this="TO_DATE", |
There was a problem hiding this comment.
As long as we've confirmed this is working as expected for timestamps, I'm ok with this.
| @@ -420,6 +421,9 @@ def visit_join(self, join: Join) -> None: | |||
| if join_type == "SEMI" and join.cardinality.singular: | |||
| join_type == "INNER" | |||
There was a problem hiding this comment.
Let's also fix the == to an = here, not sure how that one slipped through.
| connection: oracledb.connection | ||
| if connection := kwargs.pop("connection", None): | ||
| # If a connection object is provided, return it wrapped in | ||
| # DatabaseConnection |
There was a problem hiding this comment.
Instead of pre-declaring the type, I'd assert that the type is oracledb.connection before returning.
| return False | ||
|
|
||
| @property | ||
| def oracle_strftime_mapping(self) -> dict[str, str]: |
There was a problem hiding this comment.
Don't forget about this
| } | ||
|
|
||
| PYDOP_TO_ORACLE_FUNC: dict[pydop.PyDoughExpressionOperator, str] = { | ||
| pydop.ABS: "ABS", |
There was a problem hiding this comment.
What about LPAD and RPAD
| QUARTER("2023-10-01"), # Q4 | ||
| QUARTER("2023-11-15"), # Q4 | ||
| QUARTER("2023-12-31"), # Q4 | ||
| QUARTER(DATETIME("2023-01-15")), # Q1 |
There was a problem hiding this comment.
These changes should not be necessary; functions like QUARTER should already generate such a cast if needed with make_datetime_arg
| quarters_since_1995=DATEDIFF("quarter", DATETIME("1995-01-01"), order_date), | ||
| quarters_until_2000=DATEDIFF("quarter", order_date, DATETIME("2000-01-01")), |
There was a problem hiding this comment.
I think you may need to redo the sql generation because of this test causing it to fail.
| WITH _u_0 AS ( | ||
| SELECT | ||
| oid AS _u_1 | ||
| FROM main.organization | ||
| GROUP BY | ||
| 1 | ||
| ) |
There was a problem hiding this comment.
Yeah, make sure to change that == to an = then redo the sql generation because of stuff like this
| supported_databases = {"postgres", "mysql", "sqlite", "snowflake"} | ||
| supported_databases = {"postgres", "mysql", "sqlite", "snowflake", "oracle"} |
There was a problem hiding this comment.
Whoops, forgot to add BodoSQL here. Do you mind add that as well?
hadia206
left a comment
There was a problem hiding this comment.
Great work John. Almost there.
I still have a couple of comments
| this="CHAR", | ||
| expressions=[sqlglot_expressions.Literal.number(13)], | ||
| ), # carriage return | ||
| sqlglot_expressions.Literal.string(" "), |
There was a problem hiding this comment.
nit: add comment like others
|
|
||
| replacement_expr: SQLGlotExpression = ( | ||
| sqlglot_expressions.Literal.string("\\s") if len(args) == 1 else args[1] | ||
| sqlglot_expressions.Literal.string("\\s\\t\\r\\n") |
There was a problem hiding this comment.
I believe in MySQL \\s means all of them. You don't need to specify each separately
The regex engine natively understands \\s as a shorthand for the entire set of whitespace characters, including tabs and newlines.
| if step_idx != 1: | ||
| raise ValueError( | ||
| "SLICE function currently only supports the step being integer literal 1 or absent." | ||
| ) |
There was a problem hiding this comment.
This is what I mean and in this case you can be more specific with error message
if not isinstance(step, sqlglot_expressions.Null):
if isinstance(step, sqlglot_expressions.Literal):
try:
step_idx = int(step.this)
except ValueError:
raise ValueError(
"SLICE function currently only supports the step being integer literal 1 or absent, got non-integer literal."
)
if step_idx != 1:
raise ValueError(
"SLICE function currently only supports the step being integer literal 1 or absent, got value > 1."
)
else:
raise ValueError(...)
| if isinstance(types[0], StringType): | ||
| args = [ | ||
| sqlglot_expressions.Coalesce( | ||
| this=arg if arg.this != "" else sqlglot_expressions.Null(), |
There was a problem hiding this comment.
Apologies I had it the other way around. isinstance guard was suggested to avoid accessing .this on a non-Literal expression without type checking
| this=arg if arg.this != "" else sqlglot_expressions.Null(), | |
| this=arg if not (isinstance(arg, sqlglot_expressions.Literal) and arg.this == "") else sqlglot_expressions.Null(), |
| this=arg if arg.this != "" else sqlglot_expressions.Null(), | |
| this=arg if not (isinstance(arg, sqlglot_expressions.Literal) and arg.this == "") else sqlglot_expressions.Null(), |
| uses TO_CHAR with Oracle-specific date format semantics. | ||
| """ | ||
| if len(args) == 1: | ||
| # Length defaults to 4000 which is the max length of a VARCHAR2 in |
There was a problem hiding this comment.
What happens if arg has more than 4000? Would it truncate, error,...? We need to document that to the user
There was a problem hiding this comment.
Changed the implementation, now uses TO_CHAR so the limit is not longer needed
| bodo.user_logging.restore_default_bodo_verbose_logger() | ||
|
|
||
|
|
||
| @pytest.mark.bodosql |
There was a problem hiding this comment.
bodosql is not added all_dialects_tpch_db_context. Add it, if there're problems revert that delete.
tests/test_session.py
Outdated
| )() | ||
|
|
||
| # DATABASE mode: Snowflake/Postgres throw errors, SQLite/MySQL return NULL | ||
| # DATABASE mode: Snowflake/Postgres/Oracle throw errors, SQLite/MySQL return NULL |
There was a problem hiding this comment.
Need to check it for BodoSQL too, it was missed becuase BodoSQL was not added to all_dialects_tpch_db_context
pyproject.toml
Outdated
| @@ -46,6 +47,7 @@ mysql = ["mysql-connector-python==9.5.0"] | |||
| postgres = ["psycopg2-binary"] | |||
| server = ["fastapi", "httpx", "uvicorn"] | |||
| bodo = ["bodo>=2026.2", "bodosql>=2026.2", "pyiceberg[pyiceberg-core]==0.10.0"] | |||
There was a problem hiding this comment.
Not related to the PR but I think this should be bodosql for consistency. Could you do that change too? Thanks!
| """ | ||
| String joining must be implemented manually using the `||` operator with | ||
| explicit NULL handling via NVL to match CONCAT_WS semantics | ||
| for Oracle. |
There was a problem hiding this comment.
NOTE:
Other dialects use CONCAT_WS which skips NULLs ('a,b'), but this implementation uses NVL(arg, '') which replaces NULLs with empty string ('a,,b').
Since the other dialects agree on skip-NULL semantics, Oracle should match.
So lets update that and add a test case covering NULL arguments in JOIN_STRINGS
Also, document that in the user docs
There was a problem hiding this comment.
Not sure about this one, I tried JOIN_STRINGS(",", "a", None, "b") and in all dialects the result SQL was NULL, so oracle does match with Null values. Same when '' (empty_string) is used. Added a column on string_functions with None value.
| for arg in expr.iter_expressions(): | ||
| quote_oracle_identifiers(arg, dialect) |
There was a problem hiding this comment.
A thought
Is this correct for replaced identifiers? After expr.replace(new_identifier), the subsequent for arg in expr.iter_expressions() iterates the detached expr's children rather than new_identifier's children.
This is ok. since Identifier nodes have no sub-expressions but is it safe to rely on this?
knassre-bodo
left a comment
There was a problem hiding this comment.
Nothing bad jumps out to me; once everything is passing, and you address that one comment I left, feel free to merge.
| date: datetime.date | ||
| dt: datetime.datetime | ||
| if self._dialect == DatabaseDialect.ANSI: | ||
| if isinstance(literal_expression.value, datetime.date): | ||
| date: datetime.date = literal_expression.value | ||
| if isinstance(literal_expression.value, datetime.datetime): | ||
| dt = literal_expression.value |
There was a problem hiding this comment.
We may want to move a lot of this function's core details into the bindings (e.g. create a self.bindings.convert_datetime_literal(...) so we can control it at a dialect level)
Resolves #479