Arrow Expressions and Vortex DataSets #5801
Replies: 4 comments 4 replies
-
|
We need to support more substrait conversions in vortex. it would be better to use the vortex duckdb extension: con = duckdb.connect()
con.execute("INSTALL vortex; LOAD vortex;")
arr = pa.array([datetime(2024, 1, 1), datetime(2024, 6, 15), datetime(2024, 12, 31)])
table = pa.table({"ts": arr})
file_path = tmp_path / "test_timestamp.vortex"
vx.io.write(table, str(file_path))
# Use the vortex extension's vortex_scan function
result = con.execute(f"SELECT * FROM vortex_scan('{file_path}') WHERE ts > '2024-06-01'").fetchall()
assert len(result) == 2, f"Expected 2 rows, got {len(result)}: {result}"
# The results should be 2024-06-15 and 2024-12-31
timestamps = sorted([r[0] for r in result])
assert timestamps[0] == datetime(2024, 6, 15)
assert timestamps[1] == datetime(2024, 12, 31) |
Beta Was this translation helpful? Give feedback.
-
|
This the same problem we have elsewhere internally. I think @danking was working on fixing it but not sure what he settled on there. |
Beta Was this translation helpful? Give feedback.
-
|
I kinda think paultiq's last suggestion, recommend not using the Arrow Dataset API, is the best strategy (other than modifying the Arrow Dataset API). Even if we were to partition the expression and push parts down, as I describe in a sibling thread, we would still evaluate the "post-read" expression ourselves. I suspect we don't want to do that: we want duckdb to evaluate everything except for the expressions we can push down! |
Beta Was this translation helpful? Give feedback.
-
|
I think there are a few follow ups here. Let me know if I've missed anything. 1. Dataset Filter PushdownThe Dataset API clearly states that all filters must be applied by the Scanner:
Vortex Dataset should indeed uphold that API. For any expression where push-down into Vortex is not supported, it must be successfully evaluated after reading the batch from Vortex, and before returning to the caller. The problem is that Vortex will by default return StringView for string data, and ListView for list data. Both of which have disappointingly little support from the native Apache Arrow libraries. I think the fix here is for our Dataset API to default to not use the "view" arrays, and be configurable to do so when desired. 2. Vortex SubstraitThe reason we even involve substrait here is because PyArrow has no API to traverse an expression tree. Therefore the only way to unpack the expression in order for us to convert it to Vortex (or ideally partition it into supported / unsupported expressions) is to convert it to substrait. Substrait is an odd API. In theory it is a cross-system query/expression language. In practice, most systems define their own set of extension functions to model their exact semantics. Admittedly, the builtin extensions tend to follow Arrow semantics. Vortex's substrait support right now is rather lacking, so this conversion often fails. We should define Vortex <-> Substrait conversion as a Rust crate since we will want to support these conversions in Python, Java, C++, and other bindings. 3. Vortex DuckDBNow that Vortex ships as a core extension in DuckDB, we should expose our Vortex DuckDB integration through the PyVortex APIs, rather than using DuckDB + Arrow Dataset. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Issue Description
Vortex's compute differs significantly from Arrow compute in terms of supported predicates. #5781 provides one list where Vortex doesn't support things that Arrow supports... and #5765 is an example where Vortex's compute supports things Arrow doesn't.
Assuming that 5781 will take some time to resolve (to achieve parity with Arrow), this leads to a number of problems... especially with duckdb which will attempt to pushdown unsupported expressions. For example, the following query will raise an exception in duckdb:
* duckdb's CanPushdown assumes all Arrow sources support the same expressions.
I'm unclear on whether this is an Arrow problem, a Vortex problem or a duckdb problem:
One thing Vortex can do is add test cases for:
DuckDB Example
Arrow Example
All this does is demonstrate that certain kernels aren't implemented. The point here is not that the kernel isn't implement, but just to demonstrate why the above (duckdb) example occurs.
Parquet Works
Vortex Fails
Expected Behavior
.
Actual Behavior
.
Reproduction Steps
.
OS Version Information
Ubuntu 24.04
I acknowledge that:
```) on separate lines.Beta Was this translation helpful? Give feedback.
All reactions