[AURON #1724] Support binary input for Spark substring function#2262
[AURON #1724] Support binary input for Spark substring function#2262lyne7-sc wants to merge 4 commits into
substring function#2262Conversation
ShreyeshArangath
left a comment
There was a problem hiding this comment.
Mostly LGTM, left a few comments
| let start = if pos > 0 { | ||
| pos - 1 | ||
| } else if pos < 0 { | ||
| total_len_i64 + pos | ||
| } else { | ||
| 0 | ||
| } | ||
| .clamp(0, total_len_i64) as usize; |
There was a problem hiding this comment.
nit: we can split this up for better readability, right now it looked like it was only getting applied to the else block
let raw_start = if pos > 0 { pos - 1 }
else if pos < 0 { total_len_i64 + pos }
else { 0 };
let start = raw_start.clamp(0, total_len_i64) as usize;
| 0 | ||
| } | ||
| .clamp(0, total_len_i64) as usize; | ||
| let end = (start as i64 + len).clamp(0, total_len_i64) as usize; |
There was a problem hiding this comment.
How are we planning to handle overflow for very large len?
There was a problem hiding this comment.
In practice, pos and len come from Spark’s Int32 values cast to i64 (via NativeConverters), so overflow is likely unreachable here?
Still, I switched to saturating_add for safety here.
| } | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
let's add some more tests around the edge cases, examples:
- len == 0: expect empty string/bytes (not error).
- pos > total_len: expect empty.
- pos + len > total_len: expect clamp to end (e.g. substring("abc", 2, 100) == "bc").
- Empty input string and empty binary input: expect empty
There was a problem hiding this comment.
Thanks, I’ve added the edge cases you mentioned.
| let start = if pos > 0 { | ||
| pos - 1 | ||
| } else if pos < 0 { | ||
| total_len_i64 + pos |
There was a problem hiding this comment.
what if pos == i64.MIN? will it cause any issues?
There was a problem hiding this comment.
Same as above, switched to satuating_add here as well for safety.
|
@ShreyeshArangath Thanks for your review! I’ve updated the implementation accordingly. |
Which issue does this PR close?
Closes #1724
Rationale for this change
Spark
substringsupports both string and binary inputs, while Auron previously mapped it to datafusion'sSubstr, which only handled string-compatible behavior and caused the Spark string/binary substring suite case to be excluded.What changes are included in this PR?
Spark_Substringext function support forUtf8andBinaryinputs.Substringconversion throughSpark_Substringand preserve the input data type.string / binary substring functiontest case.Are there any user-facing changes?
Yes. Spark SQL
substringnow supports binary input in native execution, matching Spark behavior.How was this patch tested?
spark_substring.