Skip to content

[AURON #1724] Support binary input for Spark substring function#2262

Open
lyne7-sc wants to merge 4 commits into
apache:masterfrom
lyne7-sc:fix/spark_substr
Open

[AURON #1724] Support binary input for Spark substring function#2262
lyne7-sc wants to merge 4 commits into
apache:masterfrom
lyne7-sc:fix/spark_substr

Conversation

@lyne7-sc
Copy link
Copy Markdown

Which issue does this PR close?

Closes #1724

Rationale for this change

Spark substring supports both string and binary inputs, while Auron previously mapped it to datafusion's Substr, which only handled string-compatible behavior and caused the Spark string/binary substring suite case to be excluded.

What changes are included in this PR?

  • Add native Spark_Substring ext function support for Utf8 and Binary inputs.
  • Route Spark Substring conversion through Spark_Substring and preserve the input data type.
  • Add native unit coverage for string and binary substring.
  • Re-enable string / binary substring function test case.

Are there any user-facing changes?

Yes. Spark SQL substring now supports binary input in native execution, matching Spark behavior.

How was this patch tested?

  • Added native unit tests for spark_substring.
  • Spark String Function Suite

Copy link
Copy Markdown
Contributor

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, left a few comments

Comment on lines +181 to +188
let start = if pos > 0 {
pos - 1
} else if pos < 0 {
total_len_i64 + pos
} else {
0
}
.clamp(0, total_len_i64) as usize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can split this up for better readability, right now it looked like it was only getting applied to the else block

 let raw_start = if pos > 0 { pos - 1 }
                     else if pos < 0 { total_len_i64 + pos }
                     else { 0 };
let start = raw_start.clamp(0, total_len_i64) as usize;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated.

0
}
.clamp(0, total_len_i64) as usize;
let end = (start as i64 + len).clamp(0, total_len_i64) as usize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we planning to handle overflow for very large len?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, pos and len come from Spark’s Int32 values cast to i64 (via NativeConverters), so overflow is likely unreachable here?

Still, I switched to saturating_add for safety here.

}
}

#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some more tests around the edge cases, examples:

  • len == 0: expect empty string/bytes (not error).
  • pos > total_len: expect empty.
  • pos + len > total_len: expect clamp to end (e.g. substring("abc", 2, 100) == "bc").
  • Empty input string and empty binary input: expect empty

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I’ve added the edge cases you mentioned.

let start = if pos > 0 {
pos - 1
} else if pos < 0 {
total_len_i64 + pos
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if pos == i64.MIN? will it cause any issues?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, switched to satuating_add here as well for safety.

@lyne7-sc
Copy link
Copy Markdown
Author

@ShreyeshArangath Thanks for your review! I’ve updated the implementation accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Failed] Substring does not support binary input

2 participants