Skip to content

Conversation

@klin333
Copy link

@klin333 klin333 commented Nov 6, 2025

As noted in #822, nanoarrow_array_stream -> data.frame conversion is extremely slow, and suffers near exponential time scaling, probably due to extremely inefficient string ALTREP materialization.

Based on the reprex in that issue, the original time is

num_cols elapsed_with_arrow elapsed_without_arrow
10 2.4 secs 1.3 secs
20 3.2 secs 2.9 secs
40 6.1 secs 7.0 secs
80 12.7 secs 30.8 secs
160 26.9 secs 920.7 secs

After this fix, the elapsed time are:

num_cols elapsed_with_arrow elapsed_without_arrow
10 2.4 secs 1.3 secs
20 3.2 secs 2.5 secs
40 6.3 secs 5.3 secs
80 14.2 secs 12.4 secs
160 27.3 secs 25.6 secs

This slowness in convert_array_stream was previously noted in a comment in #219.

The intuition of this PR is that the original recreation of a single array_stream from the already collected batches of arrays, is probably very inefficient. I don't fully understand why we previously did not directly convert collected array to data.frame, given the function already exists. Maybe it's because we don't want to be binding batches of converted data.frame in R?

@klin333
Copy link
Author

klin333 commented Nov 6, 2025

ohhhhhh. i think i understand nanoarrow a bit better now.

I believe convert_array_stream currently doesn't use string ALTREP at all. Only convert_array uses ALTREP. In the PR change, by converting the collected array batches, I've actually changed to using ALTREP, but only for each converted batch. As soon as I vec_rbind the converted batches, ALTREP strings get materialized, and the resulting single data.frame does not use ALTREP.

However, even though vec_rbind forces materialization of ALTREP strings, ie negating the benefits of ALTREP, the total convert array batches path is still good.

Essentially, the PR convert array batches path is not great, but good enough. The convert_array_stream(basic_stream) path is just extremely bad with exponential time behaviour. Ie this PR qualifies only as a workaround, as opposed to root fix. convert_array_stream(basic_stream) should not be taking exponential time, and this PR does nothing to fix it.

Perhaps it's to do with R global string pool? Perhaps by using ALTREP in converted array batches, and concatenating n_batches of them, the global string pool only gets altered (n_batch * ncol) times? I have no idea how the convert_array_stream(basic_stream) path works, perhaps it results in the R global string pool being altered (n_rows * ncols) times?

@paleolimbot
Copy link
Member

Thank you for this! You are absolutely correct on all of this...something is non-linear with the conversion process that should be. I will investigate as soon as I can (hopefully today).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants