Skip to content

Zb/illico fixes#4102

Open
zboldyga wants to merge 7 commits intoscverse:ig/illicofrom
zboldyga:zb/illico-fixes
Open

Zb/illico fixes#4102
zboldyga wants to merge 7 commits intoscverse:ig/illicofrom
zboldyga:zb/illico-fixes

Conversation

@zboldyga
Copy link
Copy Markdown
Contributor

@zboldyga zboldyga commented May 5, 2026

  • Closes #
  • Tests included or not required because:

@zboldyga
Copy link
Copy Markdown
Contributor Author

zboldyga commented May 5, 2026

@ilan-gold this is a work in progress but opening a draft to keep you in the loop. This PR is scoped only to fixes to the current illico integration. The 3 commits here correspond with unique fixes

As for the groupby change, I want to double-check this and refine a bit more, but the prior version has dependencies on the illico internal api/decisions, and is slower than the proposed approach (~a few seconds runtime for some datasets I tested, could be more for bigger ones).

@ilan-gold
Copy link
Copy Markdown
Contributor

Nice this looks like a great start (and a big cleanup over what I had!)

@zboldyga zboldyga marked this pull request as ready for review May 7, 2026 21:12
@zboldyga
Copy link
Copy Markdown
Contributor Author

zboldyga commented May 7, 2026

@ilan-gold OK I reviewed carefully and I feel this is sufficient / complete for the scope of this PR.

Kind of note to self here, but overall if/when illico is more tightly integrated the whole flow from X to final stats can involve fewer operations. I suspect the unstack step won't be needed (the _to_iter function will be removed later, and it's tests as well).

@ilan-gold
Copy link
Copy Markdown
Contributor

@zboldyga This looks good overall. Just to understand, was this responsible for the performance difference?

@zboldyga
Copy link
Copy Markdown
Contributor Author

zboldyga commented May 8, 2026

@ilan-gold I'm glad you asked haha. I ran a wider benchmark, then added some new alternatives, and found a better approach. But yes, the intention here is twofold: there are some minor test adjustments, and a performance improvement so that scanpy's illico is closer to the speed + memory overhead of using illico directly.

Here's benchmarking across 14 real perturb-seq datasets. Note that groupby_dict was the former approach. unstack_reindex was what I had prior to the latest commit (it was not helping, tbh). And streaming_loc is what I settled on (very simple, just walks the data as it exists in memory and there's little overhead to doing this).

Also, note that this latest approach is streaming friendly, e.g. it will work with backed files.

Total time (ms)

Datasets ordered by n_cells (ascending).

dataset n_cells n_groups rows unstack_reindex groupby_dict unstack_nosort streaming_groupby streaming_loc vectorized_reshape
tian21crispra 21,193 101 2,345,927 421.5 450.7 456.6 113.2 63.1 77.1
wessels23 30,636 184 3,086,600 525.3 575.1 487.9 157.2 97.2 108.6
tian21crispri 32,300 185 4,552,110 823.8 892.8 860.6 216.1 121.1 151.5
sunshine23 38,431 194 4,572,580 833.6 919.6 891.2 219.3 124.4 158.7
frangieh21 57,419 233 5,425,871 345.8 1,040.6 342.2 265.7 145.0 179.6
adamson16 62,673 89 1,834,824 349.6 366.6 325.7 88.3 51.8 64.5
norman19 111,545 238 5,380,704 1,065.7 1,068.6 989.2 275.1 147.8 194.8
nadig25hepg2 133,757 1,819 17,504,237 3,518.5 3,486.4 3,476.1 910.0 742.1 579.9
replogle22rpe1 240,774 2,017 17,646,733 3,567.0 3,402.5 3,383.0 938.2 794.4 600.5
nadig25jurkat 258,202 2,138 18,987,578 3,851.3 3,780.7 3,730.6 1,054.9 843.0 641.3
replogle22k562 308,646 1,972 16,886,236 3,436.3 3,223.0 3,355.5 907.3 769.3 570.5
kaden25fibroblast 447,301 1,837 51,419,467 11,115.9 10,479.0 10,587.2 2,539.9 1,361.4 1,713.6
kaden25rpe1 850,225 1,837 53,102,159 12,466.9 10,826.8 12,016.2 2,681.6 1,391.7 1,798.3
replogle22k562gwps 1,985,724 9,676 79,807,648 15,481.5 15,477.3 15,374.2 4,342.7 3,629.2 2,608.3

phys_footprint delta (MB)

Peak memory, sampled at 5 ms

dataset n_cells n_groups rows unstack_reindex groupby_dict unstack_nosort streaming_groupby streaming_loc vectorized_reshape
tian21crispra 21,193 101 2,345,927 209.2 210.8 195.2 177.1 39.1 60.9
wessels23 30,636 184 3,086,600 240.1 298.8 246.1 238.7 50.0 79.5
tian21crispri 32,300 185 4,552,110 474.8 383.4 420.4 312.3 74.6 117.1
sunshine23 38,431 194 4,572,580 480.6 384.8 416.1 313.9 74.5 118.1
frangieh21 57,419 233 5,425,871 184.9 465.1 196.0 373.1 89.2 139.8
adamson16 62,673 89 1,834,824 178.4 154.9 173.7 124.0 30.2 48.3
norman19 111,545 238 5,380,704 472.5 445.2 458.0 369.1 86.7 139.8
nadig25hepg2 133,757 1,819 17,504,237 1,490.0 1,343.6 1,492.7 985.7 280.8 318.6
replogle22rpe1 240,774 2,017 17,646,733 1,528.1 1,387.2 1,529.2 994.8 282.7 322.9
nadig25jurkat 258,202 2,138 18,987,578 1,555.8 1,486.8 1,521.1 1,068.4 304.5 345.0
replogle22k562 308,646 1,972 16,886,236 1,514.7 1,336.8 1,481.7 950.9 271.1 309.2
kaden25fibroblast 447,301 1,837 51,419,467 3,288.9 3,586.1 3,292.7 2,746.1 823.9 858.6
kaden25rpe1 850,225 1,837 53,102,159 5,179.2 3,792.8 5,176.9 2,835.3 851.0 884.9
replogle22k562gwps 1,985,724 9,676 79,807,648 9,742.0 7,449.4 9,317.6 5,548.0 1,279.4 1,328.2

So we're saving time and memory, and the code is pretty easy to read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants