Zb/illico fixes by zboldyga · Pull Request #4102 · scverse/scanpy

zboldyga · 2026-05-05T05:38:49Z

Closes #
Tests included or not required because:

Release notes not necessary because:

for more information, see https://pre-commit.ci

zboldyga · 2026-05-05T05:43:56Z

@ilan-gold this is a work in progress but opening a draft to keep you in the loop. This PR is scoped only to fixes to the current illico integration. The 3 commits here correspond with unique fixes

As for the groupby change, I want to double-check this and refine a bit more, but the prior version has dependencies on the illico internal api/decisions, and is slower than the proposed approach (~a few seconds runtime for some datasets I tested, could be more for bigger ones).

ilan-gold · 2026-05-05T09:20:31Z

Nice this looks like a great start (and a big cleanup over what I had!)

for more information, see https://pre-commit.ci

zboldyga · 2026-05-07T21:21:17Z

@ilan-gold OK I reviewed carefully and I feel this is sufficient / complete for the scope of this PR.

Kind of note to self here, but overall if/when illico is more tightly integrated the whole flow from X to final stats can involve fewer operations. I suspect the unstack step won't be needed (the _to_iter function will be removed later, and it's tests as well).

ilan-gold · 2026-05-08T09:57:31Z

@zboldyga This looks good overall. Just to understand, was this responsible for the performance difference?

zboldyga · 2026-05-08T22:07:14Z

@ilan-gold I'm glad you asked haha. I ran a wider benchmark, then added some new alternatives, and found a better approach. But yes, the intention here is twofold: there are some minor test adjustments, and a performance improvement so that scanpy's illico is closer to the speed + memory overhead of using illico directly.

Here's benchmarking across 14 real perturb-seq datasets. Note that groupby_dict was the former approach. unstack_reindex was what I had prior to the latest commit (it was not helping, tbh). And streaming_loc is what I settled on (very simple, just walks the data as it exists in memory and there's little overhead to doing this).

Also, note that this latest approach is streaming friendly, e.g. it will work with backed files.

Total time (ms)

Datasets ordered by n_cells (ascending).

dataset	n_cells	n_groups	rows	unstack_reindex	groupby_dict	unstack_nosort	streaming_groupby	streaming_loc	vectorized_reshape
tian21crispra	21,193	101	2,345,927	421.5	450.7	456.6	113.2	63.1	77.1
wessels23	30,636	184	3,086,600	525.3	575.1	487.9	157.2	97.2	108.6
tian21crispri	32,300	185	4,552,110	823.8	892.8	860.6	216.1	121.1	151.5
sunshine23	38,431	194	4,572,580	833.6	919.6	891.2	219.3	124.4	158.7
frangieh21	57,419	233	5,425,871	345.8	1,040.6	342.2	265.7	145.0	179.6
adamson16	62,673	89	1,834,824	349.6	366.6	325.7	88.3	51.8	64.5
norman19	111,545	238	5,380,704	1,065.7	1,068.6	989.2	275.1	147.8	194.8
nadig25hepg2	133,757	1,819	17,504,237	3,518.5	3,486.4	3,476.1	910.0	742.1	579.9
replogle22rpe1	240,774	2,017	17,646,733	3,567.0	3,402.5	3,383.0	938.2	794.4	600.5
nadig25jurkat	258,202	2,138	18,987,578	3,851.3	3,780.7	3,730.6	1,054.9	843.0	641.3
replogle22k562	308,646	1,972	16,886,236	3,436.3	3,223.0	3,355.5	907.3	769.3	570.5
kaden25fibroblast	447,301	1,837	51,419,467	11,115.9	10,479.0	10,587.2	2,539.9	1,361.4	1,713.6
kaden25rpe1	850,225	1,837	53,102,159	12,466.9	10,826.8	12,016.2	2,681.6	1,391.7	1,798.3
replogle22k562gwps	1,985,724	9,676	79,807,648	15,481.5	15,477.3	15,374.2	4,342.7	3,629.2	2,608.3

phys_footprint delta (MB)

Peak memory, sampled at 5 ms

dataset	n_cells	n_groups	rows	unstack_reindex	groupby_dict	unstack_nosort	streaming_groupby	streaming_loc	vectorized_reshape
tian21crispra	21,193	101	2,345,927	209.2	210.8	195.2	177.1	39.1	60.9
wessels23	30,636	184	3,086,600	240.1	298.8	246.1	238.7	50.0	79.5
tian21crispri	32,300	185	4,552,110	474.8	383.4	420.4	312.3	74.6	117.1
sunshine23	38,431	194	4,572,580	480.6	384.8	416.1	313.9	74.5	118.1
frangieh21	57,419	233	5,425,871	184.9	465.1	196.0	373.1	89.2	139.8
adamson16	62,673	89	1,834,824	178.4	154.9	173.7	124.0	30.2	48.3
norman19	111,545	238	5,380,704	472.5	445.2	458.0	369.1	86.7	139.8
nadig25hepg2	133,757	1,819	17,504,237	1,490.0	1,343.6	1,492.7	985.7	280.8	318.6
replogle22rpe1	240,774	2,017	17,646,733	1,528.1	1,387.2	1,529.2	994.8	282.7	322.9
nadig25jurkat	258,202	2,138	18,987,578	1,555.8	1,486.8	1,521.1	1,068.4	304.5	345.0
replogle22k562	308,646	1,972	16,886,236	1,514.7	1,336.8	1,481.7	950.9	271.1	309.2
kaden25fibroblast	447,301	1,837	51,419,467	3,288.9	3,586.1	3,292.7	2,746.1	823.9	858.6
kaden25rpe1	850,225	1,837	53,102,159	5,179.2	3,792.8	5,176.9	2,835.3	851.0	884.9
replogle22k562gwps	1,985,724	9,676	79,807,648	9,742.0	7,449.4	9,317.6	5,548.0	1,279.4	1,328.2

So we're saving time and memory, and the code is pretty easy to read

zboldyga and others added 4 commits May 4, 2026 21:19

illico test fix

9ce2138

wilcoxon-illico included in test_mean_in_log_space

8e76611

more robust and performant illico dataframe handling logic

4263b08

[pre-commit.ci] auto fixes from pre-commit.com hooks

2fde37e

for more information, see https://pre-commit.ci

zboldyga and others added 2 commits May 7, 2026 14:10

simpler comment, better variable names

806ea7d

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5741c4

for more information, see https://pre-commit.ci

zboldyga marked this pull request as ready for review May 7, 2026 21:12

more optimal approach, simplified tests.

68caea9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zb/illico fixes#4102

Zb/illico fixes#4102
zboldyga wants to merge 7 commits intoscverse:ig/illicofrom
zboldyga:zb/illico-fixes

zboldyga commented May 5, 2026

Uh oh!

zboldyga commented May 5, 2026 •

edited

Loading

Uh oh!

ilan-gold commented May 5, 2026

Uh oh!

zboldyga commented May 7, 2026 •

edited

Loading

Uh oh!

ilan-gold commented May 8, 2026

Uh oh!

zboldyga commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zboldyga commented May 5, 2026

Uh oh!

zboldyga commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented May 5, 2026

Uh oh!

zboldyga commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented May 8, 2026

Uh oh!

zboldyga commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Total time (ms)

phys_footprint delta (MB)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zboldyga commented May 5, 2026 •

edited

Loading

zboldyga commented May 7, 2026 •

edited

Loading

zboldyga commented May 8, 2026 •

edited

Loading