modularize optimization internals #7401

ben-schwen · 2025-10-28T17:32:57Z

Closes Using Map instead of lapply turns GForce off #5336
Closes lapply GForce opt could work also without .SD #5032
Closes Move GForce tests to own script #4305
Towards GForce optimisation could be more smart #3815
Closes GForce as.double / as.numeric #2934
Closes benchmark regression #7404
tests (a lot of them)

Adds arithmetic for GForce as demanded in #3815 but does not add support for blocks in j like d[, j={x<-x; .(min(x))}, by=y].

codecov · 2025-10-28T17:51:29Z

Codecov Report

❌ Patch coverage is 99.25094% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.01%. Comparing base (a325db9) to head (c5e95d2).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
R/data.table.R	99.59%	1 Missing ⚠️
R/test.data.table.R	95.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7401      +/-   ##
==========================================
- Coverage   99.02%   99.01%   -0.02%     
==========================================
  Files          87       87              
  Lines       16803    16895      +92     
==========================================
+ Hits        16640    16728      +88     
- Misses        163      167       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-10-28T17:52:01Z

HEAD=modular_gforce stopped early for DT[by,verbose=TRUE] improved in #6296
HEAD=modular_gforce slower P<0.001 for memrecycle regression fixed in #5463
HEAD=modular_gforce slower P<0.001 for setDT improved in #5427

Generated via commit c5e95d2

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 57 seconds
Installing different package versions	44 seconds
Running and plotting the test cases	5 minutes and 3 seconds

man/test.Rd

R/data.table.R

ben-schwen · 2025-11-02T19:05:06Z

I'm also not sure about moving the tests to optimize.Rraw since this feels kind of wrong and not needed after introducing the new levels/optimization parameter to test.

NEWS.md

MichaelChirico · 2026-01-11T07:54:26Z

R/data.table.R

+        jvnames = c(jvnames, sdvars)
+      }
+      # Case 2e: Complex .SD usage - can't optimize
+      else if (any(all.vars(this) == ".SD")) {


Let's just drop this branch? since it's not yet supported.

MichaelChirico · 2026-01-11T08:00:25Z

R/data.table.R

+  jsub = as.call(ans)  # important no names here
+  jvnames = sdvars      # but here instead
+  list(jsub=jsub, jvnames=jvnames, funi=funi+1L)
+  # It may seem inefficient to construct a potentially long expression. But, consider calling


might be worth benchmarking this (atime?)... written 14 years ago, I wonder if it's still true 5176108

Its stil true with nrow=1e6, ncol=10, but no noticeable effect between master and this PR

Details

library(atime) library(data.table) pkg.path <- '.' limit <- 10 # Package editing function for atime pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) { pkg_find_replace <- function(glob, FIND, REPLACE) { atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE) } Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE) Package_ <- gsub(".", "_", old.Package, fixed = TRUE) new.Package_ <- paste0(Package_, "_", sha) pkg_find_replace( "DESCRIPTION", paste0("Package:\\s+", old.Package), paste("Package:", new.Package)) pkg_find_replace( file.path("src", "Makevars.*in"), Package_regex, new.Package_) pkg_find_replace( file.path("R", "onLoad.R"), Package_regex, new.Package_) pkg_find_replace( file.path("R", "onLoad.R"), sprintf('packageVersion\$"%s"\$', old.Package), sprintf('packageVersion\$"%s"\$', new.Package)) pkg_find_replace( file.path("src", "init.c"), paste0("R_init_", Package_regex), paste0("R_init_", gsub("[.]", "_", new.Package_))) pkg_find_replace( "NAMESPACE", sprintf('useDynLib\\("?%s"?', Package_regex), paste0('useDynLib(', new.Package_)) } # Commits to compare versions <- c( 'master' = 'b3acef8', 'PR' = '383b60a' ) # Vary number of groups N_groups <- 10^seq(1, 5, 0.25) set.seed(42) test_data <- lapply(setNames(nm = N_groups), function(n_groups) { n_rows <- 1e6 n_cols <- 10 dt <- data.table(grp = sample(n_groups, n_rows, replace = TRUE)) for (i in seq_len(n_cols)) { set(dt, j = paste0("V", i), value = rnorm(n_rows)) } dt }) # Test 1: With optimize = 1 (GForce enabled) gforce_opt1 <- atime_versions( pkg.path, N_groups, setup = { options(datatable.optimize = 1) DT <- test_data[[as.character(N)]] }, expr = { data.table:::`[.data.table`(DT, , lapply(.SD, sum), by = grp) }, seconds.limit = limit, verbose = TRUE, sha.vec = versions, pkg.edit.fun = pkg.edit.fun ) # Test 2: With optimize = 0 (GForce disabled) gforce_opt0 <- atime_versions( pkg.path, N_groups, setup = { options(datatable.optimize = 0) DT <- test_data[[as.character(N)]] }, expr = {data.table:::`[.data.table`(DT, , lapply(.SD, sum), by = grp) }, seconds.limit = limit, verbose = TRUE, sha.vec = versions, pkg.edit.fun = pkg.edit.fun ) library(ggplot2) opt1_data <- gforce_opt1$measurements opt1_data$optimize <- "opt1" opt0_data <- gforce_opt0$measurements opt0_data$optimize <- "opt0" combined_data <- rbind(opt1_data, opt0_data) pdf("gforce_lapply_benchmark.pdf", width = 12, height = 6) p <- ggplot(combined_data, aes(x = N, y = median, color = expr.name, linetype = optimize)) + geom_line(linewidth = 1) + geom_point(size = 2) + scale_x_log10("Number of groups") + scale_y_log10("Median time (seconds)") + labs(title = "DT[, lapply(.SD, sum), by=grp] Performance", subtitle = "Comparing optimize=0 vs optimize=1 across commits", color = "Version", linetype = "Optimization") + theme_bw() + theme(legend.position = "bottom") print(p) dev.off()

nice... wonder if benchmark.Rraw or something like that (a Wiki?) would be a good place to compile these type of low-level optimizations based on R's own logic...

(not required for this PR)

MichaelChirico · 2026-01-13T17:24:02Z

R/data.table.R

+  if (length(names(txt))>1L) .Call(Csetcharvec, names(txt), 2L, "")  # fixes bug #110
+  # support Map instead of lapply #5336
+  fun = if (jsub %iscall% "Map") txt[[1L]] else txt[[2L]]
+  if (fun %iscall% "function") {


Does it make sense to have a script of things that are only testable with the R 4.0.0+ parser? I guess it's pretty rare...

Suggested change

if (fun %iscall% "function") {

if (fun %iscall% "function") { # NB: '\(x)' only exists pre-parser, so it's also covered

MichaelChirico · 2026-01-13T17:40:31Z

R/data.table.R

+  jsubl = as.list.default(jsub)
+  oldjvnames = jvnames
+  jvnames = NULL  # TODO: not let jvnames grow, maybe use (number of lapply(.SD, .))*length(sdvars) + other jvars ?? not straightforward.
+  # Fix for #744. Don't use 'i' in for-loops. It masks the 'i' from the input!!


nit: i guess this is obsolete? there's no i in scope here and no eval() here

MichaelChirico · 2026-01-13T17:42:49Z

R/data.table.R

+  # Apply GForce
+  if (jsub %iscall% "list") {
+    GForce = TRUE
+    for (ii in seq.int(from=2L, length.out=length(jsub)-1L)) {


I changed another 2:length(...) to this seq.int() form, but I see 2:length(...) elsewhere and now I'm second-guessing -- I guess it's fine (better, even) to use the shorter 2:length(...)?

R/data.table.R

MichaelChirico · 2026-01-13T21:46:55Z

R/data.table.R

+        funi = massage_result$funi
+        jsubl[[i_]] = as.list(massage_result$jsub[-1L]) # just keep the '.' from list(.)
+        jn__ = massage_result$jvnames
+        if (isTRUE(nzchar(names(jsubl)[i_]))) {


IINM isTRUE() is only needed if names() might be NA, which is possible in general but shouldn't be here. Let's remove it?

oh no, it's for the is.null(names(.)) case, got it.

MichaelChirico · 2026-01-13T21:56:04Z

R/data.table.R

+        if (isTRUE(nzchar(names(jsubl)[i_]))) {
+          # Fix for #2311, prepend named list arguments of c() to that list's names. See tests 2283.*
+          jl__names = names(jl__) %||% rep("", length(jl__))
+          jl__hasname = nzchar(names(jl__))


?

Suggested change

jl__hasname = nzchar(names(jl__))

jl__hasname = nzchar(jl__names)

MichaelChirico · 2026-01-13T21:56:51Z

R/data.table.R

+        } else {
+          jn__ = names(jl__) %||% rep("", length(jl__))
+        }
+        idx = unlist(lapply(jl__, function(x) is.name(x) && x == ".I"))


maybe?

Suggested change

idx = unlist(lapply(jl__, function(x) is.name(x) && x == ".I"))

idx = vapply_1b(jl__, identical, quote(.I))

MichaelChirico · 2026-01-13T22:03:45Z

R/data.table.R

+  }
+
+  # Return result
+  if (!is_valid || !any_optimized) {


what's the difference between is_valid and any_optimized?

MichaelChirico · 2026-01-13T22:06:25Z

R/data.table.R

+        jvnames = c(jvnames, gsub("^[.]([N])$", "\\1", this))
+      } else {
+        # jvnames = c(jvnames, if (is.null(names(jsubl))) "" else names(jsubl)[i_])
+        is_valid = FALSE


this is_valid=FALSE; break style feels a bit like a goto statement where we have to jump past a bunch of other logic to see the flow; why not do early returns directly at the site of the "failure" instead?

Maybe something like this would help

unedited_expr = list(jsub=jsub, jvnames=oldjvnames, funi=funi, optimized=FALSE) # ... return(unedited_expr)

MichaelChirico · 2026-01-13T22:08:08Z

R/data.table.R

+  # Pattern 3b: Map(fun, .SD)
+  # Only optimize if .SD appears exactly once to avoid cases like Map(rep, .SD, .SD)
+  else if (is.call(jsub) && jsub %iscall% "Map" && length(jsub) >= 3L && jsub[[3L]] == ".SD" && length(sdvars) &&
+           sum(vapply_1b(as.list(jsub), function(x) identical(x, quote(.SD)))) == 1L) {


My memory is it's good to avoid creating lambdas if you can avoid it

Suggested change

sum(vapply_1b(as.list(jsub), function(x) identical(x, quote(.SD)))) == 1L) {

sum(vapply_1b(as.list(jsub), identical, quote(.SD))) == 1L) {

MichaelChirico

OK, I think last round of review here. Looking great!

modular optimization paths - init

ebd152d

ben-schwen added 13 commits October 29, 2025 09:17

make linter happy

71b21ab

move tests

8a9e727

add lapply(list(col1, col2, ...), fun) pattern

04e5782

turn on optimization

a8dde19

add type conversion support to GForce

67f2874

remove stale branch

2876ebe

add tests

c445c38

update man

5410e31

merge tests

dece1c6

polish test fun

5e1789d

add arithmetic

62f1c48

add AST walker and update tests

c47ec27

add tests

1d324d6

ben-schwen marked this pull request as ready for review November 2, 2025 18:01

ben-schwen requested a review from MichaelChirico as a code owner November 2, 2025 18:01

ben-schwen added 2 commits November 2, 2025 19:30

Merge branch 'master' into modular_gforce

6b54c1e

add NEWS

22cf35e

jangorecki reviewed Nov 2, 2025

View reviewed changes

man/test.Rd Outdated Show resolved Hide resolved

jangorecki reviewed Nov 2, 2025

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

ben-schwen mentioned this pull request Nov 2, 2025

benchmark regression #7404

Closed

jangorecki reviewed Nov 3, 2025

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

ben-schwen added 5 commits November 3, 2025 09:45

make function name in massageSD more expressive

25a7e2e

rename levels argument to optimization

eb8056c

update docs

4544398

restore test nums

d40edb8

remove double tests

5e7efb7

ben-schwen added 5 commits January 9, 2026 16:14

fix NEWS numbering

cf6def5

remove trailing ws

0480ee5

clean up merge

4be8a24

update errors

0a07dba

Merge branch 'master' into modular_gforce

283ba85

MichaelChirico reviewed Jan 11, 2026

View reviewed changes

MichaelChirico and others added 7 commits January 11, 2026 00:13

try another round of unnesting

416bff0

Merge branch 'master' into modular_gforce

446b448

cleaning up my mess (typos, thinkos)

2986436

thinko

45a4575

restore fallthrough case

7a1b021

delint

383b60a

move the optimization comment into 'documentation' of .massageSD

62997d3