-
-
Notifications
You must be signed in to change notification settings - Fork 17
Add a cluster or by argument to center() and standardize() for unweighted Level 2 statistics #672
Description
First, I want to say how much I appreciate the recent and exciting work going into demean() and degroup() for handling multilevel and panel data (as seen in #520, #637, and #639).
I have a related feature request regarding multilevel data, specifically pertaining to grand-mean centering and standardizing Level 2 cluster variables in unbalanced long-format datasets.
Currently, when applying datawizard::center() or datawizard::standardize() to a Level 2 variable in long format, the functions calculate the mean and standard deviation across all rows. In an unbalanced design where
In multilevel modeling, we typically prefer the unweighted grand mean and SD (calculated uniquely once per cluster) to properly center or standardize Level 2 variables.
Proposal:
It would be incredibly helpful to add an argument like cluster (or by) to center() and standardize(). This would tell the function to compute the mean and SD based on the unique values of the cluster rather than the length of the repeated vector, and then apply those unweighted statistics back to the long format data.
Reprex:
Here is a quick reproducible example simulating unbalanced two-level data to demonstrate the difference between the weighted and unweighted statistics, alongside the current and proposed behavior.
library(dplyr)
library(datawizard)
# 1. Simulate unbalanced two-level data
set.seed(123)
df_long <-
tibble(
school = 1:5,
funding = c(10, 20, 30, 40, 50),
n_students = c(2, 5, 10, 50, 100)
) |>
reframe(
school = rep(school, n_students),
funding = rep(funding, n_students)
)
# 2. Show the difference in weighted and unweighted stats
w_stats <-
df_long |>
summarize(
m_w = mean(funding),
sd_w = sd(funding)
)
uw_stats <-
df_long |>
distinct(school, funding) |>
summarize(
m_uw = mean(funding),
sd_uw = sd(funding)
)
stats <-
bind_cols(w_stats, uw_stats) |>
print()
#> # A tibble: 1 × 4
#> m_w sd_w m_uw sd_uw
#> <dbl> <dbl> <dbl> <dbl>
#> 1 44.4 8.33 30 15.8
# 3. Current and proposed datawizard centering behavior
df_long |>
transmute(
funding_c = center(funding),
funding_c_w = funding - stats$m_w,
funding_c_uw = funding - stats$m_uw
)
#> # A tibble: 167 × 3
#> funding_c funding_c_w funding_c_uw
#> <dw_trnsf> <dbl> <dbl>
#> 1 -34.43114 -34.4 -20
#> 2 -34.43114 -34.4 -20
#> 3 -24.43114 -24.4 -10
#> 4 -24.43114 -24.4 -10
#> 5 -24.43114 -24.4 -10
#> 6 -24.43114 -24.4 -10
#> 7 -24.43114 -24.4 -10
#> 8 -14.43114 -14.4 0
#> 9 -14.43114 -14.4 0
#> 10 -14.43114 -14.4 0
#> # ℹ 157 more rows
# 4. Current and proposed datawizard standardizing behavior
df_long |>
transmute(
funding_z = standardize(funding),
funding_z_w = (funding - stats$m_w) / stats$sd_w,
funding_z_uw = (funding - stats$m_uw) / stats$sd_uw
)
#> # A tibble: 167 × 3
#> funding_z funding_z_w funding_z_uw
#> <dw_trnsf> <dbl> <dbl>
#> 1 -4.132959 -4.13 -1.26
#> 2 -4.132959 -4.13 -1.26
#> 3 -2.932604 -2.93 -0.632
#> 4 -2.932604 -2.93 -0.632
#> 5 -2.932604 -2.93 -0.632
#> 6 -2.932604 -2.93 -0.632
#> 7 -2.932604 -2.93 -0.632
#> 8 -1.732249 -1.73 0
#> 9 -1.732249 -1.73 0
#> 10 -1.732249 -1.73 0
#> # ℹ 157 more rowsCreated on 2026-03-09 with reprex v2.1.1