Add a cluster or by argument to center() and standardize() for unweighted Level 2 statistics

First, I want to say how much I appreciate the recent and exciting work going into `demean()` and `degroup()` for handling multilevel and panel data (as seen in #520, #637, and #639).

I have a related feature request regarding multilevel data, specifically pertaining to grand-mean centering and standardizing Level 2 cluster variables in unbalanced long-format datasets.

Currently, when applying `datawizard::center()` or `datawizard::standardize()` to a Level 2 variable in long format, the functions calculate the mean and standard deviation across all rows. In an unbalanced design where $n_j$ varies by cluster, this results in an implicitly weighted mean and standard deviation. The clusters with more observations disproportionately pull the statistics toward their own values.

In multilevel modeling, we typically prefer the unweighted grand mean and SD (calculated uniquely once per cluster) to properly center or standardize Level 2 variables.

**Proposal:**
It would be incredibly helpful to add an argument like `cluster` (or `by`) to `center()` and `standardize()`. This would tell the function to compute the mean and SD based on the unique values of the cluster rather than the length of the repeated vector, and then apply those unweighted statistics back to the long format data.

**Reprex:**
Here is a quick reproducible example simulating unbalanced two-level data to demonstrate the difference between the weighted and unweighted statistics, alongside the current and proposed behavior.

``` r
library(dplyr)
library(datawizard)

# 1. Simulate unbalanced two-level data
set.seed(123)
df_long <- 
  tibble(
    school = 1:5,
    funding = c(10, 20, 30, 40, 50),
    n_students = c(2, 5, 10, 50, 100)
  ) |>
  reframe(
    school = rep(school, n_students),
    funding = rep(funding, n_students)
  )

# 2. Show the difference in weighted and unweighted stats
w_stats <- 
  df_long |>
  summarize(
    m_w = mean(funding),
    sd_w = sd(funding)
  )

uw_stats <- 
  df_long |>
  distinct(school, funding) |>
  summarize(
    m_uw = mean(funding),
    sd_uw = sd(funding)
  )

stats <- 
  bind_cols(w_stats, uw_stats) |> 
  print()
#> # A tibble: 1 × 4
#>     m_w  sd_w  m_uw sd_uw
#>   <dbl> <dbl> <dbl> <dbl>
#> 1  44.4  8.33    30  15.8

# 3. Current and proposed datawizard centering behavior
df_long |>
  transmute(
    funding_c = center(funding),
    funding_c_w = funding - stats$m_w,
    funding_c_uw = funding - stats$m_uw
  )
#> # A tibble: 167 × 3
#>    funding_c  funding_c_w funding_c_uw
#>    <dw_trnsf>       <dbl>        <dbl>
#>  1 -34.43114        -34.4          -20
#>  2 -34.43114        -34.4          -20
#>  3 -24.43114        -24.4          -10
#>  4 -24.43114        -24.4          -10
#>  5 -24.43114        -24.4          -10
#>  6 -24.43114        -24.4          -10
#>  7 -24.43114        -24.4          -10
#>  8 -14.43114        -14.4            0
#>  9 -14.43114        -14.4            0
#> 10 -14.43114        -14.4            0
#> # ℹ 157 more rows

# 4. Current and proposed datawizard standardizing behavior
df_long |> 
  transmute(
    funding_z = standardize(funding),
    funding_z_w = (funding - stats$m_w) / stats$sd_w,
    funding_z_uw = (funding - stats$m_uw) / stats$sd_uw
  )
#> # A tibble: 167 × 3
#>    funding_z  funding_z_w funding_z_uw
#>    <dw_trnsf>       <dbl>        <dbl>
#>  1 -4.132959        -4.13       -1.26 
#>  2 -4.132959        -4.13       -1.26 
#>  3 -2.932604        -2.93       -0.632
#>  4 -2.932604        -2.93       -0.632
#>  5 -2.932604        -2.93       -0.632
#>  6 -2.932604        -2.93       -0.632
#>  7 -2.932604        -2.93       -0.632
#>  8 -1.732249        -1.73        0    
#>  9 -1.732249        -1.73        0    
#> 10 -1.732249        -1.73        0    
#> # ℹ 157 more rows
```

<sup>Created on 2026-03-09 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a cluster or by argument to center() and standardize() for unweighted Level 2 statistics #672

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add a cluster or by argument to center() and standardize() for unweighted Level 2 statistics #672

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions