Implementing various doc and refactoring changes for the new ApplyToCols by rcap107 · Pull Request #1962 · skrub-data/skrub

rcap107 · 2026-03-13T14:36:09Z

Closes #1926

…ytocols

rcap107 · 2026-03-16T08:30:12Z

I discussed this PR IRL with @jeromedockes. There are some leftover points to address:

we want to make ApplyToEachCol and ApplyToSubFrame private, so that users only need to learn about ApplyToCols and not worry about the difference between the two.
this means that the .skb.apply method needs to be fixed so that it returns ApplyToCols as estimator rather than one of the other two (but this will be done in a bug fix release)
the documentation should be simplified: the point of this change was to avoid that users would have to understand the difference between the two Apply* objects, so we should avoid having to explain that part for most examples (it would defeat the point)

jeromedockes

thanks for the effort @rcap107 , the weird nature of applytocols makes the docstring hard to write. I think we should assume that most users don't care about the distinction between the 2 modes and the resulting different column order and different fitted attributes or the parameters that are only for the on each column mode (that is the premise of creating this class), so maybe anything that is related to that should be moved toward the end of its section 🤔

or maybe after all we should have kept the other 2 but this one should be stripped down to expose only what is in common (eg not the transformers) and have a super simple docstring

jeromedockes · 2026-03-16T19:22:55Z

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-:class:`ApplyToCols` allows flexible manipulation of dataframes by automatically
+|ApplyToCols| allows flexible manipulation of dataframes by automatically


not introduced by this PR but I think it should be fixed before the release: the content is out of order here. AFAICT this is the first time we are hearing about ApplyToCols in the user guide and we start with column rejection before an introduction to the class in general and basic usage. We have a link to basic usage but it should come before the advanced topic not after

I was thinking of adding a subsection in the introduction of the main features of skrub that mentions the basic use case of ApplyToCols

I moved the file about ApplyToCols to the section on default wrangling so now it's at the start of the user guide

jeromedockes · 2026-03-16T19:49:01Z

+    1  10.0   0.0   Rome 2024-05-15 13:46:02
+
+
+    By default, the same transformer is applied to all selected columns. It is


here also I don't think 'by default' is the right way to think about it

also we are emphasizing too much this distinction, the first example should be just about 'look this transformer was applied to some columns and the others were kept unchanged' -- and propably it should be a single column transformer as that is the most common use case. next we can point out: if we have a transformer for single columns (most skrub encoders) like the DatetimeEncoder, a separate clone is applied to each column and you can find them in transformers_. for transformers that take 2d inputs (most scikt-learn transformers), all the selected columns are passed as a single dataframe to one transformer, and it is available as transformer_. the selection of the correct mode is done automatically by inspecting the transformer but you can force it with the how parameter

jeromedockes · 2026-03-16T19:57:08Z

        return self._wrapped_transformer.get_feature_names_out(input_features)
+
+    def __getattr__(self, name):
+        if name == "transformers_":


Suggested change

if name == "transformers_":

if name == "transformers_" and isinstance(getattr(self, "_wrapped_transformer", None), ApplyToSubFrame)

we need to check that this is actually the reason. for example, it could be that the applytocols is not fitted

can I check that with

check_is_fitted(self, "transformer_")

or should I do it in some other way

that will raise a notfitted exception that you would have to catch so maybe not the most convenient, just getattr should be easier I think

jeromedockes · 2026-03-16T20:28:25Z

we could also try to come up with a shorter phrase than "if transformer is a single-column transformer or how='cols'" to refer to the 2 modes of operation, it would help if we have a short way of saying 'it depends on the effective how'. we can have a fitted attribute how_ that is either 'cols' or 'frame', or maybe say something like 'in columnwise mode' vs 'in single-frame mode', ... finding some way to distinguish the parts that are in common from those that differ clearly, and to streamline the discussion of the parts that are different

jeromedockes

thanks a lot @rcap107 . I think it is much simpler and quite understandable now, we're down to details

jeromedockes · 2026-03-20T13:17:25Z

+        The names of columns in the output dataframe that were created by one
+        of the fitted transformers.
+
+    Other Attributes


I like the idea but it seems sphinx simply omits the whole section from the output. let's just remove the heading it's ok, already a bit better now that the condition is simplified without the how option

jeromedockes · 2026-03-20T13:19:43Z

+    be used with ``ApplyToCols``. For example, to apply a
+    :class:`~sklearn.preprocessing.StandardScaler` to the numeric columns:
+
+    >>> scaler = ApplyToCols(StandardScaler(), cols=["A", "B"])


can we use a PCA instead? eg having 3 numeric columns in the input and doing a PCA(n_components=2). I think it makes the distinction clearer

jeromedockes · 2026-03-20T13:23:02Z

        """

-        check_is_fitted(self)
+        check_is_fitted(


why do we need this change? I think the old check_is_fitted(self) should be enough

without this,test_check_is_fitted_missing_fitted_attribute_transform is failing because it's not raising the notfitted exception, and I'm not sure what is going on to break that

i don't understand the test. we just need to test:

don't fit and call transform -> not fitted error

fit with single column transformer and access transformer_ -> custom message

fit with regular transformer and access transformers_ -> custom message

access some_other_attribute -> standard attributeerror message

check_is_fitted will just look for the presence of the attribute so check_is_fitted(transformer if hasattr transformer) does not seem like it should be there

jeromedockes · 2026-03-20T13:23:16Z

            Transformed feature names.
        """
-        check_is_fitted(self)
+        check_is_fitted(


jeromedockes · 2026-03-20T13:31:35Z

+
+.. tip::
+
+    All multi-column transformers provided by skrub can take skrub selectors as


many rather than all? tablevectorizer and many others don't have that parameter
or maybe the tip could be when a skrub transformer has a cols parameter to specify a column list, that can be a selector as well

jeromedockes · 2026-03-20T13:33:19Z

+
+We use the |s.numeric| and |s.string| selectors to choose the respective columns:
+
+.. admonition:: Why ``sparse_output=False``?


can we use a ordinalencoder to avoid this discussion? unless you think this is the best place to include it.

also doing the 2 seems redundant: I think we can remove the standardscaler part

it's probably not the best place to do this, we should still have it somewhere, not sure where though

the reason for having the numeric encoder was to show that it's possible to put the transformers generated with applytocols in a pipeline, but I did not explain that well.

jeromedockes · 2026-03-20T14:33:51Z

+.. admonition:: Why ``sparse_output=False``? it's probably not the best place to do this, we should still have it somewhere, not sure where though

maybe one of the howto / faq / explain this error thingies when we have them?

jeromedockes · 2026-03-20T14:35:05Z

the reason for having the numeric encoder was to show that it's possible to put the transformers generated with applytocols in a pipeline, but I did not explain that well.

yes but all transformers can be used in a pipeline it's what the pipeline is for. so I don't think we need to repeat that here and we can keep it short instead WDYT

rcap107 · 2026-03-20T14:45:46Z

the reason for having the numeric encoder was to show that it's possible to put the transformers generated with applytocols in a pipeline, but I did not explain that well.
yes but all transformers can be used in a pipeline it's what the pipeline is for. so I don't think we need to repeat that here and we can keep it short instead WDYT

I think it's better to make it clear that combining transformers should be done like

apply1  = ApplyToCols(transformer1())
apply2 = ApplyToCols(transformer2())
make_pipeline(apply1, apply2)

instead of

pipe = make_pipeline(transformer1(), transformer2())
ApplyToCols(pipe)

rcap107 · 2026-03-20T14:46:52Z

+.. admonition:: Why sparse_output=False? it's probably not the best place to do this, we should still have it somewhere, not sure where though
maybe one of the howto / faq / explain this error thingies when we have them?

yeah I think it makes sense to have one entry that explains the sparse error when invariably someone will smack into that

jeromedockes · 2026-03-23T20:51:18Z

+
+.. _user_guide_multiple_columns:
+
+Operating over multiple columns at once with |ApplyToCols|


here also I don't get the "at once" part, rather transforming only some of the columns or something like that

"Transforming selected columns", "Transforming specific columns"?

yep either of those works!

jeromedockes · 2026-03-23T20:54:48Z

+that is capable of handling this exception.
+
+Hence the ``SingleColumnTransformer`` class. It is originally a base class from which many transformers are
+inherited, but it can also be used to create new transformers. As long we specify


I don't understand this sentence

jeromedockes · 2026-03-23T20:56:15Z

+
+For instance, we might want to create a custom transformer specialized in parsing zip codes of a certain
+format, that returns |RejectColumn| with a custom warning when the length of the provided
+zip code is incorrect:


we can reiterate here that the goal is that in this case ApplyToCols will pass the column through without transformation

jeromedockes · 2026-03-23T21:01:55Z

+>>> ZipcodeParser().fit_transform(df["received"])
+Traceback (most recent call last):
+    ...
+skrub._single_column_transformer.RejectColumn: This transformer only takes zip codes of length 5.


we show the exception being raised here but later we should show it resulting in the column being passed through. in the example the datetimeencoder can be replaced with the zipcode one

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

…tions.rst Co-authored-by: Jérôme Dockès <jerome@dockes.org>

…ytocols

jeromedockes

thanks a lot @rcap107

rcap107 added 2 commits March 13, 2026 15:35

updating docs

2de0bad

Merge remote-tracking branch 'upstream/HEAD' into leftover-fixes-appl…

c29a66a

…ytocols

rcap107 changed the title ~~updating docs~~ Implementing various doc and refactoring changes for the new ApplyToCols Mar 13, 2026

rcap107 added 9 commits March 16, 2026 16:01

making classes private

ecf5bd1

changelog

420f532

updating docstring

9c42d8f

adding getattr for transform

216269d

rewording

f33789c

adding the attributes of the wrapped transformer to applytocols

ecc416f

updating user guide

cb8d065

fixing various issues with sphinx

828aea8

changelog

48e8773

jeromedockes reviewed Mar 16, 2026

View reviewed changes

rcap107 added 8 commits March 17, 2026 11:10

changelog

288c761

changelog

f5bfe91

implementing some changes from code review

b951f32

_

d9f483e

removing the "how" parameter and rewriting docstring

d2732f7

fixing broken imports

d6774a6

removing how and addressing some tests

8dc1866

fixing broken doctests

fcd5750

rcap107 added this to the Release 0.8.0 milestone Mar 18, 2026

rcap107 added 6 commits March 20, 2026 10:49

rewriting part of the user guide about applytocols

77a9140

moving user guide entry about applytocols to the intro section

a4c888e

moving user guide entry about applytocols to the intro section

beaa858

renaming file

39adf0a

reordering content

7be6fe4

changelog

ef0b907

rcap107 added 4 commits March 20, 2026 10:58

doctest

21f6167

fixing doctest

4b6eccb

skipping a doctest

c6f83b9

_

de5fd8e

rcap107 marked this pull request as ready for review March 20, 2026 10:28

rcap107 commented Mar 20, 2026

View reviewed changes

Comment thread doc/modules/column_level_featurizing/advanced_columnwise_operations.rst

cleaning up documentation

8de676e

jeromedockes reviewed Mar 20, 2026

View reviewed changes

addressing comments from code review

3bcfc44

more changes from comments

69960e1

rcap107 added 2 commits March 20, 2026 15:48

_

46bf8a3

more fixes

3b8e524

jeromedockes reviewed Mar 23, 2026

View reviewed changes

rcap107 and others added 8 commits March 23, 2026 22:47

Update skrub/_apply_to_cols.py

de85518

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

Update doc/modules/column_level_featurizing/advanced_columnwise_opera…

a27f05e

…tions.rst Co-authored-by: Jérôme Dockès <jerome@dockes.org>

Update doc/modules/column_level_featurizing/advanced_columnwise_opera…

7120d73

…tions.rst Co-authored-by: Jérôme Dockès <jerome@dockes.org>

Merge remote-tracking branch 'upstream/HEAD' into leftover-fixes-appl…

8e0a32e

…ytocols

fix

d1f18cd

implementing fixes from code review

dd05329

fixing doctest

4e175bb

Apply suggestion from @jeromedockes

d8085bd

jeromedockes approved these changes Mar 24, 2026

View reviewed changes

rcap107 merged commit 6a97b06 into skrub-data:main Mar 24, 2026
29 checks passed

rcap107 deleted the leftover-fixes-applytocols branch March 24, 2026 13:23

		1 10.0 0.0 Rome 2024-05-15 13:46:02


		By default, the same transformer is applied to all selected columns. It is

	if name == "transformers_":
	if name == "transformers_" and isinstance(getattr(self, "_wrapped_transformer", None), ApplyToSubFrame)


		.. tip::

		All multi-column transformers provided by skrub can take skrub selectors as


		We use the \|s.numeric\| and \|s.string\| selectors to choose the respective columns:

		.. admonition:: Why ``sparse_output=False``?


		.. _user_guide_multiple_columns:

		Operating over multiple columns at once with \|ApplyToCols\|

Conversation

rcap107 commented Mar 13, 2026

Uh oh!

rcap107 commented Mar 16, 2026

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromedockes Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromedockes commented Mar 16, 2026

Uh oh!

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeromedockes commented Mar 20, 2026 via email

Uh oh!

jeromedockes commented Mar 20, 2026 via email

Uh oh!

rcap107 commented Mar 20, 2026

Uh oh!

rcap107 commented Mar 20, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromedockes Mar 16, 2026 •

edited

Loading