Skip to content

Implementing various doc and refactoring changes for the new ApplyToCols#1962

Merged
rcap107 merged 42 commits intoskrub-data:mainfrom
rcap107:leftover-fixes-applytocols
Mar 24, 2026
Merged

Implementing various doc and refactoring changes for the new ApplyToCols#1962
rcap107 merged 42 commits intoskrub-data:mainfrom
rcap107:leftover-fixes-applytocols

Conversation

@rcap107
Copy link
Copy Markdown
Member

@rcap107 rcap107 commented Mar 13, 2026

Closes #1926

@rcap107 rcap107 changed the title updating docs Implementing various doc and refactoring changes for the new ApplyToCols Mar 13, 2026
@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Mar 16, 2026

I discussed this PR IRL with @jeromedockes. There are some leftover points to address:

  • we want to make ApplyToEachCol and ApplyToSubFrame private, so that users only need to learn about ApplyToCols and not worry about the difference between the two.
  • this means that the .skb.apply method needs to be fixed so that it returns ApplyToCols as estimator rather than one of the other two (but this will be done in a bug fix release)
  • the documentation should be simplified: the point of this change was to avoid that users would have to understand the difference between the two Apply* objects, so we should avoid having to explain that part for most examples (it would defeat the point)

Copy link
Copy Markdown
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the effort @rcap107 , the weird nature of applytocols makes the docstring hard to write. I think we should assume that most users don't care about the distinction between the 2 modes and the resulting different column order and different fitted attributes or the parameters that are only for the on each column mode (that is the premise of creating this class), so maybe anything that is related to that should be moved toward the end of its section 🤔

or maybe after all we should have kept the other 2 but this one should be stripped down to expose only what is in common (eg not the transformers) and have a super simple docstring

Comment thread CHANGES.rst Outdated
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:class:`ApplyToCols` allows flexible manipulation of dataframes by automatically
|ApplyToCols| allows flexible manipulation of dataframes by automatically
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not introduced by this PR but I think it should be fixed before the release: the content is out of order here. AFAICT this is the first time we are hearing about ApplyToCols in the user guide and we start with column rejection before an introduction to the class in general and basic usage. We have a link to basic usage but it should come before the advanced topic not after

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of adding a subsection in the introduction of the main features of skrub that mentions the basic use case of ApplyToCols

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the file about ApplyToCols to the section on default wrangling so now it's at the start of the user guide

Comment thread doc/modules/column_level_featurizing/advanced_columnwise_operations.rst Outdated
Comment thread doc/modules/column_level_featurizing/advanced_columnwise_operations.rst Outdated
Comment thread doc/modules/default_wrangling/apply_to_cols.rst
Comment thread skrub/_apply_to_cols.py Outdated
Comment thread skrub/_apply_to_cols.py
Comment thread skrub/_apply_to_cols.py Outdated
1 10.0 0.0 Rome 2024-05-15 13:46:02


By default, the same transformer is applied to all selected columns. It is
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also I don't think 'by default' is the right way to think about it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we are emphasizing too much this distinction, the first example should be just about 'look this transformer was applied to some columns and the others were kept unchanged' -- and propably it should be a single column transformer as that is the most common use case. next we can point out: if we have a transformer for single columns (most skrub encoders) like the DatetimeEncoder, a separate clone is applied to each column and you can find them in transformers_. for transformers that take 2d inputs (most scikt-learn transformers), all the selected columns are passed as a single dataframe to one transformer, and it is available as transformer_. the selection of the correct mode is done automatically by inspecting the transformer but you can force it with the how parameter

Comment thread skrub/_apply_to_cols.py Outdated
return self._wrapped_transformer.get_feature_names_out(input_features)

def __getattr__(self, name):
if name == "transformers_":
Copy link
Copy Markdown
Member

@jeromedockes jeromedockes Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name == "transformers_":
if name == "transformers_" and isinstance(getattr(self, "_wrapped_transformer", None), ApplyToSubFrame)

we need to check that this is actually the reason. for example, it could be that the applytocols is not fitted

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I check that with

check_is_fitted(self, "transformer_")

or should I do it in some other way

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will raise a notfitted exception that you would have to catch so maybe not the most convenient, just getattr should be easier I think

Comment thread skrub/_apply_to_cols.py Outdated
@jeromedockes
Copy link
Copy Markdown
Member

we could also try to come up with a shorter phrase than "if transformer is a single-column transformer or how='cols'" to refer to the 2 modes of operation, it would help if we have a short way of saying 'it depends on the effective how'. we can have a fitted attribute how_ that is either 'cols' or 'frame', or maybe say something like 'in columnwise mode' vs 'in single-frame mode', ... finding some way to distinguish the parts that are in common from those that differ clearly, and to streamline the discussion of the parts that are different

@rcap107 rcap107 added this to the Release 0.8.0 milestone Mar 18, 2026
@rcap107 rcap107 marked this pull request as ready for review March 20, 2026 10:28
Copy link
Copy Markdown
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @rcap107 . I think it is much simpler and quite understandable now, we're down to details

Comment thread skrub/_apply_to_cols.py
Comment thread skrub/_apply_to_cols.py Outdated
The names of columns in the output dataframe that were created by one
of the fitted transformers.

Other Attributes
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea but it seems sphinx simply omits the whole section from the output. let's just remove the heading it's ok, already a bit better now that the condition is simplified without the how option

Comment thread skrub/_apply_to_cols.py Outdated
be used with ``ApplyToCols``. For example, to apply a
:class:`~sklearn.preprocessing.StandardScaler` to the numeric columns:

>>> scaler = ApplyToCols(StandardScaler(), cols=["A", "B"])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a PCA instead? eg having 3 numeric columns in the input and doing a PCA(n_components=2). I think it makes the distinction clearer

Comment thread skrub/_apply_to_cols.py Outdated
"""

check_is_fitted(self)
check_is_fitted(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change? I think the old check_is_fitted(self) should be enough

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this,test_check_is_fitted_missing_fitted_attribute_transform is failing because it's not raising the notfitted exception, and I'm not sure what is going on to break that

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand the test. we just need to test:

  • don't fit and call transform -> not fitted error
  • fit with single column transformer and access transformer_ -> custom message
  • fit with regular transformer and access transformers_ -> custom message
  • access some_other_attribute -> standard attributeerror message

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_is_fitted will just look for the presence of the attribute so check_is_fitted(transformer if hasattr transformer) does not seem like it should be there

Comment thread skrub/_apply_to_cols.py Outdated
Transformed feature names.
"""
check_is_fitted(self)
check_is_fitted(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment thread doc/modules/default_wrangling/apply_to_cols.rst Outdated

.. tip::

All multi-column transformers provided by skrub can take skrub selectors as
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many rather than all? tablevectorizer and many others don't have that parameter
or maybe the tip could be when a skrub transformer has a cols parameter to specify a column list, that can be a selector as well


We use the |s.numeric| and |s.string| selectors to choose the respective columns:

.. admonition:: Why ``sparse_output=False``?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a ordinalencoder to avoid this discussion? unless you think this is the best place to include it.

also doing the 2 seems redundant: I think we can remove the standardscaler part

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably not the best place to do this, we should still have it somewhere, not sure where though

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for having the numeric encoder was to show that it's possible to put the transformers generated with applytocols in a pipeline, but I did not explain that well.

Comment thread doc/modules/default_wrangling/apply_to_cols.rst Outdated
Comment thread doc/modules/default_wrangling/apply_to_cols.rst Outdated
@jeromedockes
Copy link
Copy Markdown
Member

jeromedockes commented Mar 20, 2026 via email

@jeromedockes
Copy link
Copy Markdown
Member

jeromedockes commented Mar 20, 2026 via email

@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Mar 20, 2026

the reason for having the numeric encoder was to show that it's possible to put the transformers generated with applytocols in a pipeline, but I did not explain that well.
yes but all transformers can be used in a pipeline it's what the pipeline is for. so I don't think we need to repeat that here and we can keep it short instead WDYT

I think it's better to make it clear that combining transformers should be done like

apply1  = ApplyToCols(transformer1())
apply2 = ApplyToCols(transformer2())
make_pipeline(apply1, apply2)

instead of

pipe = make_pipeline(transformer1(), transformer2())
ApplyToCols(pipe)

@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Mar 20, 2026

+.. admonition:: Why sparse_output=False? it's probably not the best place to do this, we should still have it somewhere, not sure where though
maybe one of the howto / faq / explain this error thingies when we have them?

yeah I think it makes sense to have one entry that explains the sparse error when invariably someone will smack into that

Comment thread skrub/_apply_to_cols.py Outdated

.. _user_guide_multiple_columns:

Operating over multiple columns at once with |ApplyToCols|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also I don't get the "at once" part, rather transforming only some of the columns or something like that

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Transforming selected columns", "Transforming specific columns"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep either of those works!

Comment thread doc/modules/column_level_featurizing/advanced_columnwise_operations.rst Outdated
that is capable of handling this exception.

Hence the ``SingleColumnTransformer`` class. It is originally a base class from which many transformers are
inherited, but it can also be used to create new transformers. As long we specify
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence

Comment thread doc/modules/column_level_featurizing/advanced_columnwise_operations.rst Outdated

For instance, we might want to create a custom transformer specialized in parsing zip codes of a certain
format, that returns |RejectColumn| with a custom warning when the length of the provided
zip code is incorrect:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can reiterate here that the goal is that in this case ApplyToCols will pass the column through without transformation

>>> ZipcodeParser().fit_transform(df["received"])
Traceback (most recent call last):
...
skrub._single_column_transformer.RejectColumn: This transformer only takes zip codes of length 5.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we show the exception being raised here but later we should show it resulting in the column being passed through. in the example the datetimeencoder can be replaced with the zipcode one

Copy link
Copy Markdown
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @rcap107

@rcap107 rcap107 merged commit 6a97b06 into skrub-data:main Mar 24, 2026
29 checks passed
@rcap107 rcap107 deleted the leftover-fixes-applytocols branch March 24, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leftover fixes for ApplyToCols

2 participants