Implementing various doc and refactoring changes for the new ApplyToCols#1962
Implementing various doc and refactoring changes for the new ApplyToCols#1962rcap107 merged 42 commits intoskrub-data:mainfrom
Conversation
|
I discussed this PR IRL with @jeromedockes. There are some leftover points to address:
|
jeromedockes
left a comment
There was a problem hiding this comment.
thanks for the effort @rcap107 , the weird nature of applytocols makes the docstring hard to write. I think we should assume that most users don't care about the distinction between the 2 modes and the resulting different column order and different fitted attributes or the parameters that are only for the on each column mode (that is the premise of creating this class), so maybe anything that is related to that should be moved toward the end of its section 🤔
or maybe after all we should have kept the other 2 but this one should be stripped down to expose only what is in common (eg not the transformers) and have a super simple docstring
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| :class:`ApplyToCols` allows flexible manipulation of dataframes by automatically | ||
| |ApplyToCols| allows flexible manipulation of dataframes by automatically |
There was a problem hiding this comment.
not introduced by this PR but I think it should be fixed before the release: the content is out of order here. AFAICT this is the first time we are hearing about ApplyToCols in the user guide and we start with column rejection before an introduction to the class in general and basic usage. We have a link to basic usage but it should come before the advanced topic not after
There was a problem hiding this comment.
I was thinking of adding a subsection in the introduction of the main features of skrub that mentions the basic use case of ApplyToCols
There was a problem hiding this comment.
I moved the file about ApplyToCols to the section on default wrangling so now it's at the start of the user guide
| 1 10.0 0.0 Rome 2024-05-15 13:46:02 | ||
|
|
||
|
|
||
| By default, the same transformer is applied to all selected columns. It is |
There was a problem hiding this comment.
here also I don't think 'by default' is the right way to think about it
There was a problem hiding this comment.
also we are emphasizing too much this distinction, the first example should be just about 'look this transformer was applied to some columns and the others were kept unchanged' -- and propably it should be a single column transformer as that is the most common use case. next we can point out: if we have a transformer for single columns (most skrub encoders) like the DatetimeEncoder, a separate clone is applied to each column and you can find them in transformers_. for transformers that take 2d inputs (most scikt-learn transformers), all the selected columns are passed as a single dataframe to one transformer, and it is available as transformer_. the selection of the correct mode is done automatically by inspecting the transformer but you can force it with the how parameter
| return self._wrapped_transformer.get_feature_names_out(input_features) | ||
|
|
||
| def __getattr__(self, name): | ||
| if name == "transformers_": |
There was a problem hiding this comment.
| if name == "transformers_": | |
| if name == "transformers_" and isinstance(getattr(self, "_wrapped_transformer", None), ApplyToSubFrame) |
we need to check that this is actually the reason. for example, it could be that the applytocols is not fitted
There was a problem hiding this comment.
can I check that with
check_is_fitted(self, "transformer_")or should I do it in some other way
There was a problem hiding this comment.
that will raise a notfitted exception that you would have to catch so maybe not the most convenient, just getattr should be easier I think
|
we could also try to come up with a shorter phrase than "if transformer is a single-column transformer or how='cols'" to refer to the 2 modes of operation, it would help if we have a short way of saying 'it depends on the effective how'. we can have a fitted attribute |
jeromedockes
left a comment
There was a problem hiding this comment.
thanks a lot @rcap107 . I think it is much simpler and quite understandable now, we're down to details
| The names of columns in the output dataframe that were created by one | ||
| of the fitted transformers. | ||
|
|
||
| Other Attributes |
There was a problem hiding this comment.
I like the idea but it seems sphinx simply omits the whole section from the output. let's just remove the heading it's ok, already a bit better now that the condition is simplified without the how option
| be used with ``ApplyToCols``. For example, to apply a | ||
| :class:`~sklearn.preprocessing.StandardScaler` to the numeric columns: | ||
|
|
||
| >>> scaler = ApplyToCols(StandardScaler(), cols=["A", "B"]) |
There was a problem hiding this comment.
can we use a PCA instead? eg having 3 numeric columns in the input and doing a PCA(n_components=2). I think it makes the distinction clearer
| """ | ||
|
|
||
| check_is_fitted(self) | ||
| check_is_fitted( |
There was a problem hiding this comment.
why do we need this change? I think the old check_is_fitted(self) should be enough
There was a problem hiding this comment.
without this,test_check_is_fitted_missing_fitted_attribute_transform is failing because it's not raising the notfitted exception, and I'm not sure what is going on to break that
There was a problem hiding this comment.
i don't understand the test. we just need to test:
- don't fit and call transform -> not fitted error
- fit with single column transformer and access transformer_ -> custom message
- fit with regular transformer and access transformers_ -> custom message
- access some_other_attribute -> standard attributeerror message
There was a problem hiding this comment.
check_is_fitted will just look for the presence of the attribute so check_is_fitted(transformer if hasattr transformer) does not seem like it should be there
| Transformed feature names. | ||
| """ | ||
| check_is_fitted(self) | ||
| check_is_fitted( |
|
|
||
| .. tip:: | ||
|
|
||
| All multi-column transformers provided by skrub can take skrub selectors as |
There was a problem hiding this comment.
many rather than all? tablevectorizer and many others don't have that parameter
or maybe the tip could be when a skrub transformer has a cols parameter to specify a column list, that can be a selector as well
|
|
||
| We use the |s.numeric| and |s.string| selectors to choose the respective columns: | ||
|
|
||
| .. admonition:: Why ``sparse_output=False``? |
There was a problem hiding this comment.
can we use a ordinalencoder to avoid this discussion? unless you think this is the best place to include it.
also doing the 2 seems redundant: I think we can remove the standardscaler part
There was a problem hiding this comment.
it's probably not the best place to do this, we should still have it somewhere, not sure where though
There was a problem hiding this comment.
the reason for having the numeric encoder was to show that it's possible to put the transformers generated with applytocols in a pipeline, but I did not explain that well.
|
+.. admonition:: Why ``sparse_output=False``?
it's probably not the best place to do this, we should still have it somewhere, not sure where though
maybe one of the howto / faq / explain this error thingies when we have them?
|
|
the reason for having the numeric encoder was to show that it's possible to put the transformers generated with applytocols in a pipeline, but I did not explain that well.
yes but all transformers can be used in a pipeline it's what the pipeline is for. so I don't think we need to repeat that here and we can keep it short instead WDYT
|
I think it's better to make it clear that combining transformers should be done like apply1 = ApplyToCols(transformer1())
apply2 = ApplyToCols(transformer2())
make_pipeline(apply1, apply2)instead of pipe = make_pipeline(transformer1(), transformer2())
ApplyToCols(pipe) |
yeah I think it makes sense to have one entry that explains the sparse error when invariably someone will smack into that |
|
|
||
| .. _user_guide_multiple_columns: | ||
|
|
||
| Operating over multiple columns at once with |ApplyToCols| |
There was a problem hiding this comment.
here also I don't get the "at once" part, rather transforming only some of the columns or something like that
There was a problem hiding this comment.
"Transforming selected columns", "Transforming specific columns"?
There was a problem hiding this comment.
yep either of those works!
| that is capable of handling this exception. | ||
|
|
||
| Hence the ``SingleColumnTransformer`` class. It is originally a base class from which many transformers are | ||
| inherited, but it can also be used to create new transformers. As long we specify |
There was a problem hiding this comment.
I don't understand this sentence
|
|
||
| For instance, we might want to create a custom transformer specialized in parsing zip codes of a certain | ||
| format, that returns |RejectColumn| with a custom warning when the length of the provided | ||
| zip code is incorrect: |
There was a problem hiding this comment.
we can reiterate here that the goal is that in this case ApplyToCols will pass the column through without transformation
| >>> ZipcodeParser().fit_transform(df["received"]) | ||
| Traceback (most recent call last): | ||
| ... | ||
| skrub._single_column_transformer.RejectColumn: This transformer only takes zip codes of length 5. |
There was a problem hiding this comment.
we show the exception being raised here but later we should show it resulting in the column being passed through. in the example the datetimeencoder can be replaced with the zipcode one
Co-authored-by: Jérôme Dockès <jerome@dockes.org>
…tions.rst Co-authored-by: Jérôme Dockès <jerome@dockes.org>
…tions.rst Co-authored-by: Jérôme Dockès <jerome@dockes.org>
Closes #1926