recipipe package¶
Improved pipelines for data science projects.
SKLearn pipelines easy to declare and Pandas-compatible.
recipipe aliases¶
The recipipe package contains many aliases for ease of use. Here is a complete table with all the aliases.
Alias |
Definition |
recipipe |
|
astype |
|
category |
|
concat |
|
extract |
|
groupby |
|
select |
|
drop |
|
dropna |
|
dropna_rows |
|
query |
|
replace |
|
reduce_memory |
|
sum |
|
target_encoder |
|
from_sklearn |
|
binarizer |
from_sklearn(Binarizer()) |
impute |
from_sklearn(SimpleImputer(strategy=’constant’)) |
indicator |
from_sklearn(MissingIndicator(),col_format=’INDICATOR({})’) |
minmax |
from_sklearn(MinMaxScaler()) |
onehot |
from_sklearn(OneHotEncoder(sparse=False,handle_unknown=’ignore’)) |
robust_scale |
from_sklearn(RobustScaler()) |
scale |
from_sklearn(StandardScaler()) |
standarize |
from_sklearn(StandardScaler()) |
sk_binarizer |
|
sk_indicator |
|
sk_inputer |
|
sk_onehot |
|
sk_scale |
recipipe.core module¶
-
class
recipipe.core.
Recipipe
(steps=None, **kwargs)[source]¶ Bases:
sklearn.pipeline.Pipeline
Recipipe pipeline.
A Recipipe pipeline is an extension of an SKLearn pipeline. It adds some functionality that make the creation of pipelines less painful. For example, the steps param is not required at the construction time. You can add your transformers to the pipeline anytime using
recipipe.core.Recipipe.add
.- Attr:
Same attributes as
sklearn.pipeline.Pipeline
.
-
__abstractmethods__
= frozenset({})¶
-
__add__
(transformer)[source]¶ Add a new step to the pipeline using the ‘+’ operator.
Note that this is exactly the same as calling
recipipe.core.Recipipe.add
. The Recipipe object is going to be modified, that is p = p + t is the same as p + t, where p is any Recipipe pipeline and t is any transformer.See also
-
__init__
(steps=None, **kwargs)[source]¶ Create a Recipipe pipeline.
- Parameters
steps (
list
) – Same as insklearn.pipeline.Pipeline
.kwargs – Same as in
sklearn.pipeline.Pipeline
: memory and verbose.
-
__module__
= 'recipipe.core'¶
-
add
(step)[source]¶ Add a new step to the pipeline.
You can add steps even if the pipeline is already fitted, so be careful.
- Parameters
step (Transformer or tuple(str, Transformer)) – The new step that you want to add to the pipeline. Any transformer is good (SKLearn transformer or
recipipe.core.RecipipeTransformer
). If a tuple is given, the fist element of the tuple is going to be used as a name for the step in the pipeline.- Returns
The pipeline. You can chain add methods: pipe.add(…).add(…)….
See also
-
steps
= None¶
-
class
recipipe.core.
RecipipeTransformer
(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Base class of all Recipipe transformers.
-
__init__
(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.core'¶
-
fit
(df, y=None)[source]¶ Fit the transformer.
- Parameters
df (pandas.DataFrame) – Dataframe used to fit the transformation.
-
get_column_mapping
()[source]¶ Get the column mapping between the input and transformed DataFrame.
By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.
- Returns
A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.
See also
recipipe.core.RecipipeTransformer.col_format
- Raise:
ValueError
if self.cols is None.
-
transform
(df_in)[source]¶ Transform DataFrame.
- Parameters
df_in (
pandas.DataFrame
) – Input DataFrame.- Returns
Transformed DataFrame.
- Raise:
ValueError
if not cols fitted. Fit the transform to avoid this error.
-
recipipe.transformers module¶
-
class
recipipe.transformers.
AsTypeTransformer
(*args, dtypes=None, **kwargs)[source]¶ Bases:
recipipe.core.RecipipeTransformer
-
__init__
(*args, dtypes=None, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
CategoryEncoder
(*args, error_unknown=False, unknown_value=None, **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnTransformer
-
__init__
(*args, error_unknown=False, unknown_value=None, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
ColumnGroupsTransformer
(*args, groups=False, cols_init=None, **kwargs)[source]¶ Bases:
recipipe.core.RecipipeTransformer
Apply a N to 1 transformation to a group of columns.
-
__init__
(*args, groups=False, cols_init=None, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
get_column_mapping
()[source]¶ Get the column mapping between the input and transformed DataFrame.
By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.
- Returns
A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.
See also
recipipe.core.RecipipeTransformer.col_format
- Raise:
ValueError
if self.cols is None.
-
-
class
recipipe.transformers.
ColumnTransformer
(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]¶ Bases:
recipipe.core.RecipipeTransformer
Apply an operation per each input column.
This transformer only allows 1 to N relationships between input and output columns. If you want to create a column from two existing ones (like concatenate one column to another) this transformer is not for you.
Note that the output number of rows of this transformer should be the same as in the input DataFrame. No deletes are supported here.
-
__module__
= 'recipipe.transformers'¶
-
fit
(df, y=None)[source]¶ Fit the transformer.
- Parameters
df (pandas.DataFrame) – Dataframe used to fit the transformation.
-
-
class
recipipe.transformers.
ColumnsTransformer
(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]¶ Bases:
recipipe.core.RecipipeTransformer
Apply an operation at all the input columns at the same time.
This class does not do that much… It’s only particular useful when working with transformers that return a
numpy.ndarray
, like the ones in SKLearn. This class deals with the creation of a DataFrame so yo do not need to create it by yourself.Note that the output number of rows of this transformer should be the same as in the input DataFrame. No deletes are supported here.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
ConcatTransformer
(*args, separator='', **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnGroupsTransformer
Concatenate string or non-string columns into a new string column.
-
__init__
(*args, separator='', **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
DropNARowsTransformer
(*args, inplace=False, how='any', thresh=None, **kwargs)[source]¶ Bases:
recipipe.core.RecipipeTransformer
-
__init__
(*args, inplace=False, how='any', thresh=None, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
transform
(df, y=None)[source]¶ Transform DataFrame.
- Parameters
df_in (
pandas.DataFrame
) – Input DataFrame.- Returns
Transformed DataFrame.
- Raise:
ValueError
if not cols fitted. Fit the transform to avoid this error.
-
-
class
recipipe.transformers.
DropTransformer
(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]¶ Bases:
recipipe.core.RecipipeTransformer
Drop the fitted columns and continue with the reminded ones.
-
__module__
= 'recipipe.transformers'¶
-
inverse_transform
(df)[source]¶ No inverse exists for a drop operation but…
Obviously, there is no way to get back the dropped columns, but it’s useful to have this operation defined to avoid errors when using this transformer in a pipeline. For that reason, the inverse in defined as the identity function.
- Parameters
df (
pandas.DataFrame
) – DataFrame to inverse transform.- Returns
df without modifications.
- Return type
Identity function
-
transform
(df)[source]¶ Drop the fitted columns.
- Parameters
df (
pandas.DataFrame
) – DataFrame to drop columns from.- Returns
Transformed DataFrame.
-
-
class
recipipe.transformers.
ExtractTransformer
(*args, pattern=None, flags=0, indicator=False, col_values=None, **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnTransformer
Extract regex from string columns.
-
__init__
(*args, pattern=None, flags=0, indicator=False, col_values=None, **kwargs)[source]¶ Create a ExtractTransformer.
- Parameters
pattern (
list
orstr
) – Regex or list of regex to extract. If a list is provided, only one extraction group per element is allowed.flags (
int
) – Flags from there
module. Default: 0, no flags.col_values (
list
ofstr
) – Values of the output columns. If you are extracting more than one column, from an input column a and with col_values=[“b”, “c”] you will get two output columns named a=b and a=c.indicator (
bool
) – Instead of capturing the string, return an indicator with 1 if there is a match and a 0 otherwise.
-
__module__
= 'recipipe.transformers'¶
-
get_column_mapping
()[source]¶ Get the column mapping between the input and transformed DataFrame.
By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.
- Returns
A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.
See also
recipipe.core.RecipipeTransformer.col_format
- Raise:
ValueError
if self.cols is None.
-
-
class
recipipe.transformers.
GroupByTransformer
(groupby, transformer, *args, **kwargs)[source]¶ Bases:
recipipe.core.RecipipeTransformer
Apply a transformer on each group.
Example:
# Normalize by group using the SKLearn StandardScaler. scaler = SklearnCreator(StandardScaler()) Recipipe() + GroupByTransformer("group_column", scaler("num_column"))
-
__init__
(groupby, transformer, *args, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
fit
(df, y=None)[source]¶ Fit the transformer.
- Parameters
df (pandas.DataFrame) – Dataframe used to fit the transformation.
-
transform
(df_in)[source]¶ Transform DataFrame.
- Parameters
df_in (
pandas.DataFrame
) – Input DataFrame.- Returns
Transformed DataFrame.
- Raise:
ValueError
if not cols fitted. Fit the transform to avoid this error.
-
-
class
recipipe.transformers.
PandasScaler
(*args, factor=1, **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnTransformer
Standard scaler implemented with Pandas operations.
-
__init__
(*args, factor=1, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
QueryTransformer
(query, **kwargs)[source]¶ Bases:
recipipe.core.RecipipeTransformer
-
__init__
(query, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
transform
(df_in)[source]¶ Transform DataFrame.
- Parameters
df_in (
pandas.DataFrame
) – Input DataFrame.- Returns
Transformed DataFrame.
- Raise:
ValueError
if not cols fitted. Fit the transform to avoid this error.
-
-
class
recipipe.transformers.
ReduceMemoryTransformer
(*args, deep=True, verbose=False, object_to_category=True, **kwargs)[source]¶ Bases:
recipipe.core.RecipipeTransformer
Change data types in order to reduce memory usage.
This transformer iterates all the columns of the DataFrame and for numeric types checks if there is another datatype that occupies less memory and can store all the numbers of the column.
Another way to reduce memory usage of Pandas DataFrame is creating categories (
pandas.Categorical
). It’s not always possible to save memory by converting to a category. The ideal columns to convert to category are those with few different values. Converting a column that has all unique values (like and ID column) will surely occupy more as a category, so consider to drop those columns first or not to apply the transformer on those columns.In a DataFrame there may also be numeric columns that can be converted to categorical (especially columns of integer types). You should make those transformations manually if you want to have a column of category dtype from the numeric column. Note that the
recipipe.transformers.CategoryEncoder
is not really returning a Pandas category, is encoding the values into integer values.You should also note that this transformation is done in place because that’s the objective of reducing memory. Keeping a copy in memory does not make too much sense if you want to save memory.
-
__init__
(*args, deep=True, verbose=False, object_to_category=True, **kwargs)[source]¶ Create a new reduce memory transformer.
- Parameters
deep (bool) – When verbose=True the percentage of memory reduced by the conversion is shown on the screen. It’s computed using the
pandas.DataFrame.memory_usage
method. deep is the argument of this method. deep=True is more slow but it’s more accurate. Note that it’s really slow on big DataFrames.verbose (bool) – Prints informative information about the process on the screen. Shows the percentage of memory saved after the conversion.
object_to_category (bool) – If True this method automatically converts objects to the Pandas category type.
-
__module__
= 'recipipe.transformers'¶
-
transform
(df, y=None)[source]¶ Transform DataFrame.
- Parameters
df_in (
pandas.DataFrame
) – Input DataFrame.- Returns
Transformed DataFrame.
- Raise:
ValueError
if not cols fitted. Fit the transform to avoid this error.
-
-
class
recipipe.transformers.
ReplaceTransformer
(*args, values=None, **kwargs)[source]¶ Bases:
recipipe.core.RecipipeTransformer
-
__init__
(*args, values=None, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
SelectTransformer
(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]¶ Bases:
recipipe.core.RecipipeTransformer
Select the fitted columns and ignore the rest of them.
-
__module__
= 'recipipe.transformers'¶
-
inverse_transform
(df)[source]¶ No inverse exists for a select operation but…
Obviously, there is no way to get back the non-selected columns, but it’s useful to have this operation defined to avoid errors when using this transformer in a pipeline. For that reason, the inverse in defined as the identity function.
- Parameters
df (
pandas.DataFrame
) – DataFrame to inverse transform.- Returns
df without modifications.
- Return type
Identity function
-
transform
(df)[source]¶ Select the fitted columns.
- Parameters
df (
pandas.DataFrame
) – DataFrame to select columns from.- Returns
Transformed DataFrame.
-
-
class
recipipe.transformers.
SklearnColumnWrapper
(sk_transformer, *args, **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnTransformer
-
__init__
(sk_transformer, *args, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
get_column_mapping
()[source]¶ Get the column mapping between the input and transformed DataFrame.
By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.
- Returns
A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.
See also
recipipe.core.RecipipeTransformer.col_format
- Raise:
ValueError
if self.cols is None.
-
-
class
recipipe.transformers.
SklearnColumnsWrapper
(sk_transformer, *args, **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnsTransformer
-
__init__
(sk_transformer, *args, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
get_column_mapping
()[source]¶ Get the column mapping between the input and transformed DataFrame.
By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.
- Returns
A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.
See also
recipipe.core.RecipipeTransformer.col_format
- Raise:
ValueError
if self.cols is None.
-
-
class
recipipe.transformers.
SklearnCreator
(sk_transformer, **kwargs)[source]¶ Bases:
object
Utility class to generate SKLearn wrappers.
Use this class to reuse any existing SKLearn transformer in your recipipe pipelines.
- Parameters
sk_transformer (sklearn.TransformerMixin) – Any instance of an SKLearn transformer you want to incorporate in the recipipes pipelines.
Example:
# Create a onehot encoder using the SKLearn OneHotEncoder class. onehot = SklearnCreator(OneHotEncoder(sparse=False)) # Now you can use the onehot variable as a transformer in a recipipe. recipipe() + onehot(dtype="string")
-
WRAPPERS
= {'column': <class 'recipipe.transformers.SklearnColumnWrapper'>, 'columns': <class 'recipipe.transformers.SklearnColumnsWrapper'>, 'fit_one_col': <class 'recipipe.transformers.SklearnFitOneWrapper'>}¶
-
__call__
(*args, wrapper='columns', **kwargs)[source]¶ Instantiate a SKLearn wrapper using a copy of the given transformer.
It’s important to make this copy to avoid fitting several times the same transformer.
-
__dict__
= mappingproxy({'__module__': 'recipipe.transformers', '__doc__': 'Utility class to generate SKLearn wrappers.\n\n Use this class to reuse any existing SKLearn transformer in your\n recipipe pipelines.\n\n Args:\n sk_transformer (sklearn.TransformerMixin): Any instance of an\n SKLearn transformer you want to incorporate in the recipipes\n pipelines.\n\n Example::\n\n # Create a onehot encoder using the SKLearn OneHotEncoder class.\n onehot = SklearnCreator(OneHotEncoder(sparse=False))\n # Now you can use the onehot variable as a transformer in a recipipe.\n recipipe() + onehot(dtype="string")\n ', 'WRAPPERS': {'column': <class 'recipipe.transformers.SklearnColumnWrapper'>, 'columns': <class 'recipipe.transformers.SklearnColumnsWrapper'>, 'fit_one_col': <class 'recipipe.transformers.SklearnFitOneWrapper'>}, '__init__': <function SklearnCreator.__init__>, '__call__': <function SklearnCreator.__call__>, '__dict__': <attribute '__dict__' of 'SklearnCreator' objects>, '__weakref__': <attribute '__weakref__' of 'SklearnCreator' objects>})¶
-
__init__
(sk_transformer, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
__module__
= 'recipipe.transformers'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
class
recipipe.transformers.
SklearnFitOneWrapper
(sk_transformer, *args, **kwargs)[source]¶ Bases:
recipipe.transformers.SklearnColumnWrapper
Fit in a concatenation of all the columns, apply one by one.
This is useful when we have two or more columns that are very related. For example, if all those columns share the same category type.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
SumTransformer
(*args, groups=False, cols_init=None, **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnGroupsTransformer
Sum columns.
-
__module__
= 'recipipe.transformers'¶
-
-
class
recipipe.transformers.
TargetEncoderTransformer
(*args, target=None, min_samples_leaf=1, smoothing=1, **kwargs)[source]¶ Bases:
recipipe.transformers.ColumnTransformer
Target encoder.
Computed as described in: A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, by Daniele Micci-Barreca.
Code partially taken from: https://www.kaggle.com/ogrellier/xgb-classifier-upsampling-lb-0-283
-
__init__
(*args, target=None, min_samples_leaf=1, smoothing=1, **kwargs)[source]¶ Create a new transformer.
Columns names can be use Unix filename pattern matching (
fnmatch
).- Parameters
*args (
list
ofstr
) – List of columns the transformer will work on.cols_init (
list
ofstr
) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.exclude (
list
ofstr
) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.
- :param dtype (
numpy.dtype
,str
,list
of:numpy.dtype
or withstr
ordict
): This value is passed to
pandas.DataFrame.select_dtypes
. If adict
is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.
- Parameters
name (
str
) – Human-friendly name of the transformer.keep_original (
bool
) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.col_format (
str
) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.cols_not_found_error (
bool
) – Raise an error if the isn’t any match for any of the specified columns. Default: False.
-
__module__
= 'recipipe.transformers'¶
-
recipipe.utils module¶
-
recipipe.utils.
add_to_map_dict
(col_map, k, v)[source]¶ Stores k and v in the given col_map.
If k is a string, col_map[k] += list(v). If k is a tuple, for each i in k: col_map[k] += list(v).
-
recipipe.utils.
default_params
(fun_kwargs, default_dict=None, **kwargs)[source]¶ Add to kwargs and/or default_dict the values of fun_kwargs.
This function allows the user to overwrite default values of some parameters. For example, in the next example the user cannot give a value to the param a because you will be passing the param a twice to the function another_fun:
>>> def fun(**kwargs): ... return another_fun(a="a", **kwargs)
You can solve this in two ways. The fist one:
>>> def fun(a="a", **kwargs): ... return another_fun(a=a, **kwargs)
Or using default_params:
>>> def fun(**kwargs): ... kwargs = default_params(kwargs, a="a") ... return another_fun(**kwargs)
-
recipipe.utils.
fit_columns
(df, cols=None, dtype=None, raise_error=True, drop_duplicates=True)[source]¶ Fit columns to a DataFrame.
If no cols and not dtype are given, df.columns is returned. If both cols and dtype are given, first cols are applied and dtype is applied over the resulting columns.
Note than an empty list can be returned if df does not contain columns.
- Parameters
df (
pandas.DataFrame
) – DataFrame that is been fitted.cols (
list
) – List of columns to fit. The names may contain Unix filename pattern matching (fnmatch
) symbols.dtype – Any value suported by
pandas.DataFrame.select_dtypes
.raise_error (
bool
) – If True and not column in df match the given column in cols, an exception is raised.drop_duplicates (
bool
) – Remove duplicates keeping the order. Default: True.
- Returns
List of existing columns in df that satisfy the constrains of dtype and cols.
- Raises
raise_error –
-
recipipe.utils.
flatten_list
(cols_list, recursive=True)[source]¶ Take a list of lists and return a flattened list.
- Parameters
cols_list – an iterable of any quantity of str/tuple/list/set.
Example
>>> flatten_list(["a", ("b", set(["c"])), [["d"]]]) ["a", "b", "c", "d"]
-
recipipe.utils.
is_categorical
(s, column=None)[source]¶ Check if a pandas Series or a column in a DataFrame is categorical.
- s (pandas.Series or pandas.DataFrame): Series to check or dataframe with
column to check.
- column (str): Column name. If a column name is given it is assumed that
s is a pandas.DataFrame.