recipipe package

Improved pipelines for data science projects.

SKLearn pipelines easy to declare and Pandas-compatible.

recipipe.greet()[source]

Print a silly sentence to stdout.

recipipe aliases

The recipipe package contains many aliases for ease of use. Here is a complete table with all the aliases.

Alias

Definition

recipipe

recipipe.core.Recipipe

astype

recipipe.transformers.AsTypeTransformer

category

recipipe.transformers.CategoryEncoder

concat

recipipe.transformers.ConcatTransformer

extract

recipipe.transformers.ExtractTransformer

groupby

recipipe.transformers.GroupByTransformer

select

recipipe.transformers.SelectTransformer

drop

recipipe.transformers.DropTransformer

dropna

recipipe.transformers.DropNARowsTransformer

dropna_rows

recipipe.transformers.DropNARowsTransformer

query

recipipe.transformers.QueryTransformer

replace

recipipe.transformers.ReplaceTransformer

reduce_memory

recipipe.transformers.ReduceMemoryTransformer

sum

recipipe.transformers.SumTransformer

target_encoder

recipipe.transformers.TargetEncoderTransformer

from_sklearn

recipipe.transformers.SklearnCreator

binarizer

from_sklearn(Binarizer())

impute

from_sklearn(SimpleImputer(strategy=’constant’))

indicator

from_sklearn(MissingIndicator(),col_format=’INDICATOR({})’)

minmax

from_sklearn(MinMaxScaler())

onehot

from_sklearn(OneHotEncoder(sparse=False,handle_unknown=’ignore’))

robust_scale

from_sklearn(RobustScaler())

scale

from_sklearn(StandardScaler())

standarize

from_sklearn(StandardScaler())

sk_binarizer

sklearn.preprocessing.Binarizer

sk_indicator

sklearn.impute.MissingIndicator

sk_inputer

sklearn.impute.SimpleImputer

sk_onehot

sklearn.preprocessing.OneHotEncoder

sk_scale

sklearn.preprocessing.StandardScaler

recipipe.core module

class recipipe.core.Recipipe(steps=None, **kwargs)[source]

Bases: sklearn.pipeline.Pipeline

Recipipe pipeline.

A Recipipe pipeline is an extension of an SKLearn pipeline. It adds some functionality that make the creation of pipelines less painful. For example, the steps param is not required at the construction time. You can add your transformers to the pipeline anytime using recipipe.core.Recipipe.add.

Attr:

Same attributes as sklearn.pipeline.Pipeline.

__abstractmethods__ = frozenset({})
__add__(transformer)[source]

Add a new step to the pipeline using the ‘+’ operator.

Note that this is exactly the same as calling recipipe.core.Recipipe.add. The Recipipe object is going to be modified, that is p = p + t is the same as p + t, where p is any Recipipe pipeline and t is any transformer.

__init__(steps=None, **kwargs)[source]

Create a Recipipe pipeline.

Parameters
__module__ = 'recipipe.core'
add(step)[source]

Add a new step to the pipeline.

You can add steps even if the pipeline is already fitted, so be careful.

Parameters

step (Transformer or tuple(str, Transformer)) – The new step that you want to add to the pipeline. Any transformer is good (SKLearn transformer or recipipe.core.RecipipeTransformer). If a tuple is given, the fist element of the tuple is going to be used as a name for the step in the pipeline.

Returns

The pipeline. You can chain add methods: pipe.add(…).add(…)….

steps = None
class recipipe.core.RecipipeTransformer(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Base class of all Recipipe transformers.

name

Human-friendly name of the transformer.

Type

str

__init__(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.core'
fit(df, y=None)[source]

Fit the transformer.

Parameters

df (pandas.DataFrame) – Dataframe used to fit the transformation.

get_column_mapping()[source]

Get the column mapping between the input and transformed DataFrame.

By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.

Returns

A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.

See also

recipipe.core.RecipipeTransformer.col_format

Raise:

ValueError if self.cols is None.

inverse_transform(df_in)[source]
transform(df_in)[source]

Transform DataFrame.

Parameters

df_in (pandas.DataFrame) – Input DataFrame.

Returns

Transformed DataFrame.

Raise:

ValueError if not cols fitted. Fit the transform to avoid this error.

recipipe.transformers module

class recipipe.transformers.AsTypeTransformer(*args, dtypes=None, **kwargs)[source]

Bases: recipipe.core.RecipipeTransformer

__init__(*args, dtypes=None, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
class recipipe.transformers.CategoryEncoder(*args, error_unknown=False, unknown_value=None, **kwargs)[source]

Bases: recipipe.transformers.ColumnTransformer

__init__(*args, error_unknown=False, unknown_value=None, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
class recipipe.transformers.ColumnGroupsTransformer(*args, groups=False, cols_init=None, **kwargs)[source]

Bases: recipipe.core.RecipipeTransformer

Apply a N to 1 transformation to a group of columns.

__init__(*args, groups=False, cols_init=None, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
get_column_mapping()[source]

Get the column mapping between the input and transformed DataFrame.

By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.

Returns

A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.

See also

recipipe.core.RecipipeTransformer.col_format

Raise:

ValueError if self.cols is None.

class recipipe.transformers.ColumnTransformer(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]

Bases: recipipe.core.RecipipeTransformer

Apply an operation per each input column.

This transformer only allows 1 to N relationships between input and output columns. If you want to create a column from two existing ones (like concatenate one column to another) this transformer is not for you.

Note that the output number of rows of this transformer should be the same as in the input DataFrame. No deletes are supported here.

__module__ = 'recipipe.transformers'
fit(df, y=None)[source]

Fit the transformer.

Parameters

df (pandas.DataFrame) – Dataframe used to fit the transformation.

class recipipe.transformers.ColumnsTransformer(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]

Bases: recipipe.core.RecipipeTransformer

Apply an operation at all the input columns at the same time.

This class does not do that much… It’s only particular useful when working with transformers that return a numpy.ndarray, like the ones in SKLearn. This class deals with the creation of a DataFrame so yo do not need to create it by yourself.

Note that the output number of rows of this transformer should be the same as in the input DataFrame. No deletes are supported here.

__module__ = 'recipipe.transformers'
class recipipe.transformers.ConcatTransformer(*args, separator='', **kwargs)[source]

Bases: recipipe.transformers.ColumnGroupsTransformer

Concatenate string or non-string columns into a new string column.

__init__(*args, separator='', **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
class recipipe.transformers.DropNARowsTransformer(*args, inplace=False, how='any', thresh=None, **kwargs)[source]

Bases: recipipe.core.RecipipeTransformer

__init__(*args, inplace=False, how='any', thresh=None, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
inverse_transform(df)[source]
transform(df, y=None)[source]

Transform DataFrame.

Parameters

df_in (pandas.DataFrame) – Input DataFrame.

Returns

Transformed DataFrame.

Raise:

ValueError if not cols fitted. Fit the transform to avoid this error.

class recipipe.transformers.DropTransformer(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]

Bases: recipipe.core.RecipipeTransformer

Drop the fitted columns and continue with the reminded ones.

__module__ = 'recipipe.transformers'
inverse_transform(df)[source]

No inverse exists for a drop operation but…

Obviously, there is no way to get back the dropped columns, but it’s useful to have this operation defined to avoid errors when using this transformer in a pipeline. For that reason, the inverse in defined as the identity function.

Parameters

df (pandas.DataFrame) – DataFrame to inverse transform.

Returns

df without modifications.

Return type

Identity function

transform(df)[source]

Drop the fitted columns.

Parameters

df (pandas.DataFrame) – DataFrame to drop columns from.

Returns

Transformed DataFrame.

class recipipe.transformers.ExtractTransformer(*args, pattern=None, flags=0, indicator=False, col_values=None, **kwargs)[source]

Bases: recipipe.transformers.ColumnTransformer

Extract regex from string columns.

__init__(*args, pattern=None, flags=0, indicator=False, col_values=None, **kwargs)[source]

Create a ExtractTransformer.

Parameters
  • pattern (list or str) – Regex or list of regex to extract. If a list is provided, only one extraction group per element is allowed.

  • flags (int) – Flags from the re module. Default: 0, no flags.

  • col_values (list of str) – Values of the output columns. If you are extracting more than one column, from an input column a and with col_values=[“b”, “c”] you will get two output columns named a=b and a=c.

  • indicator (bool) – Instead of capturing the string, return an indicator with 1 if there is a match and a 0 otherwise.

__module__ = 'recipipe.transformers'
get_column_mapping()[source]

Get the column mapping between the input and transformed DataFrame.

By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.

Returns

A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.

See also

recipipe.core.RecipipeTransformer.col_format

Raise:

ValueError if self.cols is None.

class recipipe.transformers.GroupByTransformer(groupby, transformer, *args, **kwargs)[source]

Bases: recipipe.core.RecipipeTransformer

Apply a transformer on each group.

Example:

# Normalize by group using the SKLearn StandardScaler.
scaler = SklearnCreator(StandardScaler())
Recipipe() + GroupByTransformer("group_column", scaler("num_column"))
__init__(groupby, transformer, *args, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
fit(df, y=None)[source]

Fit the transformer.

Parameters

df (pandas.DataFrame) – Dataframe used to fit the transformation.

fit_group(name, df)[source]
inverse_transform(df_in)[source]
inverse_transform_group(name, df)[source]
transform(df_in)[source]

Transform DataFrame.

Parameters

df_in (pandas.DataFrame) – Input DataFrame.

Returns

Transformed DataFrame.

Raise:

ValueError if not cols fitted. Fit the transform to avoid this error.

transform_group(name, df)[source]
class recipipe.transformers.PandasScaler(*args, factor=1, **kwargs)[source]

Bases: recipipe.transformers.ColumnTransformer

Standard scaler implemented with Pandas operations.

__init__(*args, factor=1, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
class recipipe.transformers.QueryTransformer(query, **kwargs)[source]

Bases: recipipe.core.RecipipeTransformer

__init__(query, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
transform(df_in)[source]

Transform DataFrame.

Parameters

df_in (pandas.DataFrame) – Input DataFrame.

Returns

Transformed DataFrame.

Raise:

ValueError if not cols fitted. Fit the transform to avoid this error.

class recipipe.transformers.ReduceMemoryTransformer(*args, deep=True, verbose=False, object_to_category=True, **kwargs)[source]

Bases: recipipe.core.RecipipeTransformer

Change data types in order to reduce memory usage.

This transformer iterates all the columns of the DataFrame and for numeric types checks if there is another datatype that occupies less memory and can store all the numbers of the column.

Another way to reduce memory usage of Pandas DataFrame is creating categories (pandas.Categorical). It’s not always possible to save memory by converting to a category. The ideal columns to convert to category are those with few different values. Converting a column that has all unique values (like and ID column) will surely occupy more as a category, so consider to drop those columns first or not to apply the transformer on those columns.

In a DataFrame there may also be numeric columns that can be converted to categorical (especially columns of integer types). You should make those transformations manually if you want to have a column of category dtype from the numeric column. Note that the recipipe.transformers.CategoryEncoder is not really returning a Pandas category, is encoding the values into integer values.

You should also note that this transformation is done in place because that’s the objective of reducing memory. Keeping a copy in memory does not make too much sense if you want to save memory.

__init__(*args, deep=True, verbose=False, object_to_category=True, **kwargs)[source]

Create a new reduce memory transformer.

Parameters
  • deep (bool) – When verbose=True the percentage of memory reduced by the conversion is shown on the screen. It’s computed using the pandas.DataFrame.memory_usage method. deep is the argument of this method. deep=True is more slow but it’s more accurate. Note that it’s really slow on big DataFrames.

  • verbose (bool) – Prints informative information about the process on the screen. Shows the percentage of memory saved after the conversion.

  • object_to_category (bool) – If True this method automatically converts objects to the Pandas category type.

__module__ = 'recipipe.transformers'
transform(df, y=None)[source]

Transform DataFrame.

Parameters

df_in (pandas.DataFrame) – Input DataFrame.

Returns

Transformed DataFrame.

Raise:

ValueError if not cols fitted. Fit the transform to avoid this error.

class recipipe.transformers.ReplaceTransformer(*args, values=None, **kwargs)[source]

Bases: recipipe.core.RecipipeTransformer

__init__(*args, values=None, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
class recipipe.transformers.SelectTransformer(*args, cols_init=None, exclude=None, dtype=None, name=None, keep_original=False, col_format='{}', cols_not_found_error=False)[source]

Bases: recipipe.core.RecipipeTransformer

Select the fitted columns and ignore the rest of them.

__module__ = 'recipipe.transformers'
inverse_transform(df)[source]

No inverse exists for a select operation but…

Obviously, there is no way to get back the non-selected columns, but it’s useful to have this operation defined to avoid errors when using this transformer in a pipeline. For that reason, the inverse in defined as the identity function.

Parameters

df (pandas.DataFrame) – DataFrame to inverse transform.

Returns

df without modifications.

Return type

Identity function

transform(df)[source]

Select the fitted columns.

Parameters

df (pandas.DataFrame) – DataFrame to select columns from.

Returns

Transformed DataFrame.

class recipipe.transformers.SklearnColumnWrapper(sk_transformer, *args, **kwargs)[source]

Bases: recipipe.transformers.ColumnTransformer

__init__(sk_transformer, *args, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
get_column_mapping()[source]

Get the column mapping between the input and transformed DataFrame.

By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.

Returns

A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.

See also

recipipe.core.RecipipeTransformer.col_format

Raise:

ValueError if self.cols is None.

class recipipe.transformers.SklearnColumnsWrapper(sk_transformer, *args, **kwargs)[source]

Bases: recipipe.transformers.ColumnsTransformer

__init__(sk_transformer, *args, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'
get_column_mapping()[source]

Get the column mapping between the input and transformed DataFrame.

By default it returns a 1:1 map between the input and output columns. Make sure your transformer is fitted before calling this function.

Returns

A dict in which the keys are the input DataFrame column names and the value is the output DataFrame column names. Both key and values can be tuples, tuple:1 useful to indicate that one output column has been created from a list of columns from the input DataFrame, 1:tuple useful to indicate that a list of output columns come from one specific column of the input DataFrame. We use tuples and not lists because lists are not hashable, so they cannot be keys in a dict.

See also

recipipe.core.RecipipeTransformer.col_format

Raise:

ValueError if self.cols is None.

class recipipe.transformers.SklearnCreator(sk_transformer, **kwargs)[source]

Bases: object

Utility class to generate SKLearn wrappers.

Use this class to reuse any existing SKLearn transformer in your recipipe pipelines.

Parameters

sk_transformer (sklearn.TransformerMixin) – Any instance of an SKLearn transformer you want to incorporate in the recipipes pipelines.

Example:

# Create a onehot encoder using the SKLearn OneHotEncoder class.
onehot = SklearnCreator(OneHotEncoder(sparse=False))
# Now you can use the onehot variable as a transformer in a recipipe.
recipipe() + onehot(dtype="string")
WRAPPERS = {'column': <class 'recipipe.transformers.SklearnColumnWrapper'>, 'columns': <class 'recipipe.transformers.SklearnColumnsWrapper'>, 'fit_one_col': <class 'recipipe.transformers.SklearnFitOneWrapper'>}
__call__(*args, wrapper='columns', **kwargs)[source]

Instantiate a SKLearn wrapper using a copy of the given transformer.

It’s important to make this copy to avoid fitting several times the same transformer.

__dict__ = mappingproxy({'__module__': 'recipipe.transformers', '__doc__': 'Utility class to generate SKLearn wrappers.\n\n Use this class to reuse any existing SKLearn transformer in your\n recipipe pipelines.\n\n Args:\n sk_transformer (sklearn.TransformerMixin): Any instance of an\n SKLearn transformer you want to incorporate in the recipipes\n pipelines.\n\n Example::\n\n # Create a onehot encoder using the SKLearn OneHotEncoder class.\n onehot = SklearnCreator(OneHotEncoder(sparse=False))\n # Now you can use the onehot variable as a transformer in a recipipe.\n recipipe() + onehot(dtype="string")\n ', 'WRAPPERS': {'column': <class 'recipipe.transformers.SklearnColumnWrapper'>, 'columns': <class 'recipipe.transformers.SklearnColumnsWrapper'>, 'fit_one_col': <class 'recipipe.transformers.SklearnFitOneWrapper'>}, '__init__': <function SklearnCreator.__init__>, '__call__': <function SklearnCreator.__call__>, '__dict__': <attribute '__dict__' of 'SklearnCreator' objects>, '__weakref__': <attribute '__weakref__' of 'SklearnCreator' objects>})
__init__(sk_transformer, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

__module__ = 'recipipe.transformers'
__weakref__

list of weak references to the object (if defined)

class recipipe.transformers.SklearnFitOneWrapper(sk_transformer, *args, **kwargs)[source]

Bases: recipipe.transformers.SklearnColumnWrapper

Fit in a concatenation of all the columns, apply one by one.

This is useful when we have two or more columns that are very related. For example, if all those columns share the same category type.

__module__ = 'recipipe.transformers'
class recipipe.transformers.SumTransformer(*args, groups=False, cols_init=None, **kwargs)[source]

Bases: recipipe.transformers.ColumnGroupsTransformer

Sum columns.

__module__ = 'recipipe.transformers'
class recipipe.transformers.TargetEncoderTransformer(*args, target=None, min_samples_leaf=1, smoothing=1, **kwargs)[source]

Bases: recipipe.transformers.ColumnTransformer

Target encoder.

Computed as described in: A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, by Daniele Micci-Barreca.

Code partially taken from: https://www.kaggle.com/ogrellier/xgb-classifier-upsampling-lb-0-283

__init__(*args, target=None, min_samples_leaf=1, smoothing=1, **kwargs)[source]

Create a new transformer.

Columns names can be use Unix filename pattern matching ( fnmatch).

Parameters
  • *args (list of str) – List of columns the transformer will work on.

  • cols_init (list of str) – List of columns the transformer will work on. If *args are provided, this list of columns is going to be appended at the end.

  • exclude (list of str) – List of columns to exclude. The exclusion is applied after fitting the columns, so it can be used at the same time as *args and col_init.

:param dtype (numpy.dtype, str, list of: numpy.dtype or with str or dict): This

value is passed to pandas.DataFrame.select_dtypes. If a dict is given, the Pandas function is going to be called with dictionary unpacking: select_dtypes(**dtype). In this way you can exclude, for example, int dtypes using: dtype=dict(exclude=int). The columns returned by this method (executed in the DataFrame passed to the fit method) will be the columns that are going to be used in the transformation phase. When used in combination with *args or cols_init, the dtype filter is applied later.

Parameters
  • name (str) – Human-friendly name of the transformer.

  • keep_original (bool) – True if you want to keep the input columns used in the transformer in the transformed DataFrame, False if not. Note that, if the output column has the same name as the input column, the output input column will not be included even if keep_original is set to True. Default: False.

  • col_format (str) – New name of the columns. Use “{}” in to substitute that placeholder by the column name. For example, if you want to append the string “_new” at the end of all the generated columns you must set col_format=”{}_new”. Default: “{}”.

  • cols_not_found_error (bool) – Raise an error if the isn’t any match for any of the specified columns. Default: False.

__module__ = 'recipipe.transformers'

recipipe.utils module

recipipe.utils.add_to_map_dict(col_map, k, v)[source]

Stores k and v in the given col_map.

If k is a string, col_map[k] += list(v). If k is a tuple, for each i in k: col_map[k] += list(v).

Parameters
  • col_map (dict) – Dictionary in which the keys and values will be stored.

  • k (str or tuple) – Tuples will be split and col_map will contain a string key with a list of values taken from v.

  • v (str or tuple) – Value appended to each of the keys in k.

recipipe.utils.default_params(fun_kwargs, default_dict=None, **kwargs)[source]

Add to kwargs and/or default_dict the values of fun_kwargs.

This function allows the user to overwrite default values of some parameters. For example, in the next example the user cannot give a value to the param a because you will be passing the param a twice to the function another_fun:

>>> def fun(**kwargs):
...     return another_fun(a="a", **kwargs)

You can solve this in two ways. The fist one:

>>> def fun(a="a", **kwargs):
...     return another_fun(a=a, **kwargs)

Or using default_params:

>>> def fun(**kwargs):
...    kwargs = default_params(kwargs, a="a")
...    return another_fun(**kwargs)
recipipe.utils.fit_columns(df, cols=None, dtype=None, raise_error=True, drop_duplicates=True)[source]

Fit columns to a DataFrame.

If no cols and not dtype are given, df.columns is returned. If both cols and dtype are given, first cols are applied and dtype is applied over the resulting columns.

Note than an empty list can be returned if df does not contain columns.

Parameters
  • df (pandas.DataFrame) – DataFrame that is been fitted.

  • cols (list) – List of columns to fit. The names may contain Unix filename pattern matching (fnmatch) symbols.

  • dtype – Any value suported by pandas.DataFrame.select_dtypes.

  • raise_error (bool) – If True and not column in df match the given column in cols, an exception is raised.

  • drop_duplicates (bool) – Remove duplicates keeping the order. Default: True.

Returns

List of existing columns in df that satisfy the constrains of dtype and cols.

Raises
recipipe.utils.flatten_list(cols_list, recursive=True)[source]

Take a list of lists and return a flattened list.

Parameters

cols_list – an iterable of any quantity of str/tuple/list/set.

Example

>>> flatten_list(["a", ("b", set(["c"])), [["d"]]])
["a", "b", "c", "d"]
recipipe.utils.is_categorical(s, column=None)[source]

Check if a pandas Series or a column in a DataFrame is categorical.

s (pandas.Series or pandas.DataFrame): Series to check or dataframe with

column to check.

column (str): Column name. If a column name is given it is assumed that

s is a pandas.DataFrame.

recipipe.utils.memory_usage_mb(df, *args, **kwargs)[source]

Dataframe memory usage in MB.