Python package to make data cleaning easy

I made a simple python package to make data cleaning easy.

4 min readFeb 21, 2020

This article is a small tutorial on how to use the package

‘dfcleaner’ has many small helper functions like sanitize(), remove_outliers(), spot_irrelevant_columns() and many more.

It also has inbuilt logging capabilities so that you know exactly what changed in the pandas DataFrame before and after applying any function.

So, without any further delay, let’s get started with the tutorial

Installation

dfcleaner is available on PyPI and therefore you can install it with pip

OS X & Linux:

pip3 install dfcleaner

Windows:

pip install dfcleaner

Importing the Package

from dfcleaner import cleaner

We also need pandas to read CSV files and create DataFrames.

If you want to enable logging, simply set the corresponding variables as shown below.

Make sure that the folder specified by ‘LOG_DIR’ exists or else, it will give you an error.

cleaner.ENABLE_LOGGING = True
cleaner.LOG_DIR = './logs'

Now, all the log files will be stored in the directory specified by ‘LOG_DIR’

Usage Examples

Let’s look at each function one by one.

.sanitize(arr)

This function will take an array of strings and “sanitizes” them.

Now, what does “sanitizes” mean?

It means, for each string in the array, this function …
— keeps only alphanumeric, space and underscore characters
— replaces multiple consecutive spaces with a single space
— strips leading and trailing white spaces
— replace spaces with underscores
— convert CamelCase to snake_case
— remove multiple consecutive underscores again
— convert the whole string into lowercase

… in the same order as mentioned above.

This function is mainly designed to sanitize column names.

df.columns = cleaner.sanitize(df.columns)

.change_dtypes(df, conversion_dictionary)

This function will take a pandas DataFrame and a conversion dictionary as input and changes the datatypes of the columns according to the dictionary.

Usage Scenario 1:

If a column where the values are like ‘$25.99’, ‘$ 7.2’, ‘$29,347.32’ … where you want to extract the float number and leave the string characters like ‘$’, ‘,’ … you can just include the column name as key and float as value in the conversion dictionary.
Eg: {‘col_name’: float}

Usage Scenario 2:

If a column where the values are actually float but because of a single string character somewhere, the whole dtype changes to Object.
Eg: a column has these values 27.6, 48.125, 24, ‘?’, 74.32

The ‘?’ is a string character in a numeric column trying to represent nan value. In this case, simply include the key-value pairs in the conversion dict.
Eg: {‘col_name’: float}

df = cleaner.change_dtypes(df, {'col_1':float, 'col_2':'category'})

.remove_outliers(df, std_coeff=1.5, label_col=None)

This function will take a DataFrame and replaces all the outliers with np.nan.

If the target(label) column name is given, then it won’t consider that column to check and remove outliers.

The outliers are determined based on the ‘std_coeff’ given as a parameter.

df = cleaner.remove_outliers(df, 1.5, 'target_col_name_if_any')

.suggest_conversion_dict(df)

This function checks if any string columns can be converted into float and also if any numeric (int or float) columns can be converted into categorical columns with the dtype ‘category’.

Finally, it returns a conversion dictionary that can be passed as an argument to the change_dtypes() function

Eg:
— a column may have numeric values but in string format like, ‘12.0’, ‘15.7’.
— a column may have special symbols or characters instead of np.nan, like, ‘?’, ‘null’, ‘na’ along with normal numbers like 12, 14.26, 85.15, …
— because of a single string value, the whole column might have the ‘Object’ or ‘category’ datatype

Eg:
— a column namely has_credit_card may have binary values 1 or 0.
— This column can be considered as categorical because the ratio of the number of unique values in the column to the total number of values in the column is very small (smaller than a threshold of 0.01)

conversion_dict = cleaner.suggest_conversion_dict(df)

.spot_irrelevant_columns(cols)

This function takes an array of strings and looks for any strings like name, ID, Surname, First name, Last name, customer ID, …

The Idea behind is that you will pass df.columns as an argument and it will select columns that might be irrelevant (based on their names) as mentioned above.

The selected columns can be later dropped from the DataFrame.

cols_to_drop = cleaner.spot_irrelevant_columns(df.columns)

There are many more functionalities to be added in the future. If you like the package and want to contribute then you are welcome to make a pull request.

Just go to its GitHub repository, fork it, modify it, and make a PR.

Also, if you like any specific function in the package, you can always look at the source code and copy the function you want :)