Jumpstart your EDA with Pandas-Profiling

Bring your data to life with just one line of code

Panda profile.jpg from Wikimedia Commons

As a data science student, I am often reminded that data scientists can spend 60% — 80% of their time cleaning and managing data… and that’s why Exploratory Data Analysis is so important. EDA is not the most glamorous task, but it lays the foundation for the rest of the work you will do. It should be approached mindfully and methodically.

I am also interested in adding as many tools to my collection as I can, which is why I was so intrigued when I stumbled across Pandas Profiling.

So what does it do?

As the creators describe it:

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics — if relevant for the column type — are presented in an interactive HTML report:

Type inference: detect the types of columns in a dataframe.

Essentials: type, unique values, missing values

Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

Most frequent values

Histogram

Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

Missing values matrix, count, heatmap and dendrogram of missing values

Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

All of it happens with just one line of code.

Before we get to that, I will run through a few of the usual EDA steps for comparison:

# import our good friends pandas and seaborn
import pandas as pd
import seaborn as sns

Now, I’ll read in a .csv file from this Kaggle dataset on heart attack possibility :

After reading in the data, I can begin to inspect the data using the following:

.head() to have a look at the first few rows of the DataFrame,

.shape to see how many rows and columns are there, and

.describe()to get more detail on the numerical values.

At first glance, I can see that this dataset is all numerical values, but a mix of continuous (‘age’, ‘trestbps’, ‘chol’) and probably categorical (‘sex’, ‘fbs’, target’).

A quick seaborn pair plot can shed some light on those numbers, and also check for any obvious correlations:

sns.pairplot(df, dropna = True)
Well, that does show us a lot of plots… but what can we see?

From this high up, I can see some blobs that look like probable correlations, and dots that likely represent some categorical values. It is hard to discern much more at this level.

Now let’s try this with Pandas Profiling.

There are several installation methods documented here. I will use pip install:

pip install pandas_profiling

In my Jupyter notebook, I import Pandas and Pandas-Profiling, then read in the .csv file.

import pandas as pd
import pandas_profiling
# read in the heart.csv
heart_df = pd.read_csv('heart.csv')

Next, I run the profile report:

heart_df.profile_report()

Pandas-Profiling does take a little time to run and returns a pretty long report. I will go over the highlights below with a gif and some screenshots.

With a few additional lines of code, I can create an even friendlier version for my Jupyter notebook or output the profile report as an HTML file:

from pandas_profiling import ProfileReportprofile = ProfileReport(heart_df, title='Heart Data Profiling    Report', explorative= True, minimal=False)# to notebook
profile.to_notebook_iframe()
# output to file
profile.to_file(output_file="heart_output.html")

Let’s take a closer look at the report sections

‘Overview’ provides a statistical breakdown of the DataFrame contents.

‘Variables’ displays an analysis of each variable.

Click on the “Toggle details” button for a more detailed look at a variable.

‘Interactions’ shows the relationship between numerical variables. You can select two variables to produce a scatter plot.

‘Correlations’ displays a heatmap of several types of correlations.

‘Missing Values’ shows a count or matrix of values for each variable.

‘Sample’ is a stand-in for df.head() and df.tail() methods.

‘Duplicate Rows’ shows any duplicate rows in the DataFrame.

In conclusion

Does Pandas Profiling replace all of the other methods of EDA?

No, of course not.

Is it fun to use?

Definitely!

Is it a bit slow?

Yes, it does take time to load and render the profile report, but I think it is an extremely useful tool to do some Exploratory Exploratory Data Analysis. My advice is to start it up, pour a cup of coffee, and you can sit down to report that will help narrow your focus as you dive into the task at hand.

Have fun exploring!

Data Scientist, Musician, Cocktail Maker — — Let’s connect linkedin.com/in/dannmorr/