Bring your data to life with just one line of code
As a data science student, I am often reminded that data scientists can spend 60% — 80% of their time cleaning and managing data… and that’s why Exploratory Data Analysis is so important. EDA is not the most glamorous task, but it lays the foundation for the rest of the work you will do. It should be approached mindfully and methodically.
I am also interested in adding as many tools to my collection as I can, which is why I was so intrigued when I stumbled across Pandas Profiling.
So what does it do?
As the creators describe it:
Generates profile reports from a pandas
DataFrame
. The pandasdf.describe()
function is great but a little basic for serious exploratory data analysis.pandas_profiling
extends the pandas DataFrame withdf.profile_report()
for quick data analysis.For each column the following statistics — if relevant for the column type — are presented in an interactive HTML report:
Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.
All of it happens with just one line of code.
Before we get to that, I will run through a few of the usual EDA steps for comparison:
# import our good friends pandas and seaborn
import pandas as pd
import seaborn as sns
Now, I’ll read in a .csv file from this Kaggle dataset on heart attack possibility :
After reading in the data, I can begin to inspect the data using the following:
.head()
to have a look at the first few rows of the DataFrame,
.shape
to see how many rows and columns are there, and
.describe()
to get more detail on the numerical values.
At first glance, I can see that this dataset is all numerical values, but a mix of continuous (‘age’, ‘trestbps’, ‘chol’) and probably categorical (‘sex’, ‘fbs’, target’).
A quick seaborn pair plot can shed some light on those numbers, and also check for any obvious correlations:
sns.pairplot(df, dropna = True)
From this high up, I can see some blobs that look like probable correlations, and dots that likely represent some categorical values. It is hard to discern much more at this level.
Now let’s try this with Pandas Profiling.
There are several installation methods documented here. I will use pip install:
pip install pandas_profiling
In my Jupyter notebook, I import Pandas and Pandas-Profiling, then read in the .csv file.
import pandas as pd
import pandas_profiling# read in the heart.csv
heart_df = pd.read_csv('heart.csv')
Next, I run the profile report:
heart_df.profile_report()
Pandas-Profiling does take a little time to run and returns a pretty long report. I will go over the highlights below with a gif and some screenshots.
With a few additional lines of code, I can create an even friendlier version for my Jupyter notebook or output the profile report as an HTML file:
from pandas_profiling import ProfileReportprofile = ProfileReport(heart_df, title='Heart Data Profiling Report', explorative= True, minimal=False)# to notebook
profile.to_notebook_iframe()# output to file
profile.to_file(output_file="heart_output.html")
Let’s take a closer look at the report sections
‘Overview’ provides a statistical breakdown of the DataFrame contents.
‘Variables’ displays an analysis of each variable.
Click on the “Toggle details” button for a more detailed look at a variable.
‘Interactions’ shows the relationship between numerical variables. You can select two variables to produce a scatter plot.
‘Correlations’ displays a heatmap of several types of correlations.
‘Missing Values’ shows a count or matrix of values for each variable.
‘Sample’ is a stand-in for df.head()
and df.tail()
methods.
‘Duplicate Rows’ shows any duplicate rows in the DataFrame.
In conclusion
Does Pandas Profiling replace all of the other methods of EDA?
No, of course not.
Is it fun to use?
Definitely!
Is it a bit slow?
Yes, it does take time to load and render the profile report, but I think it is an extremely useful tool to do some Exploratory Exploratory Data Analysis. My advice is to start it up, pour a cup of coffee, and you can sit down to report that will help narrow your focus as you dive into the task at hand.
Have fun exploring!