Exploratory Data Analysis (EDA) with Python: Discovering Insights Before You Predict
- Samul Black

- Jan 7, 2024
- 8 min read
Updated: Jul 23
In the world of data science, rushing into modeling without understanding your dataset is like jumping into a conversation without knowing the topic. This blog will walk you through the essential steps of Exploratory Data Analysis with Python, from visualizing distributions to identifying missing values and relationships between features.
Using the Titanic dataset as an example, we’ll not only explain the concepts but also show you the actual code to perform each analysis. So, if you're ready to stop guessing and start understanding your data — let's dive in.

Exploratory Data Analysis (EDA): Uncovering the Story Hidden in Your Data
Before diving into machine learning models or predictive analytics, it's essential to take a moment and truly understandthe data you're working with. That’s where Exploratory Data Analysis (EDA) comes in — the process of making sense of data before making assumptions about it. Think of EDA as the detective phase of data science, where you're trying to understand what's going on, what's normal, what's unusual, and what questions you might ask next.
EDA is more than just crunching numbers. It's about looking at the data visually, summarizing it statistically, and spotting patterns or oddities that might otherwise go unnoticed. It’s the stage where you explore — not predict — and where you ask "what is this data trying to tell me?"
Using a combination of statistical summaries, graphical techniques, and relationship analyses, EDA gives you a holistic view of the dataset. It helps in uncovering trends, correlations, anomalies, and missing values — all the things that influence how the rest of your data pipeline will perform. Most importantly, it equips you to make informed decisions about how to clean, transform, and model your data going forward.
What Are the Goals of Exploratory Data Analysis (EDA)?
EDA doesn’t follow a strict checklist, but it does have a few core objectives:
Understanding the Data Structure: This includes exploring the size of the dataset, types of variables, data distributions, and recognizing things like missing values or duplicated records.
Detecting Patterns and Relationships: Whether it’s trends over time or correlations between variables, EDA allows you to visually and statistically explore how your features interact.
Spotting Anomalies and Outliers: Outliers can distort your analysis, and EDA helps you identify and decide how to handle them.
Formulating Hypotheses: Based on initial findings, you can begin to generate questions and hypotheses to explore in more depth during the modeling stage.
Core Techniques Used in Exploratory Data Analysis (EDA)
To accomplish these goals, data scientists rely on a wide range of techniques:
Descriptive Statistics - Get a quick snapshot of your dataset using metrics like mean, median, mode, range, and standard deviation.
Univariate Analysis - Focus on individual variables to understand their distribution. Tools like histograms, box plots, and bar chartscome in handy here.
Bivariate and Multivariate Analysis - Examine how variables relate to one another through scatter plots, correlation matrices, and heatmaps. These reveal hidden patterns or relationships in the data.
Data Visualization - Charts like line graphs, violin plots, and pie charts help make data easier to interpret visually and communicate to non-technical stakeholders.
Outlier Detection - Spot outliers using methods like Z-score analysis or box plots. These data points might need to be excluded or separately analyzed, depending on the context.
Handling Missing Data - Use techniques like mean/median imputation, forward fill, or more advanced strategies to address incomplete data entries.
Dimensionality Reduction - Tools like Principal Component Analysis (PCA) and t-SNE help reduce the complexity of your dataset, making it easier to visualize and analyze high-dimensional data.
Clustering and Segmentation - Use unsupervised learning techniques like K-means clustering to group similar data points. This helps identify subgroups and behavioral patterns within the dataset.
Frameworks and Libraries for Exploratory Data Analysis (EDA)
If you're working in Python, you’ll be happy to know that EDA is well-supported by several powerful and intuitive libraries:
Pandas – For data manipulation and summarization.
Matplotlib & Seaborn – For creating insightful and aesthetic visualizations.
Plotly – For interactive, browser-based data plots.
NumPy & SciPy – For statistical calculations and numerical operations.
Scikit-learn – For more advanced techniques like clustering and dimensionality reduction.
Sweetviz / Pandas-Profiling – For automated EDA reports that summarize data distributions, missing values, and correlations.
Why Exploratory Data Analysis (EDA) Matters
Skipping or rushing through EDA is like trying to write a book without understanding your characters. It’s during this exploratory phase that you gain the most intuition about the data — how clean or messy it is, what surprises it holds, and where you should focus your energy going forward. Whether you're building a model, creating dashboards, or simply trying to understand a dataset for business decisions, EDA lays the groundwork for everything that follows.
Now that we've walked through the concepts and techniques of EDA, it's time to get our hands dirty. In the next section, we’ll demonstrate these principles in action using Python and some real-world data.
Exploratory Data Analysis on Titanic Dataset using Python
Let’s perform a detailed EDA step-by-step using the Titanic dataset with Python libraries such as pandas, matplotlib, seaborn, and numpy.
1. Import Libraries and Load Data
We start by importing the required Python libraries. pandas and numpy help with data handling and numerical operations. seaborn and matplotlib.pyplot are used for creating plots and charts. We set Seaborn's style to "whitegrid" for cleaner visuals.
The Titanic dataset is loaded using Seaborn’s built-in load_dataset() function. It includes information about passengers such as age, class, gender, and survival status. Using df.head(), we preview the first few rows to get an initial look at the data.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Set plot style
sns.set(style="whitegrid")
# Load Titanic dataset from Seaborn
df = sns.load_dataset('titanic')
# Display first 5 rows
df.head()Output:
index | survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone |
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.25 | S | Third | man | true | NaN | Southampton | no | false |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | false | C | Cherbourg | yes | false |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.925 | S | Third | woman | false | NaN | Southampton | yes | true |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1 | S | First | woman | false | C | Southampton | yes | false |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.05 | S | Third | man | true | NaN | Southampton | no | true |
2. Dataset Overview
Before diving into analysis, we quickly inspect the structure of the dataset. The shape tells us the number of rows and columns. df.info() shows data types and non-null counts for each column, helping us spot missing values. Finally, df.describe() provides summary statistics for both numeric and categorical features.
# Shape of the dataset
print("Shape of dataset:", df.shape)
# Data types and non-null counts
print("\nInfo:")
df.info()
# Summary statistics
print("\nSummary Statistics:")
df.describe(include='all')Output:
Shape of dataset: (891, 15)
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
Summary Statistics:
index | survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone |
count | 891.0 | 891.0 | 891 | 714.0 | 891.0 | 891.0 | 891.0 | 889 | 891 | 891 | 891 | 203 | 889 | 891 | 891 |
unique | NaN | NaN | 2 | NaN | NaN | NaN | NaN | 3 | 3 | 3 | 2 | 7 | 3 | 2 | 2 |
top | NaN | NaN | male | NaN | NaN | NaN | NaN | S | Third | man | true | C | Southampton | no | true |
freq | NaN | NaN | 577 | NaN | NaN | NaN | NaN | 644 | 491 | 537 | 537 | 59 | 644 | 549 | 537 |
mean | 0.3838383838383838 | 2.308641975308642 | NaN | 29.69911764705882 | 0.5230078563411896 | 0.38159371492704824 | 32.204207968574636 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | 0.4865924542648585 | 0.8360712409770513 | NaN | 14.526497332334044 | 1.1027434322934275 | 0.8060572211299559 | 49.693428597180905 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | 0.0 | 1.0 | NaN | 0.42 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | 0.0 | 2.0 | NaN | 20.125 | 0.0 | 0.0 | 7.9104 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | 0.0 | 3.0 | NaN | 28.0 | 0.0 | 0.0 | 14.4542 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | 1.0 | 3.0 | NaN | 38.0 | 1.0 | 0.0 | 31.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | 1.0 | 3.0 | NaN | 80.0 | 8.0 | 6.0 | 512.3292 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3. Checking Missing Values
Missing data can affect the quality of our analysis, so it's important to identify it early. Using isnull().sum(), we count missing values in each column and sort them in descending order. We then display only the columns that have missing entries.
# Count of missing values
missing = df.isnull().sum().sort_values(ascending=False)
missing[missing > 0]Output:
deck | 688 |
age | 177 |
embarked | 2 |
embark_town | 2 |
4. Univariate Analysis
Univariate analysis focuses on examining one variable at a time to understand its distribution and characteristics. It helps identify patterns, outliers, or imbalances in the data.
Categorical Variables
We start with categorical variables using Seaborn’s countplot:
The survived column shows how many passengers lived or died.
The sex column reveals the gender distribution.
The class column displays the spread across travel classes.
# Countplot of Survival
sns.countplot(x='survived', data=df)
plt.title("Survival Counts")
plt.show()
# Sex distribution
sns.countplot(x='sex', data=df)
plt.title("Gender Distribution")
plt.show()
# Passenger Class
sns.countplot(x='class', data=df)
plt.title("Passenger Class Distribution")
plt.show()Outputs:
Numerical Variables
For numerical features, we use histograms with kernel density estimation (KDE) to visualize their distribution.
The age variable shows a concentration of passengers in their 20s and 30s, with fewer older passengers.The fare variable is right-skewed, with most passengers paying lower fares and a few paying significantly more.
# Distribution of age
sns.histplot(df['age'].dropna(), kde=True, bins=30)
plt.title("Age Distribution")
plt.show()
# Distribution of fare
sns.histplot(df['fare'], kde=True)
plt.title("Fare Distribution")
plt.show()Output:
5. Bivariate Analysis
Bivariate analysis helps us explore the relationship between two variables. This is useful for understanding how one feature may influence another—especially the survived column in our case.
Survival by Gender: We use a grouped countplot to compare survival rates across male and female passengers.
Survival by Passenger Class: This plot shows how survival chances varied across different travel classes.
Age vs Fare: A scatter plot shows the relationship between age and fare, with color representing survival status. This helps identify clusters or patterns among survivors and non-survivors.
# Survival by Gender
sns.countplot(x='sex', hue='survived', data=df)
plt.title("Survival by Gender")
plt.show()
# Survival by Passenger Class
sns.countplot(x='class', hue='survived', data=df)
plt.title("Survival by Passenger Class")
plt.show()
# Age vs Fare
sns.scatterplot(x='age', y='fare', hue='survived', data=df)
plt.title("Age vs Fare by Survival")
plt.show()Output:
These visualizations offer key insights into which factors may have influenced survival on the Titanic.
6. Correlation Matrix (Numerical Features)
To understand how numerical features relate to each other, we compute a correlation matrix. This helps identify linear relationships—both positive and negative—between variables.
We first select only the numeric columns from the dataset, then compute their pairwise correlation. The heatmap visualises these relationships, making it easy to spot strong correlations.
# Select only numerical columns
num_df = df.select_dtypes(include=['float64', 'int64'])
# Compute correlation matrix
corr = num_df.corr()
# Heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()Output:

7. Feature Relationships (Boxplots, Violin, etc.)
To explore how numerical features vary across categories, we use violin and box plots.
Age Distribution by Survival:The violin plot shows the distribution of age for survivors and non-survivors. It highlights both the spread and density of age values in each group.
Fare by Class:The box plot displays fare distributions across passenger classes. It shows medians, quartiles, and potential outliers, giving insights into pricing patterns based on class.
# Age Distribution by Survival
sns.violinplot(x='survived', y='age', data=df)
plt.title("Age Distribution by Survival")
plt.show()
# Fare by Class
sns.boxplot(x='class', y='fare', data=df)
plt.title("Fare by Passenger Class")
plt.show()Output:
Conclusion
Exploratory Data Analysis is more than just a preliminary step — it's the foundation of any meaningful data science project. Through statistical summaries and visual exploration, EDA uncovers hidden patterns, highlights data quality issues, and guides the choices you make in feature engineering and modeling.
By applying techniques like univariate and bivariate analysis, correlation heatmaps, and distribution plots, we’ve seen how a dataset can start to tell its story. In our Titanic dataset example, we observed the role of gender and class in survival rates, identified outliers in fare prices, and got a feel for the overall structure of the data.
As you move forward to model building or advanced analytics, remember: the insights you gather during EDA can make or break your results. Let the data speak — and listen carefully in this crucial phase.
























