top of page

Learn through our Blogs, Get Expert Help, Mentorship & Freelance Support!

Welcome to Colabcodes, where innovation drives technology forward. Explore the latest trends, practical programming tutorials, and in-depth insights across software development, AI, ML, NLP and more. Connect with our experienced freelancers and mentors for personalised guidance and support tailored to your needs.

Coding expert help blog - colabcodes

Exploratory Data Analysis (EDA) with Python: Discovering Insights Before You Predict

  • Writer: Samul Black
    Samul Black
  • Jan 7, 2024
  • 8 min read

Updated: Jul 23

In the world of data science, rushing into modeling without understanding your dataset is like jumping into a conversation without knowing the topic. This blog will walk you through the essential steps of Exploratory Data Analysis with Python, from visualizing distributions to identifying missing values and relationships between features.

Using the Titanic dataset as an example, we’ll not only explain the concepts but also show you the actual code to perform each analysis. So, if you're ready to stop guessing and start understanding your data — let's dive in.

exploratory data analytics - Eda - colabcodes

Exploratory Data Analysis (EDA): Uncovering the Story Hidden in Your Data

Before diving into machine learning models or predictive analytics, it's essential to take a moment and truly understandthe data you're working with. That’s where Exploratory Data Analysis (EDA) comes in — the process of making sense of data before making assumptions about it. Think of EDA as the detective phase of data science, where you're trying to understand what's going on, what's normal, what's unusual, and what questions you might ask next.

EDA is more than just crunching numbers. It's about looking at the data visually, summarizing it statistically, and spotting patterns or oddities that might otherwise go unnoticed. It’s the stage where you explore — not predict — and where you ask "what is this data trying to tell me?"

Using a combination of statistical summaries, graphical techniques, and relationship analyses, EDA gives you a holistic view of the dataset. It helps in uncovering trends, correlations, anomalies, and missing values — all the things that influence how the rest of your data pipeline will perform. Most importantly, it equips you to make informed decisions about how to clean, transform, and model your data going forward.


What Are the Goals of Exploratory Data Analysis (EDA)?

EDA doesn’t follow a strict checklist, but it does have a few core objectives:


  1. Understanding the Data Structure: This includes exploring the size of the dataset, types of variables, data distributions, and recognizing things like missing values or duplicated records.

  2. Detecting Patterns and Relationships: Whether it’s trends over time or correlations between variables, EDA allows you to visually and statistically explore how your features interact.

  3. Spotting Anomalies and Outliers: Outliers can distort your analysis, and EDA helps you identify and decide how to handle them.

  4. Formulating Hypotheses: Based on initial findings, you can begin to generate questions and hypotheses to explore in more depth during the modeling stage.


Core Techniques Used in Exploratory Data Analysis (EDA)

To accomplish these goals, data scientists rely on a wide range of techniques:


  1. Descriptive Statistics - Get a quick snapshot of your dataset using metrics like mean, median, mode, range, and standard deviation.

  2. Univariate Analysis - Focus on individual variables to understand their distribution. Tools like histograms, box plots, and bar chartscome in handy here.

  3. Bivariate and Multivariate Analysis - Examine how variables relate to one another through scatter plots, correlation matrices, and heatmaps. These reveal hidden patterns or relationships in the data.

  4. Data Visualization - Charts like line graphs, violin plots, and pie charts help make data easier to interpret visually and communicate to non-technical stakeholders.

  5. Outlier Detection - Spot outliers using methods like Z-score analysis or box plots. These data points might need to be excluded or separately analyzed, depending on the context.

  6. Handling Missing Data - Use techniques like mean/median imputation, forward fill, or more advanced strategies to address incomplete data entries.

  7. Dimensionality Reduction - Tools like Principal Component Analysis (PCA) and t-SNE help reduce the complexity of your dataset, making it easier to visualize and analyze high-dimensional data.

  8. Clustering and Segmentation - Use unsupervised learning techniques like K-means clustering to group similar data points. This helps identify subgroups and behavioral patterns within the dataset.


Frameworks and Libraries for Exploratory Data Analysis (EDA)

If you're working in Python, you’ll be happy to know that EDA is well-supported by several powerful and intuitive libraries:


  • Pandas – For data manipulation and summarization.

  • Matplotlib & Seaborn – For creating insightful and aesthetic visualizations.

  • Plotly – For interactive, browser-based data plots.

  • NumPy & SciPy – For statistical calculations and numerical operations.

  • Scikit-learn – For more advanced techniques like clustering and dimensionality reduction.

  • Sweetviz / Pandas-Profiling – For automated EDA reports that summarize data distributions, missing values, and correlations.


Why Exploratory Data Analysis (EDA) Matters

Skipping or rushing through EDA is like trying to write a book without understanding your characters. It’s during this exploratory phase that you gain the most intuition about the data — how clean or messy it is, what surprises it holds, and where you should focus your energy going forward. Whether you're building a model, creating dashboards, or simply trying to understand a dataset for business decisions, EDA lays the groundwork for everything that follows.


Now that we've walked through the concepts and techniques of EDA, it's time to get our hands dirty. In the next section, we’ll demonstrate these principles in action using Python and some real-world data.


Exploratory Data Analysis on Titanic Dataset using Python

Let’s perform a detailed EDA step-by-step using the Titanic dataset with Python libraries such as pandas, matplotlib, seaborn, and numpy.


1. Import Libraries and Load Data

We start by importing the required Python libraries. pandas and numpy help with data handling and numerical operations. seaborn and matplotlib.pyplot are used for creating plots and charts. We set Seaborn's style to "whitegrid" for cleaner visuals.

The Titanic dataset is loaded using Seaborn’s built-in load_dataset() function. It includes information about passengers such as age, class, gender, and survival status. Using df.head(), we preview the first few rows to get an initial look at the data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set plot style
sns.set(style="whitegrid")

# Load Titanic dataset from Seaborn
df = sns.load_dataset('titanic')

# Display first 5 rows
df.head()

Output:

index

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

0

0

3

male

22.0

1

0

7.25

S

Third

man

true

NaN

Southampton

no

false

1

1

1

female

38.0

1

0

71.2833

C

First

woman

false

C

Cherbourg

yes

false

2

1

3

female

26.0

0

0

7.925

S

Third

woman

false

NaN

Southampton

yes

true

3

1

1

female

35.0

1

0

53.1

S

First

woman

false

C

Southampton

yes

false

4

0

3

male

35.0

0

0

8.05

S

Third

man

true

NaN

Southampton

no

true


2. Dataset Overview

Before diving into analysis, we quickly inspect the structure of the dataset. The shape tells us the number of rows and columns. df.info() shows data types and non-null counts for each column, helping us spot missing values. Finally, df.describe() provides summary statistics for both numeric and categorical features.

# Shape of the dataset
print("Shape of dataset:", df.shape)

# Data types and non-null counts
print("\nInfo:")
df.info()

# Summary statistics
print("\nSummary Statistics:")
df.describe(include='all')

Output:

Shape of dataset: (891, 15)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

Summary Statistics:

index

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

count

891.0

891.0

891

714.0

891.0

891.0

891.0

889

891

891

891

203

889

891

891

unique

NaN

NaN

2

NaN

NaN

NaN

NaN

3

3

3

2

7

3

2

2

top

NaN

NaN

male

NaN

NaN

NaN

NaN

S

Third

man

true

C

Southampton

no

true

freq

NaN

NaN

577

NaN

NaN

NaN

NaN

644

491

537

537

59

644

549

537

mean

0.3838383838383838

2.308641975308642

NaN

29.69911764705882

0.5230078563411896

0.38159371492704824

32.204207968574636

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

std

0.4865924542648585

0.8360712409770513

NaN

14.526497332334044

1.1027434322934275

0.8060572211299559

49.693428597180905

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

min

0.0

1.0

NaN

0.42

0.0

0.0

0.0

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

25%

0.0

2.0

NaN

20.125

0.0

0.0

7.9104

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

50%

0.0

3.0

NaN

28.0

0.0

0.0

14.4542

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

75%

1.0

3.0

NaN

38.0

1.0

0.0

31.0

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

max

1.0

3.0

NaN

80.0

8.0

6.0

512.3292

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

3. Checking Missing Values

Missing data can affect the quality of our analysis, so it's important to identify it early. Using isnull().sum(), we count missing values in each column and sort them in descending order. We then display only the columns that have missing entries.

# Count of missing values
missing = df.isnull().sum().sort_values(ascending=False)
missing[missing > 0]

Output:

deck

688

age

177

embarked

2

embark_town

2

4. Univariate Analysis

Univariate analysis focuses on examining one variable at a time to understand its distribution and characteristics. It helps identify patterns, outliers, or imbalances in the data.


Categorical Variables

We start with categorical variables using Seaborn’s countplot:


  • The survived column shows how many passengers lived or died.

  • The sex column reveals the gender distribution.

  • The class column displays the spread across travel classes.

# Countplot of Survival
sns.countplot(x='survived', data=df)
plt.title("Survival Counts")
plt.show()

# Sex distribution
sns.countplot(x='sex', data=df)
plt.title("Gender Distribution")
plt.show()

# Passenger Class
sns.countplot(x='class', data=df)
plt.title("Passenger Class Distribution")
plt.show()

Outputs:


Numerical Variables

For numerical features, we use histograms with kernel density estimation (KDE) to visualize their distribution.

The age variable shows a concentration of passengers in their 20s and 30s, with fewer older passengers.The fare variable is right-skewed, with most passengers paying lower fares and a few paying significantly more.

# Distribution of age
sns.histplot(df['age'].dropna(), kde=True, bins=30)
plt.title("Age Distribution")
plt.show()

# Distribution of fare
sns.histplot(df['fare'], kde=True)
plt.title("Fare Distribution")
plt.show()

Output:


5. Bivariate Analysis

Bivariate analysis helps us explore the relationship between two variables. This is useful for understanding how one feature may influence another—especially the survived column in our case.


  1. Survival by Gender: We use a grouped countplot to compare survival rates across male and female passengers.

  2. Survival by Passenger Class: This plot shows how survival chances varied across different travel classes.

  3. Age vs Fare: A scatter plot shows the relationship between age and fare, with color representing survival status. This helps identify clusters or patterns among survivors and non-survivors.

# Survival by Gender
sns.countplot(x='sex', hue='survived', data=df)
plt.title("Survival by Gender")
plt.show()

# Survival by Passenger Class
sns.countplot(x='class', hue='survived', data=df)
plt.title("Survival by Passenger Class")
plt.show()

# Age vs Fare
sns.scatterplot(x='age', y='fare', hue='survived', data=df)
plt.title("Age vs Fare by Survival")
plt.show()

Output:

These visualizations offer key insights into which factors may have influenced survival on the Titanic.


6. Correlation Matrix (Numerical Features)

To understand how numerical features relate to each other, we compute a correlation matrix. This helps identify linear relationships—both positive and negative—between variables.

We first select only the numeric columns from the dataset, then compute their pairwise correlation. The heatmap visualises these relationships, making it easy to spot strong correlations.

# Select only numerical columns
num_df = df.select_dtypes(include=['float64', 'int64'])

# Compute correlation matrix
corr = num_df.corr()

# Heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

Output:

Correlation Matrix - colabcodes

7. Feature Relationships (Boxplots, Violin, etc.)

To explore how numerical features vary across categories, we use violin and box plots.


  1. Age Distribution by Survival:The violin plot shows the distribution of age for survivors and non-survivors. It highlights both the spread and density of age values in each group.

  2. Fare by Class:The box plot displays fare distributions across passenger classes. It shows medians, quartiles, and potential outliers, giving insights into pricing patterns based on class.

# Age Distribution by Survival
sns.violinplot(x='survived', y='age', data=df)
plt.title("Age Distribution by Survival")
plt.show()

# Fare by Class
sns.boxplot(x='class', y='fare', data=df)
plt.title("Fare by Passenger Class")
plt.show()

Output:


Conclusion

Exploratory Data Analysis is more than just a preliminary step — it's the foundation of any meaningful data science project. Through statistical summaries and visual exploration, EDA uncovers hidden patterns, highlights data quality issues, and guides the choices you make in feature engineering and modeling.

By applying techniques like univariate and bivariate analysis, correlation heatmaps, and distribution plots, we’ve seen how a dataset can start to tell its story. In our Titanic dataset example, we observed the role of gender and class in survival rates, identified outliers in fare prices, and got a feel for the overall structure of the data.

As you move forward to model building or advanced analytics, remember: the insights you gather during EDA can make or break your results. Let the data speak — and listen carefully in this crucial phase.

Get in touch for customized mentorship, research and freelance solutions tailored to your needs.

bottom of page