Data Cleaning & Preparation
Real-world data is messy. Learn to handle missing values, fix data types, and remove duplicates.
Real-world data is messy. Learn to handle missing values, fix data types, and remove duplicates. This hands-on tutorial focuses on practical implementation of data cleaning & preparation concepts.
Module 4: Data Cleaning & Preparation
80% of a Data scientist's time is spent cleaning data. Pandas gives you a comprehensive toolkit to turn messy, real-world datasets into clean, analyzable insights.
Lesson 7: Handling Missing Data
In Pandas, missing data is represented as NaN (Not a Number) or None.
Detecting Missing Data
df.isnull(): Returns True for NaN values.df.notnull(): Returns True for non-missing values.df.isnull().sum(): Counts missing values per column.
Fixing Missing Data
- Drop it:
df.dropna()removes rows with any missing values. - Fill it:
df.fillna(value)replaces NaNs with a specific value (like 0 or the mean).
Lesson 8: Data Types & Duplicates
Fixing Data Types
Sometimes numbers are loaded as strings.
df['col'].astype(int): Converts a column to integer.pd.to_datetime(df['date']): Converts strings to efficient Datetime objects.
Handling Duplicates
df.duplicated(): Returns True for duplicate rows.df.drop_duplicates(): Removes duplicate rows.
Practice: The Janitor
Challenge:
- Create a DataFrame with a column "Score" containing
[10, 95, np.nan, 10]. - Replace the
NaNwith the median of the scores. - Remove any duplicate rows.
Quiz
Question 1 of 5Which method would you use to count the total number of missing values in each column?
Key Takeaways
✅ dropna() is aggressive; fillna() is usually better.
✅ Always check df.dtypes; string-math causes errors.
✅ drop_duplicates() is your friend for raw data.