Python

Data Cleaning & Preparation

Real-world data is messy. Learn to handle missing values, fix data types, and remove duplicates.

By TechCoder TeamLast updated: 2026-06-02
In a Nutshell

Real-world data is messy. Learn to handle missing values, fix data types, and remove duplicates. This hands-on tutorial focuses on practical implementation of data cleaning & preparation concepts.

Module 4: Data Cleaning & Preparation

80% of a Data scientist's time is spent cleaning data. Pandas gives you a comprehensive toolkit to turn messy, real-world datasets into clean, analyzable insights.


Lesson 7: Handling Missing Data

In Pandas, missing data is represented as NaN (Not a Number) or None.

Detecting Missing Data

  • df.isnull(): Returns True for NaN values.
  • df.notnull(): Returns True for non-missing values.
  • df.isnull().sum(): Counts missing values per column.

Fixing Missing Data

  1. Drop it: df.dropna() removes rows with any missing values.
  2. Fill it: df.fillna(value) replaces NaNs with a specific value (like 0 or the mean).
PYTHON PLAYGROUND
⏳ Loading editor…

Lesson 8: Data Types & Duplicates

Fixing Data Types

Sometimes numbers are loaded as strings.

  • df['col'].astype(int): Converts a column to integer.
  • pd.to_datetime(df['date']): Converts strings to efficient Datetime objects.

Handling Duplicates

  • df.duplicated(): Returns True for duplicate rows.
  • df.drop_duplicates(): Removes duplicate rows.
PYTHON PLAYGROUND
⏳ Loading editor…

Practice: The Janitor

Challenge:

  1. Create a DataFrame with a column "Score" containing [10, 95, np.nan, 10].
  2. Replace the NaN with the median of the scores.
  3. Remove any duplicate rows.

Quiz

Question 1 of 5

Which method would you use to count the total number of missing values in each column?

df.count_missing()
df.isnull().sum()
df.isna().count()
df.missing_vals()

Key Takeaways

dropna() is aggressive; fillna() is usually better.
✅ Always check df.dtypes; string-math causes errors.
drop_duplicates() is your friend for raw data.