Module 4: Data Cleaning & Preparation

80% of a Data scientist's time is spent cleaning data. Pandas gives you a comprehensive toolkit to turn messy, real-world datasets into clean, analyzable insights.

Lesson 7: Handling Missing Data

In Pandas, missing data is represented as NaN (Not a Number) or None.

Detecting Missing Data

df.isnull(): Returns True for NaN values.
df.notnull(): Returns True for non-missing values.
df.isnull().sum(): Counts missing values per column.

Fixing Missing Data

Drop it: df.dropna() removes rows with any missing values.
Fill it: df.fillna(value) replaces NaNs with a specific value (like 0 or the mean).

PYTHON PLAYGROUND

⏳ Loading editor…

Lesson 8: Data Types & Duplicates

Fixing Data Types

Sometimes numbers are loaded as strings.

df['col'].astype(int): Converts a column to integer.
pd.to_datetime(df['date']): Converts strings to efficient Datetime objects.

Handling Duplicates

df.duplicated(): Returns True for duplicate rows.
df.drop_duplicates(): Removes duplicate rows.

PYTHON PLAYGROUND

⏳ Loading editor…

Practice: The Janitor

Challenge:

Create a DataFrame with a column "Score" containing [10, 95, np.nan, 10].
Replace the NaN with the median of the scores.
Remove any duplicate rows.

Quiz

Question 1 of 5

Which method would you use to count the total number of missing values in each column?

df.count_missing()

df.isnull().sum()

df.isna().count()

df.missing_vals()

Key Takeaways

✅ dropna() is aggressive; fillna() is usually better.
✅ Always check df.dtypes; string-math causes errors.
✅ drop_duplicates() is your friend for raw data.