AI & Machine Learning

Data Cleaning & Preprocessing

Real-world data is messy. Learn how to handle missing values, duplicates, and prepare data for AI models.

By TechCoder TeamLast updated: 2026-06-02
In a Nutshell

Real-world data is messy. Learn how to handle missing values, duplicates, and prepare data for AI models. This hands-on tutorial focuses on practical implementation of data cleaning & preprocessing concepts.

Data Cleaning & Preprocessing

Data Scientists spend 80% of their time cleaning data. Why? Because models trained on bad data are useless.

1. Handling Missing Values (Imputation) 🧩

Missing data is common (e.g., a user didn't fill out their age).

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, np.nan, 35, 25],
    'Salary': [50000, 60000, np.nan, 52000]
})

# Check for missing values
print(df.isnull().sum())

# Strategy 1: Drop rows with missing values
df_dropped = df.dropna()

# Strategy 2: Fill with Mean/Median (Imputation)
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

2. Removing Duplicates πŸ‘―β€β™‚οΈ

Duplicate entries can skew your analysis.

df = pd.DataFrame({
    'ID': [1, 2, 2, 3],
    'Name': ['A', 'B', 'B', 'C']
})

# Check for duplicates
print(df.duplicated().sum()) # 1

# Remove duplicates
df = df.drop_duplicates()

3. String Manipulation 🧡

Text data often needs cleaning (removing whitespace, fixing typos).

df = pd.DataFrame({'City': [' New York ', 'Paris', 'london', 'New York']})

# Strip whitespace
df['City'] = df['City'].str.strip()

# Convert to title case
df['City'] = df['City'].str.title()

# Replace values
df['City'] = df['City'].replace('Ny', 'New York')

4. Data Type Conversion πŸ”„

Sometimes numbers are stored as strings (e.g., "$100").

df = pd.DataFrame({'Price': ['$100', '$200', '$300']})

# Remove '$' and convert to float
df['Price'] = df['Price'].str.replace('$', '').astype(float)

Interactive Challenge: The Janitor

Clean up this messy dataset!

PYTHON PLAYGROUND
⏳ Loading editor…

Quiz

Quiz

Question 1 of 3

What is the best way to handle missing values if you have very little data?

Drop the rows
Fill with mean/median (Imputation)
Leave them as NaN
Delete the column

Key Takeaways

βœ… Imputation saves data by filling missing values.
βœ… String methods (.str) are powerful for cleaning text.
βœ… Type conversion (.astype) ensures numbers are actually numbers.

What's Next?

Clean data is boring to look at in a table. Let's visualize it!

Next Chapter: Data Visualization.