Data Cleaning & Preprocessing
Real-world data is messy. Learn how to handle missing values, duplicates, and prepare data for AI models.
Real-world data is messy. Learn how to handle missing values, duplicates, and prepare data for AI models. This hands-on tutorial focuses on practical implementation of data cleaning & preprocessing concepts.
Data Cleaning & Preprocessing
Data Scientists spend 80% of their time cleaning data. Why? Because models trained on bad data are useless.
1. Handling Missing Values (Imputation) π§©
Missing data is common (e.g., a user didn't fill out their age).
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, np.nan, 35, 25],
'Salary': [50000, 60000, np.nan, 52000]
})
# Check for missing values
print(df.isnull().sum())
# Strategy 1: Drop rows with missing values
df_dropped = df.dropna()
# Strategy 2: Fill with Mean/Median (Imputation)
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
2. Removing Duplicates π―ββοΈ
Duplicate entries can skew your analysis.
df = pd.DataFrame({
'ID': [1, 2, 2, 3],
'Name': ['A', 'B', 'B', 'C']
})
# Check for duplicates
print(df.duplicated().sum()) # 1
# Remove duplicates
df = df.drop_duplicates()
3. String Manipulation π§΅
Text data often needs cleaning (removing whitespace, fixing typos).
df = pd.DataFrame({'City': [' New York ', 'Paris', 'london', 'New York']})
# Strip whitespace
df['City'] = df['City'].str.strip()
# Convert to title case
df['City'] = df['City'].str.title()
# Replace values
df['City'] = df['City'].replace('Ny', 'New York')
4. Data Type Conversion π
Sometimes numbers are stored as strings (e.g., "$100").
df = pd.DataFrame({'Price': ['$100', '$200', '$300']})
# Remove '$' and convert to float
df['Price'] = df['Price'].str.replace('$', '').astype(float)
Interactive Challenge: The Janitor
Clean up this messy dataset!
Quiz
Quiz
Question 1 of 3What is the best way to handle missing values if you have very little data?
Key Takeaways
β
Imputation saves data by filling missing values.
β
String methods (.str) are powerful for cleaning text.
β
Type conversion (.astype) ensures numbers are actually numbers.
What's Next?
Clean data is boring to look at in a table. Let's visualize it!
Next Chapter: Data Visualization.