Module 11: Pandas for Machine Learning

Before a model can learn, data must be prepared. This module covers "Feature Engineering" - the art of creating inputs that make models smarter.

Lesson 23: Feature Engineering

One-Hot Encoding

Converting categories like "Red", "Blue" into numbers (0, 1).

pd.get_dummies(df): Automatically one-hot encodes categorical columns.

Scaling Features (Manual)

ML models like numbers on the same scale. You can do this in Pandas:

df['Age_Scaled'] = df['Age'] / df['Age'].max()

PYTHON PLAYGROUND

⏳ Loading editor…

Lesson 24: Splitting Data

Before training, we split data into Training (to learn) and Test (to validate) sets. While scikit-learn usually does this, understanding how to do it in Pandas is useful for time-series splits.

# First 80% for training
train_size = int(len(df) * 0.8)
train_set = df.iloc[:train_size]
test_set = df.iloc[train_size:]

PYTHON PLAYGROUND

⏳ Loading editor…

Practice: ML Prep

Challenge:

Create a DataFrame with a Color column (Red, Green, Blue) and Sales.
Use get_dummies to encode the colors.
Normalize the Sales column (divide by max sales) so it ranges from 0 to 1.

Quiz

Question 1 of 5

What does pd.get_dummies() do?

Removes dummy variables

Converts categorical variables into binary (0/1) columns

Fills missing values with dummy text

Splits the dataframe into chunks

Key Takeaways

✅ pd.get_dummies is the easiest way to handle categorical data for ML.
✅ Manual splitting is useful, but Scikit-Learn is the standard.
✅ Normalization is key for Model performance.