Module 11: NumPy for Machine Learning

Before you feed data into a Machine Learning model, it must be preprocessed. Raw data is often in different scales, contains missing values, or uses categorical labels that machines don't understand.

Lesson 23: Data Preprocessing

Normalization (Min-Max Scaling)

Scaling data to a range between 0 and 1. This prevents features with large ranges (like Salary) from overpowering small ranges (like Age). Formula: (x - min) / (max - min)

Standardization (Z-score Scaling)

Rescaling data to have a mean of 0 and a standard deviation of 1. Formula: (x - mean) / std

PYTHON PLAYGROUND

⏳ Loading editor…

Lesson 24: Feature Engineering

Handling Missing Values

In ML, we often replace NaN (Not a Number) with the mean or median of the column.

np.isnan(arr): Detects NaN.
arr[np.isnan(arr)] = mean: Fills NaNs.

One-Hot Encoding (NumPy Way)

Converting categorical labels (like "Red", "Blue") into numerical bits.

np.eye(categories)[label_indices]: A fast way to generate one-hot vectors.

PYTHON PLAYGROUND

⏳ Loading editor…

Practice: Preprocessing Pipeline

Challenge: Create an array of 20 random integers representing house prices.

Reshape it into a 2D column vector (20 rows, 1 column).
Apply Standardization using NumPy.
Replace any value that is more than 2 standard deviations away from the mean (outliers) with the mean value.

Quiz

Question 1 of 5

What is the result of (data - min) / (max - min)?

Standardization

Normalization (Min-Max Scaling)

Categorization

Vectorization

Key Takeaways

✅ Normalization and Standardization ensure all features are treated equally by the model.
✅ Use np.nanmean to handle datasets with missing values.
✅ np.eye is a clever trick for fast one-hot encoding.