NumPy for Machine Learning
Learn how to prepare raw data for AI models using normalization, standardization, and one-hot encoding with NumPy.
Learn how to prepare raw data for AI models using normalization, standardization, and one-hot encoding with NumPy. This hands-on tutorial focuses on practical implementation of numpy for machine learning concepts.
Module 11: NumPy for Machine Learning
Before you feed data into a Machine Learning model, it must be preprocessed. Raw data is often in different scales, contains missing values, or uses categorical labels that machines don't understand.
Lesson 23: Data Preprocessing
Normalization (Min-Max Scaling)
Scaling data to a range between 0 and 1. This prevents features with large ranges (like Salary) from overpowering small ranges (like Age).
Formula: (x - min) / (max - min)
Standardization (Z-score Scaling)
Rescaling data to have a mean of 0 and a standard deviation of 1.
Formula: (x - mean) / std
Lesson 24: Feature Engineering
Handling Missing Values
In ML, we often replace NaN (Not a Number) with the mean or median of the column.
np.isnan(arr): Detects NaN.arr[np.isnan(arr)] = mean: Fills NaNs.
One-Hot Encoding (NumPy Way)
Converting categorical labels (like "Red", "Blue") into numerical bits.
np.eye(categories)[label_indices]: A fast way to generate one-hot vectors.
Practice: Preprocessing Pipeline
Challenge: Create an array of 20 random integers representing house prices.
- Reshape it into a 2D column vector (20 rows, 1 column).
- Apply Standardization using NumPy.
- Replace any value that is more than 2 standard deviations away from the mean (outliers) with the mean value.
Quiz
Question 1 of 5What is the result of (data - min) / (max - min)?
Key Takeaways
✅ Normalization and Standardization ensure all features are treated equally by the model.
✅ Use np.nanmean to handle datasets with missing values.
✅ np.eye is a clever trick for fast one-hot encoding.