Module 9: Performance & Optimization

Pandas is fast, but bad code can make it slow. This module teaches you how to write professional, high-performance Pandas code that handles millions of rows without crashing.

Lesson 19: Vectorization (Again!)

Just like NumPy, NEVER loop over rows in Pandas.

Bad: for index, row in df.iterrows(): ...
Good: df['col'] = df['col'] * 2

Vectorized operations are implemented in C and can be 1000x faster than Python loops.

PYTHON PLAYGROUND

⏳ Loading editor…

Lesson 20: Memory Optimization

Categorical Data

Storing text columns repeated millions times (e.g., "Male", "Male", "Female") is wasteful. Convert them to Category dtype.

df['Gender'] = df['Gender'].astype('category')

This stores the unique strings once and uses integers under the hood, saving huge amounts of RAM.

Downcasting Numbers

Do you need int64 (up to 9 quintillion) for a column of ages (0-100)? No. Use int8 or float32.

PYTHON PLAYGROUND

⏳ Loading editor…

Practice: The Optimizer

Challenge:

Create a DataFrame with 10 rows of "Yes" and "No" strings.
Check the memory usage (df.memory_usage(deep=True)).
Convert the column to "category" type and check memory usage again.

Quiz

Question 1 of 5

Which method of iterating over a DataFrame is generally the slowest and should be avoided?

Vectorization

.apply()

.iterrows()

.itertuples()

Key Takeaways

✅ Avoid loops. Use vectorization.
✅ Use astype('category') for text columns with repeated values.
✅ Check df.memory_usage(deep=True) to find heavy columns.