File Handling & Integration
Learn how to save and load NumPy data efficiently and how NumPy integrates with the broader Python ecosystem.
Learn how to save and load NumPy data efficiently and how NumPy integrates with the broader Python ecosystem. This hands-on tutorial focuses on practical implementation of file handling & integration concepts.
Module 10: File Handling & Integration
In the real world, data doesn't live in code; it lives in files. NumPy provides specialized functions for handling CSVs, text files, and its own high-speed binary format.
Lesson 21: File I/O
Working with Text Files (CSV/TXT)
np.loadtxt(fname, delimiter): Good for simple, clean numerical files.np.genfromtxt(fname, delimiter, names=True): More robust; handles missing values and headers.
High-Speed Binary Files
For very large datasets, text files (CSVs) are slow. NumPy uses .npy and .npz formats which are much faster and smaller.
np.save(file, arr): Saves a single array in.npyformat.np.savez(file, a=arr1, b=arr2): Saves multiple arrays into a compressed.npzfile.np.load(file): Loads it back.
Lesson 22: NumPy with Ecosystem Libraries
NumPy is the backbone of the Python Data Science stack. It integrates perfectly with Pandas and Matplotlib.
Integration with Pandas
A Pandas Series or DataFrame is essentially a wrapper around a NumPy array.
- To NumPy:
df.valuesordf.to_numpy() - From NumPy:
pd.DataFrame(numpy_arr)
Integration with Matplotlib
Plotting libraries expect NumPy arrays as input for coordinates.
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y) # Matplotlib accepts x and y as NumPy arrays
Practice: CSV Data Cleaning
Challenge: Imagine you have a CSV with some missing values denoted by -999.
- Use
np.genfromtxt()(conceptually) to load the data. - Filter out all
-999values using boolean indexing. - Calculate the mean of the remaining data.
Quiz
Question 1 of 5Which file format is specifically designed for high-speed NumPy storage?
Key Takeaways
✅ Use .npy for high-speed storage of NumPy data.
✅ genfromtxt is better than loadtxt for real-world (dirty) data.
✅ NumPy is the universal language shared by Pandas, SciPy, and Matplotlib.