Libraries & Packages

💡What You'll Build

By the end of this module, you will install packages with pip, manage dependencies with virtual environments, analyze a CSV with pandas, call an API with requests, and create professional charts with matplotlib. You will set up a complete project structure with `requirements.txt` — the way real Python developers work.

How one library made Python the language of data science

In 2008, a quantitative analyst at a hedge fund named Wes McKinney was frustrated. He needed to analyze financial data — millions of rows of stock prices, trades, and time series — and the existing tools were slow, clunky, and expensive. MATLAB cost thousands of dollars. R was powerful but awkward. Excel choked on large datasets.

So he built pandas — an open-source Python library for data analysis. He released it for free in 2009.

Today, pandas is installed over 100 million times per month. It single-handedly made Python the dominant language for data science, finance, and analytics. Data scientists at Netflix, Spotify, NASA, and every major bank use it daily. An entire industry runs on a library that one frustrated analyst built in his spare time.

This is the power of Python's ecosystem. You do not have to build everything yourself. Hundreds of thousands of libraries are available — free, one command to install — covering everything from data analysis to web scraping to machine learning.

In Module 6, you used csv and urllib — Python's built-in tools for data and APIs. They work, but they are verbose. Libraries like pandas, requests, and matplotlib do the same jobs in a fraction of the code, with far more power.

500000+packages on PyPI

100M+pandas downloads per month

1 cmdto install any package

pip — the package installer

pip is Python's package manager. It downloads and installs libraries from PyPI (Python Package Index), the central repository of Python packages. Think of PyPI as an app store for Python code.

bash

# Install a package
pip install pandas

# Install a specific version
pip install pandas==2.2.0

# Install multiple packages
pip install pandas matplotlib requests

# See what is installed
pip list

# Uninstall a package
pip uninstall pandas

⚠️python vs python3, pip vs pip3

On some systems (especially Mac/Linux), `python` and `pip` point to Python 2, while `python3` and `pip3` point to Python 3. If `pip install pandas` gives a "command not found" error, try `pip3 install pandas`. To avoid confusion, always verify: `python --version` should say Python 3.x. If it says 2.x, use `python3` and `pip3` everywhere.

Virtual environments — keeping projects separate

Imagine you have two projects. Project A needs pandas version 1.5. Project B needs pandas version 2.2. If both share the same Python installation, you cannot have both versions at once.

A virtual environment is a separate, isolated Python installation for each project. Think of it as giving each project its own toolbox instead of sharing one messy toolbox for everything.

bash

# Create a virtual environment
python -m venv myproject_env

# Activate it (Mac/Linux)
source myproject_env/bin/activate

# Activate it (Windows)
myproject_env\Scripts\activate

# Your terminal now shows (myproject_env) — you are inside
# Now pip installs go into THIS environment only
pip install pandas

# Deactivate when done
deactivate

Step 1: Create — python -m venv env_name makes a new isolated environment

Step 2: Activate — source env_name/bin/activate (Mac/Linux) or env_name\Scripts\activate (Windows)

Step 3: Install — pip install packages — installs ONLY in this environment

Step 4: Freeze — pip freeze > requirements.txt — saves the exact list of packages

Step 5: Share — Anyone can recreate your environment: pip install -r requirements.txt

🔑requirements.txt is your project's ingredient list

`pip freeze > requirements.txt` creates a file listing every installed package and its exact version. When a teammate clones your project, they run `pip install -r requirements.txt` and get the identical setup. Every professional Python project has a `requirements.txt`. It is the recipe card that ensures everyone is cooking with the same ingredients.

There Are No Dumb Questions

"Do I really need virtual environments? It seems like extra work."

For learning, you can skip them. For any real project you plan to share, deploy, or maintain, they are essential. Without them, installing a new package for one project can break a different project. The 30 seconds it takes to create a venv saves hours of debugging dependency conflicts later.

"What about conda? I see it mentioned in data science tutorials."

Conda is an alternative package manager popular in data science. It can manage both Python packages AND non-Python dependencies (like C libraries). For this course, pip and venv are simpler and sufficient. If you later do heavy data science or machine learning, you may want to explore conda or miniconda.

pandas — data analysis in one line

pandas turns Python into a spreadsheet on steroids. It reads CSVs, filters rows, calculates statistics, and handles missing data — tasks that take hours in Excel take seconds in pandas.

python

import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv("employees.csv")

# See the first 5 rows
print(df.head())

# Basic statistics
print(df.describe())

# Filter rows
engineers = df[df["department"] == "Engineering"]

# Calculate average salary
avg_salary = df["salary"].mean()
print(f"Average salary: ${avg_salary:,.2f}")

# Sort by salary, highest first
top_paid = df.sort_values("salary", ascending=False)

# Add a new column
df["bonus"] = df["salary"] * 0.1

pandas operation	What it does	Excel equivalent
`pd.read_csv("file.csv")`	Load a CSV	File → Open
`df.head()`	Show first 5 rows	Scroll to top
`df.describe()`	Summary statistics	Manual formulas
`df[df["col"] > 50]`	Filter rows	Filter button
`df["col"].mean()`	Average of column	=AVERAGE()
`df.sort_values("col")`	Sort by column	Sort A→Z
`df.groupby("col").mean()`	Group and aggregate	Pivot table

🔒

pandas in Action

25 XP

Create a CSV file called `sales_data.csv`: ``` product,category,units,price Laptop,Electronics,120,999.99 Phone,Electronics,350,699.99 Desk,Furniture,85,249.99 Chair,Furniture,210,149.99 Keyboard,Electronics,500,79.99 Lamp,Furniture,175,39.99 ``` Then write a pandas script that: 1. Reads the CSV 2. Adds a "revenue" column (units * price) 3. Finds the product with the highest revenue 4. Calculates the total revenue per category 5. Prints the results _Hint: `df["revenue"] = df["units"] * df["price"]`. Use `df.sort_values("revenue", ascending=False).iloc[0]` for the top product. Use `df.groupby("category")["revenue"].sum()` for category totals._

requests — talking to APIs the easy way

In Module 6, we used urllib to call APIs. The requests library makes this much cleaner:

python

import requests

# GET request — fetch data
response = requests.get("https://api.open-meteo.com/v1/forecast", params={
    "latitude": 40.71,
    "longitude": -74.01,
    "current_weather": True
})

data = response.json()    # Automatically parses JSON
weather = data["current_weather"]
print(f"NYC temperature: {weather['temperature']}C")

# Check for errors
if response.status_code == 200:
    print("Success!")
else:
    print(f"Error: {response.status_code}")

✗ urllib (built-in)

✗Verbose — 4+ lines per request
✗Must manually parse JSON
✗Error handling is clunky
✗Good for simple, quick scripts

✓ requests (third-party)

✓Clean — 1-2 lines per request
✓.json() method built in
✓Status codes are easy to check
✓Best for real projects

There Are No Dumb Questions

"Why are there so many Python libraries? How do I know which to use?"

Python's philosophy is "batteries included" (good standard library) plus "there is a library for that" (rich ecosystem). For common tasks, one library dominates: pandas for data, requests for APIs, matplotlib for charts, flask or django for web. When in doubt, search "best Python library for X" — the community has strong consensus.

matplotlib — creating visualizations

matplotlib turns data into charts. It is the most widely used plotting library in Python and the foundation that other visualization libraries (seaborn, plotly) build on.

python

import matplotlib.pyplot as plt

# Simple bar chart
products = ["Laptop", "Phone", "Desk", "Chair"]
revenue = [120000, 245000, 21250, 31500]

plt.figure(figsize=(8, 5))
plt.bar(products, revenue, color=["#3b82f6", "#8b5cf6", "#10b981", "#f59e0b"])
plt.title("Revenue by Product")
plt.xlabel("Product")
plt.ylabel("Revenue ($)")
plt.tight_layout()
plt.savefig("revenue_chart.png")
plt.show()

python

# Line chart — trends over time
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
sales = [45, 52, 49, 63, 58, 71]

plt.figure(figsize=(8, 5))
plt.plot(months, sales, marker="o", color="#8b5cf6", linewidth=2)
plt.title("Monthly Sales Trend")
plt.xlabel("Month")
plt.ylabel("Sales (units)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("sales_trend.png")
plt.show()

python

# Pie chart — proportions
categories = ["Electronics", "Furniture", "Clothing", "Food"]
sizes = [35, 25, 20, 20]

plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=categories, autopct="%1.1f%%",
        colors=["#3b82f6", "#10b981", "#f59e0b", "#ef4444"])
plt.title("Revenue by Category")
plt.tight_layout()
plt.savefig("category_pie.png")
plt.show()

🔒

Visualize Your Data

25 XP

Using the sales data from the previous challenge, create two charts: 1. A bar chart showing revenue by product 2. A pie chart showing revenue share by category Save both as PNG files. Make them look professional with titles, labels, and colors. _Hint: Use pandas to calculate the data, then pass the values to matplotlib. `plt.savefig("chart.png")` must come BEFORE `plt.show()` — otherwise the figure is cleared before saving._

The essential starter kit

Here are the libraries every Python beginner should know about:

Library	Purpose	Install command
pandas	Data analysis and manipulation	`pip install pandas`
requests	HTTP requests and APIs	`pip install requests`
matplotlib	Charts and visualizations	`pip install matplotlib`
numpy	Fast numerical computing (arrays, math)	`pip install numpy`
python-dotenv	Load environment variables from .env files	`pip install python-dotenv`
openpyxl	Read/write Excel files	`pip install openpyxl`
beautifulsoup4	Web scraping (parse HTML)	`pip install beautifulsoup4`
pytest	Testing your code	`pip install pytest`

🔒

Set Up a Professional Project

50 XP

Create a complete professional project setup: 1. Create a new directory called `my_data_project` 2. Create a virtual environment inside it 3. Activate the environment 4. Install pandas, requests, and matplotlib 5. Freeze the requirements: `pip freeze > requirements.txt` 6. Create a `main.py` that imports all three and prints their versions ```python import pandas as pd import requests import matplotlib print(f"pandas: {pd.__version__}") print(f"requests: {requests.__version__}") print(f"matplotlib: {matplotlib.__version__}") ``` _Hint: After `pip freeze > requirements.txt`, open the file and verify it lists all three packages with version numbers. This file is how you share your project's dependencies with others._

Back to Wes McKinney

McKinney built pandas because he was frustrated. The tools that existed were slow, expensive, or painful to use. So he built something better — and released it for free. That is the Python ecosystem in a nutshell: one frustrated developer's solution becomes the entire industry's standard tool. pandas, requests, matplotlib, numpy — all built by individuals or small teams who saw a problem and solved it.

You just learned to tap into that ecosystem. One pip install command gives you access to work that took brilliant developers years to build. That is leverage.

Next up: In the final module, every concept from this track comes together. You will build a complete data analysis project from scratch — load a dataset, clean it, analyze trends, create visualizations, and generate a report. It is the same workflow professional data analysts use every day, and it is portfolio-worthy.

Key takeaways

pip is Python's package manager — pip install package_name installs anything from PyPI's 500,000+ packages
Virtual environments isolate project dependencies — always use one for real projects (python -m venv env_name)
requirements.txt records your exact dependencies — create it with pip freeze > requirements.txt
pandas turns Python into a data analysis powerhouse — read CSVs, filter, aggregate, all in one line
requests makes API calls clean and simple — response = requests.get(url) then response.json()
matplotlib creates publication-quality charts — bar, line, pie, scatter, and more
The ecosystem is Python's biggest strength — the language itself is simple; the libraries make it powerful

Knowledge Check

1.What is the purpose of a virtual environment in Python?

2.Which command saves a list of all installed packages and their versions to a file?

3.In pandas, what does `df[df['salary'] > 50000]` do?

4.Why must `plt.savefig('chart.png')` come BEFORE `plt.show()` in matplotlib?

How one library made Python the language of data science

pip — the package installer

Virtual environments — keeping projects separate

pandas — data analysis in one line

pandas in Action

requests — talking to APIs the easy way

matplotlib — creating visualizations

Visualize Your Data

The essential starter kit

Set Up a Professional Project

Back to Wes McKinney

Key takeaways

Knowledge Check

💻 Software Engineering Master Class