Your First Python Project
Build a complete data analysis project from scratch — read real data, clean it, analyze it, and create visualizations that tell a story.
The data analyst who got hired with one project
In 2022, a career changer named Marcus was applying for data analyst positions. He had no CS degree, no bootcamp certificate, and zero professional experience. What he did have was a GitHub repository with one project: a Python script that analyzed Airbnb listing data for his city, found pricing patterns by neighborhood, and produced a set of clean visualizations.
In his interview at a mid-size real estate company, the hiring manager pulled up Marcus's project on a screen. "Walk me through this," she said. Marcus explained every step — how he loaded the data, handled missing values, filtered outliers, grouped by neighborhood, and created charts that revealed which areas were overpriced.
He got the job. Not because of a degree or a certificate — because he could take a messy CSV file and turn it into insights a non-technical person could understand.
That is exactly what you are going to build in this module. Every concept from the previous seven modules comes together here: variables and f-strings (Module 2), control flow (Module 3), functions with parameters and returns (Module 4), lists and dictionaries (Module 5), CSV and JSON file handling (Module 6), and pandas, matplotlib, and virtual environments (Module 7). One project that proves you can write real Python.
The project: analyzing global temperature data
You will build a data analysis script that reads a dataset of global average temperatures by country, cleans it, analyzes trends, and creates visualizations. This is the same kind of work data analysts do every day at companies like Google, the World Bank, and consulting firms.
Here is what you will build:
Step 1: Set up the project — Create the directory, virtual environment, and install dependencies
Step 2: Create sample data — Generate a realistic CSV dataset to work with
Step 3: Load and explore — Read the CSV with pandas and understand its structure
Step 4: Clean the data — Handle missing values, fix types, remove outliers
Step 5: Analyze — Calculate statistics, find trends, compare groups
Step 6: Visualize — Create charts that tell the story
Step 7: Generate a report — Write the findings to a JSON summary file
Step 1: Set up the project
Every professional project starts with a clean structure.
mkdir temperature_analysis
cd temperature_analysis
python -m venv venv
source venv/bin/activate # Mac/Linux
# venv\Scripts\activate # Windows
pip install pandas matplotlib
pip freeze > requirements.txtYour project folder should look like:
temperature_analysis/
venv/
requirements.txt
create_data.py
analyze.py
Step 2: Create sample data
In real projects, you get data from a database, API, or file download. For this project, let us generate a realistic dataset so everyone has the same starting point.
# create_data.py
import csv
import random
random.seed(42) # Makes random numbers reproducible
countries = ["USA", "UK", "Germany", "Japan", "Brazil",
"India", "Australia", "Canada", "France", "Mexico"]
base_temps = {
"USA": 12.5, "UK": 9.8, "Germany": 9.6, "Japan": 15.4,
"Brazil": 25.0, "India": 24.7, "Australia": 21.8,
"Canada": -0.5, "France": 11.2, "Mexico": 21.0
}
rows = [["year", "country", "avg_temp", "co2_emissions"]]
for year in range(1970, 2025):
for country in countries:
base = base_temps[country]
warming = (year - 1970) * 0.02 + random.uniform(-0.5, 0.5)
temp = round(base + warming, 1)
# Some missing values (realistic)
if random.random() < 0.03:
temp = ""
co2 = round(random.uniform(2.0, 16.0) + (year - 1970) * 0.05, 1)
if random.random() < 0.02:
co2 = ""
rows.append([year, country, temp, co2])
with open("climate_data.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
print(f"Created climate_data.csv with {len(rows) - 1} rows")Run this script: python create_data.py. You should see "Created climate_data.csv with 550 rows."
Step 3: Load and explore
Now the real work begins. Create analyze.py:
# analyze.py
import pandas as pd
import matplotlib.pyplot as plt
# Load the data
df = pd.read_csv("climate_data.csv")
# First look
print("Shape:", df.shape) # (rows, columns)
print("\nFirst 5 rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics:")
print(df.describe())This is the exploration phase. Before you analyze anything, you need to understand what you are working with. How many rows? What columns? What types? Any missing values?
There Are No Dumb Questions
"What is df.shape?"
df.shapereturns a tuple like(550, 4)— 550 rows and 4 columns. It is the first thing every data analyst checks. If you expected 1,000 rows and only see 500, something went wrong during loading."Why check for missing values before analyzing?"
Missing values cause incorrect results. If you average a column with missing values, pandas skips them — which might be fine, or might bias your results. Knowing WHERE data is missing helps you decide HOW to handle it. Always check before you calculate.
Explore Your Data
25 XPRun the exploration code above and answer these questions: 1. How many rows and columns does the dataset have? 2. Which columns have missing values, and how many? 3. What is the minimum and maximum temperature in the dataset? 4. What is the average CO2 emissions across all countries? _Hint: `df.shape` answers #1. `df.isnull().sum()` answers #2. `df.describe()` answers #3 and #4._
Sign in to earn XPStep 4: Clean the data
Real data is messy. Our dataset has missing values that need handling.
def clean_data(df):
"""Clean the climate dataset."""
print(f"Rows before cleaning: {len(df)}")
# Convert columns to proper types (handles empty strings)
df["avg_temp"] = pd.to_numeric(df["avg_temp"], errors="coerce")
df["co2_emissions"] = pd.to_numeric(df["co2_emissions"], errors="coerce")
# Option 1: Drop rows with missing values
# df = df.dropna()
# Option 2: Fill missing values with column average (better)
df["avg_temp"] = df["avg_temp"].fillna(df["avg_temp"].mean())
df["co2_emissions"] = df["co2_emissions"].fillna(df["co2_emissions"].mean())
print(f"Missing values after cleaning: {df.isnull().sum().sum()}")
print(f"Rows after cleaning: {len(df)}")
return df
df = clean_data(df)✗ dropna() — Delete missing rows
- ✗Simple and safe
- ✗Loses data — fewer rows for analysis
- ✗Best when missing data is random and rare
- ✗Might bias results if missing data is not random
✓ fillna() — Fill missing values
- ✓Preserves all rows
- ✓Introduces approximation
- ✓Best when you need every data point
- ✓Mean, median, or forward-fill are common strategies
Step 5: Analyze
Now extract insights from the clean data.
def analyze_trends(df):
"""Calculate key statistics and trends."""
results = {}
# Overall statistics
results["total_records"] = len(df)
results["year_range"] = f"{df['year'].min()}-{df['year'].max()}"
results["avg_global_temp"] = round(df["avg_temp"].mean(), 2)
# Temperature trend: compare first decade vs last decade
early = df[df["year"] <= 1980]["avg_temp"].mean()
recent = df[df["year"] >= 2015]["avg_temp"].mean()
results["temp_change"] = round(recent - early, 2)
# Hottest and coldest countries
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
results["coldest_country"] = country_temps.index[0]
results["hottest_country"] = country_temps.index[-1]
# Highest CO2 emitter (most recent year)
latest = df[df["year"] == df["year"].max()]
top_emitter = latest.sort_values("co2_emissions", ascending=False).iloc[0]
results["top_co2_emitter"] = top_emitter["country"]
# Temperature by decade
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
results["decade_trends"] = {str(k): round(v, 2) for k, v in decade_temps.items()}
return results
results = analyze_trends(df)
print("\n=== ANALYSIS RESULTS ===")
for key, value in results.items():
print(f"{key}: {value}")There Are No Dumb Questions
"What does
df['year'] // 10 * 10do?"It converts a year into its decade.
2023 // 10is202(floor division drops the decimal).202 * 10is2020. So 2023 becomes 2020, 2015 becomes 2010, 1985 becomes 1980. This is a common trick for grouping time-series data by decade.
Step 6: Visualize
Charts make data understandable. Create three visualizations:
def create_visualizations(df):
"""Generate charts from the climate data."""
# Chart 1: Temperature trend over time (global average per year)
yearly_temp = df.groupby("year")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
plt.plot(yearly_temp.index, yearly_temp.values,
color="#ef4444", linewidth=2)
plt.title("Global Average Temperature Over Time", fontsize=14)
plt.xlabel("Year")
plt.ylabel("Average Temperature (C)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("chart_temperature_trend.png", dpi=150)
plt.close()
print("Saved: chart_temperature_trend.png")
# Chart 2: Average temperature by country (bar chart)
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
plt.figure(figsize=(10, 5))
colors = ["#3b82f6" if t < 15 else "#ef4444" for t in country_temps.values]
plt.barh(country_temps.index, country_temps.values, color=colors)
plt.title("Average Temperature by Country", fontsize=14)
plt.xlabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_country_temps.png", dpi=150)
plt.close()
print("Saved: chart_country_temps.png")
# Chart 3: Temperature by decade (bar chart)
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
bars = plt.bar([str(d) + "s" for d in decade_temps.index],
decade_temps.values, color="#8b5cf6")
plt.title("Average Temperature by Decade", fontsize=14)
plt.xlabel("Decade")
plt.ylabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_decade_temps.png", dpi=150)
plt.close()
print("Saved: chart_decade_temps.png")
create_visualizations(df)Add a Fourth Chart
25 XPAdd a visualization that shows CO2 emissions by country as a horizontal bar chart. Sort countries from highest to lowest emissions. Use orange (#f59e0b) for the bars. _Hint: Follow the same pattern as Chart 2, but group by "co2_emissions" instead of "avg_temp" and sort descending with `ascending=False`. Then call `sort_values(ascending=True)` for the horizontal bar chart (barh displays bottom-to-top)._
Sign in to earn XPStep 7: Generate the report
The final step: save your findings as a structured JSON file that anyone can read.
import json
def save_report(results):
"""Save analysis results to a JSON report."""
report = {
"title": "Global Climate Data Analysis",
"dataset": "climate_data.csv",
"findings": results,
"charts_generated": [
"chart_temperature_trend.png",
"chart_country_temps.png",
"chart_decade_temps.png"
]
}
with open("analysis_report.json", "w") as f:
json.dump(report, f, indent=2)
print("\nSaved: analysis_report.json")
save_report(results)The complete script
Here is the full analyze.py — everything in one file:
import pandas as pd
import matplotlib.pyplot as plt
import json
def clean_data(df):
df["avg_temp"] = pd.to_numeric(df["avg_temp"], errors="coerce")
df["co2_emissions"] = pd.to_numeric(df["co2_emissions"], errors="coerce")
df["avg_temp"] = df["avg_temp"].fillna(df["avg_temp"].mean())
df["co2_emissions"] = df["co2_emissions"].fillna(df["co2_emissions"].mean())
return df
def analyze_trends(df):
results = {}
results["total_records"] = len(df)
results["year_range"] = f"{df['year'].min()}-{df['year'].max()}"
results["avg_global_temp"] = round(df["avg_temp"].mean(), 2)
early = df[df["year"] <= 1980]["avg_temp"].mean()
recent = df[df["year"] >= 2015]["avg_temp"].mean()
results["temp_change"] = round(recent - early, 2)
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
results["coldest_country"] = country_temps.index[0]
results["hottest_country"] = country_temps.index[-1]
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
results["decade_trends"] = {str(k): round(v, 2) for k, v in decade_temps.items()}
return results
def create_visualizations(df):
yearly_temp = df.groupby("year")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
plt.plot(yearly_temp.index, yearly_temp.values, color="#ef4444", linewidth=2)
plt.title("Global Average Temperature Over Time", fontsize=14)
plt.xlabel("Year")
plt.ylabel("Average Temperature (C)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("chart_temperature_trend.png", dpi=150)
plt.close()
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
plt.figure(figsize=(10, 5))
colors = ["#3b82f6" if t < 15 else "#ef4444" for t in country_temps.values]
plt.barh(country_temps.index, country_temps.values, color=colors)
plt.title("Average Temperature by Country", fontsize=14)
plt.xlabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_country_temps.png", dpi=150)
plt.close()
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
plt.bar([str(d) + "s" for d in decade_temps.index],
decade_temps.values, color="#8b5cf6")
plt.title("Average Temperature by Decade", fontsize=14)
plt.xlabel("Decade")
plt.ylabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_decade_temps.png", dpi=150)
plt.close()
def save_report(results):
report = {
"title": "Global Climate Data Analysis",
"findings": results,
"charts": ["chart_temperature_trend.png",
"chart_country_temps.png",
"chart_decade_temps.png"]
}
with open("analysis_report.json", "w") as f:
json.dump(report, f, indent=2)
# Main execution
print("Loading data...")
df = pd.read_csv("climate_data.csv")
print(f"Loaded {len(df)} rows, {len(df.columns)} columns")
print("\nCleaning data...")
df = clean_data(df)
print("\nAnalyzing trends...")
results = analyze_trends(df)
for key, value in results.items():
print(f" {key}: {value}")
print("\nCreating visualizations...")
create_visualizations(df)
print("\nSaving report...")
save_report(results)
print("\nDone! Check the generated files.")Build and Run the Complete Project
50 XPBuild the complete project end to end: 1. Create the project directory and virtual environment 2. Run `create_data.py` to generate the dataset 3. Run `analyze.py` to perform the full analysis 4. Verify you have: `climate_data.csv`, three PNG charts, and `analysis_report.json` Then extend the project with one of these features: - Add a function that finds the country with the fastest temperature increase - Create a scatter plot of temperature vs CO2 emissions - Export the analysis to a nicely formatted text file instead of JSON This project — with your extension — is portfolio-worthy. Push it to GitHub. _Hint: For the fastest warming country, calculate each country's temperature in the first decade vs the last decade, then find the biggest difference._
Sign in to earn XPWhat you have learned in this course
Look at how far you have come:
| Module | Skill | You can now... |
|---|---|---|
| 1. Why Python | Environment setup | Install Python, VS Code, write and run scripts |
| 2. Variables & Types | Data fundamentals | Store, convert, and manipulate strings, numbers, and booleans |
| 3. Control Flow | Logic | Make decisions with if/else and repeat with loops |
| 4. Functions | Reusability | Write modular, reusable code with parameters and returns |
| 5. Data Structures | Organization | Use lists, dicts, tuples, sets, and list comprehensions |
| 6. Files & Data | I/O | Read/write CSV, JSON, call APIs, handle errors |
| 7. Libraries | Ecosystem | Use pip, pandas, matplotlib, requests, virtual envs |
| 8. Final Project | Everything | Build a complete data analysis pipeline from scratch |
There Are No Dumb Questions
"What should I learn next after this course?"
Three paths, depending on your goal: Data analysis — learn SQL and advanced pandas (our Data Skills track covers this). Web development — learn Flask or Django to build web applications. Automation — start automating your own daily tasks with Python scripts. The best path is the one that solves a problem you personally care about.
"Is this enough to get a job?"
This course gives you the foundation. To be job-ready, build 2-3 more projects using real data, learn SQL, and practice on platforms like LeetCode or HackerRank. The project from this module is a strong portfolio piece — push it to GitHub and explain it in interviews.
Back to Marcus
Marcus walked into that interview with no CS degree, no bootcamp certificate, and no professional experience. He had one thing: a GitHub repository with a project exactly like the one you just built. He loaded a messy CSV, cleaned it, found patterns, and turned them into charts that told a story.
The hiring manager did not ask him about algorithms or data structures theory. She said, "Walk me through this." He did, and he got the job.
You just built the same thing. Push it to GitHub. When someone asks what you can do with Python, show them: a dataset loaded, cleaned, analyzed, and visualized — with three charts and a JSON report to prove it.
You started this track with print("Hello, World!"). You are ending it with a professional data analysis pipeline. That is the Python Fundamentals track, complete.
Key takeaways
- A professional project has structure: directory, virtual environment, requirements.txt, clean code in functions
- Data analysis follows a pipeline: load, explore, clean, analyze, visualize, report
- Always explore before analyzing — check shape, types, and missing values before calculating anything
- Cleaning is half the work — handling missing values and type conversions is what separates beginners from professionals
- groupby() is your most powerful tool in pandas — learn it well
- Visualizations tell the story — a chart communicates faster than a table of numbers
- This project is portfolio-worthy — push it to GitHub and explain it in job interviews
- You wrote real Python — every concept from 8 modules, combined into working software
Knowledge Check
1.What is the correct order for a data analysis pipeline?
2.In pandas, what does `df.groupby('country')['avg_temp'].mean()` do?
3.Why is `pd.to_numeric(df['column'], errors='coerce')` used during data cleaning?
4.What should you always do BEFORE starting any data analysis in pandas?
Want to go deeper?
💻 Software Engineering Master Class
The complete software engineering program — from your first line of code to landing your first job.
View the full program