Your First Python Project
Build a complete data analysis project from scratch — read real data, clean it, analyze it, and create visualizations that tell a story.
The data analyst who got hired with one project
In 2022, a career changer named Marcus was applying for data analyst positions. He had no CS degree, no bootcamp certificate, and zero professional experience. What he did have was a GitHub repository with one project: a Python script that analyzed Airbnb listing data for his city, found pricing patterns by neighborhood, and produced a set of clean visualizations.
In his interview at a mid-size real estate company, the hiring manager pulled up Marcus's project on a screen. "Walk me through this," she said. Marcus explained every step — how he loaded the data, handled missing values, filtered outliers, grouped by neighborhood, and created charts that revealed which areas were overpriced.
He got the job. Not because of a degree or a certificate — because he could take a messy CSV file and turn it into insights a non-technical person could understand.
That is exactly what you are going to build in this module. Every concept from the previous seven modules comes together here: variables, control flow, functions, data structures, file I/O, libraries, and visualization. One project that proves you can write real Python.
The project: analyzing global temperature data
You will build a data analysis script that reads a dataset of global average temperatures by country, cleans it, analyzes trends, and creates visualizations. This is the same kind of work data analysts do every day at companies like Google, the World Bank, and consulting firms.
Here is what you will build:
Step 1: Set up the project — Create the directory, virtual environment, and install dependencies
Step 2: Create sample data — Generate a realistic CSV dataset to work with
Step 3: Load and explore — Read the CSV with pandas and understand its structure
Step 4: Clean the data — Handle missing values, fix types, remove outliers
Step 5: Analyze — Calculate statistics, find trends, compare groups
Step 6: Visualize — Create charts that tell the story
Step 7: Generate a report — Write the findings to a JSON summary file
Step 1: Set up the project
Every professional project starts with a clean structure.
mkdir temperature_analysis
cd temperature_analysis
python -m venv venv
source venv/bin/activate # Mac/Linux
# venv\Scripts\activate # Windows
pip install pandas matplotlib
pip freeze > requirements.txt
Your project folder should look like:
temperature_analysis/
venv/
requirements.txt
create_data.py
analyze.py
Step 2: Create sample data
In real projects, you get data from a database, API, or file download. For this project, let us generate a realistic dataset so everyone has the same starting point.
# create_data.py
import csv
import random
random.seed(42) # Makes random numbers reproducible
countries = ["USA", "UK", "Germany", "Japan", "Brazil",
"India", "Australia", "Canada", "France", "Mexico"]
base_temps = {
"USA": 12.5, "UK": 9.8, "Germany": 9.6, "Japan": 15.4,
"Brazil": 25.0, "India": 24.7, "Australia": 21.8,
"Canada": -0.5, "France": 11.2, "Mexico": 21.0
}
rows = [["year", "country", "avg_temp", "co2_emissions"]]
for year in range(1970, 2025):
for country in countries:
base = base_temps[country]
warming = (year - 1970) * 0.02 + random.uniform(-0.5, 0.5)
temp = round(base + warming, 1)
# Some missing values (realistic)
if random.random() < 0.03:
temp = ""
co2 = round(random.uniform(2.0, 16.0) + (year - 1970) * 0.05, 1)
if random.random() < 0.02:
co2 = ""
rows.append([year, country, temp, co2])
with open("climate_data.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
print(f"Created climate_data.csv with {len(rows) - 1} rows")
Run this script: python create_data.py. You should see "Created climate_data.csv with 550 rows."
Step 3: Load and explore
Now the real work begins. Create analyze.py:
# analyze.py
import pandas as pd
import matplotlib.pyplot as plt
# Load the data
df = pd.read_csv("climate_data.csv")
# First look
print("Shape:", df.shape) # (rows, columns)
print("\nFirst 5 rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics:")
print(df.describe())
This is the exploration phase. Before you analyze anything, you need to understand what you are working with. How many rows? What columns? What types? Any missing values?
There Are No Dumb Questions
"What is df.shape?"
df.shapereturns a tuple like(550, 4)— 550 rows and 4 columns. It is the first thing every data analyst checks. If you expected 1,000 rows and only see 500, something went wrong during loading."Why check for missing values before analyzing?"
Missing values cause incorrect results. If you average a column with missing values, pandas skips them — which might be fine, or might bias your results. Knowing WHERE data is missing helps you decide HOW to handle it. Always check before you calculate.
Explore Your Data
25 XPStep 4: Clean the data
Real data is messy. Our dataset has missing values that need handling.
def clean_data(df):
"""Clean the climate dataset."""
print(f"Rows before cleaning: {len(df)}")
# Convert columns to proper types (handles empty strings)
df["avg_temp"] = pd.to_numeric(df["avg_temp"], errors="coerce")
df["co2_emissions"] = pd.to_numeric(df["co2_emissions"], errors="coerce")
# Option 1: Drop rows with missing values
# df = df.dropna()
# Option 2: Fill missing values with column average (better)
df["avg_temp"] = df["avg_temp"].fillna(df["avg_temp"].mean())
df["co2_emissions"] = df["co2_emissions"].fillna(df["co2_emissions"].mean())
print(f"Missing values after cleaning: {df.isnull().sum().sum()}")
print(f"Rows after cleaning: {len(df)}")
return df
df = clean_data(df)
✗ Without AI
- ✗Simple and safe
- ✗Loses data — fewer rows for analysis
- ✗Best when missing data is random and rare
- ✗Might bias results if missing data is not random
✓ With AI
- ✓Preserves all rows
- ✓Introduces approximation
- ✓Best when you need every data point
- ✓Mean, median, or forward-fill are common strategies
Step 5: Analyze
Now extract insights from the clean data.
def analyze_trends(df):
"""Calculate key statistics and trends."""
results = {}
# Overall statistics
results["total_records"] = len(df)
results["year_range"] = f"{df['year'].min()}-{df['year'].max()}"
results["avg_global_temp"] = round(df["avg_temp"].mean(), 2)
# Temperature trend: compare first decade vs last decade
early = df[df["year"] <= 1980]["avg_temp"].mean()
recent = df[df["year"] >= 2015]["avg_temp"].mean()
results["temp_change"] = round(recent - early, 2)
# Hottest and coldest countries
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
results["coldest_country"] = country_temps.index[0]
results["hottest_country"] = country_temps.index[-1]
# Highest CO2 emitter (most recent year)
latest = df[df["year"] == df["year"].max()]
top_emitter = latest.sort_values("co2_emissions", ascending=False).iloc[0]
results["top_co2_emitter"] = top_emitter["country"]
# Temperature by decade
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
results["decade_trends"] = {str(k): round(v, 2) for k, v in decade_temps.items()}
return results
results = analyze_trends(df)
print("\n=== ANALYSIS RESULTS ===")
for key, value in results.items():
print(f"{key}: {value}")
There Are No Dumb Questions
"What does
df['year'] // 10 * 10do?"It converts a year into its decade.
2023 // 10is202(floor division drops the decimal).202 * 10is2020. So 2023 becomes 2020, 2015 becomes 2010, 1985 becomes 1980. This is a common trick for grouping time-series data by decade.
Step 6: Visualize
Charts make data understandable. Create three visualizations:
def create_visualizations(df):
"""Generate charts from the climate data."""
# Chart 1: Temperature trend over time (global average per year)
yearly_temp = df.groupby("year")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
plt.plot(yearly_temp.index, yearly_temp.values,
color="#ef4444", linewidth=2)
plt.title("Global Average Temperature Over Time", fontsize=14)
plt.xlabel("Year")
plt.ylabel("Average Temperature (C)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("chart_temperature_trend.png", dpi=150)
plt.close()
print("Saved: chart_temperature_trend.png")
# Chart 2: Average temperature by country (bar chart)
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
plt.figure(figsize=(10, 5))
colors = ["#3b82f6" if t < 15 else "#ef4444" for t in country_temps.values]
plt.barh(country_temps.index, country_temps.values, color=colors)
plt.title("Average Temperature by Country", fontsize=14)
plt.xlabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_country_temps.png", dpi=150)
plt.close()
print("Saved: chart_country_temps.png")
# Chart 3: Temperature by decade (bar chart)
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
bars = plt.bar([str(d) + "s" for d in decade_temps.index],
decade_temps.values, color="#8b5cf6")
plt.title("Average Temperature by Decade", fontsize=14)
plt.xlabel("Decade")
plt.ylabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_decade_temps.png", dpi=150)
plt.close()
print("Saved: chart_decade_temps.png")
create_visualizations(df)
Add a Fourth Chart
25 XPStep 7: Generate the report
The final step: save your findings as a structured JSON file that anyone can read.
import json
def save_report(results):
"""Save analysis results to a JSON report."""
report = {
"title": "Global Climate Data Analysis",
"dataset": "climate_data.csv",
"findings": results,
"charts_generated": [
"chart_temperature_trend.png",
"chart_country_temps.png",
"chart_decade_temps.png"
]
}
with open("analysis_report.json", "w") as f:
json.dump(report, f, indent=2)
print("\nSaved: analysis_report.json")
save_report(results)
The complete script
Here is the full analyze.py — everything in one file:
import pandas as pd
import matplotlib.pyplot as plt
import json
def clean_data(df):
df["avg_temp"] = pd.to_numeric(df["avg_temp"], errors="coerce")
df["co2_emissions"] = pd.to_numeric(df["co2_emissions"], errors="coerce")
df["avg_temp"] = df["avg_temp"].fillna(df["avg_temp"].mean())
df["co2_emissions"] = df["co2_emissions"].fillna(df["co2_emissions"].mean())
return df
def analyze_trends(df):
results = {}
results["total_records"] = len(df)
results["year_range"] = f"{df['year'].min()}-{df['year'].max()}"
results["avg_global_temp"] = round(df["avg_temp"].mean(), 2)
early = df[df["year"] <= 1980]["avg_temp"].mean()
recent = df[df["year"] >= 2015]["avg_temp"].mean()
results["temp_change"] = round(recent - early, 2)
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
results["coldest_country"] = country_temps.index[0]
results["hottest_country"] = country_temps.index[-1]
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
results["decade_trends"] = {str(k): round(v, 2) for k, v in decade_temps.items()}
return results
def create_visualizations(df):
yearly_temp = df.groupby("year")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
plt.plot(yearly_temp.index, yearly_temp.values, color="#ef4444", linewidth=2)
plt.title("Global Average Temperature Over Time", fontsize=14)
plt.xlabel("Year")
plt.ylabel("Average Temperature (C)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("chart_temperature_trend.png", dpi=150)
plt.close()
country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
plt.figure(figsize=(10, 5))
colors = ["#3b82f6" if t < 15 else "#ef4444" for t in country_temps.values]
plt.barh(country_temps.index, country_temps.values, color=colors)
plt.title("Average Temperature by Country", fontsize=14)
plt.xlabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_country_temps.png", dpi=150)
plt.close()
df["decade"] = (df["year"] // 10) * 10
decade_temps = df.groupby("decade")["avg_temp"].mean()
plt.figure(figsize=(10, 5))
plt.bar([str(d) + "s" for d in decade_temps.index],
decade_temps.values, color="#8b5cf6")
plt.title("Average Temperature by Decade", fontsize=14)
plt.xlabel("Decade")
plt.ylabel("Average Temperature (C)")
plt.tight_layout()
plt.savefig("chart_decade_temps.png", dpi=150)
plt.close()
def save_report(results):
report = {
"title": "Global Climate Data Analysis",
"findings": results,
"charts": ["chart_temperature_trend.png",
"chart_country_temps.png",
"chart_decade_temps.png"]
}
with open("analysis_report.json", "w") as f:
json.dump(report, f, indent=2)
# Main execution
print("Loading data...")
df = pd.read_csv("climate_data.csv")
print(f"Loaded {len(df)} rows, {len(df.columns)} columns")
print("\nCleaning data...")
df = clean_data(df)
print("\nAnalyzing trends...")
results = analyze_trends(df)
for key, value in results.items():
print(f" {key}: {value}")
print("\nCreating visualizations...")
create_visualizations(df)
print("\nSaving report...")
save_report(results)
print("\nDone! Check the generated files.")
Build and Run the Complete Project
50 XPWhat you have learned in this course
Look at how far you have come:
| Module | Skill | You can now... |
|---|---|---|
| 1. Why Python | Environment setup | Install Python, VS Code, write and run scripts |
| 2. Variables & Types | Data fundamentals | Store, convert, and manipulate strings, numbers, and booleans |
| 3. Control Flow | Logic | Make decisions with if/else and repeat with loops |
| 4. Functions | Reusability | Write modular, reusable code with parameters and returns |
| 5. Data Structures | Organization | Use lists, dicts, tuples, sets, and list comprehensions |
| 6. Files & Data | I/O | Read/write CSV, JSON, call APIs, handle errors |
| 7. Libraries | Ecosystem | Use pip, pandas, matplotlib, requests, virtual envs |
| 8. Final Project | Everything | Build a complete data analysis pipeline from scratch |
Your Python Skill Progression (%)
There Are No Dumb Questions
"What should I learn next after this course?"
Three paths, depending on your goal: Data analysis — learn SQL and advanced pandas (our Data Skills track covers this). Web development — learn Flask or Django to build web applications. Automation — start automating your own daily tasks with Python scripts. The best path is the one that solves a problem you personally care about.
"Is this enough to get a job?"
This course gives you the foundation. To be job-ready, build 2-3 more projects using real data, learn SQL, and practice on platforms like LeetCode or HackerRank. The project from this module is a strong portfolio piece — push it to GitHub and explain it in interviews.
Key takeaways
- A professional project has structure: directory, virtual environment, requirements.txt, clean code in functions
- Data analysis follows a pipeline: load, explore, clean, analyze, visualize, report
- Always explore before analyzing — check shape, types, and missing values before calculating anything
- Cleaning is half the work — handling missing values and type conversions is what separates beginners from professionals
- groupby() is your most powerful tool in pandas — learn it well
- Visualizations tell the story — a chart communicates faster than a table of numbers
- This project is portfolio-worthy — push it to GitHub and explain it in job interviews
- You wrote real Python — every concept from 8 modules, combined into working software
Knowledge Check
1.What is the correct order for a data analysis pipeline?
2.In pandas, what does `df.groupby('country')['avg_temp'].mean()` do?
3.Why is `pd.to_numeric(df['column'], errors='coerce')` used during data cleaning?
4.What should you always do BEFORE starting any data analysis in pandas?