Module 8

Your First Python Project

Build a complete data analysis project from scratch — read real data, clean it, analyze it, and create visualizations that tell a story.

💡What You'll Build
In this capstone module, you will build a complete data analysis project from scratch: generate a climate dataset, load it with pandas, clean missing values, calculate trends by country and decade, create three publication-quality charts with matplotlib, and save a JSON report. This is the same workflow professional data analysts use every day — and it is portfolio-worthy.

The data analyst who got hired with one project

In 2022, a career changer named Marcus was applying for data analyst positions. He had no CS degree, no bootcamp certificate, and zero professional experience. What he did have was a GitHub repository with one project: a Python script that analyzed Airbnb listing data for his city, found pricing patterns by neighborhood, and produced a set of clean visualizations.

In his interview at a mid-size real estate company, the hiring manager pulled up Marcus's project on a screen. "Walk me through this," she said. Marcus explained every step — how he loaded the data, handled missing values, filtered outliers, grouped by neighborhood, and created charts that revealed which areas were overpriced.

He got the job. Not because of a degree or a certificate — because he could take a messy CSV file and turn it into insights a non-technical person could understand.

That is exactly what you are going to build in this module. Every concept from the previous seven modules comes together here: variables and f-strings (Module 2), control flow (Module 3), functions with parameters and returns (Module 4), lists and dictionaries (Module 5), CSV and JSON file handling (Module 6), and pandas, matplotlib, and virtual environments (Module 7). One project that proves you can write real Python.

7modules of skills, combined into 1 project

1project to prove you can code

100+lines of professional Python

The project: analyzing global temperature data

You will build a data analysis script that reads a dataset of global average temperatures by country, cleans it, analyzes trends, and creates visualizations. This is the same kind of work data analysts do every day at companies like Google, the World Bank, and consulting firms.

Here is what you will build:

Step 1: Set up the project — Create the directory, virtual environment, and install dependencies

Step 2: Create sample data — Generate a realistic CSV dataset to work with

Step 3: Load and explore — Read the CSV with pandas and understand its structure

Step 4: Clean the data — Handle missing values, fix types, remove outliers

Step 5: Analyze — Calculate statistics, find trends, compare groups

Step 6: Visualize — Create charts that tell the story

Step 7: Generate a report — Write the findings to a JSON summary file

Step 1: Set up the project

Every professional project starts with a clean structure.

bash
mkdir temperature_analysis
cd temperature_analysis
python -m venv venv
source venv/bin/activate     # Mac/Linux
# venv\Scripts\activate      # Windows
pip install pandas matplotlib
pip freeze > requirements.txt

Your project folder should look like:

temperature_analysis/
  venv/
  requirements.txt
  create_data.py
  analyze.py

Step 2: Create sample data

In real projects, you get data from a database, API, or file download. For this project, let us generate a realistic dataset so everyone has the same starting point.

python
# create_data.py
import csv
import random

random.seed(42)  # Makes random numbers reproducible

countries = ["USA", "UK", "Germany", "Japan", "Brazil",
             "India", "Australia", "Canada", "France", "Mexico"]

base_temps = {
    "USA": 12.5, "UK": 9.8, "Germany": 9.6, "Japan": 15.4,
    "Brazil": 25.0, "India": 24.7, "Australia": 21.8,
    "Canada": -0.5, "France": 11.2, "Mexico": 21.0
}

rows = [["year", "country", "avg_temp", "co2_emissions"]]

for year in range(1970, 2025):
    for country in countries:
        base = base_temps[country]
        warming = (year - 1970) * 0.02 + random.uniform(-0.5, 0.5)
        temp = round(base + warming, 1)

        # Some missing values (realistic)
        if random.random() < 0.03:
            temp = ""

        co2 = round(random.uniform(2.0, 16.0) + (year - 1970) * 0.05, 1)
        if random.random() < 0.02:
            co2 = ""

        rows.append([year, country, temp, co2])

with open("climate_data.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

print(f"Created climate_data.csv with {len(rows) - 1} rows")

Run this script: python create_data.py. You should see "Created climate_data.csv with 550 rows."

🔑Why generate data instead of using a real dataset?
In tutorials, real datasets often have download links that break, format changes, or are too large. Generated data ensures everyone can follow along regardless of internet connection. The skills transfer perfectly to real datasets — once you can analyze this, you can analyze any CSV. After finishing this module, try downloading real climate data from [datahub.io](https://datahub.io) and running the same analysis.

Step 3: Load and explore

Now the real work begins. Create analyze.py:

python
# analyze.py
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv("climate_data.csv")

# First look
print("Shape:", df.shape)          # (rows, columns)
print("\nFirst 5 rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics:")
print(df.describe())

This is the exploration phase. Before you analyze anything, you need to understand what you are working with. How many rows? What columns? What types? Any missing values?

There Are No Dumb Questions

"What is df.shape?"

df.shape returns a tuple like (550, 4) — 550 rows and 4 columns. It is the first thing every data analyst checks. If you expected 1,000 rows and only see 500, something went wrong during loading.

"Why check for missing values before analyzing?"

Missing values cause incorrect results. If you average a column with missing values, pandas skips them — which might be fine, or might bias your results. Knowing WHERE data is missing helps you decide HOW to handle it. Always check before you calculate.

🔒

Explore Your Data

25 XP

Run the exploration code above and answer these questions: 1. How many rows and columns does the dataset have? 2. Which columns have missing values, and how many? 3. What is the minimum and maximum temperature in the dataset? 4. What is the average CO2 emissions across all countries? _Hint: `df.shape` answers #1. `df.isnull().sum()` answers #2. `df.describe()` answers #3 and #4._

Sign in to earn XP

Step 4: Clean the data

Real data is messy. Our dataset has missing values that need handling.

python
def clean_data(df):
    """Clean the climate dataset."""
    print(f"Rows before cleaning: {len(df)}")

    # Convert columns to proper types (handles empty strings)
    df["avg_temp"] = pd.to_numeric(df["avg_temp"], errors="coerce")
    df["co2_emissions"] = pd.to_numeric(df["co2_emissions"], errors="coerce")

    # Option 1: Drop rows with missing values
    # df = df.dropna()

    # Option 2: Fill missing values with column average (better)
    df["avg_temp"] = df["avg_temp"].fillna(df["avg_temp"].mean())
    df["co2_emissions"] = df["co2_emissions"].fillna(df["co2_emissions"].mean())

    print(f"Missing values after cleaning: {df.isnull().sum().sum()}")
    print(f"Rows after cleaning: {len(df)}")

    return df

df = clean_data(df)

dropna() — Delete missing rows

  • Simple and safe
  • Loses data — fewer rows for analysis
  • Best when missing data is random and rare
  • Might bias results if missing data is not random

fillna() — Fill missing values

  • Preserves all rows
  • Introduces approximation
  • Best when you need every data point
  • Mean, median, or forward-fill are common strategies

Step 5: Analyze

Now extract insights from the clean data.

python
def analyze_trends(df):
    """Calculate key statistics and trends."""
    results = {}

    # Overall statistics
    results["total_records"] = len(df)
    results["year_range"] = f"{df['year'].min()}-{df['year'].max()}"
    results["avg_global_temp"] = round(df["avg_temp"].mean(), 2)

    # Temperature trend: compare first decade vs last decade
    early = df[df["year"] <= 1980]["avg_temp"].mean()
    recent = df[df["year"] >= 2015]["avg_temp"].mean()
    results["temp_change"] = round(recent - early, 2)

    # Hottest and coldest countries
    country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
    results["coldest_country"] = country_temps.index[0]
    results["hottest_country"] = country_temps.index[-1]

    # Highest CO2 emitter (most recent year)
    latest = df[df["year"] == df["year"].max()]
    top_emitter = latest.sort_values("co2_emissions", ascending=False).iloc[0]
    results["top_co2_emitter"] = top_emitter["country"]

    # Temperature by decade
    df["decade"] = (df["year"] // 10) * 10
    decade_temps = df.groupby("decade")["avg_temp"].mean()
    results["decade_trends"] = {str(k): round(v, 2) for k, v in decade_temps.items()}

    return results

results = analyze_trends(df)

print("\n=== ANALYSIS RESULTS ===")
for key, value in results.items():
    print(f"{key}: {value}")
🔑groupby() is the most powerful pandas method
`df.groupby("country")["avg_temp"].mean()` splits the data by country, calculates the average temperature for each group, and returns the results. This single line replaces what would be a complex loop with dictionaries. Mastering `groupby()` is the difference between writing 20 lines and writing 1 line that does the same thing.

There Are No Dumb Questions

"What does df['year'] // 10 * 10 do?"

It converts a year into its decade. 2023 // 10 is 202 (floor division drops the decimal). 202 * 10 is 2020. So 2023 becomes 2020, 2015 becomes 2010, 1985 becomes 1980. This is a common trick for grouping time-series data by decade.

Step 6: Visualize

Charts make data understandable. Create three visualizations:

python
def create_visualizations(df):
    """Generate charts from the climate data."""

    # Chart 1: Temperature trend over time (global average per year)
    yearly_temp = df.groupby("year")["avg_temp"].mean()

    plt.figure(figsize=(10, 5))
    plt.plot(yearly_temp.index, yearly_temp.values,
             color="#ef4444", linewidth=2)
    plt.title("Global Average Temperature Over Time", fontsize=14)
    plt.xlabel("Year")
    plt.ylabel("Average Temperature (C)")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("chart_temperature_trend.png", dpi=150)
    plt.close()
    print("Saved: chart_temperature_trend.png")

    # Chart 2: Average temperature by country (bar chart)
    country_temps = df.groupby("country")["avg_temp"].mean().sort_values()

    plt.figure(figsize=(10, 5))
    colors = ["#3b82f6" if t < 15 else "#ef4444" for t in country_temps.values]
    plt.barh(country_temps.index, country_temps.values, color=colors)
    plt.title("Average Temperature by Country", fontsize=14)
    plt.xlabel("Average Temperature (C)")
    plt.tight_layout()
    plt.savefig("chart_country_temps.png", dpi=150)
    plt.close()
    print("Saved: chart_country_temps.png")

    # Chart 3: Temperature by decade (bar chart)
    df["decade"] = (df["year"] // 10) * 10
    decade_temps = df.groupby("decade")["avg_temp"].mean()

    plt.figure(figsize=(10, 5))
    bars = plt.bar([str(d) + "s" for d in decade_temps.index],
                   decade_temps.values, color="#8b5cf6")
    plt.title("Average Temperature by Decade", fontsize=14)
    plt.xlabel("Decade")
    plt.ylabel("Average Temperature (C)")
    plt.tight_layout()
    plt.savefig("chart_decade_temps.png", dpi=150)
    plt.close()
    print("Saved: chart_decade_temps.png")

create_visualizations(df)

🔒

Add a Fourth Chart

25 XP

Add a visualization that shows CO2 emissions by country as a horizontal bar chart. Sort countries from highest to lowest emissions. Use orange (#f59e0b) for the bars. _Hint: Follow the same pattern as Chart 2, but group by "co2_emissions" instead of "avg_temp" and sort descending with `ascending=False`. Then call `sort_values(ascending=True)` for the horizontal bar chart (barh displays bottom-to-top)._

Sign in to earn XP

Step 7: Generate the report

The final step: save your findings as a structured JSON file that anyone can read.

python
import json

def save_report(results):
    """Save analysis results to a JSON report."""
    report = {
        "title": "Global Climate Data Analysis",
        "dataset": "climate_data.csv",
        "findings": results,
        "charts_generated": [
            "chart_temperature_trend.png",
            "chart_country_temps.png",
            "chart_decade_temps.png"
        ]
    }

    with open("analysis_report.json", "w") as f:
        json.dump(report, f, indent=2)

    print("\nSaved: analysis_report.json")

save_report(results)

The complete script

Here is the full analyze.py — everything in one file:

python
import pandas as pd
import matplotlib.pyplot as plt
import json


def clean_data(df):
    df["avg_temp"] = pd.to_numeric(df["avg_temp"], errors="coerce")
    df["co2_emissions"] = pd.to_numeric(df["co2_emissions"], errors="coerce")
    df["avg_temp"] = df["avg_temp"].fillna(df["avg_temp"].mean())
    df["co2_emissions"] = df["co2_emissions"].fillna(df["co2_emissions"].mean())
    return df


def analyze_trends(df):
    results = {}
    results["total_records"] = len(df)
    results["year_range"] = f"{df['year'].min()}-{df['year'].max()}"
    results["avg_global_temp"] = round(df["avg_temp"].mean(), 2)

    early = df[df["year"] <= 1980]["avg_temp"].mean()
    recent = df[df["year"] >= 2015]["avg_temp"].mean()
    results["temp_change"] = round(recent - early, 2)

    country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
    results["coldest_country"] = country_temps.index[0]
    results["hottest_country"] = country_temps.index[-1]

    df["decade"] = (df["year"] // 10) * 10
    decade_temps = df.groupby("decade")["avg_temp"].mean()
    results["decade_trends"] = {str(k): round(v, 2) for k, v in decade_temps.items()}

    return results


def create_visualizations(df):
    yearly_temp = df.groupby("year")["avg_temp"].mean()

    plt.figure(figsize=(10, 5))
    plt.plot(yearly_temp.index, yearly_temp.values, color="#ef4444", linewidth=2)
    plt.title("Global Average Temperature Over Time", fontsize=14)
    plt.xlabel("Year")
    plt.ylabel("Average Temperature (C)")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("chart_temperature_trend.png", dpi=150)
    plt.close()

    country_temps = df.groupby("country")["avg_temp"].mean().sort_values()
    plt.figure(figsize=(10, 5))
    colors = ["#3b82f6" if t < 15 else "#ef4444" for t in country_temps.values]
    plt.barh(country_temps.index, country_temps.values, color=colors)
    plt.title("Average Temperature by Country", fontsize=14)
    plt.xlabel("Average Temperature (C)")
    plt.tight_layout()
    plt.savefig("chart_country_temps.png", dpi=150)
    plt.close()

    df["decade"] = (df["year"] // 10) * 10
    decade_temps = df.groupby("decade")["avg_temp"].mean()
    plt.figure(figsize=(10, 5))
    plt.bar([str(d) + "s" for d in decade_temps.index],
            decade_temps.values, color="#8b5cf6")
    plt.title("Average Temperature by Decade", fontsize=14)
    plt.xlabel("Decade")
    plt.ylabel("Average Temperature (C)")
    plt.tight_layout()
    plt.savefig("chart_decade_temps.png", dpi=150)
    plt.close()


def save_report(results):
    report = {
        "title": "Global Climate Data Analysis",
        "findings": results,
        "charts": ["chart_temperature_trend.png",
                    "chart_country_temps.png",
                    "chart_decade_temps.png"]
    }
    with open("analysis_report.json", "w") as f:
        json.dump(report, f, indent=2)


# Main execution
print("Loading data...")
df = pd.read_csv("climate_data.csv")
print(f"Loaded {len(df)} rows, {len(df.columns)} columns")

print("\nCleaning data...")
df = clean_data(df)

print("\nAnalyzing trends...")
results = analyze_trends(df)
for key, value in results.items():
    print(f"  {key}: {value}")

print("\nCreating visualizations...")
create_visualizations(df)

print("\nSaving report...")
save_report(results)

print("\nDone! Check the generated files.")

🔒

Build and Run the Complete Project

50 XP

Build the complete project end to end: 1. Create the project directory and virtual environment 2. Run `create_data.py` to generate the dataset 3. Run `analyze.py` to perform the full analysis 4. Verify you have: `climate_data.csv`, three PNG charts, and `analysis_report.json` Then extend the project with one of these features: - Add a function that finds the country with the fastest temperature increase - Create a scatter plot of temperature vs CO2 emissions - Export the analysis to a nicely formatted text file instead of JSON This project — with your extension — is portfolio-worthy. Push it to GitHub. _Hint: For the fastest warming country, calculate each country's temperature in the first decade vs the last decade, then find the biggest difference._

Sign in to earn XP

What you have learned in this course

Look at how far you have come:

ModuleSkillYou can now...
1. Why PythonEnvironment setupInstall Python, VS Code, write and run scripts
2. Variables & TypesData fundamentalsStore, convert, and manipulate strings, numbers, and booleans
3. Control FlowLogicMake decisions with if/else and repeat with loops
4. FunctionsReusabilityWrite modular, reusable code with parameters and returns
5. Data StructuresOrganizationUse lists, dicts, tuples, sets, and list comprehensions
6. Files & DataI/ORead/write CSV, JSON, call APIs, handle errors
7. LibrariesEcosystemUse pip, pandas, matplotlib, requests, virtual envs
8. Final ProjectEverythingBuild a complete data analysis pipeline from scratch

There Are No Dumb Questions

"What should I learn next after this course?"

Three paths, depending on your goal: Data analysis — learn SQL and advanced pandas (our Data Skills track covers this). Web development — learn Flask or Django to build web applications. Automation — start automating your own daily tasks with Python scripts. The best path is the one that solves a problem you personally care about.

"Is this enough to get a job?"

This course gives you the foundation. To be job-ready, build 2-3 more projects using real data, learn SQL, and practice on platforms like LeetCode or HackerRank. The project from this module is a strong portfolio piece — push it to GitHub and explain it in interviews.

Back to Marcus

Marcus walked into that interview with no CS degree, no bootcamp certificate, and no professional experience. He had one thing: a GitHub repository with a project exactly like the one you just built. He loaded a messy CSV, cleaned it, found patterns, and turned them into charts that told a story.

The hiring manager did not ask him about algorithms or data structures theory. She said, "Walk me through this." He did, and he got the job.

You just built the same thing. Push it to GitHub. When someone asks what you can do with Python, show them: a dataset loaded, cleaned, analyzed, and visualized — with three charts and a JSON report to prove it.

You started this track with print("Hello, World!"). You are ending it with a professional data analysis pipeline. That is the Python Fundamentals track, complete.

Key takeaways

  • A professional project has structure: directory, virtual environment, requirements.txt, clean code in functions
  • Data analysis follows a pipeline: load, explore, clean, analyze, visualize, report
  • Always explore before analyzing — check shape, types, and missing values before calculating anything
  • Cleaning is half the work — handling missing values and type conversions is what separates beginners from professionals
  • groupby() is your most powerful tool in pandas — learn it well
  • Visualizations tell the story — a chart communicates faster than a table of numbers
  • This project is portfolio-worthy — push it to GitHub and explain it in job interviews
  • You wrote real Python — every concept from 8 modules, combined into working software

?

Knowledge Check

1.What is the correct order for a data analysis pipeline?

2.In pandas, what does `df.groupby('country')['avg_temp'].mean()` do?

3.Why is `pd.to_numeric(df['column'], errors='coerce')` used during data cleaning?

4.What should you always do BEFORE starting any data analysis in pandas?

Want to go deeper?

💻 Software Engineering Master Class

The complete software engineering program — from your first line of code to landing your first job.

View the full program