Advanced SQL — Data Skills Essentials

The analyst who got promoted with one query

A product analyst at a fintech startup had been writing basic SQL for a year — SELECT, WHERE, GROUP BY. Solid but limited. One Friday, the VP of Product asked a question that stumped the whole analytics team: "For each customer, show me their most recent transaction, how it compares to their average, and rank all customers by lifetime value — but only for customers who made at least 3 purchases in the last 90 days."

Everyone else said they'd need to build it in Python or export the data to Excel. The analyst opened her SQL editor and wrote a single query using a CTE to filter active customers, a window function to calculate running averages and rankings, and a self-join to pull the most recent transaction.

Twelve lines of SQL. The VP had his answer in 30 seconds. The analyst was promoted to Senior Analyst two months later.

The difference between basic and advanced SQL isn't about writing more code. It's about solving harder problems in fewer steps.

🔑The career multiplier

Basic SQL (SELECT, WHERE, GROUP BY) gets you in the door. Advanced SQL (JOINs, window functions, CTEs, subqueries) gets you promoted. In senior analyst and analytics engineer interviews, window functions and CTEs are the most common technical questions.

JOINs: connecting the pieces

Real databases don't store everything in one table. Customer info lives in customers, orders in orders, products in products. JOINs connect them.

The four types of JOINs

JOIN type	What it returns	When to use it
INNER JOIN	Only rows that match in both tables	When you want customers who HAVE orders
LEFT JOIN	All rows from the left table + matches from right	When you want ALL customers, even those with no orders
RIGHT JOIN	All rows from the right table + matches from left	Rarely used — just swap table order and use LEFT JOIN
FULL OUTER JOIN	All rows from both tables	When you need to find mismatches between tables

-- INNER JOIN: Only customers who have placed orders
SELECT c.name, o.product, o.amount
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id;

-- LEFT JOIN: ALL customers, even if they haven't ordered
SELECT c.name, o.product, o.amount
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id;

The LEFT JOIN is the workhorse of analytics. If a customer has no orders, the order columns show NULL — which tells you something valuable: this is a customer who hasn't bought anything.

✗ Without AI

✗Only matching rows
✗Drops customers with no orders
✗Smaller result set
✗Use when both sides must exist

✓ With AI

✓All rows from left table
✓Keeps customers with no orders (NULLs)
✓Larger result set
✓Use when you want the full picture

There Are No Dumb Questions

"How do I know which table is 'left' and which is 'right'?"

The table in the FROM clause is the left table. The table in the JOIN clause is the right table. FROM customers LEFT JOIN orders — customers is left, orders is right. Every customer appears; only matching orders appear.

"What happens if I JOIN on a column that isn't unique?"

You get duplicate rows. If customer 42 has 5 orders, a JOIN on customer_id produces 5 rows for customer 42 — one for each order. This is usually correct. But if you accidentally JOIN on a non-unique column (like name), you'll get a "cartesian explosion" — thousands of false matches. Always JOIN on IDs.

⚡

JOIN Exercises

25 XP

You have three tables: - `employees` (id, name, department_id, salary) - `departments` (id, dept_name, manager_id) - `projects` (id, project_name, lead_employee_id) Write the SQL for each: 1. List all employees with their department name (only employees who belong to a department): 2. List ALL departments, including those with no employees: 3. Show each project alongside its lead employee's name and department: 4. Find departments that have no employees assigned: *Hint: #4 uses a LEFT JOIN with a WHERE check for NULL.*

Subqueries: queries inside queries

A subquery is a SELECT statement nested inside another query. Think of it as a two-step question: first find something, then use that result.

In the WHERE clause

-- Find customers who spent more than the average
SELECT name, total_spent
FROM customers
WHERE total_spent > (SELECT AVG(total_spent) FROM customers);

The inner query calculates the average. The outer query filters using that average. You can't do this with a single WHERE clause.

In the FROM clause (derived tables)

-- Average orders per customer (two-step: first count, then average)
SELECT AVG(order_count) as avg_orders_per_customer
FROM (
  SELECT customer_id, COUNT(*) as order_count
  FROM orders
  GROUP BY customer_id
) customer_orders;

With IN for list filtering

-- Find products that have never been ordered
SELECT name
FROM products
WHERE id NOT IN (SELECT DISTINCT product_id FROM orders);

⚠️Subquery performance

Subqueries in the WHERE clause run once for each row in the outer query (correlated subqueries) or once total (non-correlated). If your query is slow, the subquery is often the culprit. CTEs and JOINs frequently perform better on large datasets.

CTEs: readable, reusable query blocks

A Common Table Expression (CTE) is a named temporary result set. It makes complex queries readable by breaking them into logical steps.

WITH active_customers AS (
  SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spent
  FROM orders
  WHERE order_date >= '2025-01-01'
  GROUP BY customer_id
  HAVING COUNT(*) >= 3
),
customer_details AS (
  SELECT c.name, c.email, ac.order_count, ac.total_spent
  FROM customers c
  JOIN active_customers ac ON c.id = ac.customer_id
)
SELECT *
FROM customer_details
ORDER BY total_spent DESC;

Without the CTE, this would be a mess of nested subqueries. With CTEs, each step has a name and a purpose:

active_customers — filter to customers with 3+ orders in 2025
customer_details — join in names and emails
Final SELECT — sort and display

There Are No Dumb Questions

"What's the difference between a CTE and a subquery?"

Functionally, they're often interchangeable. The difference is readability. A CTE sits at the top of the query with a clear name. A subquery is buried inside another query. For anything more than one level of nesting, CTEs are dramatically easier to read, debug, and maintain.

"Can I reference a CTE multiple times in the same query?"

Yes — that's one of the biggest advantages over subqueries. If you define WITH monthly_revenue AS (...), you can JOIN to it, filter by it, and aggregate it multiple times without re-running the query.

Window functions: the advanced analytics power tool

Window functions perform calculations across a set of rows related to the current row — without collapsing the rows into groups like GROUP BY does.

The key difference: GROUP BY reduces 1,000 rows to 10 groups. Window functions keep all 1,000 rows but add a calculated column.

ROW_NUMBER, RANK, DENSE_RANK

-- Rank customers by total spending
SELECT
  name,
  total_spent,
  ROW_NUMBER() OVER (ORDER BY total_spent DESC) as row_num,
  RANK() OVER (ORDER BY total_spent DESC) as rank,
  DENSE_RANK() OVER (ORDER BY total_spent DESC) as dense_rank
FROM customers;

Function	Ties behavior	Example (scores: 100, 95, 95, 90)
ROW_NUMBER	No ties — arbitrary order	1, 2, 3, 4
RANK	Same rank for ties, gaps after	1, 2, 2, 4
DENSE_RANK	Same rank for ties, no gaps	1, 2, 2, 3

Running totals and moving averages

-- Running total of revenue by month
SELECT
  month,
  revenue,
  SUM(revenue) OVER (ORDER BY month) as running_total,
  AVG(revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg_3mo
FROM monthly_revenue;

PARTITION BY: window within groups

-- Rank each employee within their department
SELECT
  name,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;

PARTITION BY is like GROUP BY for window functions — it creates separate "windows" for each partition. The rank resets for each department.

SUM() OVER — running totals and cumulative values

AVG() OVER — moving averages (great for smoothing trends)

ROW_NUMBER() OVER — unique numbering (great for deduplication)

RANK() OVER — ranking with ties (great for leaderboards)

LAG() / LEAD() OVER — access previous/next row (great for period-over-period comparison)

LAG and LEAD: comparing rows

-- Month-over-month revenue change
SELECT
  month,
  revenue,
  LAG(revenue) OVER (ORDER BY month) as prev_month_revenue,
  revenue - LAG(revenue) OVER (ORDER BY month) as change,
  ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
    / LAG(revenue) OVER (ORDER BY month), 1) as pct_change
FROM monthly_revenue;

This calculates "how much did revenue change from last month?" — one of the most common business questions, answered in a single query.

⚡

Window Function Challenge

50 XP

You have a table `sales` with columns: salesperson, region, month, revenue. Write SQL for each: 1. Rank salespeople within each region by total revenue: 2. Calculate a 3-month moving average of revenue for each salesperson: 3. For each salesperson, show their revenue alongside the previous month's revenue and the percentage change: 4. Find the top-performing salesperson in each region (rank = 1): *Hint: Use PARTITION BY for grouping, ORDER BY for sorting within the window, and a CTE for #4.*

Putting it all together

Here's the query that got our analyst promoted — reconstructed:

WITH active_customers AS (
  -- Step 1: Find customers with 3+ purchases in last 90 days
  SELECT customer_id
  FROM transactions
  WHERE transaction_date >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY customer_id
  HAVING COUNT(*) >= 3
),
customer_stats AS (
  -- Step 2: Calculate lifetime stats for active customers
  SELECT
    t.customer_id,
    c.name,
    SUM(t.amount) as lifetime_value,
    AVG(t.amount) as avg_transaction,
    MAX(t.transaction_date) as last_transaction_date,
    RANK() OVER (ORDER BY SUM(t.amount) DESC) as ltv_rank
  FROM transactions t
  JOIN customers c ON t.customer_id = c.id
  WHERE t.customer_id IN (SELECT customer_id FROM active_customers)
  GROUP BY t.customer_id, c.name
),
latest_transactions AS (
  -- Step 3: Get the most recent transaction for each customer
  SELECT *
  FROM (
    SELECT
      customer_id,
      amount as latest_amount,
      ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY transaction_date DESC) as rn
    FROM transactions
  ) ranked
  WHERE rn = 1
)
-- Step 4: Combine everything
SELECT
  cs.name,
  cs.lifetime_value,
  cs.avg_transaction,
  lt.latest_amount,
  lt.latest_amount - cs.avg_transaction as vs_average,
  cs.ltv_rank
FROM customer_stats cs
JOIN latest_transactions lt ON cs.customer_id = lt.customer_id
ORDER BY cs.ltv_rank;

Four CTEs, each doing one clear thing. Window functions for ranking and getting the latest row. JOINs to combine the pieces. This is senior-level SQL — and you now know every building block.

⚡

Read the Query

25 XP

Look at the full query above and answer: 1. Why is ROW_NUMBER used in `latest_transactions` instead of MAX(transaction_date)? 2. What would happen if you changed INNER JOIN to LEFT JOIN on the final SELECT? 3. If two customers have the same lifetime_value, what rank does the second one get? (RANK vs DENSE_RANK vs ROW_NUMBER) 4. Why is the `active_customers` CTE filtered in a separate step instead of adding the HAVING clause directly in `customer_stats`? *These are the kinds of questions you'll get in senior analyst interviews.*

Key takeaways

LEFT JOIN is the analyst's best friend. It keeps all rows from the left table, filling NULLs for non-matches — essential for seeing the complete picture.
CTEs make complex queries readable. Name each step, break the logic into blocks, and compose them. Any query with more than one subquery should be a CTE.
Window functions analyze without collapsing. ROW_NUMBER, RANK, SUM OVER, LAG, and LEAD calculate across rows while keeping every row visible.
PARTITION BY is GROUP BY for windows. It creates separate calculations within groups without reducing the result set.
LAG/LEAD compare adjacent rows. Month-over-month changes, period comparisons, and sequential analysis are all one function call away.
Advanced SQL is a career differentiator. These skills are tested in every senior analyst and analytics engineer interview.

Knowledge Check

1.What is the key difference between INNER JOIN and LEFT JOIN?

2.What problem does a CTE (Common Table Expression) solve that subqueries don't?

3.Three employees have salaries of $90K, $85K, $85K, and $80K. Using RANK() OVER (ORDER BY salary DESC), what ranks do they receive?

4.What does LAG(revenue) OVER (ORDER BY month) return?

The analyst who got promoted with one query

Twelve lines of SQL. The VP had his answer in 30 seconds. The analyst was promoted to Senior Analyst two months later.

The difference between basic and advanced SQL isn't about writing more code. It's about solving harder problems in fewer steps.

🔑The career multiplier

JOINs: connecting the pieces

Real databases don't store everything in one table. Customer info lives in customers, orders in orders, products in products. JOINs connect them.

The four types of JOINs

JOIN type	What it returns	When to use it
INNER JOIN	Only rows that match in both tables	When you want customers who HAVE orders
LEFT JOIN	All rows from the left table + matches from right	When you want ALL customers, even those with no orders
RIGHT JOIN	All rows from the right table + matches from left	Rarely used — just swap table order and use LEFT JOIN
FULL OUTER JOIN	All rows from both tables	When you need to find mismatches between tables

-- INNER JOIN: Only customers who have placed orders
SELECT c.name, o.product, o.amount
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id;

-- LEFT JOIN: ALL customers, even if they haven't ordered
SELECT c.name, o.product, o.amount
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id;

The LEFT JOIN is the workhorse of analytics. If a customer has no orders, the order columns show NULL — which tells you something valuable: this is a customer who hasn't bought anything.

✗ Without AI

✗Only matching rows
✗Drops customers with no orders
✗Smaller result set
✗Use when both sides must exist

✓ With AI

✓All rows from left table
✓Keeps customers with no orders (NULLs)
✓Larger result set
✓Use when you want the full picture

There Are No Dumb Questions

"How do I know which table is 'left' and which is 'right'?"

The table in the FROM clause is the left table. The table in the JOIN clause is the right table. FROM customers LEFT JOIN orders — customers is left, orders is right. Every customer appears; only matching orders appear.

"What happens if I JOIN on a column that isn't unique?"

You get duplicate rows. If customer 42 has 5 orders, a JOIN on customer_id produces 5 rows for customer 42 — one for each order. This is usually correct. But if you accidentally JOIN on a non-unique column (like name), you'll get a "cartesian explosion" — thousands of false matches. Always JOIN on IDs.

⚡

JOIN Exercises

25 XP

Subqueries: queries inside queries

A subquery is a SELECT statement nested inside another query. Think of it as a two-step question: first find something, then use that result.

In the WHERE clause

-- Find customers who spent more than the average
SELECT name, total_spent
FROM customers
WHERE total_spent > (SELECT AVG(total_spent) FROM customers);

The inner query calculates the average. The outer query filters using that average. You can't do this with a single WHERE clause.

In the FROM clause (derived tables)

-- Average orders per customer (two-step: first count, then average)
SELECT AVG(order_count) as avg_orders_per_customer
FROM (
  SELECT customer_id, COUNT(*) as order_count
  FROM orders
  GROUP BY customer_id
) customer_orders;

With IN for list filtering

-- Find products that have never been ordered
SELECT name
FROM products
WHERE id NOT IN (SELECT DISTINCT product_id FROM orders);

⚠️Subquery performance

CTEs: readable, reusable query blocks

A Common Table Expression (CTE) is a named temporary result set. It makes complex queries readable by breaking them into logical steps.

WITH active_customers AS (
  SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spent
  FROM orders
  WHERE order_date >= '2025-01-01'
  GROUP BY customer_id
  HAVING COUNT(*) >= 3
),
customer_details AS (
  SELECT c.name, c.email, ac.order_count, ac.total_spent
  FROM customers c
  JOIN active_customers ac ON c.id = ac.customer_id
)
SELECT *
FROM customer_details
ORDER BY total_spent DESC;

Without the CTE, this would be a mess of nested subqueries. With CTEs, each step has a name and a purpose:

active_customers — filter to customers with 3+ orders in 2025
customer_details — join in names and emails
Final SELECT — sort and display

There Are No Dumb Questions

"What's the difference between a CTE and a subquery?"

Functionally, they're often interchangeable. The difference is readability. A CTE sits at the top of the query with a clear name. A subquery is buried inside another query. For anything more than one level of nesting, CTEs are dramatically easier to read, debug, and maintain.

"Can I reference a CTE multiple times in the same query?"

Yes — that's one of the biggest advantages over subqueries. If you define WITH monthly_revenue AS (...), you can JOIN to it, filter by it, and aggregate it multiple times without re-running the query.

Window functions: the advanced analytics power tool

Window functions perform calculations across a set of rows related to the current row — without collapsing the rows into groups like GROUP BY does.

The key difference: GROUP BY reduces 1,000 rows to 10 groups. Window functions keep all 1,000 rows but add a calculated column.

ROW_NUMBER, RANK, DENSE_RANK

-- Rank customers by total spending
SELECT
  name,
  total_spent,
  ROW_NUMBER() OVER (ORDER BY total_spent DESC) as row_num,
  RANK() OVER (ORDER BY total_spent DESC) as rank,
  DENSE_RANK() OVER (ORDER BY total_spent DESC) as dense_rank
FROM customers;

Function	Ties behavior	Example (scores: 100, 95, 95, 90)
ROW_NUMBER	No ties — arbitrary order	1, 2, 3, 4
RANK	Same rank for ties, gaps after	1, 2, 2, 4
DENSE_RANK	Same rank for ties, no gaps	1, 2, 2, 3

Running totals and moving averages

-- Running total of revenue by month
SELECT
  month,
  revenue,
  SUM(revenue) OVER (ORDER BY month) as running_total,
  AVG(revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg_3mo
FROM monthly_revenue;

PARTITION BY: window within groups

-- Rank each employee within their department
SELECT
  name,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;

PARTITION BY is like GROUP BY for window functions — it creates separate "windows" for each partition. The rank resets for each department.

SUM() OVER — running totals and cumulative values

AVG() OVER — moving averages (great for smoothing trends)

ROW_NUMBER() OVER — unique numbering (great for deduplication)

RANK() OVER — ranking with ties (great for leaderboards)

LAG() / LEAD() OVER — access previous/next row (great for period-over-period comparison)

LAG and LEAD: comparing rows

-- Month-over-month revenue change
SELECT
  month,
  revenue,
  LAG(revenue) OVER (ORDER BY month) as prev_month_revenue,
  revenue - LAG(revenue) OVER (ORDER BY month) as change,
  ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
    / LAG(revenue) OVER (ORDER BY month), 1) as pct_change
FROM monthly_revenue;

This calculates "how much did revenue change from last month?" — one of the most common business questions, answered in a single query.

⚡

Window Function Challenge

50 XP

Putting it all together

Here's the query that got our analyst promoted — reconstructed:

WITH active_customers AS (
  -- Step 1: Find customers with 3+ purchases in last 90 days
  SELECT customer_id
  FROM transactions
  WHERE transaction_date >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY customer_id
  HAVING COUNT(*) >= 3
),
customer_stats AS (
  -- Step 2: Calculate lifetime stats for active customers
  SELECT
    t.customer_id,
    c.name,
    SUM(t.amount) as lifetime_value,
    AVG(t.amount) as avg_transaction,
    MAX(t.transaction_date) as last_transaction_date,
    RANK() OVER (ORDER BY SUM(t.amount) DESC) as ltv_rank
  FROM transactions t
  JOIN customers c ON t.customer_id = c.id
  WHERE t.customer_id IN (SELECT customer_id FROM active_customers)
  GROUP BY t.customer_id, c.name
),
latest_transactions AS (
  -- Step 3: Get the most recent transaction for each customer
  SELECT *
  FROM (
    SELECT
      customer_id,
      amount as latest_amount,
      ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY transaction_date DESC) as rn
    FROM transactions
  ) ranked
  WHERE rn = 1
)
-- Step 4: Combine everything
SELECT
  cs.name,
  cs.lifetime_value,
  cs.avg_transaction,
  lt.latest_amount,
  lt.latest_amount - cs.avg_transaction as vs_average,
  cs.ltv_rank
FROM customer_stats cs
JOIN latest_transactions lt ON cs.customer_id = lt.customer_id
ORDER BY cs.ltv_rank;

Four CTEs, each doing one clear thing. Window functions for ranking and getting the latest row. JOINs to combine the pieces. This is senior-level SQL — and you now know every building block.

⚡

Read the Query

25 XP

Key takeaways

LEFT JOIN is the analyst's best friend. It keeps all rows from the left table, filling NULLs for non-matches — essential for seeing the complete picture.
CTEs make complex queries readable. Name each step, break the logic into blocks, and compose them. Any query with more than one subquery should be a CTE.
Window functions analyze without collapsing. ROW_NUMBER, RANK, SUM OVER, LAG, and LEAD calculate across rows while keeping every row visible.
PARTITION BY is GROUP BY for windows. It creates separate calculations within groups without reducing the result set.
LAG/LEAD compare adjacent rows. Month-over-month changes, period comparisons, and sequential analysis are all one function call away.
Advanced SQL is a career differentiator. These skills are tested in every senior analyst and analytics engineer interview.

Knowledge Check

1.What is the key difference between INNER JOIN and LEFT JOIN?

2.What problem does a CTE (Common Table Expression) solve that subqueries don't?

3.Three employees have salaries of $90K, $85K, $85K, and $80K. Using RANK() OVER (ORDER BY salary DESC), what ranks do they receive?

4.What does LAG(revenue) OVER (ORDER BY month) return?