A Comprehensive Guide in SQL Joins

Join is an operation that combines rows from two or more tables based on related columns.

The three pillars:

  • Grain — What does one row represent in each table?
  • Key(s) — Which columns uniquely identify a row (primary key) and connect tables (foreign key)?
  • Cardinality — one-to-one (1-1), one-to-many (1-), many-to-many (-*).

Type of Joins

JOIN (INNER)

Returns records that have matching values in both tables. If a row in one table doesn’t have a match in the other, it’s excluded.

Example tables (before):

Customers

Customer_ID Customer_Name
1 Alice
2 Bob
3 Carol

Orders

Order_ID Customer_ID Amount
101 1 50
102 1 30
103 4 20

Query

SELECT o.Order_ID, c.Customer_ID, c.Customer_Name, o.Amount
FROM Orders o
JOIN Customers c ON c.Customer_ID = o.Customer_ID;

Result (after):

Order_ID Customer_ID Customer_Name Amount
101 1 Alice 50
102 1 Alice 30

LEFT JOIN

Returns all records from the left table and the matched records from the right table. If there is no match, NULLs are returned for right columns.

Query

SELECT c.Customer_ID, c.Customer_Name, o.Order_ID, o.Amount
FROM Customers c
LEFT JOIN Orders o ON o.Customer_ID = c.Customer_ID;

Result (after):

Customer_ID Customer_Name Order_ID Amount
1 Alice 101 50
1 Alice 102 30
2 Bob NULL NULL
3 Carol NULL NULL

RIGHT JOIN

Returns all records from the right table and the matched records from the left table.

Query

SELECT c.Customer_ID, c.Customer_Name, o.Order_ID, o.Amount
FROM Customers c
RIGHT JOIN Orders o ON o.Customer_ID = c.Customer_ID;

Result (after):

Customer_ID Customer_Name Order_ID Amount
1 Alice 101 50
1 Alice 102 30
NULL NULL 103 20

FULL OUTER JOIN (FULL JOIN)

Returns all records when there is a match in either left or right table.

Query

SELECT c.Customer_ID, c.Customer_Name, o.Order_ID, o.Amount
FROM Customers c
FULL OUTER JOIN Orders o ON o.Customer_ID = c.Customer_ID;

Result (after):

Customer_ID Customer_Name Order_ID Amount
1 Alice 101 50
1 Alice 102 30
2 Bob NULL NULL
3 Carol NULL NULL
NULL NULL 103 20

Other Types of Joins

SELF JOIN

Merge a table with itself to compare rows or express hierarchies (e.g., manager → employee).

Employees

emp_id name manager_id
1 Alice NULL
2 Bob 1
3 Carol 1
4 Dan 2

Query

SELECT e.emp_id, e.name AS employee, m.name AS manager
FROM Employees e
JOIN Employees m ON m.emp_id = e.manager_id;

Result (after):

emp_id employee manager
2 Bob Alice
3 Carol Alice
4 Dan Bob

CROSS JOIN

Produces all possible pairings (Cartesian product). No join condition required.

Students

student
Anna
Ben

Subjects

subject
Math
English

Query

SELECT s.student, t.subject
FROM Students s
CROSS JOIN Subjects t;

Result (after):

student subject
Anna Math
Anna English
Ben Math
Ben English

SEMI JOIN

Returns left-table rows that have a match in the right table; only left columns are returned.

Query

SELECT c.Customer_ID, c.Customer_Name
FROM Customers c
WHERE EXISTS (
  SELECT 1
  FROM Orders o
  WHERE c.Customer_ID = o.Customer_ID
);

Result (after):

Customer_ID Customer_Name
1 Alice

ANTI JOIN

Returns left-table rows that have no match in the right table.

Query

SELECT c.Customer_ID, c.Customer_Name
FROM Customers c
LEFT JOIN Orders o ON o.Customer_ID = c.Customer_ID
WHERE o.Customer_ID IS NULL;

Result (after):

Customer_ID Customer_Name
2 Bob
3 Carol

Join Mindset

1) Start with the end in mind (declare the grain). Example: one row per (store_id, sku, date).

2) Pick a driving table (and be explicit why). Use LEFT when data quality isn’t guaranteed; INNER when referential integrity is trusted.

3) Make keys & cardinality explicit.

  • fact→dim = many→one
  • fact→fact = many↔many → pre-aggregate the many side.

4) Choose the right pattern. Enrich: LEFT/INNER · Presence: EXISTS (SEMI) · Absence: LEFT … WHERE right IS NULL or NOT EXISTS (ANTI) · Reconciliation: FULL · Combinations: CROSS (intentional).

5) Put filters in the right place (ON vs WHERE).

-- GOOD: keeps unmatched promos as NULL (true LEFT)
LEFT JOIN promo pr
  ON pr.sku = s.sku
 AND s.sale_dt BETWEEN pr.start_dt AND pr.end_dt
-- BAD: WHERE pr.start_dt ...  -- turns LEFT into INNER

6) Aggregate at the correct moment (pre-agg many side).

WITH pr_by_sale AS (
  SELECT sale_id, COUNT(*) AS promos
  FROM fact_promotions pr
  JOIN fact_sales s
    ON pr.sku = s.sku
   AND s.sale_dt BETWEEN pr.start_dt AND pr.end_dt
  GROUP BY sale_id
)
SELECT s.*, COALESCE(p.promos,0) AS promos
FROM fact_sales s
LEFT JOIN pr_by_sale p USING (sale_id);

7) Treat NULLs deliberately. NULL = NULL is unknown; use COALESCE only when correct.

8) Scaffolds help robustness.

WITH frame AS (
  SELECT c.dt, a.store_id, a.sku
  FROM dim_calendar c
  JOIN assortment a ON c.dt BETWEEN a.start_dt AND a.end_dt
  WHERE c.dt BETWEEN DATE '2025-09-01' AND DATE '2025-09-07'
)
SELECT f.dt, f.store_id, f.sku,
       COALESCE(SUM(s.units),0) AS units
FROM frame f
LEFT JOIN fact_sales s
  ON s.sale_dt = f.dt
 AND s.store_id = f.store_id
 AND s.sku = f.sku
GROUP BY 1,2,3;

Performance habits (BigQuery-like engines)

  • Filter early; select only needed columns.
  • Pre-aggregate many sides before joins.
  • Partition/cluster big facts by date and common keys.
  • Prefer EXISTS to JOIN + DISTINCT for presence tests.
  • Avoid SELECT * in production.

Reusable mini-templates

Presence (SEMI)

SELECT p.sku
FROM dim_product p
WHERE EXISTS (
  SELECT 1 FROM fact_sales s
  WHERE s.sku = p.sku
    AND s.sale_dt >= CURRENT_DATE - INTERVAL 30 DAY
);

Absence (ANTI)

SELECT p.sku
FROM dim_product p
LEFT JOIN fact_sales s
  ON s.sku = p.sku
 AND s.sale_dt >= CURRENT_DATE - INTERVAL 30 DAY
WHERE s.sku IS NULL;

SCD2 time-bounded

SELECT s.sale_id, d.category
FROM fact_sales s
JOIN dim_product_scd2 d
  ON d.sku = s.sku
 AND s.sale_dt >= d.effective_start_dt
 AND s.sale_dt <  d.effective_end_dt;

Many-to-many guard (pre-aggregate)

WITH sales_30 AS (
  SELECT sku, store_id, SUM(units) AS units_30
  FROM fact_sales
  WHERE sale_dt >= CURRENT_DATE - INTERVAL 30 DAY
  GROUP BY 1,2
)
SELECT f.store_id, f.sku, s.units_30
FROM frame f
LEFT JOIN sales_30 s USING (store_id, sku);

Mental checklist

  • What’s my result grain?
  • What’s the driver table?
  • What are the keys and the cardinality?
  • Do I need presence/absence (EXISTS / NOT EXISTS) or enrichment (JOIN)?
  • Are right-side filters in the ON clause?
  • Do I need to pre-aggregate to avoid duplication?
  • How do NULLs affect logic and totals?
  • Quick row-count & dup checks after each step.
  • Any performance wins (filter early, select few, partitions/cluster)?
  • Does the output still match the declared grain?