SQL Set Operations: UNION, INTERSECT, and EXCEPT

Q: What is the difference between UNION and UNION ALL?

UNION removes duplicates through sorting or hashing at a performance cost. UNION ALL combines results without deduplication and is faster. Use UNION ALL unless deduplication is explicitly required.

Q: What does INTERSECT do and when would I use it?

INTERSECT returns only rows appearing in both query results. Use it to find records that exist in multiple datasets simultaneously such as customers present in both your current and a legacy system.

Q: What rules must both queries in a UNION satisfy?

Both queries must have the same number of columns with compatible data types and must be explicitly cast if needed. Column names come from the first SELECT and ORDER BY applies only to the final combined result.

"A database is just a collection of Mathematical Sets. To master SQL is to master Set Theory."

In most of your SQL work, you use JOINs to "Connect" tables sideways (Horizontal Fusion). But what if you want to "Stack" tables on top of each other? Or find the "Overlap" between two different reports? In modern database engineering, we use Set Operations to manipulate result sets as singular mathematical entities.

Set Operations (UNION, INTERSECT, and EXCEPT) are the fundamental tools for comparing and combining entire result sets. Whether you are finding "Customers who also work for us" or discovering "Products that sold in London but NOT in New York," set operations provide a cleaner, more readable, and often faster solution than complex subqueries.

This 1,500+ word flagship guide is your deep-dive into the "Algebra of Data." We will explore the "Hardware-Mirror" of how the engine physically merges millions of rows, why the "Buffer Manager" hates UNION, and why a single ALL keyword can reduce your SSD I/O by 90%.

1. The Big Three: Vertical Data Fusion Geometry

Set operations combine the results of two SELECT statements into a single output. Imagine two spreadsheets stacked on top of each other.

UNION: Adds the rows of Query B to Query A. It performs an implicit DISTINCT across all columns to remove duplicates.
INTERSECT: Only returns rows that exist in BOTH queries. This is the Venn Diagram Overlap.
EXCEPT (or MINUS): Returns rows from Query A that DO NOT exist in Query B. This is the Set Difference.

The Strict Laws of Schema Parity

For the database engine to perform surface set operations, the two queries MUST adhere to these laws:

Column Count Parity: You cannot UNION a query with 3 columns to a query with 5.
Bit-Level Compatibility: Column 1 in both must be "Compatible." You can mix VARCHAR(10) and VARCHAR(50), but you cannot mix INT and JSONB without explicit casting.
Positional Mapping: SQL maps by position, not by name. If Query A has (name, age) and Query B has (age, name), the engine will try to put names into the age column, likely crashing or corrupting the results.

2. Hardware-Mirror: The Deduplication Physics

When you run a standard UNION, the database MUST ensure that no row appears twice. This is not a "Visual" check; it is a heavy physical calculation. The engine typically chooses between two strategies based on your work_mem settings.

Strategy A: The Hash-Unique Mirror

The engine builds a Hash Table in RAM. As it scans Query A, it calculates a hash of the entire row (all columns combined) and places it in the hash bucket. When it scans Query B, if a row's hash already exists, it is discarded.

The Performance Limit: If the cardinality of your sets exceeds your available work_mem (RAM), the engine must "Spill" the hash table to the SSD as a Temporary File.
The Result: Your query speed drops from 10,000,000 rows per second (RAM) to 200,000 rows per second (SSD latency).

Strategy B: The Sort-Merge Set Mirror

The engine sorts both result sets by all columns. Then, it walks down both sorted lists (like a zipper), comparing the current rows. Because the lists are sorted, it knows instantly if a duplicate is coming up.

The Hardware Profile: This relies heavily on Sequential I/O. It is often preferred for massive datasets that involve billions of rows, where hashing would cause too collisions or memory overhead.

3. The "ALL" Secret: Eliminating the Bottleneck

This is the "Senior Engineer's Law": Always use UNION ALL unless you explicitly require deduplication.

UNION: Execute A + Execute B + Wait to Sort/Hash + Wait to Deduplicate.
UNION ALL: Execute A + Stream Result -> Execute B + Stream Result.

The Physics of Latency: UNION ALL is a "Streaming Operation." The first row of the result hits the user as soon as the first row of Query A is found. Standard UNION is a "Blocking Operation"; nothing is sent to the user until the entire set is gathered and deduplicated. In high-traffic APIs, the difference between UNION and UNION ALL is the difference between a 100ms response and a 4-second timeout.

4. INTERSECT and EXCEPT: The Comparison Buffer

While EXCEPT is mathematically elegant, it is often misunderstood at the hardware level.

The Physics of Intersection (The Probe Scan)

When you INTERSECT two sets, the database creates a hash map of the smaller set and probes it with the larger set.

The Architect's Tip: If you are intersecting a table of "All Citizens" (10 million) with "Active Voters" (100k), the engine stays fast because the 100k hash table fits entirely in the CPU's cache.
Complexity: O(N + M).

EXCEPT vs. NOT IN: The Performance Mirror

Many juniors use WHERE id NOT IN (SELECT id FROM ...) instead of EXCEPT.

The NOT IN Trap: If the subquery returns even one NULL, the entire NOT IN evaluates to FALSE, returning no rows.
The EXCEPT Advantage: EXCEPT handles NULL values correctly in the mathematical sense (considering them as a distinct "Value" for set difference).
The CPU Profile: EXCEPT is usually faster for multi-column comparisons (e.g., "Find rows where Name AND Age AND Salary differ"), whereas NOT EXISTS is faster if you are only comparing a single Primary Key.

5. Global Use Case: The Multi-Shard Audit

Imagine you have a distributed database for "EcoStore Global." You have a sharded table for US orders and one for EU orders.

The Problem

You need a list of all unique "Premium Products" sold globally yesterday, sorted by popularity.

The Set Architecture

sql

(
  SELECT product_id, price FROM orders_us_shard WHERE type = 'PREMIUM'
  UNION
  SELECT product_id, price FROM orders_eu_shard WHERE type = 'PREMIUM'
)
ORDER BY price DESC;

The Mirror Logic: By wrapping the UNION in parentheses and putting the ORDER BY at the bottom, you tell the engine to merge the shards first, and then perform a global sort. If you put ORDER BY inside each select, the engine would sort the US list, sort the EU list, and then have to re-sort the combined list anyway.

6. Advanced Physics: EXCEPT in Zero-Downtime Migrations

During a database migration, you often copy data from a "Legacy Mirror" to a "Cloud Mirror." How do you prove they are identical?

The "Mirror Difference" Script

sql

-- Find rows in Legacy not in Cloud
(SELECT * FROM legacy_users) EXCEPT (SELECT * FROM cloud_users);

-- Find rows in Cloud not in Legacy
(SELECT * FROM cloud_users) EXCEPT (SELECT * FROM legacy_users);

If both queries return 0 rows, your data migration has Atomic Parity. This is a standard check for SOC2 and HIPAA compliance audits where "Data Integrity Proof" is a legal requirement.

7. Performance Lab: Tuning Set Operations

Work_Mem Tuning: If you see "External Merge Sort" in your EXPLAIN ANALYZE for a UNION, increase your session's memory for that query: SET work_mem = '128MB';.
Cast Early: If mixing types, cast them in the subquery: SELECT id::text FROM ... UNION SELECT name FROM .... This prevents the engine from guessing the type and choosing a inefficient "Catch-all" conversion.
Prefer Left Joins for Single Keys: If you only need to exclude data based on an ID, a LEFT JOIN ... WHERE NULL is often easier for the Query Planner to optimize than an EXCEPT.

8. Summary: The Set Theory Checklist

The "ALL" Priority: Never use plain UNION unless you explicitly want to risk the deduplication overhead.
Constraint Awareness: If you have a UNIQUE constraint on your columns, UNION ALL is guaranteed to be identical to UNION but much faster.
Mathematical Integrity: Use EXCEPT for audits and migrations where you need 100% bit-for-bit comparison between two sets.
Hardware Alignment: Watch for RAM spillage on INTERSECT. If the overlap set is huge, your system will swap to disk.

By mastering the vertical movement of data through Set Operations, you gain the power to harmonize disparate datasets into a single "Source of Truth." You no longer "Query for rows"; you "Architect the Set."

Phase 10: Set Logic Action Items

Refactor legacy UNION queries to UNION ALL where duplicates are impossible or acceptable.
Implement an EXCEPT-based audit script to verify data integrity between your Staging and Production environments.
Verify that all column types in your set operations are bit-aligned to avoid implicit conversion overhead.
Use EXPLAIN ANALYZE to check if your INTERSECT is using a "Hash Join" or falling back to a "Sort Merge."

Frequently Asked Questions

Q: What is the difference between UNION and UNION ALL?

UNION combines the results of two queries and removes duplicate rows - it performs a sort or hash operation to identify and eliminate duplicates, which has a performance cost. UNION ALL combines results without removing duplicates - it is faster because no deduplication step is needed. Use UNION when you genuinely need distinct rows across the combined sets. Use UNION ALL when you know there are no duplicates, when duplicates are meaningful (combining two logs), or when performance matters and deduplication is unnecessary. In practice, UNION ALL is the better default choice unless deduplication is required.

Q: What does INTERSECT do and when would I use it?

INTERSECT returns only rows that appear in both query results - the mathematical intersection of two sets. Use it to find records that exist in multiple datasets simultaneously: customers who have both placed an order and submitted a support ticket, products that appear in both the current and previous month's top sellers, or users present in both an old and new system during a migration. INTERSECT is often replaceable with an INNER JOIN or an EXISTS subquery - choose whichever makes the intent clearest for your specific query.

Q: What rules must both queries in a UNION satisfy?

Both SELECT statements must have the same number of columns, and corresponding columns must have compatible data types (or be explicitly cast to compatible types). Column names in the result come from the first SELECT statement - aliases in the second SELECT are ignored. ORDER BY applies to the final combined result and must appear after the last SELECT, not within each individual SELECT. These rules apply to UNION, UNION ALL, INTERSECT, and EXCEPT equally.

Part of the SQL Mastery Course - engineering the set.