SQL Set Operations: UNION, INTERSECT, and EXCEPT

SQL Set Operations: UNION, INTERSECT, and EXCEPT
"A database is just a collection of Mathematical Sets. To master SQL is to master Set Theory."
In most of your SQL work, you use JOINs to "Connect" tables sideways (Horizontal Fusion). But what if you want to "Stack" tables on top of each other? Or find the "Overlap" between two different reports? In modern database engineering, we use Set Operations to manipulate result sets as singular mathematical entities.
Set Operations (UNION, INTERSECT, and EXCEPT) are the fundamental tools for comparing and combining entire result sets. Whether you are finding "Customers who also work for us" or discovering "Products that sold in London but NOT in New York," set operations provide a cleaner, more readable, and often faster solution than complex subqueries.
This 1,500+ word flagship guide is your deep-dive into the "Algebra of Data." We will explore the "Hardware-Mirror" of how the engine physically merges millions of rows, why the "Buffer Manager" hates UNION, and why a single ALL keyword can reduce your SSD I/O by 90%.
1. The Big Three: Vertical Data Fusion Geometry
Set operations combine the results of two SELECT statements into a single output. Imagine two spreadsheets stacked on top of each other.
UNION: Adds the rows of Query B to Query A. It performs an implicitDISTINCTacross all columns to remove duplicates.INTERSECT: Only returns rows that exist in BOTH queries. This is the Venn Diagram Overlap.EXCEPT(or MINUS): Returns rows from Query A that DO NOT exist in Query B. This is the Set Difference.
The Strict Laws of Schema Parity
For the database engine to perform surface set operations, the two queries MUST adhere to these laws:
- Column Count Parity: You cannot
UNIONa query with 3 columns to a query with 5. - Bit-Level Compatibility: Column 1 in both must be "Compatible." You can mix
VARCHAR(10)andVARCHAR(50), but you cannot mixINTandJSONBwithout explicit casting. - Positional Mapping: SQL maps by position, not by name. If Query A has
(name, age)and Query B has(age, name), the engine will try to put names into the age column, likely crashing or corrupting the results.
2. Hardware-Mirror: The Deduplication Physics
When you run a standard UNION, the database MUST ensure that no row appears twice. This is not a "Visual" check; it is a heavy physical calculation. The engine typically chooses between two strategies based on your work_mem settings.
Strategy A: The Hash-Unique Mirror
The engine builds a Hash Table in RAM. As it scans Query A, it calculates a hash of the entire row (all columns combined) and places it in the hash bucket. When it scans Query B, if a row's hash already exists, it is discarded.
- The Performance Limit: If the cardinality of your sets exceeds your available
work_mem(RAM), the engine must "Spill" the hash table to the SSD as a Temporary File. - The Result: Your query speed drops from 10,000,000 rows per second (RAM) to 200,000 rows per second (SSD latency).
Strategy B: The Sort-Merge Set Mirror
The engine sorts both result sets by all columns. Then, it walks down both sorted lists (like a zipper), comparing the current rows. Because the lists are sorted, it knows instantly if a duplicate is coming up.
- The Hardware Profile: This relies heavily on Sequential I/O. It is often preferred for massive datasets that involve billions of rows, where hashing would cause too collisions or memory overhead.
3. The "ALL" Secret: Eliminating the Bottleneck
This is the "Senior Engineer's Law": Always use UNION ALL unless you explicitly require deduplication.
UNION: Execute A + Execute B + Wait to Sort/Hash + Wait to Deduplicate.UNION ALL: Execute A + Stream Result -> Execute B + Stream Result.
The Physics of Latency: UNION ALL is a "Streaming Operation." The first row of the result hits the user as soon as the first row of Query A is found. Standard UNION is a "Blocking Operation"; nothing is sent to the user until the entire set is gathered and deduplicated. In high-traffic APIs, the difference between UNION and UNION ALL is the difference between a 100ms response and a 4-second timeout.
4. INTERSECT and EXCEPT: The Comparison Buffer
While EXCEPT is mathematically elegant, it is often misunderstood at the hardware level.
The Physics of Intersection (The Probe Scan)
When you INTERSECT two sets, the database creates a hash map of the smaller set and probes it with the larger set.
- The Architect's Tip: If you are intersecting a table of "All Citizens" (10 million) with "Active Voters" (100k), the engine stays fast because the 100k hash table fits entirely in the CPU's cache.
- Complexity: O(N + M).
EXCEPT vs. NOT IN: The Performance Mirror
Many juniors use WHERE id NOT IN (SELECT id FROM ...) instead of EXCEPT.
- The
NOT INTrap: If the subquery returns even oneNULL, the entireNOT INevaluates toFALSE, returning no rows. - The
EXCEPTAdvantage:EXCEPThandlesNULLvalues correctly in the mathematical sense (considering them as a distinct "Value" for set difference). - The CPU Profile:
EXCEPTis usually faster for multi-column comparisons (e.g., "Find rows where Name AND Age AND Salary differ"), whereasNOT EXISTSis faster if you are only comparing a single Primary Key.
5. Global Use Case: The Multi-Shard Audit
Imagine you have a distributed database for "EcoStore Global." You have a sharded table for US orders and one for EU orders.
The Problem
You need a list of all unique "Premium Products" sold globally yesterday, sorted by popularity.
The Set Architecture
The Mirror Logic: By wrapping the UNION in parentheses and putting the ORDER BY at the bottom, you tell the engine to merge the shards first, and then perform a global sort. If you put ORDER BY inside each select, the engine would sort the US list, sort the EU list, and then have to re-sort the combined list anyway.
6. Advanced Physics: EXCEPT in Zero-Downtime Migrations
During a database migration, you often copy data from a "Legacy Mirror" to a "Cloud Mirror." How do you prove they are identical?
The "Mirror Difference" Script
If both queries return 0 rows, your data migration has Atomic Parity. This is a standard check for SOC2 and HIPAA compliance audits where "Data Integrity Proof" is a legal requirement.
7. Performance Lab: Tuning Set Operations
- Work_Mem Tuning: If you see "External Merge Sort" in your
EXPLAIN ANALYZEfor aUNION, increase your session's memory for that query:SET work_mem = '128MB';. - Cast Early: If mixing types, cast them in the subquery:
SELECT id::text FROM ... UNION SELECT name FROM .... This prevents the engine from guessing the type and choosing a inefficient "Catch-all" conversion. - Prefer Left Joins for Single Keys: If you only need to exclude data based on an ID, a
LEFT JOIN ... WHERE NULLis often easier for the Query Planner to optimize than anEXCEPT.
8. Summary: The Set Theory Checklist
- The "ALL" Priority: Never use plain
UNIONunless you explicitly want to risk the deduplication overhead. - Constraint Awareness: If you have a
UNIQUEconstraint on your columns,UNION ALLis guaranteed to be identical toUNIONbut much faster. - Mathematical Integrity: Use
EXCEPTfor audits and migrations where you need 100% bit-for-bit comparison between two sets. - Hardware Alignment: Watch for RAM spillage on
INTERSECT. If the overlap set is huge, your system will swap to disk.
By mastering the vertical movement of data through Set Operations, you gain the power to harmonize disparate datasets into a single "Source of Truth." You no longer "Query for rows"; you "Architect the Set."
Phase 10: Set Logic Action Items
- Refactor legacy
UNIONqueries toUNION ALLwhere duplicates are impossible or acceptable. - Implement an
EXCEPT-based audit script to verify data integrity between your Staging and Production environments. - Verify that all column types in your set operations are bit-aligned to avoid implicit conversion overhead.
- Use
EXPLAIN ANALYZEto check if yourINTERSECTis using a "Hash Join" or falling back to a "Sort Merge."
Read next: SQL Aggregate Functions: Summarizing the Global Mirror →
Part of the SQL Mastery Course — engineering the set.
