SQLQueries

SQL SELECT Queries: Filtering and Logic

TT
TopicTrick Team
SQL SELECT Queries: Filtering and Logic

SQL SELECT Queries: Filtering and Logic

In many high-level languages like JavaScript or Python, if you want to find an item in a list, you write a for loop. You are telling the computer how to find the data. In SQL, you do the opposite. You tell the computer what you want, and the Query Optimizer—the most sophisticated piece of software in the engine—decides the "How."

This 1,500+ word guide is your deep-dive into the "Extraction Layer." We will explore the anatomy of a query, the physics of filtering, and the dangerous "Ghost" of Three-Valued Logic.


1. The Anatomy of a Query: The Execution Order

Every SQL query follows a strict syntax, but the database engine actually executes the lines in a completely different order than you write them. Understanding this order is the first step toward optimization.

How you write it (Lexical Order):

  1. SELECT: What columns do I want?
  2. FROM: Which table am I looking at?
  3. WHERE: What is the condition?

How the Engine executes it (Logical Order):

  1. FROM: The engine first loads the target table into memory. If it's a join, it builds the cartesian product here.
  2. WHERE: It applies the filters to discard rows as early as possible.
  3. SELECT: Finally, it throws away the columns you didn't ask for (Projection).

Architect's Note: Because WHERE happens before SELECT, you cannot use a name you created in the SELECT line (an Alias) inside your WHERE clause. Example: SELECT age AS user_age FROM users WHERE user_age > 18 will FAIL.


2. Hardware-Mirror: The Physics of "Seeking" vs. "Scanning"

When you write WHERE user_id = 500, what happens to the physical silicon?

The Full Table Scan (Sequential I/O)

If user_id is NOT indexed, the database engine must physically read every single 8KB "Page" of your table from the disk into the Buffer Pool (RAM).

  • The Thread Physics: To speed this up, modern engines like Postgres 16 use Parallel Sequential Scans, launching multiple background workers to read different parts of the storage mirror simultaneously.
  • The OS Buffer Mirror: Even if the database asks for a scan, the OS might already have the pages in its Page Cache. This makes the second scan significantly faster than the first.
  • The Performance: If you have 10 million rows, this takes seconds. On a high-traffic site, your CPU will hit 100% and your app will time out. This is "Sequential I/O"—it's predictable but slow for small searches.

The Index Seek (Random I/O)

If you have an index, the engine uses a B-Tree structure.

  • The Process: It jumps to the Root Page, then to a Branch Page, then directly to the Leaf Page containing ID 500. It only touches 3 or 4 pages total.
  • The Performance: This takes less than $1$ millisecond.
  • Mastery: Your goal as an architect is to ensure your WHERE clauses are SARGable (Search ARgument Able).

The Visibility Map Mirror

How does the engine know which data is "Old" vs. "New"?

  • The MVCC Physics: Every row has a xmin and xmax hidden attribute.
  • The Map: Postgres maintains a Visibility Map (VM) that tracks which pages contain only "All-Visible" rows (committed and seen by everyone).
  • Efficiency Mirror: When performing an "Index-Only Scan," the engine checks the VM first. If the page is all-visible, it skips visiting the actual table (the Heap) entirely, effectively doubling the I/O throughput.

3. Predicate Pushdown: The Filter Barrier

In modern distributed SQL (like CockroachDB or AWS Aurora), the engine uses a technique called Predicate Pushdown.

The Flow Physics

Instead of the Storage Engine sending $1$ million rows to the Relational Engine for filtering, the Relational Engine "Pushes" the filter logic down into the Storage Engine itself.

  • The Result: The physical disk only sends the relevant results back up the stack.
  • The Performance Win: This reduces the "Serialization Mirror" cost and network bandwidth, allowing you to filter billions of rows at the speed of the underlying NVMe lanes.

3. SARGable vs. Non-SARGable Queries

A "Non-SARGable" query is a perfectly valid SQL command that accidentally "Kills" the database's ability to use an index.

The Index Killer: Using Functions on Columns

Amateur SQL: WHERE UPPER(email) = 'ALICE@EXAMPLE.COM'

  • The Result: Because you wrapped email in the UPPER() function, the engine cannot use the index on email. It has to calculate UPPER() for every single row in the database. Professional SQL: WHERE email = 'alice@example.com' (Assuming data is already normalized to lowercase).

The Wildcard Trap

Amateur SQL: WHERE name LIKE '%John%'

  • The Result: A leading wildcard prevents the engine from "Jumping" to the start of the index. It must scan the entire desk. Professional SQL: WHERE name LIKE 'John%' (This is SARGable and uses the index).

4. Logical Operators and Short-Circuiting

SQL uses Boolean logic (AND, OR, NOT) to combine filters. But unlike modern programming languages, SQL engines sometimes optimize these in surprising ways.

  • AND Short-Circuit: In most engines, if the first part of an AND is FALSE, the engine skips the rest.
    • Optimization Trick: Put the "Most Selective" filter (the one that removes the most rows) first.
  • The "OR" Performance Cost: OR is the enemy of performance. If you have WHERE id = 5 OR name = 'John', the engine often has to perform two separate index searches and then "Merge" the results in RAM. If you can avoid OR, do it.

5. The Ghost in the Machine: Three-Valued Logic

In standard code, values are TRUE or FALSE. In SQL, there is a third state: UNKNOWN (NULL).

The 1% Database Bug

NULL represents "Information missing." Since the database doesn't know what is in the NULL cell, it cannot say it is "Equal" to anything.

  • 1 = 1TRUE
  • 1 = NULLUNKNOWN
  • NULL = NULLUNKNOWN

The Catastrophic Filter Bug

Imagine you want all users who are NOT in London. SELECT * FROM users WHERE city != 'London' If a user has a NULL city, they will NOT appear in your results. Why? Because NULL != 'London' is UNKNOWN, and SQL only returns rows where the result is strictly TRUE.

  • Architect's Standard: Always account for the "Ghost" state. Use WHERE city != 'London' OR city IS NULL.

6. Case Study: The "Anti-Pattern" Table Lock

A logistics company had a query that ran every minute to find "Unprocessed Orders." WHERE status = 'PENDING' AND created_at > '2024-01-01'

The Problem: They had an index on created_at but NOT on status. As the year progressed, there were $5,000,000$ orders from 2024. The database used the index to find all $5$ million rows and then manually filtered them for 'PENDING'. The Result: The query started taking $10$ seconds. Because it was a write-heavy table, the locks created by this long query started backing up, eventually crashing the entire warehouse system.

The Fix: A "Composite Index" on (status, created_at). This allowed the engine to jump directly to the "Pending" section of the hardware and find the items in $2$ms.


7. Summary: The Query Optimization Checklist

  1. Selectivity: Put the filters that discard the most data at the top of your query.
  2. Projection: Only SELECT what you need. SELECT * is a waste of network bandwidth and CPU cycles.
  3. Logic Integrity: Never use = NULL. Always use IS NULL.
  4. SARGability Check: Never use functions (UPPER, DATE, etc.) on the column side of your filter.
  5. EXPLAIN ANALYZE: Before you push to production, run your query with EXPLAIN ANALYZE. If you see "Seq Scan" on a large table, you have failed the architecture test.

Mastering selective querying is the transition from "Reading data" to "Architecting the Search Engine." By understanding the physics of the Index Seek and the logical traps of NULL, you build systems that scale gracefully from the first user to the billionth.


The Optimizer Cost Mirror

How does the engine decide between a Scan and a Seek? It's a math problem.

  • seq_page_cost: Usually set to $1.0$. This is the baseline cost for a sequential disk read.
  • random_page_cost: Usually set to $4.0$ (on HDDs) or $1.1$ (on NVMe).
  • The Threshold: If the optimizer calculates that it needs to read more than ~10-15% of the table, it will often "Flip" the execution mirror and choose a Sequential Scan over an Index Seek to avoid the overhead of random seeks.

Masterclass Filter Checklist

  • Audit SARGability: Remove all function calls (e.g., UPPER(), DATE()) from column filters.
  • Implement Composite Indexes: Group common filter columns in a single index structure.
  • Verify Null Logic: Audit all != filters for missing IS NULL coverage.
  • Test Selectivity: Avoid filtering on low-cardinality columns (e.g., Gender, Boolean) without an accompanying high-selectivity filter.

Read next: SQL Sorting: The External Sort Mirror →


Part of the SQL Mastery Course — engineering the search.