DB2 WHERE, GROUP BY, and HAVING: Filtering and Grouping Data

DB2 WHERE, GROUP BY, and HAVING: Filtering and Grouping Data
Understanding how to correctly filter and group data is a core skill for any DB2 developer or SQL author working on IBM mainframe or DB2 LUW systems. The DB2 GROUP BY and HAVING clauses are among the most powerful — and most frequently misunderstood — features in SQL. Used correctly, they transform raw transactional data into actionable summaries. Used incorrectly, they silently return wrong results or cause hard-to-diagnose errors. This guide walks through every aspect of the WHERE, GROUP BY, and HAVING clauses in DB2, from basic filtering to advanced multidimensional grouping with ROLLUP, CUBE, and GROUPING SETS.
DB2 Logical Query Processing Order
Before diving into syntax, it is essential to understand the order in which DB2 logically processes the clauses of a SELECT statement. This order determines what each clause can and cannot reference:
FROM— identify the source tables and joinsWHERE— filter individual rowsGROUP BY— partition filtered rows into groupsHAVING— filter groups based on aggregate conditionsSELECT— compute output columns and aliasesORDER BY— sort the final result set
This explains why a WHERE clause cannot reference a SELECT alias (aliases don't exist yet at step 2), and why HAVING can reference aggregate functions (aggregation happens at step 3-4).
The WHERE Clause: Filtering Individual Rows
The WHERE clause evaluates a condition for each row. Only rows for which the condition is TRUE pass through to the next stage. In DB2's three-valued logic, conditions can evaluate to TRUE, FALSE, or UNKNOWN (when NULLs are involved) — only TRUE rows survive a WHERE filter.
Comparison Operators
-- Basic comparisons
SELECT EMPNO, LASTNAME, SALARY, DEPT
FROM EMPLOYEE
WHERE SALARY > 60000 -- greater than
AND DEPT <> 'D01' -- not equal (also valid: != )
AND HIREDATE >= '2020-01-01';DB2 supports: =, <> (or !=), >, >=, <, <=.
BETWEEN, IN, and LIKE
-- BETWEEN (inclusive on both ends)
SELECT EMPNO, SALARY
FROM EMPLOYEE
WHERE SALARY BETWEEN 50000 AND 80000;
-- IN: match any value in a list
SELECT EMPNO, LASTNAME
FROM EMPLOYEE
WHERE JOB IN ('ANALYST', 'DESIGNER', 'MANAGER');
-- NOT IN: exclude a list (beware of NULLs — see HAVING section)
SELECT EMPNO, LASTNAME
FROM EMPLOYEE
WHERE DEPT NOT IN ('D01', 'D02');
-- LIKE: pattern matching with wildcards
-- % matches any sequence of characters
-- _ matches exactly one character
SELECT EMPNO, LASTNAME
FROM EMPLOYEE
WHERE LASTNAME LIKE 'SM%' -- starts with SM
AND FIRSTNME LIKE '_A%'; -- second character is AIS NULL and IS NOT NULL
-- Filter on NULL presence
SELECT EMPNO, LASTNAME, BONUS
FROM EMPLOYEE
WHERE BONUS IS NULL; -- no bonus recorded
SELECT EMPNO, LASTNAME, BONUS
FROM EMPLOYEE
WHERE BONUS IS NOT NULL; -- has a bonusNever use = NULL in DB2 (or any SQL database). The expression BONUS = NULL always evaluates to UNKNOWN, so the WHERE clause never selects any rows. Always use IS NULL.
EXISTS: Testing Subquery Row Existence
-- Employees who have placed at least one order
SELECT E.EMPNO, E.LASTNAME
FROM EMPLOYEE E
WHERE EXISTS (
SELECT 1
FROM ORDERS O
WHERE O.SALES_REP = E.EMPNO
);
-- Employees who have placed NO orders
SELECT E.EMPNO, E.LASTNAME
FROM EMPLOYEE E
WHERE NOT EXISTS (
SELECT 1
FROM ORDERS O
WHERE O.SALES_REP = E.EMPNO
);EXISTS stops scanning as soon as it finds one matching row, making it efficient for semi-join patterns. NOT EXISTS is generally safer than NOT IN when the subquery might return NULLs (see the common mistakes section).
Logical Operators: AND, OR, NOT
-- AND, OR, NOT with explicit parentheses for clarity
SELECT EMPNO, LASTNAME, SALARY, JOB, DEPT
FROM EMPLOYEE
WHERE (JOB = 'ANALYST' OR JOB = 'DESIGNER')
AND SALARY > 70000
AND NOT DEPT = 'D99';Always use parentheses when mixing AND and OR. Without them, AND has higher precedence than OR, which often produces logically incorrect results that can be hard to spot.
WHERE with Subqueries
-- Employees earning above the company-wide average salary
SELECT EMPNO, LASTNAME, SALARY
FROM EMPLOYEE
WHERE SALARY > (SELECT AVG(SALARY) FROM EMPLOYEE)
ORDER BY SALARY DESC;
-- Employees in departments located in New York
SELECT E.EMPNO, E.LASTNAME
FROM EMPLOYEE E
WHERE E.DEPT IN (
SELECT D.DEPTNO
FROM DEPARTMENT D
WHERE D.LOCATION = 'NEW YORK'
);GROUP BY: Grouping Rows for Aggregation
GROUP BY divides the rows that survive the WHERE clause into groups, one per unique combination of the grouping columns. An aggregate function applied after GROUP BY returns one value per group rather than one value for the whole table.
GROUP BY with a Single Column
-- Employee count and total salary per department
SELECT DEPT,
COUNT(*) AS HEADCOUNT,
SUM(SALARY) AS TOTAL_SALARY
FROM EMPLOYEE
GROUP BY DEPT
ORDER BY TOTAL_SALARY DESC;GROUP BY with Multiple Columns
-- Breakdown by department AND job title
SELECT DEPT,
JOB,
COUNT(*) AS HEADCOUNT,
AVG(SALARY) AS AVG_SALARY,
MIN(SALARY) AS MIN_SALARY,
MAX(SALARY) AS MAX_SALARY
FROM EMPLOYEE
GROUP BY DEPT, JOB
ORDER BY DEPT, AVG_SALARY DESC;Every column in SELECT that is not inside an aggregate function must appear in GROUP BY. If you add LASTNAME to the SELECT above without adding it to GROUP BY, DB2 raises a SQL error.
GROUP BY with Expressions
You can group by an expression — not just a raw column. DB2 evaluates the expression for each row and groups by the result.
-- Group sales by year and quarter derived from a date column
SELECT YEAR(SALE_DATE) AS SALE_YEAR,
QUARTER(SALE_DATE) AS SALE_QTR,
COUNT(*) AS ORDER_COUNT,
SUM(AMOUNT) AS TOTAL_SALES
FROM ORDERS
GROUP BY YEAR(SALE_DATE), QUARTER(SALE_DATE)
ORDER BY SALE_YEAR, SALE_QTR;Note: DB2 does not allow referencing the SELECT alias (SALE_YEAR) in GROUP BY — you must repeat the full expression YEAR(SALE_DATE).
HAVING: Filtering Groups After Aggregation
HAVING is the post-aggregation equivalent of WHERE. It evaluates a condition against each group and discards groups for which the condition is not TRUE. HAVING is where you place conditions involving aggregate functions.
Basic HAVING Usage
-- Only departments with more than 5 employees
-- AND an average salary above 65000
SELECT DEPT,
COUNT(*) AS HEADCOUNT,
AVG(SALARY) AS AVG_SALARY
FROM EMPLOYEE
GROUP BY DEPT
HAVING COUNT(*) > 5
AND AVG(SALARY) > 65000
ORDER BY AVG_SALARY DESC;HAVING Without GROUP BY
HAVING can be used without GROUP BY, treating the entire table as a single group. This is niche but valid:
-- Only return a result if the average salary exceeds 60000
SELECT COUNT(*) AS TOTAL_EMP, AVG(SALARY) AS AVG_SALARY
FROM EMPLOYEE
HAVING AVG(SALARY) > 60000;If the condition is false, the query returns no rows at all.
HAVING with Non-Aggregate Conditions
Technically you can put non-aggregate filters in HAVING, but it is always better practice to put them in WHERE for performance — WHERE reduces the rows before aggregation, while HAVING removes groups after the more expensive aggregation step.
-- CORRECT but inefficient: non-aggregate filter in HAVING
SELECT DEPT, COUNT(*)
FROM EMPLOYEE
HAVING DEPT <> 'D99' -- DB2 will aggregate D99 and then discard it
GROUP BY DEPT;
-- BETTER: non-aggregate filter in WHERE
SELECT DEPT, COUNT(*)
FROM EMPLOYEE
WHERE DEPT <> 'D99' -- D99 rows excluded before aggregation
GROUP BY DEPT;WHERE vs HAVING: The Definitive Comparison
| Criterion | WHERE | HAVING |
|---|---|---|
| Evaluates at stage | Before GROUP BY | After GROUP BY |
| Can filter on aggregate functions (COUNT, SUM…) | No | Yes |
| Can filter on individual row columns | Yes | Yes (but use WHERE) |
| Performance impact | Reduces rows before aggregation (cheaper) | Removes groups after aggregation (costlier) |
| Can use SELECT column aliases | No | No |
Rule of thumb: put every condition that does not involve an aggregate into WHERE. Put every condition that does involve an aggregate into HAVING.
ORDER BY with GROUP BY
ORDER BY sorts the final result and is always the last clause evaluated. You can sort by any column in the SELECT list, including aggregate expressions. On DB2 for z/OS, ORDER BY position numbers (e.g., ORDER BY 3 DESC) are supported but discouraged — they make queries fragile when columns are reordered.
-- Quarterly sales ranked by total revenue, highest first
SELECT YEAR(SALE_DATE) AS YR,
QUARTER(SALE_DATE) AS QTR,
SUM(AMOUNT) AS TOTAL,
COUNT(*) AS ORDERS
FROM ORDERS
GROUP BY YEAR(SALE_DATE), QUARTER(SALE_DATE)
ORDER BY TOTAL DESC, -- alias is valid in ORDER BY (unlike WHERE/HAVING)
YR ASC,
QTR ASC;GROUPING SETS: Precise Multi-Level Aggregation
GROUPING SETS allows you to specify multiple grouping specifications in a single query. DB2 computes each grouping set independently and unions the results. This replaces UNION ALL of multiple GROUP BY queries with a single, more efficient pass.
-- Three separate subtotals: by DEPT, by JOB, and the grand total
SELECT DEPT,
JOB,
SUM(SALARY) AS TOTAL_SALARY,
COUNT(*) AS HEADCOUNT
FROM EMPLOYEE
GROUP BY GROUPING SETS (
(DEPT), -- subtotal per department
(JOB), -- subtotal per job
() -- grand total
)
ORDER BY GROUPING(DEPT), GROUPING(JOB),
DEPT NULLS LAST, JOB NULLS LAST;DB2 uses NULL in the result rows to indicate the column is part of the roll-up rather than a genuine NULL in the data. Use the GROUPING() function to distinguish them.
ROLLUP: Hierarchical Subtotals
ROLLUP is a shorthand for generating subtotals at every level of a left-to-right column hierarchy, plus a grand total. For N columns in the ROLLUP, DB2 generates N+1 grouping combinations.
-- Sales hierarchy: Year > Month > Region
SELECT YEAR(SALE_DATE) AS SALE_YEAR,
MONTH(SALE_DATE) AS SALE_MONTH,
REGION,
SUM(AMOUNT) AS REVENUE,
COUNT(*) AS ORDERS
FROM ORDERS
GROUP BY ROLLUP(
YEAR(SALE_DATE),
MONTH(SALE_DATE),
REGION
)
ORDER BY GROUPING(YEAR(SALE_DATE)),
GROUPING(MONTH(SALE_DATE)),
GROUPING(REGION),
SALE_YEAR NULLS LAST,
SALE_MONTH NULLS LAST,
REGION NULLS LAST;The result includes:
- Year + Month + Region (detail rows)
- Year + Month subtotals
- Year subtotals
- Grand total row (all grouping columns are NULL)
CUBE: All-Combinations Multidimensional Analysis
CUBE generates subtotals for every possible combination of the grouping columns. For three columns, that is 2³ = 8 combinations. It is the SQL equivalent of a full OLAP cross-tabulation.
-- All-dimension analysis: Category × Region × Channel
SELECT CATEGORY,
REGION,
CHANNEL,
SUM(AMOUNT) AS TOTAL,
GROUPING(CATEGORY) AS GRP_CAT,
GROUPING(REGION) AS GRP_REG,
GROUPING(CHANNEL) AS GRP_CHAN
FROM SALES
GROUP BY CUBE(CATEGORY, REGION, CHANNEL)
ORDER BY GRP_CAT, GRP_REG, GRP_CHAN,
CATEGORY NULLS LAST,
REGION NULLS LAST,
CHANNEL NULLS LAST;The GROUPING() columns return 1 for subtotal rows and 0 for detail rows, making it easy to apply conditional formatting in a reporting tool.
Common Mistakes and How to Avoid Them
Mistake 1: Using a SELECT Alias in WHERE
-- WRONG: alias ANNUAL_SALARY does not exist during WHERE evaluation
SELECT EMPNO, SALARY * 12 AS ANNUAL_SALARY
FROM EMPLOYEE
WHERE ANNUAL_SALARY > 720000; -- ERROR: ANNUAL_SALARY is unknown
-- CORRECT: repeat the expression, or use a derived table
SELECT EMPNO, ANNUAL_SALARY
FROM (
SELECT EMPNO, SALARY * 12 AS ANNUAL_SALARY
FROM EMPLOYEE
) AS E
WHERE ANNUAL_SALARY > 720000;Mistake 2: Putting an Aggregate in WHERE
-- WRONG: aggregate function in WHERE
SELECT DEPT, COUNT(*)
FROM EMPLOYEE
WHERE COUNT(*) > 5 -- ERROR: aggregates not allowed in WHERE
GROUP BY DEPT;
-- CORRECT: aggregate condition in HAVING
SELECT DEPT, COUNT(*)
FROM EMPLOYEE
GROUP BY DEPT
HAVING COUNT(*) > 5;Mistake 3: NOT IN with a Subquery That Returns NULLs
-- DANGEROUS: if ANY row in the subquery returns NULL,
-- NOT IN evaluates to UNKNOWN for every row, returning zero results
SELECT EMPNO, LASTNAME
FROM EMPLOYEE
WHERE DEPT NOT IN (
SELECT MGR_DEPT FROM MANAGERS -- MGR_DEPT might contain NULLs
);
-- SAFE: use NOT EXISTS instead
SELECT E.EMPNO, E.LASTNAME
FROM EMPLOYEE E
WHERE NOT EXISTS (
SELECT 1
FROM MANAGERS M
WHERE M.MGR_DEPT = E.DEPT
);Mistake 4: Omitting a GROUP BY Column from SELECT
-- If you GROUP BY DEPT and JOB but only select DEPT,
-- the JOB breakdown is computed but silently discarded —
-- you may be aggregating at a finer grain than intended.
-- Always verify your GROUP BY matches your reporting grain.
SELECT DEPT,
SUM(SALARY) AS TOTAL -- JOB missing from SELECT, GROUP BY is DEPT+JOB
FROM EMPLOYEE
GROUP BY DEPT, JOB; -- This is legal but unusual; produces one row per DEPT+JOB
-- while only showing DEPT in the outputPerformance Considerations
Index Usage with WHERE
-- Sargable: index on HIREDATE can be used (range scan)
WHERE HIREDATE BETWEEN '2020-01-01' AND '2022-12-31'
-- Non-sargable: function wrapping the column prevents index use
WHERE YEAR(HIREDATE) BETWEEN 2020 AND 2022Sargable predicates (Search ARGument ABLE) allow DB2 to use an index to limit the rows scanned. Wrapping a column in a function makes it non-sargable. Rewrite date range predicates as BETWEEN literals for optimal index use.
Stage 1 vs Stage 2 Predicates (DB2 for z/OS)
In DB2 for z/OS, predicates are classified as Stage 1 (applied during the data manager scan, most efficient) or Stage 2 (applied after rows are returned to RDS, less efficient). Simple column comparisons (=, >, BETWEEN, IN) are typically Stage 1. Functions and complex expressions are Stage 2. Keeping your WHERE predicates Stage 1 is critical for mainframe performance.
Avoid SELECT * with GROUP BY
SELECT * is invalid with GROUP BY in most cases (you cannot include all columns in both GROUP BY and aggregates simultaneously). More importantly, selecting only the columns you need reduces the data volume DB2 must sort and return.
RUNSTATS and GROUP BY Plans
DB2's optimizer decides between sort-based and hash-based aggregation based on catalog statistics. After large data loads, run RUNSTATS on the tables and indexes involved in frequent GROUP BY queries to ensure the optimizer has accurate row counts and column cardinality estimates.
Full Example: Sales Performance Report
-- Complete sales performance report combining all concepts
SELECT R.REGION_NAME,
YEAR(O.ORDER_DATE) AS ORDER_YEAR,
COUNT(*) AS TOTAL_ORDERS,
COUNT(DISTINCT O.CUSTOMER_ID) AS UNIQUE_CUSTOMERS,
SUM(O.AMOUNT) AS GROSS_REVENUE,
SUM(O.AMOUNT - O.DISCOUNT) AS NET_REVENUE,
AVG(O.AMOUNT) AS AVG_ORDER_VALUE,
MAX(O.AMOUNT) AS LARGEST_ORDER
FROM ORDERS O
JOIN REGIONS R ON O.REGION_ID = R.REGION_ID
WHERE O.STATUS IN ('COMPLETED', 'SHIPPED')
AND O.ORDER_DATE BETWEEN '2025-01-01' AND '2026-12-31'
GROUP BY R.REGION_NAME, YEAR(O.ORDER_DATE)
HAVING SUM(O.AMOUNT) > 100000
AND COUNT(*) > 50
ORDER BY ORDER_YEAR, NET_REVENUE DESC;This query demonstrates every concept in this article: WHERE with IN and BETWEEN, a JOIN, GROUP BY with an expression, multiple aggregates, HAVING with aggregate conditions, and ORDER BY with a mix of column and aggregate expressions.
