MainframeDatabaseDB2 Complete Reference

DB2 WHERE, GROUP BY, and HAVING: Filtering and Grouping Data

TT
TopicTrick Team
DB2 WHERE, GROUP BY, and HAVING: Filtering and Grouping Data

DB2 WHERE, GROUP BY, and HAVING: Filtering and Grouping Data

Understanding how to correctly filter and group data is a core skill for any DB2 developer or SQL author working on IBM mainframe or DB2 LUW systems. The DB2 GROUP BY and HAVING clauses are among the most powerful — and most frequently misunderstood — features in SQL. Used correctly, they transform raw transactional data into actionable summaries. Used incorrectly, they silently return wrong results or cause hard-to-diagnose errors. This guide walks through every aspect of the WHERE, GROUP BY, and HAVING clauses in DB2, from basic filtering to advanced multidimensional grouping with ROLLUP, CUBE, and GROUPING SETS.


DB2 Logical Query Processing Order

Before diving into syntax, it is essential to understand the order in which DB2 logically processes the clauses of a SELECT statement. This order determines what each clause can and cannot reference:

  1. FROM — identify the source tables and joins
  2. WHERE — filter individual rows
  3. GROUP BY — partition filtered rows into groups
  4. HAVING — filter groups based on aggregate conditions
  5. SELECT — compute output columns and aliases
  6. ORDER BY — sort the final result set

This explains why a WHERE clause cannot reference a SELECT alias (aliases don't exist yet at step 2), and why HAVING can reference aggregate functions (aggregation happens at step 3-4).


The WHERE Clause: Filtering Individual Rows

The WHERE clause evaluates a condition for each row. Only rows for which the condition is TRUE pass through to the next stage. In DB2's three-valued logic, conditions can evaluate to TRUE, FALSE, or UNKNOWN (when NULLs are involved) — only TRUE rows survive a WHERE filter.

Comparison Operators

sql
-- Basic comparisons
SELECT EMPNO, LASTNAME, SALARY, DEPT
FROM   EMPLOYEE
WHERE  SALARY > 60000       -- greater than
  AND  DEPT <> 'D01'        -- not equal (also valid: != )
  AND  HIREDATE >= '2020-01-01';

DB2 supports: =, <> (or !=), >, >=, <, <=.

BETWEEN, IN, and LIKE

sql
-- BETWEEN (inclusive on both ends)
SELECT EMPNO, SALARY
FROM   EMPLOYEE
WHERE  SALARY BETWEEN 50000 AND 80000;

-- IN: match any value in a list
SELECT EMPNO, LASTNAME
FROM   EMPLOYEE
WHERE  JOB IN ('ANALYST', 'DESIGNER', 'MANAGER');

-- NOT IN: exclude a list (beware of NULLs — see HAVING section)
SELECT EMPNO, LASTNAME
FROM   EMPLOYEE
WHERE  DEPT NOT IN ('D01', 'D02');

-- LIKE: pattern matching with wildcards
-- % matches any sequence of characters
-- _ matches exactly one character
SELECT EMPNO, LASTNAME
FROM   EMPLOYEE
WHERE  LASTNAME LIKE 'SM%'      -- starts with SM
  AND  FIRSTNME LIKE '_A%';     -- second character is A

IS NULL and IS NOT NULL

sql
-- Filter on NULL presence
SELECT EMPNO, LASTNAME, BONUS
FROM   EMPLOYEE
WHERE  BONUS IS NULL;      -- no bonus recorded

SELECT EMPNO, LASTNAME, BONUS
FROM   EMPLOYEE
WHERE  BONUS IS NOT NULL;  -- has a bonus

Never use = NULL in DB2 (or any SQL database). The expression BONUS = NULL always evaluates to UNKNOWN, so the WHERE clause never selects any rows. Always use IS NULL.

EXISTS: Testing Subquery Row Existence

sql
-- Employees who have placed at least one order
SELECT E.EMPNO, E.LASTNAME
FROM   EMPLOYEE E
WHERE  EXISTS (
    SELECT 1
    FROM   ORDERS O
    WHERE  O.SALES_REP = E.EMPNO
);

-- Employees who have placed NO orders
SELECT E.EMPNO, E.LASTNAME
FROM   EMPLOYEE E
WHERE  NOT EXISTS (
    SELECT 1
    FROM   ORDERS O
    WHERE  O.SALES_REP = E.EMPNO
);

EXISTS stops scanning as soon as it finds one matching row, making it efficient for semi-join patterns. NOT EXISTS is generally safer than NOT IN when the subquery might return NULLs (see the common mistakes section).

Logical Operators: AND, OR, NOT

sql
-- AND, OR, NOT with explicit parentheses for clarity
SELECT EMPNO, LASTNAME, SALARY, JOB, DEPT
FROM   EMPLOYEE
WHERE  (JOB = 'ANALYST' OR JOB = 'DESIGNER')
  AND  SALARY > 70000
  AND  NOT DEPT = 'D99';

Always use parentheses when mixing AND and OR. Without them, AND has higher precedence than OR, which often produces logically incorrect results that can be hard to spot.

WHERE with Subqueries

sql
-- Employees earning above the company-wide average salary
SELECT EMPNO, LASTNAME, SALARY
FROM   EMPLOYEE
WHERE  SALARY > (SELECT AVG(SALARY) FROM EMPLOYEE)
ORDER BY SALARY DESC;

-- Employees in departments located in New York
SELECT E.EMPNO, E.LASTNAME
FROM   EMPLOYEE E
WHERE  E.DEPT IN (
    SELECT D.DEPTNO
    FROM   DEPARTMENT D
    WHERE  D.LOCATION = 'NEW YORK'
);

GROUP BY: Grouping Rows for Aggregation

GROUP BY divides the rows that survive the WHERE clause into groups, one per unique combination of the grouping columns. An aggregate function applied after GROUP BY returns one value per group rather than one value for the whole table.

GROUP BY with a Single Column

sql
-- Employee count and total salary per department
SELECT   DEPT,
         COUNT(*)     AS HEADCOUNT,
         SUM(SALARY)  AS TOTAL_SALARY
FROM     EMPLOYEE
GROUP BY DEPT
ORDER BY TOTAL_SALARY DESC;

GROUP BY with Multiple Columns

sql
-- Breakdown by department AND job title
SELECT   DEPT,
         JOB,
         COUNT(*)            AS HEADCOUNT,
         AVG(SALARY)         AS AVG_SALARY,
         MIN(SALARY)         AS MIN_SALARY,
         MAX(SALARY)         AS MAX_SALARY
FROM     EMPLOYEE
GROUP BY DEPT, JOB
ORDER BY DEPT, AVG_SALARY DESC;

Every column in SELECT that is not inside an aggregate function must appear in GROUP BY. If you add LASTNAME to the SELECT above without adding it to GROUP BY, DB2 raises a SQL error.

GROUP BY with Expressions

You can group by an expression — not just a raw column. DB2 evaluates the expression for each row and groups by the result.

sql
-- Group sales by year and quarter derived from a date column
SELECT   YEAR(SALE_DATE)    AS SALE_YEAR,
         QUARTER(SALE_DATE) AS SALE_QTR,
         COUNT(*)           AS ORDER_COUNT,
         SUM(AMOUNT)        AS TOTAL_SALES
FROM     ORDERS
GROUP BY YEAR(SALE_DATE), QUARTER(SALE_DATE)
ORDER BY SALE_YEAR, SALE_QTR;

Note: DB2 does not allow referencing the SELECT alias (SALE_YEAR) in GROUP BY — you must repeat the full expression YEAR(SALE_DATE).


HAVING: Filtering Groups After Aggregation

HAVING is the post-aggregation equivalent of WHERE. It evaluates a condition against each group and discards groups for which the condition is not TRUE. HAVING is where you place conditions involving aggregate functions.

Basic HAVING Usage

sql
-- Only departments with more than 5 employees
-- AND an average salary above 65000
SELECT   DEPT,
         COUNT(*)    AS HEADCOUNT,
         AVG(SALARY) AS AVG_SALARY
FROM     EMPLOYEE
GROUP BY DEPT
HAVING   COUNT(*)    > 5
     AND AVG(SALARY) > 65000
ORDER BY AVG_SALARY DESC;

HAVING Without GROUP BY

HAVING can be used without GROUP BY, treating the entire table as a single group. This is niche but valid:

sql
-- Only return a result if the average salary exceeds 60000
SELECT COUNT(*) AS TOTAL_EMP, AVG(SALARY) AS AVG_SALARY
FROM   EMPLOYEE
HAVING AVG(SALARY) > 60000;

If the condition is false, the query returns no rows at all.

HAVING with Non-Aggregate Conditions

Technically you can put non-aggregate filters in HAVING, but it is always better practice to put them in WHERE for performance — WHERE reduces the rows before aggregation, while HAVING removes groups after the more expensive aggregation step.

sql
-- CORRECT but inefficient: non-aggregate filter in HAVING
SELECT   DEPT, COUNT(*)
FROM     EMPLOYEE
HAVING   DEPT <> 'D99'   -- DB2 will aggregate D99 and then discard it
GROUP BY DEPT;

-- BETTER: non-aggregate filter in WHERE
SELECT   DEPT, COUNT(*)
FROM     EMPLOYEE
WHERE    DEPT <> 'D99'   -- D99 rows excluded before aggregation
GROUP BY DEPT;

WHERE vs HAVING: The Definitive Comparison

CriterionWHEREHAVING
Evaluates at stageBefore GROUP BYAfter GROUP BY
Can filter on aggregate functions (COUNT, SUM…)NoYes
Can filter on individual row columnsYesYes (but use WHERE)
Performance impactReduces rows before aggregation (cheaper)Removes groups after aggregation (costlier)
Can use SELECT column aliasesNoNo

Rule of thumb: put every condition that does not involve an aggregate into WHERE. Put every condition that does involve an aggregate into HAVING.


ORDER BY with GROUP BY

ORDER BY sorts the final result and is always the last clause evaluated. You can sort by any column in the SELECT list, including aggregate expressions. On DB2 for z/OS, ORDER BY position numbers (e.g., ORDER BY 3 DESC) are supported but discouraged — they make queries fragile when columns are reordered.

sql
-- Quarterly sales ranked by total revenue, highest first
SELECT   YEAR(SALE_DATE)    AS YR,
         QUARTER(SALE_DATE) AS QTR,
         SUM(AMOUNT)        AS TOTAL,
         COUNT(*)           AS ORDERS
FROM     ORDERS
GROUP BY YEAR(SALE_DATE), QUARTER(SALE_DATE)
ORDER BY TOTAL DESC,    -- alias is valid in ORDER BY (unlike WHERE/HAVING)
         YR   ASC,
         QTR  ASC;

GROUPING SETS: Precise Multi-Level Aggregation

GROUPING SETS allows you to specify multiple grouping specifications in a single query. DB2 computes each grouping set independently and unions the results. This replaces UNION ALL of multiple GROUP BY queries with a single, more efficient pass.

sql
-- Three separate subtotals: by DEPT, by JOB, and the grand total
SELECT   DEPT,
         JOB,
         SUM(SALARY) AS TOTAL_SALARY,
         COUNT(*)    AS HEADCOUNT
FROM     EMPLOYEE
GROUP BY GROUPING SETS (
    (DEPT),      -- subtotal per department
    (JOB),       -- subtotal per job
    ()           -- grand total
)
ORDER BY GROUPING(DEPT), GROUPING(JOB),
         DEPT NULLS LAST, JOB NULLS LAST;

DB2 uses NULL in the result rows to indicate the column is part of the roll-up rather than a genuine NULL in the data. Use the GROUPING() function to distinguish them.


ROLLUP: Hierarchical Subtotals

ROLLUP is a shorthand for generating subtotals at every level of a left-to-right column hierarchy, plus a grand total. For N columns in the ROLLUP, DB2 generates N+1 grouping combinations.

sql
-- Sales hierarchy: Year > Month > Region
SELECT   YEAR(SALE_DATE)   AS SALE_YEAR,
         MONTH(SALE_DATE)  AS SALE_MONTH,
         REGION,
         SUM(AMOUNT)       AS REVENUE,
         COUNT(*)          AS ORDERS
FROM     ORDERS
GROUP BY ROLLUP(
    YEAR(SALE_DATE),
    MONTH(SALE_DATE),
    REGION
)
ORDER BY GROUPING(YEAR(SALE_DATE)),
         GROUPING(MONTH(SALE_DATE)),
         GROUPING(REGION),
         SALE_YEAR NULLS LAST,
         SALE_MONTH NULLS LAST,
         REGION     NULLS LAST;

The result includes:

  • Year + Month + Region (detail rows)
  • Year + Month subtotals
  • Year subtotals
  • Grand total row (all grouping columns are NULL)

CUBE: All-Combinations Multidimensional Analysis

CUBE generates subtotals for every possible combination of the grouping columns. For three columns, that is 2³ = 8 combinations. It is the SQL equivalent of a full OLAP cross-tabulation.

sql
-- All-dimension analysis: Category × Region × Channel
SELECT   CATEGORY,
         REGION,
         CHANNEL,
         SUM(AMOUNT)  AS TOTAL,
         GROUPING(CATEGORY) AS GRP_CAT,
         GROUPING(REGION)   AS GRP_REG,
         GROUPING(CHANNEL)  AS GRP_CHAN
FROM     SALES
GROUP BY CUBE(CATEGORY, REGION, CHANNEL)
ORDER BY GRP_CAT, GRP_REG, GRP_CHAN,
         CATEGORY NULLS LAST,
         REGION   NULLS LAST,
         CHANNEL  NULLS LAST;

The GROUPING() columns return 1 for subtotal rows and 0 for detail rows, making it easy to apply conditional formatting in a reporting tool.


Common Mistakes and How to Avoid Them

Mistake 1: Using a SELECT Alias in WHERE

sql
-- WRONG: alias ANNUAL_SALARY does not exist during WHERE evaluation
SELECT EMPNO, SALARY * 12 AS ANNUAL_SALARY
FROM   EMPLOYEE
WHERE  ANNUAL_SALARY > 720000;   -- ERROR: ANNUAL_SALARY is unknown

-- CORRECT: repeat the expression, or use a derived table
SELECT EMPNO, ANNUAL_SALARY
FROM (
    SELECT EMPNO, SALARY * 12 AS ANNUAL_SALARY
    FROM   EMPLOYEE
) AS E
WHERE  ANNUAL_SALARY > 720000;

Mistake 2: Putting an Aggregate in WHERE

sql
-- WRONG: aggregate function in WHERE
SELECT DEPT, COUNT(*)
FROM   EMPLOYEE
WHERE  COUNT(*) > 5     -- ERROR: aggregates not allowed in WHERE
GROUP BY DEPT;

-- CORRECT: aggregate condition in HAVING
SELECT DEPT, COUNT(*)
FROM   EMPLOYEE
GROUP BY DEPT
HAVING COUNT(*) > 5;

Mistake 3: NOT IN with a Subquery That Returns NULLs

sql
-- DANGEROUS: if ANY row in the subquery returns NULL,
-- NOT IN evaluates to UNKNOWN for every row, returning zero results
SELECT EMPNO, LASTNAME
FROM   EMPLOYEE
WHERE  DEPT NOT IN (
    SELECT MGR_DEPT FROM MANAGERS   -- MGR_DEPT might contain NULLs
);

-- SAFE: use NOT EXISTS instead
SELECT E.EMPNO, E.LASTNAME
FROM   EMPLOYEE E
WHERE  NOT EXISTS (
    SELECT 1
    FROM   MANAGERS M
    WHERE  M.MGR_DEPT = E.DEPT
);

Mistake 4: Omitting a GROUP BY Column from SELECT

sql
-- If you GROUP BY DEPT and JOB but only select DEPT,
-- the JOB breakdown is computed but silently discarded —
-- you may be aggregating at a finer grain than intended.

-- Always verify your GROUP BY matches your reporting grain.
SELECT   DEPT,
         SUM(SALARY) AS TOTAL   -- JOB missing from SELECT, GROUP BY is DEPT+JOB
FROM     EMPLOYEE
GROUP BY DEPT, JOB;             -- This is legal but unusual; produces one row per DEPT+JOB
                                -- while only showing DEPT in the output

Performance Considerations

Index Usage with WHERE

sql
-- Sargable: index on HIREDATE can be used (range scan)
WHERE HIREDATE BETWEEN '2020-01-01' AND '2022-12-31'

-- Non-sargable: function wrapping the column prevents index use
WHERE YEAR(HIREDATE) BETWEEN 2020 AND 2022

Sargable predicates (Search ARGument ABLE) allow DB2 to use an index to limit the rows scanned. Wrapping a column in a function makes it non-sargable. Rewrite date range predicates as BETWEEN literals for optimal index use.

Stage 1 vs Stage 2 Predicates (DB2 for z/OS)

In DB2 for z/OS, predicates are classified as Stage 1 (applied during the data manager scan, most efficient) or Stage 2 (applied after rows are returned to RDS, less efficient). Simple column comparisons (=, >, BETWEEN, IN) are typically Stage 1. Functions and complex expressions are Stage 2. Keeping your WHERE predicates Stage 1 is critical for mainframe performance.

Avoid SELECT * with GROUP BY

SELECT * is invalid with GROUP BY in most cases (you cannot include all columns in both GROUP BY and aggregates simultaneously). More importantly, selecting only the columns you need reduces the data volume DB2 must sort and return.

RUNSTATS and GROUP BY Plans

DB2's optimizer decides between sort-based and hash-based aggregation based on catalog statistics. After large data loads, run RUNSTATS on the tables and indexes involved in frequent GROUP BY queries to ensure the optimizer has accurate row counts and column cardinality estimates.


Full Example: Sales Performance Report

sql
-- Complete sales performance report combining all concepts
SELECT   R.REGION_NAME,
         YEAR(O.ORDER_DATE)              AS ORDER_YEAR,
         COUNT(*)                        AS TOTAL_ORDERS,
         COUNT(DISTINCT O.CUSTOMER_ID)   AS UNIQUE_CUSTOMERS,
         SUM(O.AMOUNT)                   AS GROSS_REVENUE,
         SUM(O.AMOUNT - O.DISCOUNT)      AS NET_REVENUE,
         AVG(O.AMOUNT)                   AS AVG_ORDER_VALUE,
         MAX(O.AMOUNT)                   AS LARGEST_ORDER
FROM     ORDERS     O
JOIN     REGIONS    R  ON O.REGION_ID = R.REGION_ID
WHERE    O.STATUS   IN ('COMPLETED', 'SHIPPED')
  AND    O.ORDER_DATE BETWEEN '2025-01-01' AND '2026-12-31'
GROUP BY R.REGION_NAME, YEAR(O.ORDER_DATE)
HAVING   SUM(O.AMOUNT) > 100000
     AND COUNT(*)       > 50
ORDER BY ORDER_YEAR, NET_REVENUE DESC;

This query demonstrates every concept in this article: WHERE with IN and BETWEEN, a JOIN, GROUP BY with an expression, multiple aggregates, HAVING with aggregate conditions, and ORDER BY with a mix of column and aggregate expressions.


Related Resources