SQL Project: E-commerce Analytics Engine

SQL Project: E-commerce Analytics Engine
Welcome to the first Capstone Lab of the SQL Masterclass. You have spent modules learning the "Grammar" of SQL—from the basics of SELECT to the advanced physics of MVCC and GIN Indexes. Now, you will build a "Machine."
In the modern digital economy, the database is not just a storage box; it is the Engine of Truth. If this engine is slow, the mobile app lags, the Black Friday sales crash, and the company loses millions. As an Architect, your job is to ensure the Hardware Mirror of your code is perfectly aligned with the physical reality of SSDs and CPU cycles.
This project guide is a high-fidelity engineering assignment. You are the Lead Database Architect at "Arcane Commerce," a global marketplace. Your mission is to build the backend logic for their 2026 analytics dashboard, capable of handling millions of orders and real-time inventory checks.
1. Task 1: The "Hardware-Mirror" Infrastructure
Before you can analyze data, you must build a storage engine that doesn't collapse under load. Most developers default to SERIAL integers for Primary Keys. This is a "Legacy Trap."
The Mission: 3NF Relational Foundation
Design a 3rd Normal Form (3NF) relational schema that handles accounts, catalogs, and transactions.
- Schema Requirements:
users: Must handle geographic data, account age, and registration metadata.products: Must support a base structure (SKU, Price) plus custom metadata.orders: The central fact table linking users and timestamps.order_items: The grain of the transaction, storing price snapshots.
The Architect's Standard: UUID v7
Instead of standard integers or random UUID v4, use UUID v7 for all primary keys.
- The Physics: UUID v7 is "Time-Ordered." Unlike random UUID v4, which causes "Index Fragmentation" by inserting rows into random pages of the B-Tree, UUID v7 preserves "Sequential Ordering."
- B-Tree Page Splitting: When you use random IDs, the database has to split pages in the middle of the index to fit new rows. This results in Storage Bloat and higher Write Amplification. UUID v7 ensures new records land at the "Right-most Leaf," keeping the index dense and efficient.
- The Result: Your disk writes stay sequential, and your index lookups stay in the CPU cache.
2. Task 2: The Hybrid Catalog (JSONB & GIN)
Marketing wants to store custom attributes for every product (e.g., "Fabric" for shirts, "Screen Type" for laptops, "Battery Capacity" for gadgets). If you create a new column for every possible attribute, your table will have 5,000 columns—an architectural failure known as "The Sparse Column Trap."
The Mission: Schema-less Flexibility
- Implementation: Add a
metadataJSONB column to theproductstable. - Storage Physics: Unlike standard
JSON(which is stored as plain text and parsed on every read),JSONBis stored in a Decomposed Binary Format. This allows the engine to jump directly to specific keys without reading the entire document. - Requirement: Create a GIN (Generalized Inverted Index) on the
metadatacolumn. - The Challenge: Write a high-performance query using the
@>containment operator to find all "Organic Cotton" shirts priced between $20 and $60.
The Mirror Detail: A GIN index doesn't index the rows; it indexes the Values inside the JSON. When you search for {"material": "cotton"}, the index tells the database exactly which RowIDs contain that specific key-value pair, bypassing a slow Sequential Scan of millions of products.
3. Task 3: Analytical Intelligence (Cohort Analysis)
The CEO doesn't care about single orders. They care about Retention. In e-commerce, the most expensive thing you can do is acquire a new customer. The most profitable thing is keeping one.
The Mission: Retention & Customer Lifetime Value (LTV)
Use Window Functions and CTEs to build a "Customer Lifetime Value" (LTV) report that tracks user behavior over time.
- Step 1: The First Touch: Use
MIN(order_date) OVER (PARTITION BY user_id)to find the date of every user's first purchase. - Step 2: The Running Total: Use
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date)to calculate the cumulative revenue per user. - Step 3: Monthly Cohorts: Group users by the month they joined and calculate their average revenue after 3, 6, and 12 months.
The "Work_Mem" Challenge
When you run a Window Function over 10 million rows, the database must sort the data. If your work_mem setting is too low, the database will "Spill to Disk"—writing temporary files to the SSD. This is 100x slower than RAM. As an Architect, you must optimize your CTEs to ensure the sort happens entirely in memory.
SQL Logic Mirror:
4. Task 4: The Atomic Checkout Engine
Data integrity is the "Foundation of Trust." If an order is created but the inventory wasn't updated, the system is broken. If two people buy the "Last Item" at the same microsecond, and they both get a "Success" message, you have a physical supply chain failure.
The Mission: The process_order Engine
Write a Stored Procedure (PL/pgSQL) that handles a checkout atomically.
- Requirements:
- Isolation: Use
SELECT FOR UPDATEto lock the specific row in theinventorytable. This prevents "Lost Updates." - Safety: Check if
inventory.quantity >= order_item.quantity. - Atomicity: If the quantity is insufficient, issue a
RAISE EXCEPTION, which triggers an automaticROLLBACKof the entire transaction. - Audit: On success, trigger an entry in the
audit_logstable using theJSONBversion of the row to record "State before vs after."
- Isolation: Use
The Lock Contention Mirror
While SELECT FOR UPDATE is safe, it is also a bottleneck. If 5,000 people try to buy the same "Limited Edition" item, they will all queue up for a single row lock. Architects solve this by using Inventory Sharding or Soft-Reservation Logs, but for this project, you must implement the primary "Strict" lock logic.
5. Task 5: Performance Stress Lab (1 Million Row Benchmark)
Architecture is a theory until it hits $1$ million rows. A query that takes 1ms on your laptop might take 30 seconds in production.
The Mission: The 2:00 AM Black Friday Simulation
- Step 1: Data Synthesis: Use
generate_series()andrandom()to populate your tables with a realistic distribution of data. - Step 2: The Bottleneck: Run a complex join across
users,orders, andproductswithout primary/foreign key indexes. - Step 3: The Physics of EXPLAIN: Use
EXPLAIN (ANALYZE, BUFFERS)to read the Hardware Cost Model.- Look for "Seq Scan": The engine is reading the entire file from the SSD.
- Look for "Filter": The CPU is discarding rows after reading them from the disk.
- Step 4: The Fix: Add B-Tree indexes on all join columns. Observe how the engine switches to a "Nested Loop Join" or "Hash Join", and see the
Buffers: shared hitcount increase. This means your data is now living in the OS Page Cache (RAM), not the slow disk.
6. Task 6: Materialized Dashboards
Leadership needs a "One Second Refresh" dashboard for daily sales metrics. Joining 1 million rows and calculating aggregates on every page refresh of a busy internal tool is a waste of $1,000/month in cloud compute.
The Mission: Caching the State
- Implementation: Create a Materialized View named
daily_sales_statsthat calculates yesterday's total revenue, top products, and churn rate. - The "Staleness" Mirror: Data in a Materialized View is "Stale"—it doesn't update automatically.
- Maintenance: Implement a logic that runs
REFRESH MATERIALIZED VIEW CONCURRENTLY. TheCONCURRENTLYkeyword ensures that users can still read the "Old" data while the "New" data is being calculated in the background, avoiding any downtime.
7. Task 7: Growth Engineering (Moving Averages)
In e-commerce, raw numbers are deceptive. You need to see the Trend.
The Mission: The 7-Day Revenue Curve
Build a query that calculates the 7-Day Moving Average of revenue.
- The Physics: Use the window frame
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW. - The Goal: Smooth out the "Noise" of low-traffic weekdays vs high-traffic weekends to give the marketing team a clear view of marketing campaign impact.
8. Summary: Assessment & Final Certification
To receive your "Lead Database Architect" certification for this project, your code must satisfy:
- Normalization Fidelity: No PII data should be duplicated across tables.
- Indexing Precision: Every query in your dashboard MUST result in an Index Scan or Index Only Scan. No Sequential Scans are allowed on tables over 10,000 rows.
- Concurrency Safety: Your checkout procedure must be immune to "Race Conditions" where two users buy the same final item.
- Hardware Awareness: You must be able to explain exactly why you chose
JSONBoverJSONandUUIDv7overSerial.
This project is the ultimate proof of your SQL competence. By building this system from scratch, you have proven that you can handle the scale and complexity of a modern digital economy. You have graduated from "Fetching data" to "Architecting Relevance."
Phase 29: Deliverables Checklist
- Complete
.sqlschema with UUID v7 primary keys and proper foreign key constraints. - A
Stored Procedurefor Atomic Checkout withFOR UPDATElocking andROLLBACKlogic. - A "Growth Report" SQL script utilizing LTV, Retention cohorts, and Moving Averages.
- An
EXPLAIN (ANALYZE, BUFFERS)log showing sub-1ms performance on indexed queries. - A
Materialized Viewimplementation for heavy aggregate caching.
Read next: JSONB & NoSQL: The Hybrid Database Revolution →
Part of the SQL Mastery Course — project-based excellence.
