C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store

C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store
Table of Contents
- The Hash Table Concept: Trading Space for Speed
- Hash Functions: djb2, FNV-1a, and SipHash
- Chaining: Collision Resolution with Linked Lists
- Open Addressing: Linear and Quadratic Probing
- Robin Hood Hashing: Variance Minimization
- Load Factor and Rehashing
- Complete Hash Map Implementation
- Real-World Use Cases
- Performance Analysis: When Hash Tables Fail
- Frequently Asked Questions
- Key Takeaway
The Hash Table Concept: Trading Space for Speed
An array gives you O(1) access by numeric index. A hash table extends this to O(1) access by arbitrary key (string, integer, struct) by using a hash function to convert the key to a numeric index:
The price: extra memory (the array must be larger than the number of items, typically 1.33-2× the load) and complexity (collision handling). For most workloads, this is an excellent trade.
Applications in production systems:
- Redis: All key-value storage uses hash tables; the server-side dict is a resizable hash table.
- Python
dict: Open addressing with random probing; resized at 66% load. - Linux kernel: Symbol tables, filesystem dentry caches, network packet routing tables.
- Compilers: Symbol tables for variable and function name resolution.
- Databases: Hash joins, hash indexes for equality queries.
Hash Functions: djb2, FNV-1a, and SipHash
A hash function converts an arbitrary byte sequence into a fixed-size integer. The ideal properties: fast, uniform distribution (minimizes collisions), and avalanche effect (small input changes produce wildly different outputs).
djb2 (Daniel J. Bernstein)
Simple, fast, historically popular for general string hashing:
Pros: Extremely fast (few operations per byte). Cons: Known weaknesses against adversarial inputs (hash flooding attacks).
FNV-1a (Fowler-Noll-Vo)
Excellent distribution with a different formula — preferred for many embedded and systems use cases:
FNV-1a works on arbitrary byte sequences (not just null-terminated strings), making it useful for binary keys.
SipHash (Production Use)
When hash tables accept user-supplied keys (e.g., Python dict, Rust HashMap), a cryptographic hash like SipHash-1-3 provides protection against hash-flooding DoS attacks where an attacker crafts inputs that all hash to the same bucket:
Guideline: For internal, trusted data → djb2 or FNV-1a. For user-supplied network data → SipHash.
Chaining: Collision Resolution with Linked Lists
In chaining, each slot in the hash table array contains a pointer to a linked list. Colliding entries are added to the list at that slot:
Open Addressing: Linear and Quadratic Probing
Open addressing stores all entries in the array itself — no linked lists. When a collision occurs, the algorithm probes for an alternative slot:
Linear probing suffers from primary clustering — filled slots cluster together, causing long probe sequences. Quadratic probing (step = i²) and double hashing (step = prime hash) reduce clustering at the cost of complexity.
Robin Hood Hashing: Variance Minimization
Robin Hood hashing is an elegant open addressing variant used by Rust's HashMap and many modern hash table implementations. When inserting a new key, if the existing key at the probe position is richer (closer to its ideal slot) than the inserting key, they swap positions (take from the rich, give to the poor):
This gives Robin Hood tables excellent practical performance — used in production by Rust STD, Boost (C++), and many game engines.
Load Factor and Rehashing
The load factor α = count / capacity. As the table fills:
- At α = 0.5: Average chaining list = 0.5 elements, lookup ≈ 1.5 comparisons.
- At α = 0.75 (Python dict): Average ≈ few comparisons, still fast.
- At α = 1.0: Hash table is full — performance degrades to O(n) in worst case.
When α exceeds a threshold (typically 0.75), the table must rehash: allocate a new, larger array (usually 2× the old capacity), then re-insert every existing entry:
Rehashing is an O(n) operation, but it happens rarely enough that amortized insertion remains O(1).
Performance Analysis: When Hash Tables Fail
Hash tables are not universally fast. They degrade in several scenarios:
| Problem | Cause | Solution |
|---|---|---|
| Hash flooding attack | Attacker crafts keys that all hash to the same bucket | Use SipHash with secret seed |
| Poor hash function | Keys cluster in a subset of buckets | Use FNV-1a or SipHash |
| Very high load | Too many items for table size | Trigger rehash at α = 0.75 |
| Cache thrashing | Chaining dereferences scattered heap pointers | Use open addressing (more cache-friendly) |
| Key ordering required | Hash tables are unordered | Use a balanced BST (red-black tree) instead |
Frequently Asked Questions
How big should the initial table size be?
Use a power of 2 for bitwise modulo (hash & (capacity - 1) instead of hash % capacity — much faster). A prime number for non-power-of-2 tables reduces certain clustering patterns. Start with 16 or 64 and let rehashing handle growth.
What is a "Tombstone" in open addressing? When you delete a key in open addressing, you can't simply empty the slot — it would break probe chains for other keys that passed through this slot during insertion. Instead, mark it with a tombstone sentinel value. Tombstones are skipped during lookup but counted during insert. Tables with many deletions benefit from periodic compaction.
Why does Python's dict maintain insertion order? CPython 3.7+ uses a combined compact dict with two arrays: a sparse indices array (the hash table proper) and a dense array of key-value pairs in insertion order. This is a more memory-efficient design that naturally preserves insertion order.
When is a hash table worse than a sorted array with binary search? Binary search on a sorted array is O(log n) but has perfect cache locality — every comparison reads adjacent memory. For small tables (< 32 elements), binary search often outperforms hash lookup due to cache efficiency. For large, frequently queried tables, hash lookup consistently wins.
Key Takeaway
Hash Tables represent the pinnacle of Lookup Efficiency — the fundamental data structure that transforms a simple array into an associative container. By mastering hash functions, collision resolution, and load factor management, you gain the ability to build the same kinds of key-value engines that power Redis, the CPython interpreter, and the Linux kernel's dentry cache.
This is not an abstract exercise — every major system software stack uses custom hash table implementations tuned for its specific access patterns. Writing your own from scratch is the only way to truly understand the trade-offs.
Read next: Function Pointers & Callbacks: Event-Driven C →
Part of the C Mastery Course — 30 modules from fundamentals to expert-level systems engineering.
