CAlgorithms

C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store

TT
TopicTrick Team
C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store

C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store


Table of Contents


The Hash Table Concept: Trading Space for Speed

An array gives you O(1) access by numeric index. A hash table extends this to O(1) access by arbitrary key (string, integer, struct) by using a hash function to convert the key to a numeric index:

mermaid

The price: extra memory (the array must be larger than the number of items, typically 1.33-2× the load) and complexity (collision handling). For most workloads, this is an excellent trade.

Applications in production systems:

  • Redis: All key-value storage uses hash tables; the server-side dict is a resizable hash table.
  • Python dict: Open addressing with random probing; resized at 66% load.
  • Linux kernel: Symbol tables, filesystem dentry caches, network packet routing tables.
  • Compilers: Symbol tables for variable and function name resolution.
  • Databases: Hash joins, hash indexes for equality queries.

Hash Functions: djb2, FNV-1a, and SipHash

A hash function converts an arbitrary byte sequence into a fixed-size integer. The ideal properties: fast, uniform distribution (minimizes collisions), and avalanche effect (small input changes produce wildly different outputs).

djb2 (Daniel J. Bernstein)

Simple, fast, historically popular for general string hashing:

c

Pros: Extremely fast (few operations per byte). Cons: Known weaknesses against adversarial inputs (hash flooding attacks).

FNV-1a (Fowler-Noll-Vo)

Excellent distribution with a different formula — preferred for many embedded and systems use cases:

c

FNV-1a works on arbitrary byte sequences (not just null-terminated strings), making it useful for binary keys.

SipHash (Production Use)

When hash tables accept user-supplied keys (e.g., Python dict, Rust HashMap), a cryptographic hash like SipHash-1-3 provides protection against hash-flooding DoS attacks where an attacker crafts inputs that all hash to the same bucket:

c

Guideline: For internal, trusted data → djb2 or FNV-1a. For user-supplied network data → SipHash.


Chaining: Collision Resolution with Linked Lists

In chaining, each slot in the hash table array contains a pointer to a linked list. Colliding entries are added to the list at that slot:

c

Open Addressing: Linear and Quadratic Probing

Open addressing stores all entries in the array itself — no linked lists. When a collision occurs, the algorithm probes for an alternative slot:

c

Linear probing suffers from primary clustering — filled slots cluster together, causing long probe sequences. Quadratic probing (step = i²) and double hashing (step = prime hash) reduce clustering at the cost of complexity.


Robin Hood Hashing: Variance Minimization

Robin Hood hashing is an elegant open addressing variant used by Rust's HashMap and many modern hash table implementations. When inserting a new key, if the existing key at the probe position is richer (closer to its ideal slot) than the inserting key, they swap positions (take from the rich, give to the poor):

c

This gives Robin Hood tables excellent practical performance — used in production by Rust STD, Boost (C++), and many game engines.


Load Factor and Rehashing

The load factor α = count / capacity. As the table fills:

  • At α = 0.5: Average chaining list = 0.5 elements, lookup ≈ 1.5 comparisons.
  • At α = 0.75 (Python dict): Average ≈ few comparisons, still fast.
  • At α = 1.0: Hash table is full — performance degrades to O(n) in worst case.

When α exceeds a threshold (typically 0.75), the table must rehash: allocate a new, larger array (usually 2× the old capacity), then re-insert every existing entry:

c

Rehashing is an O(n) operation, but it happens rarely enough that amortized insertion remains O(1).


Performance Analysis: When Hash Tables Fail

Hash tables are not universally fast. They degrade in several scenarios:

ProblemCauseSolution
Hash flooding attackAttacker crafts keys that all hash to the same bucketUse SipHash with secret seed
Poor hash functionKeys cluster in a subset of bucketsUse FNV-1a or SipHash
Very high loadToo many items for table sizeTrigger rehash at α = 0.75
Cache thrashingChaining dereferences scattered heap pointersUse open addressing (more cache-friendly)
Key ordering requiredHash tables are unorderedUse a balanced BST (red-black tree) instead

Frequently Asked Questions

How big should the initial table size be? Use a power of 2 for bitwise modulo (hash & (capacity - 1) instead of hash % capacity — much faster). A prime number for non-power-of-2 tables reduces certain clustering patterns. Start with 16 or 64 and let rehashing handle growth.

What is a "Tombstone" in open addressing? When you delete a key in open addressing, you can't simply empty the slot — it would break probe chains for other keys that passed through this slot during insertion. Instead, mark it with a tombstone sentinel value. Tombstones are skipped during lookup but counted during insert. Tables with many deletions benefit from periodic compaction.

Why does Python's dict maintain insertion order? CPython 3.7+ uses a combined compact dict with two arrays: a sparse indices array (the hash table proper) and a dense array of key-value pairs in insertion order. This is a more memory-efficient design that naturally preserves insertion order.

When is a hash table worse than a sorted array with binary search? Binary search on a sorted array is O(log n) but has perfect cache locality — every comparison reads adjacent memory. For small tables (< 32 elements), binary search often outperforms hash lookup due to cache efficiency. For large, frequently queried tables, hash lookup consistently wins.


Key Takeaway

Hash Tables represent the pinnacle of Lookup Efficiency — the fundamental data structure that transforms a simple array into an associative container. By mastering hash functions, collision resolution, and load factor management, you gain the ability to build the same kinds of key-value engines that power Redis, the CPython interpreter, and the Linux kernel's dentry cache.

This is not an abstract exercise — every major system software stack uses custom hash table implementations tuned for its specific access patterns. Writing your own from scratch is the only way to truly understand the trade-offs.

Read next: Function Pointers & Callbacks: Event-Driven C →


Part of the C Mastery Course — 30 modules from fundamentals to expert-level systems engineering.