How big should the initial hash table size be?

Use a power of 2 for bitwise modulo (hash & (capacity - 1) is faster than hash % capacity). Start with 16 or 64 buckets for most use cases. Keep the load factor (elements / buckets) below 0.7 - resize (double capacity and rehash) when it exceeds this threshold. A prime number for the initial size can reduce clustering for some hash functions but is rarely worth the modulo overhead.

What is the difference between chaining and open addressing for collision resolution?

Chaining stores a linked list at each bucket - collisions add to the list. It is simple, performs consistently up to high load factors, but uses extra memory for pointers and is cache-unfriendly. Open addressing probes for the next empty slot in the array - linear probing, quadratic probing, or double hashing. Open addressing is more cache-friendly and memory-efficient but degrades severely above 0.7 load factor. Most production hash tables use open addressing.

How do I choose a good hash function in C?

For strings, FNV-1a and djb2 are simple and effective. For integer keys, multiply by a large prime (Knuth: key * 2654435761). For security-sensitive use (hash tables exposed to user input), use SipHash to prevent hash flooding attacks - an attacker who can choose keys can degrade a naive hash table to O(n) per operation. The C standard library provides no hash function; include one explicitly.

What is a hash flooding attack and how do I prevent it?

A hash flooding attack provides input designed to hash to the same bucket, degrading the hash table to O(n) per operation and causing a denial-of-service. Prevent it by using a cryptographically keyed hash function (SipHash with a random per-process key) or by randomising the hash seed at startup. This is important for hash tables that index user-supplied data - web server request parameters, language runtime symbol tables.

C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store

← Back to C Mastery

C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store

The Hash Table Concept: Trading Space for Speed
Hash Functions: djb2, FNV-1a, and SipHash
Chaining: Collision Resolution with Linked Lists
Open Addressing: Linear and Quadratic Probing
Robin Hood Hashing: Variance Minimization
Load Factor and Rehashing
Complete Hash Map Implementation
Real-World Use Cases
Performance Analysis: When Hash Tables Fail
Frequently Asked Questions
Key Takeaway

The Hash Table Concept: Trading Space for Speed

An array gives you O(1) access by numeric index. A hash table extends this to O(1) access by arbitrary key (string, integer, struct) by using a hash function to convert the key to a numeric index:

The price: extra memory (the array must be larger than the number of items, typically 1.33-2x the load) and complexity (collision handling). For most workloads, this is an excellent trade.

Applications in production systems:

Redis: All key-value storage uses hash tables; the server-side dict is a resizable hash table.
Python dict: Open addressing with random probing; resized at 66% load.
Linux kernel: Symbol tables, filesystem dentry caches, network packet routing tables.
Compilers: Symbol tables for variable and function name resolution.
Databases: Hash joins, hash indexes for equality queries.

Hash Functions: djb2, FNV-1a, and SipHash

A hash function converts an arbitrary byte sequence into a fixed-size integer. The ideal properties: fast, uniform distribution (minimizes collisions), and avalanche effect (small input changes produce wildly different outputs).

djb2 (Daniel J. Bernstein)

Simple, fast, historically popular for general string hashing:

#include <stdint.h>

uint64_t hash_djb2(const char *str) {
    uint64_t hash = 5381;
    int c;
    while ((c = (unsigned char)*str++)) {
        hash = ((hash << 5) + hash) + c; // hash * 33 + c
    }
    return hash;
}

Pros: Extremely fast (few operations per byte). Cons: Known weaknesses against adversarial inputs (hash flooding attacks).

FNV-1a (Fowler-Noll-Vo)

Excellent distribution with a different formula - preferred for many embedded and systems use cases:

#define FNV_OFFSET_BASIS 14695981039346656037ULL
#define FNV_PRIME        1099511628211ULL

uint64_t hash_fnv1a(const void *data, size_t len) {
    uint64_t hash = FNV_OFFSET_BASIS;
    const uint8_t *bytes = (const uint8_t*)data;
    for (size_t i = 0; i < len; i++) {
        hash ^= bytes[i];
        hash *= FNV_PRIME;
    }
    return hash;
}

FNV-1a works on arbitrary byte sequences (not just null-terminated strings), making it useful for binary keys.

SipHash (Production Use)

When hash tables accept user-supplied keys (e.g., Python dict, Rust HashMap), a cryptographic hash like SipHash-1-3 provides protection against hash-flooding DoS attacks where an attacker crafts inputs that all hash to the same bucket:

// SipHash is more complex (uses a keyed block cipher internally)
// Python uses SipHash-1-3 by default since Python 3.4
// Use the reference implementation: https://github.com/veorq/SipHash

Guideline: For internal, trusted data -> djb2 or FNV-1a. For user-supplied network data -> SipHash.

Chaining: Collision Resolution with Linked Lists

In chaining, each slot in the hash table array contains a pointer to a linked list. Colliding entries are added to the list at that slot:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define HT_INITIAL_SIZE 64

typedef struct Entry {
    char        *key;
    char        *value;
    struct Entry *next; // Linked list chain for collision resolution
} Entry;

typedef struct {
    Entry  **buckets;    // Array of pointers to entry chains
    size_t   capacity;   // Number of buckets
    size_t   count;      // Number of stored items
} HashMap;

// Create a new hash map
HashMap *hm_create(void) {
    HashMap *hm = malloc(sizeof(HashMap));
    if (!hm) return NULL;
    hm->capacity = HT_INITIAL_SIZE;
    hm->count    = 0;
    hm->buckets  = calloc(hm->capacity, sizeof(Entry*)); // NULL-init all buckets
    if (!hm->buckets) { free(hm); return NULL; }
    return hm;
}

// Hash a string key to a bucket index
static size_t hash_to_index(const char *key, size_t capacity) {
    uint64_t hash = 5381;
    while (*key) {
        hash = ((hash << 5) + hash) + (unsigned char)*key++;
    }
    return (size_t)(hash % capacity);
}

// Insert or update a key-value pair: O(1) average
int hm_set(HashMap *hm, const char *key, const char *value) {
    size_t idx = hash_to_index(key, hm->capacity);
    
    // Search for existing key in chain (update if found)
    for (Entry *e = hm->buckets[idx]; e != NULL; e = e->next) {
        if (strcmp(e->key, key) == 0) {
            char *new_val = strdup(value);
            if (!new_val) return -1;
            free(e->value);
            e->value = new_val;
            return 0;
        }
    }
    
    // Key not found - prepend new entry to chain
    Entry *entry = malloc(sizeof(Entry));
    if (!entry) return -1;
    entry->key   = strdup(key);
    entry->value = strdup(value);
    if (!entry->key || !entry->value) {
        free(entry->key); free(entry->value); free(entry);
        return -1;
    }
    entry->next       = hm->buckets[idx]; // Prepend to chain
    hm->buckets[idx]  = entry;
    hm->count++;
    return 0;
}

// Retrieve a value by key: O(1) average, O(n) worst case
const char *hm_get(const HashMap *hm, const char *key) {
    size_t idx = hash_to_index(key, hm->capacity);
    for (Entry *e = hm->buckets[idx]; e != NULL; e = e->next) {
        if (strcmp(e->key, key) == 0) return e->value;
    }
    return NULL; // Not found
}

// Delete a key: O(1) average
bool hm_delete(HashMap *hm, const char *key) {
    size_t idx = hash_to_index(key, hm->capacity);
    Entry **current = &hm->buckets[idx];
    
    while (*current) {
        Entry *entry = *current;
        if (strcmp(entry->key, key) == 0) {
            *current = entry->next;  // Unlink from chain
            free(entry->key);
            free(entry->value);
            free(entry);
            hm->count--;
            return true;
        }
        current = &entry->next;
    }
    return false;
}

// Free the entire hash map
void hm_destroy(HashMap *hm) {
    for (size_t i = 0; i < hm->capacity; i++) {
        Entry *e = hm->buckets[i];
        while (e) {
            Entry *next = e->next;
            free(e->key);
            free(e->value);
            free(e);
            e = next;
        }
    }
    free(hm->buckets);
    free(hm);
}

int main(void) {
    HashMap *hm = hm_create();
    
    hm_set(hm, "name",     "Alice");
    hm_set(hm, "language", "C");
    hm_set(hm, "role",     "Systems Engineer");
    
    printf("name:     %s\n", hm_get(hm, "name"));     // Alice
    printf("language: %s\n", hm_get(hm, "language")); // C
    printf("missing:  %s\n", hm_get(hm, "missing") ?: "(null)");
    
    hm_delete(hm, "role");
    hm_destroy(hm);
    return 0;
}

Open Addressing: Linear and Quadratic Probing

Open addressing stores all entries in the array itself - no linked lists. When a collision occurs, the algorithm probes for an alternative slot:

#include <stdint.h>
#include <string.h>
#include <stdlib.h>

#define OA_SIZE 128
#define DELETED_SENTINEL ((char*)-1) // Tombstone marker

typedef struct {
    char *key;
    char *value;
} OAEntry;

typedef struct {
    OAEntry slots[OA_SIZE];
    size_t  count;
} OAHashMap;

// Linear probing: if slot i is taken, try i+1, i+2, ...
size_t linear_probe(const OAHashMap *hm, const char *key) {
    uint64_t hash = 5381;
    while (*key) hash = ((hash << 5) + hash) + (unsigned char)*key++;
    size_t idx = (size_t)(hash % OA_SIZE);
    
    while (hm->slots[idx].key != NULL &&
           hm->slots[idx].key != DELETED_SENTINEL &&
           strcmp(hm->slots[idx].key, key) != 0) {
        idx = (idx + 1) % OA_SIZE; // Linear step
    }
    return idx;
}

Linear probing suffers from primary clustering - filled slots cluster together, causing long probe sequences. Quadratic probing (step = i²) and double hashing (step = prime hash) reduce clustering at the cost of complexity.

Robin Hood Hashing: Variance Minimization

Robin Hood hashing is an elegant open addressing variant used by Rust's HashMap and many modern hash table implementations. When inserting a new key, if the existing key at the probe position is richer (closer to its ideal slot) than the inserting key, they swap positions (take from the rich, give to the poor):

// Robin Hood property: each key has a "probe length" = actual_index - ideal_index
// On collision: the key with the longer probe length "steals" the slot
// Result: maximum probe length is O(log n) instead of O(n)
// Advantage: lookup time variance is dramatically reduced - better worst-case

This gives Robin Hood tables excellent practical performance - used in production by Rust STD, Boost (C++), and many game engines.

Load Factor and Rehashing

The load factor Î± = count / capacity. As the table fills:

At Î± = 0.5: Average chaining list = 0.5 elements, lookup ≈ 1.5 comparisons.
At Î± = 0.75 (Python dict): Average ≈ few comparisons, still fast.
At Î± = 1.0: Hash table is full - performance degrades to O(n) in worst case.

When Î± exceeds a threshold (typically 0.75), the table must rehash: allocate a new, larger array (usually 2x the old capacity), then re-insert every existing entry:

int hm_rehash(HashMap *hm) {
    size_t  new_capacity = hm->capacity * 2;
    Entry **new_buckets  = calloc(new_capacity, sizeof(Entry*));
    if (!new_buckets) return -1;
    
    for (size_t i = 0; i < hm->capacity; i++) {
        Entry *entry = hm->buckets[i];
        while (entry) {
            Entry *next  = entry->next;
            size_t new_i = hash_to_index(entry->key, new_capacity);
            entry->next     = new_buckets[new_i];
            new_buckets[new_i] = entry;
            entry = next;
        }
    }
    
    free(hm->buckets);
    hm->buckets  = new_buckets;
    hm->capacity = new_capacity;
    return 0;
}

// In hm_set, check load before inserting:
if ((double)hm->count / (double)hm->capacity >= 0.75) {
    if (hm_rehash(hm) != 0) return -1;
}

Rehashing is an O(n) operation, but it happens rarely enough that amortized insertion remains O(1).

Performance Analysis: When Hash Tables Fail

Hash tables are not universally fast. They degrade in several scenarios:

Problem	Cause	Solution
Hash flooding attack	Attacker crafts keys that all hash to the same bucket	Use SipHash with secret seed
Poor hash function	Keys cluster in a subset of buckets	Use FNV-1a or SipHash
Very high load	Too many items for table size	Trigger rehash at Î± = 0.75
Cache thrashing	Chaining dereferences scattered heap pointers	Use open addressing (more cache-friendly)
Key ordering required	Hash tables are unordered	Use a balanced BST (red-black tree) instead

Frequently Asked Questions

How big should the initial table size be? Use a power of 2 for bitwise modulo (hash & (capacity - 1) instead of hash % capacity - much faster). A prime number for non-power-of-2 tables reduces certain clustering patterns. Start with 16 or 64 and let rehashing handle growth.

What is a "Tombstone" in open addressing? When you delete a key in open addressing, you can't simply empty the slot - it would break probe chains for other keys that passed through this slot during insertion. Instead, mark it with a tombstone sentinel value. Tombstones are skipped during lookup but counted during insert. Tables with many deletions benefit from periodic compaction.

Why does Python's dict maintain insertion order? CPython 3.7+ uses a combined compact dict with two arrays: a sparse indices array (the hash table proper) and a dense array of key-value pairs in insertion order. This is a more memory-efficient design that naturally preserves insertion order.

When is a hash table worse than a sorted array with binary search? Binary search on a sorted array is O(log n) but has perfect cache locality - every comparison reads adjacent memory. For small tables (< 32 elements), binary search often outperforms hash lookup due to cache efficiency. For large, frequently queried tables, hash lookup consistently wins.

Key Takeaway

Hash Tables represent the pinnacle of Lookup Efficiency - the fundamental data structure that transforms a simple array into an associative container. By mastering hash functions, collision resolution, and load factor management, you gain the ability to build the same kinds of key-value engines that power Redis, the CPython interpreter, and the Linux kernel's dentry cache.

This is not an abstract exercise - every major system software stack uses custom hash table implementations tuned for its specific access patterns. Writing your own from scratch is the only way to truly understand the trade-offs.

Part of the C Mastery Course - 30 modules from fundamentals to expert-level systems engineering.

C Hash Tables & Collision Resolution: Build an O(1) Key-Value Store

Table of Contents

The Hash Table Concept: Trading Space for Speed

Hash Functions: djb2, FNV-1a, and SipHash

djb2 (Daniel J. Bernstein)

FNV-1a (Fowler-Noll-Vo)

SipHash (Production Use)

Chaining: Collision Resolution with Linked Lists

Open Addressing: Linear and Quadratic Probing

Robin Hood Hashing: Variance Minimization

Load Factor and Rehashing

Performance Analysis: When Hash Tables Fail

Frequently Asked Questions

Key Takeaway