Is AVX2 code portable across all modern x86 processors?

No - AVX2 requires Haswell (Intel, 2013) or Zen (AMD, 2017) or newer. Processors before those dates do not support AVX2 instructions and will raise an illegal instruction fault at runtime. Always use CPUID checks or compile with runtime dispatch to detect AVX2 support. For maximum portability, target SSE4.2 (supported from 2008+) or use compiler auto-vectorisation and let the compiler target the appropriate instruction set.

Should I write SIMD intrinsics manually or rely on compiler auto-vectorisation?

Rely on auto-vectorisation first - modern compilers (GCC, Clang) with -O2 or -O3 and appropriate -march flags auto-vectorise simple loops effectively. Check the output with Compiler Explorer (godbolt.org) or -fopt-info-vec. Write manual intrinsics only when the compiler fails to vectorise a hot loop and profiling confirms it is a bottleneck. Manual intrinsics are harder to maintain, harder to port, and often not faster than well-written auto-vectorised code.

What is the overhead of std::sort vs a SIMD-accelerated sort?

std::sort uses introsort (quicksort + heapsort + insertion sort), which is comparison-based with O(n log n) complexity. SIMD-accelerated sorts (pdqsort with SIMD, vqsort from Highway library) use vectorised comparisons and permutations to sort multiple elements per cycle, achieving 2x to 8x throughput improvement on large arrays of fixed-size elements. For small arrays or complex comparison functions, std::sort or std::ranges::sort is sufficient.

What is the Highway library and why is it preferred over raw intrinsics?

Highway (google/highway) is a portable SIMD library that abstracts over AVX2, AVX-512, NEON (ARM), SVE, WASM SIMD, and other targets with a single API. You write vectorised code once; Highway generates the correct intrinsics for each target at compile time or via runtime dispatch. Compared to raw intrinsics, Highway code is portable across ISAs, easier to read, and maintained by Google. It is the recommended approach for new portable SIMD code.

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

← Back to C++ Mastery

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

SIMD Register Widths: SSE -> AVX -> AVX-512
When SIMD Pays Off: Amdahl's Law Applied
Auto-Vectorization: Helping the Compiler
Memory Alignment: The SIMD Prerequisite
AVX2 Intrinsics: Manual Vectorization
Fused Multiply-Add (FMA): Free Arithmetic
Horizontal Operations: Reductions
std::experimental::simd (C++26 Preview)
Vectorizing a Dot Product: 4 Levels of Optimization
Detecting CPU Features at Runtime
Frequently Asked Questions
Key Takeaway

SIMD Register Widths: SSE -> AVX -> AVX-512

text

Register  | Width  | float32 | float64 | int32 | int64
----------|--------|---------|---------|-------|------
XMM (SSE) | 128-bit|    4    |    2    |   4   |   2
YMM (AVX2)| 256-bit|    8    |    4    |   8   |   4
ZMM(AVX512| 512-bit|   16    |    8    |  16   |   8

Theoretical maximum FLOPS per cycle (one instruction):
SSE2  :  4 float adds OR  4 float muls  ->  4 FLOPS/cycle
AVX2  :  8 float FMAs (add+mul fused)   -> 16 FLOPS/cycle
AVX-512: 16 float FMAs                  -> 32 FLOPS/cycle

cpp

// Scalar (1 float per instruction):
float a = 1.0f, b = 2.0f;
float c = a + b; // 1 operation

// SSE2 (4 floats per instruction):
#include <xmmintrin.h>
__m128 va = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f); // [4,3,2,1] in one register
__m128 vb = _mm_set_ps(8.0f, 7.0f, 6.0f, 5.0f); // [8,7,6,5]
__m128 vc = _mm_add_ps(va, vb); // [12,10,8,6] - 4 additions simultaneously!

// AVX2 (8 floats per instruction):
#include <immintrin.h>
__m256 va8 = _mm256_loadu_ps(array_a); // Load 8 floats from array_a
__m256 vb8 = _mm256_loadu_ps(array_b); // Load 8 floats from array_b
__m256 vc8 = _mm256_add_ps(va8, vb8);  // 8 additions simultaneously!

When SIMD Pays Off: Amdahl's Law Applied

SIMD is optimal when:

Data parallelism: same operation on many independent elements
Contiguous memory: data is sequential (arrays, vectors - not linked lists)
Hot loop: function called millions of times or operates on large arrays
Compute-bound: bottleneck is arithmetic, not memory bandwidth or IO

SIMD is ineffective for:

Control-flow heavy code (branches break vectorization)
Small datasets (< 64 elements)
Irregular memory access (random pointer following)
Dependency between iterations (result[i] = result[i-1] + x[i])

Auto-Vectorization: Helping the Compiler

Always try auto-vectorization before writing intrinsics. Compilers with -O2/-O3 automatically vectorize many loops:

cpp

#include <vector>

// THIS LOOP auto-vectorizes (simple, independent, contiguous):
void scale(float* data, float factor, int n) {
    for (int i = 0; i < n; i++) {
        data[i] *= factor; // data[i] independent - compiler uses SIMD
    }
}

// THIS LOOP does NOT auto-vectorize (data dependency between iterations):
void prefix_sum(float* data, int n) {
    for (int i = 1; i < n; i++) {
        data[i] += data[i-1]; // data[i] depends on data[i-1] - cannot parallelize
    }
}

// Help the compiler with hints:
void optimized_scale(float* __restrict__ data, float factor, int n) {
    // __restrict__: promise data doesn't alias other pointers
    data = static_cast<float*>(__builtin_assume_aligned(data, 32)); // 32-byte aligned
    for (int i = 0; i < n; i++) {
        data[i] *= factor; // Now compiler generates AVX2 code
    }
}

// Check if compiler vectorized: godbolt.org - look for vmulps/vaddps instructions
// Compile with: clang++ -O3 -march=native -Rpass=loop-vectorize

Memory Alignment: The SIMD Prerequisite

SIMD loads/stores require or prefer aligned memory. Misaligned access causes segfaults (_mm256_load_ps) or performance penalties (_mm256_loadu_ps):

cpp

#include <memory>
#include <new>

// alignas: stack alignment
alignas(32) float stack_array[256]; // 32-byte aligned for AVX2

// Heap alignment:
// std::aligned_alloc (C++17):
float* heap_array = static_cast<float*>(std::aligned_alloc(32, sizeof(float) * 256));
// Don't forget to free!

// std::vector with aligned allocator:
template<typename T, size_t Alignment>
struct AlignedAllocator {
    using value_type = T;
    T* allocate(size_t n) {
        return static_cast<T*>(std::aligned_alloc(Alignment, n * sizeof(T)));
    }
    void deallocate(T* p, size_t) { std::free(p); }
};

using AVX2Vector = std::vector<float, AlignedAllocator<float, 32>>;
AVX2Vector avx_data(1024); // All elements 32-byte aligned

// Check alignment at runtime:
assert(reinterpret_cast<uintptr_t>(avx_data.data()) % 32 == 0);

AVX2 Intrinsics: Manual Vectorization

cpp

#include <immintrin.h> // Intel AVX/AVX2

// AVX2 naming convention:
// _mm256   -> 256-bit YMM register
// _ps      -> packed single (float32)  
// _pd      -> packed double (float64)
// _epi32   -> packed int32
// loadu    -> unaligned load (slower but safe)
// load     -> aligned load (faster, segfaults if misaligned)

void array_add(const float* a, const float* b, float* out, int n) {
    int i = 0;
    
    // Process 8 elements at a time with AVX2:
    for (; i + 7 < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);   // Load 8 floats
        __m256 vb = _mm256_loadu_ps(b + i);
        __m256 vc = _mm256_add_ps(va, vb);    // Add 8 floats
        _mm256_storeu_ps(out + i, vc);         // Store 8 floats
    }
    
    // Handle remaining elements (tail loop):
    for (; i < n; i++) {
        out[i] = a[i] + b[i]; // Scalar for leftovers
    }
}

// Common AVX2 operations:
void avx2_demo(float* a, float* b, float* c, float* out, int n) {
    for (int i = 0; i + 7 < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        __m256 vc = _mm256_loadu_ps(c + i);
        
        // Fused multiply-add: out = a * b + c (ONE instruction for three ops!)
        __m256 result = _mm256_fmadd_ps(va, vb, vc);
        
        // Conditional blend based on comparison:
        __m256 zero    = _mm256_setzero_ps();
        __m256 mask    = _mm256_cmp_ps(result, zero, _CMP_GT_OS); // result > 0?
        __m256 clamped = _mm256_blendv_ps(zero, result, mask); // ReLU!
        
        _mm256_storeu_ps(out + i, clamped);
    }
}

std::experimental::simd (C++26 Preview)

std::simd (voted for C++26, available as experimental in GCC/Clang) provides portable SIMD without architecture-specific intrinsics:

cpp

#include <experimental/simd>
namespace stdx = std::experimental;

void portable_scale(float* data, float factor, int n) {
    using simd_f = stdx::native_simd<float>; // Width determined by target CPU
    constexpr int W = simd_f::size();         // 8 on AVX2, 16 on AVX-512
    
    int i = 0;
    for (; i + W <= n; i += W) {
        simd_f v(data + i, stdx::element_aligned);
        v *= factor;
        v.copy_to(data + i, stdx::element_aligned);
    }
    
    // Tail loop:
    for (; i < n; i++) data[i] *= factor;
}

// Same code compiles to:
// SSE2 on x86:   xmm registers, 4 floats at a time
// AVX2 on AVX2:  ymm registers, 8 floats at a time  
// NEON on ARM:   ARM NEON registers, 4 floats at a time
// Scalar on rest: standard scalar multiplication

Dot Product Optimization: 4 Levels

cpp

constexpr int N = 1024;
float a[N], b[N];

// Level 0: Scalar (baseline)
float dot_scalar(const float* a, const float* b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) sum += a[i] * b[i];
    return sum;
}

// Level 1: Auto-vectorized (just compile with -O3 -march=native)
// Compiler generates same as: Level 0 with AVX2

// Level 2: AVX2 manual (better control)
float dot_avx2(const float* a, const float* b, int n) {
    __m256 acc = _mm256_setzero_ps();
    int i = 0;
    for (; i + 7 < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        acc = _mm256_fmadd_ps(va, vb, acc); // FMA: acc += a * b
    }
    // Horizontal sum of 8 accumulators:
    __m128 low  = _mm256_castps256_ps128(acc);
    __m128 high = _mm256_extractf128_ps(acc, 1);
    __m128 sum4 = _mm_add_ps(low, high);
    sum4 = _mm_hadd_ps(sum4, sum4);
    sum4 = _mm_hadd_ps(sum4, sum4);
    float result = _mm_cvtss_f32(sum4);
    for (; i < n; i++) result += a[i] * b[i]; // Tail
    return result;
}

// Benchmark (N=1024, single core):
// Scalar:       ~500ns
// Auto-vectorized: ~70ns (7x speedup)
// AVX2 manual:    ~65ns (similar - compiler is good!)
// AVX-512 FMA:    ~35ns (14x speedup vs scalar)

Frequently Asked Questions

Is AVX2 code portable across all x86 CPUs? No. AVX2 requires Intel Haswell (2013) or AMD Ryzen (2017) or newer. AVX-512 requires Ice Lake (Intel, 2019) or Zen 4 (AMD, 2022). Always detect CPU features at runtime using __builtin_cpu_supports("avx2") (GCC/Clang) or CPUID, and provide fallback paths. In deployment, use -march=x86-64-v3 for AVX2 baseline or detect dynamically.

Should I use intrinsics or rely on auto-vectorization? For most code: write clean loops and let the compiler auto-vectorize with -O3 -march=native. Check with Compiler Explorer (godbolt.org) using -Rpass=loop-vectorize. Only hand-write intrinsics when: (1) the loop has complex dependencies the compiler can't resolve, (2) you need specific instructions like gather/scatter, or (3) profiling shows the auto-vectorized code isn't optimal.

What is the performance impact of misaligned SIMD loads? On modern Intel/AMD CPUs, misaligned loads (loadu vs load) have zero performance penalty when the data crosses no cache line boundaries. If the load crosses a 64-byte cache line boundary ("split load"), there's a ~4-cycle penalty. For large array processing, align to 64 bytes (cache line size) for best performance.

Key Takeaway

SIMD is the final frontier of single-thread performance optimization. Once you've eliminated algorithmic waste (wrong data structure), removed memory allocation overhead (RAII + pooling), and leveraged cache locality (struct-of-arrays layout), SIMD gives you another 4-16x speedup on data-parallel code. The std::simd (C++26) standard finally makes this portable - write once, compile to AVX2 on x86, NEON on ARM, and scalar on everything else.

Part of the C++ Mastery Course - 30 modules from modern C++ basics to expert systems engineering.

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

Table of Contents

SIMD Register Widths: SSE -> AVX -> AVX-512

When SIMD Pays Off: Amdahl's Law Applied

Auto-Vectorization: Helping the Compiler

Memory Alignment: The SIMD Prerequisite

AVX2 Intrinsics: Manual Vectorization

std::experimental::simd (C++26 Preview)

Dot Product Optimization: 4 Levels

Frequently Asked Questions

Key Takeaway