C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment
Table of Contents
- SIMD Register Widths: SSE → AVX → AVX-512
- When SIMD Pays Off: Amdahl's Law Applied
- Auto-Vectorization: Helping the Compiler
- Memory Alignment: The SIMD Prerequisite
- AVX2 Intrinsics: Manual Vectorization
- Fused Multiply-Add (FMA): Free Arithmetic
- Horizontal Operations: Reductions
- std::experimental::simd (C++26 Preview)
- Vectorizing a Dot Product: 4 Levels of Optimization
- Detecting CPU Features at Runtime
- Frequently Asked Questions
- Key Takeaway
SIMD Register Widths: SSE → AVX → AVX-512
When SIMD Pays Off: Amdahl's Law Applied
SIMD is optimal when:
- Data parallelism: same operation on many independent elements
- Contiguous memory: data is sequential (arrays, vectors — not linked lists)
- Hot loop: function called millions of times or operates on large arrays
- Compute-bound: bottleneck is arithmetic, not memory bandwidth or IO
SIMD is ineffective for:
- Control-flow heavy code (branches break vectorization)
- Small datasets (< 64 elements)
- Irregular memory access (random pointer following)
- Dependency between iterations (
result[i] = result[i-1] + x[i])
Auto-Vectorization: Helping the Compiler
Always try auto-vectorization before writing intrinsics. Compilers with -O2/-O3 automatically vectorize many loops:
Memory Alignment: The SIMD Prerequisite
SIMD loads/stores require or prefer aligned memory. Misaligned access causes segfaults (_mm256_load_ps) or performance penalties (_mm256_loadu_ps):
AVX2 Intrinsics: Manual Vectorization
std::experimental::simd (C++26 Preview)
std::simd (voted for C++26, available as experimental in GCC/Clang) provides portable SIMD without architecture-specific intrinsics:
Dot Product Optimization: 4 Levels
Frequently Asked Questions
Is AVX2 code portable across all x86 CPUs?
No. AVX2 requires Intel Haswell (2013) or AMD Ryzen (2017) or newer. AVX-512 requires Ice Lake (Intel, 2019) or Zen 4 (AMD, 2022). Always detect CPU features at runtime using __builtin_cpu_supports("avx2") (GCC/Clang) or CPUID, and provide fallback paths. In deployment, use -march=x86-64-v3 for AVX2 baseline or detect dynamically.
Should I use intrinsics or rely on auto-vectorization?
For most code: write clean loops and let the compiler auto-vectorize with -O3 -march=native. Check with Compiler Explorer (godbolt.org) using -Rpass=loop-vectorize. Only hand-write intrinsics when: (1) the loop has complex dependencies the compiler can't resolve, (2) you need specific instructions like gather/scatter, or (3) profiling shows the auto-vectorized code isn't optimal.
What is the performance impact of misaligned SIMD loads?
On modern Intel/AMD CPUs, misaligned loads (loadu vs load) have zero performance penalty when the data crosses no cache line boundaries. If the load crosses a 64-byte cache line boundary ("split load"), there's a ~4-cycle penalty. For large array processing, align to 64 bytes (cache line size) for best performance.
Key Takeaway
SIMD is the final frontier of single-thread performance optimization. Once you've eliminated algorithmic waste (wrong data structure), removed memory allocation overhead (RAII + pooling), and leveraged cache locality (struct-of-arrays layout), SIMD gives you another 4-16× speedup on data-parallel code. The std::simd (C++26) standard finally makes this portable — write once, compile to AVX2 on x86, NEON on ARM, and scalar on everything else.
Read next: Meta-programming: constexpr and Compile-Time Logic →
Part of the C++ Mastery Course — 30 modules from modern C++ basics to expert systems engineering.
