C++Performance

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

TT
TopicTrick Team
C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment


Table of Contents


SIMD Register Widths: SSE → AVX → AVX-512

text
cpp

When SIMD Pays Off: Amdahl's Law Applied

SIMD is optimal when:

  • Data parallelism: same operation on many independent elements
  • Contiguous memory: data is sequential (arrays, vectors — not linked lists)
  • Hot loop: function called millions of times or operates on large arrays
  • Compute-bound: bottleneck is arithmetic, not memory bandwidth or IO

SIMD is ineffective for:

  • Control-flow heavy code (branches break vectorization)
  • Small datasets (< 64 elements)
  • Irregular memory access (random pointer following)
  • Dependency between iterations (result[i] = result[i-1] + x[i])

Auto-Vectorization: Helping the Compiler

Always try auto-vectorization before writing intrinsics. Compilers with -O2/-O3 automatically vectorize many loops:

cpp

Memory Alignment: The SIMD Prerequisite

SIMD loads/stores require or prefer aligned memory. Misaligned access causes segfaults (_mm256_load_ps) or performance penalties (_mm256_loadu_ps):

cpp

AVX2 Intrinsics: Manual Vectorization

cpp

std::experimental::simd (C++26 Preview)

std::simd (voted for C++26, available as experimental in GCC/Clang) provides portable SIMD without architecture-specific intrinsics:

cpp

Dot Product Optimization: 4 Levels

cpp

Frequently Asked Questions

Is AVX2 code portable across all x86 CPUs? No. AVX2 requires Intel Haswell (2013) or AMD Ryzen (2017) or newer. AVX-512 requires Ice Lake (Intel, 2019) or Zen 4 (AMD, 2022). Always detect CPU features at runtime using __builtin_cpu_supports("avx2") (GCC/Clang) or CPUID, and provide fallback paths. In deployment, use -march=x86-64-v3 for AVX2 baseline or detect dynamically.

Should I use intrinsics or rely on auto-vectorization? For most code: write clean loops and let the compiler auto-vectorize with -O3 -march=native. Check with Compiler Explorer (godbolt.org) using -Rpass=loop-vectorize. Only hand-write intrinsics when: (1) the loop has complex dependencies the compiler can't resolve, (2) you need specific instructions like gather/scatter, or (3) profiling shows the auto-vectorized code isn't optimal.

What is the performance impact of misaligned SIMD loads? On modern Intel/AMD CPUs, misaligned loads (loadu vs load) have zero performance penalty when the data crosses no cache line boundaries. If the load crosses a 64-byte cache line boundary ("split load"), there's a ~4-cycle penalty. For large array processing, align to 64 bytes (cache line size) for best performance.


Key Takeaway

SIMD is the final frontier of single-thread performance optimization. Once you've eliminated algorithmic waste (wrong data structure), removed memory allocation overhead (RAII + pooling), and leveraged cache locality (struct-of-arrays layout), SIMD gives you another 4-16× speedup on data-parallel code. The std::simd (C++26) standard finally makes this portable — write once, compile to AVX2 on x86, NEON on ARM, and scalar on everything else.

Read next: Meta-programming: constexpr and Compile-Time Logic →


Part of the C++ Mastery Course — 30 modules from modern C++ basics to expert systems engineering.