C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment

C++ SIMD & Intrinsics: AVX2, Auto-Vectorization, std::simd (C++26) & Memory Alignment
Table of Contents
- SIMD Register Widths: SSE → AVX → AVX-512
- When SIMD Pays Off: Amdahl's Law Applied
- Auto-Vectorization: Helping the Compiler
- Memory Alignment: The SIMD Prerequisite
- AVX2 Intrinsics: Manual Vectorization
- Fused Multiply-Add (FMA): Free Arithmetic
- Horizontal Operations: Reductions
- std::experimental::simd (C++26 Preview)
- Vectorizing a Dot Product: 4 Levels of Optimization
- Detecting CPU Features at Runtime
- Frequently Asked Questions
- Key Takeaway
SIMD Register Widths: SSE → AVX → AVX-512
Register | Width | float32 | float64 | int32 | int64
----------|--------|---------|---------|-------|------
XMM (SSE) | 128-bit| 4 | 2 | 4 | 2
YMM (AVX2)| 256-bit| 8 | 4 | 8 | 4
ZMM(AVX512| 512-bit| 16 | 8 | 16 | 8
Theoretical maximum FLOPS per cycle (one instruction):
SSE2 : 4 float adds OR 4 float muls → 4 FLOPS/cycle
AVX2 : 8 float FMAs (add+mul fused) → 16 FLOPS/cycle
AVX-512: 16 float FMAs → 32 FLOPS/cycle// Scalar (1 float per instruction):
float a = 1.0f, b = 2.0f;
float c = a + b; // 1 operation
// SSE2 (4 floats per instruction):
#include <xmmintrin.h>
__m128 va = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f); // [4,3,2,1] in one register
__m128 vb = _mm_set_ps(8.0f, 7.0f, 6.0f, 5.0f); // [8,7,6,5]
__m128 vc = _mm_add_ps(va, vb); // [12,10,8,6] — 4 additions simultaneously!
// AVX2 (8 floats per instruction):
#include <immintrin.h>
__m256 va8 = _mm256_loadu_ps(array_a); // Load 8 floats from array_a
__m256 vb8 = _mm256_loadu_ps(array_b); // Load 8 floats from array_b
__m256 vc8 = _mm256_add_ps(va8, vb8); // 8 additions simultaneously!When SIMD Pays Off: Amdahl's Law Applied
SIMD is optimal when:
- Data parallelism: same operation on many independent elements
- Contiguous memory: data is sequential (arrays, vectors — not linked lists)
- Hot loop: function called millions of times or operates on large arrays
- Compute-bound: bottleneck is arithmetic, not memory bandwidth or IO
SIMD is ineffective for:
- Control-flow heavy code (branches break vectorization)
- Small datasets (< 64 elements)
- Irregular memory access (random pointer following)
- Dependency between iterations (
result[i] = result[i-1] + x[i])
Auto-Vectorization: Helping the Compiler
Always try auto-vectorization before writing intrinsics. Compilers with -O2/-O3 automatically vectorize many loops:
#include <vector>
// THIS LOOP auto-vectorizes (simple, independent, contiguous):
void scale(float* data, float factor, int n) {
for (int i = 0; i < n; i++) {
data[i] *= factor; // data[i] independent — compiler uses SIMD
}
}
// THIS LOOP does NOT auto-vectorize (data dependency between iterations):
void prefix_sum(float* data, int n) {
for (int i = 1; i < n; i++) {
data[i] += data[i-1]; // data[i] depends on data[i-1] — cannot parallelize
}
}
// Help the compiler with hints:
void optimized_scale(float* __restrict__ data, float factor, int n) {
// __restrict__: promise data doesn't alias other pointers
data = static_cast<float*>(__builtin_assume_aligned(data, 32)); // 32-byte aligned
for (int i = 0; i < n; i++) {
data[i] *= factor; // Now compiler generates AVX2 code
}
}
// Check if compiler vectorized: godbolt.org — look for vmulps/vaddps instructions
// Compile with: clang++ -O3 -march=native -Rpass=loop-vectorizeMemory Alignment: The SIMD Prerequisite
SIMD loads/stores require or prefer aligned memory. Misaligned access causes segfaults (_mm256_load_ps) or performance penalties (_mm256_loadu_ps):
#include <memory>
#include <new>
// alignas: stack alignment
alignas(32) float stack_array[256]; // 32-byte aligned for AVX2
// Heap alignment:
// std::aligned_alloc (C++17):
float* heap_array = static_cast<float*>(std::aligned_alloc(32, sizeof(float) * 256));
// Don't forget to free!
// std::vector with aligned allocator:
template<typename T, size_t Alignment>
struct AlignedAllocator {
using value_type = T;
T* allocate(size_t n) {
return static_cast<T*>(std::aligned_alloc(Alignment, n * sizeof(T)));
}
void deallocate(T* p, size_t) { std::free(p); }
};
using AVX2Vector = std::vector<float, AlignedAllocator<float, 32>>;
AVX2Vector avx_data(1024); // All elements 32-byte aligned
// Check alignment at runtime:
assert(reinterpret_cast<uintptr_t>(avx_data.data()) % 32 == 0);AVX2 Intrinsics: Manual Vectorization
#include <immintrin.h> // Intel AVX/AVX2
// AVX2 naming convention:
// _mm256 → 256-bit YMM register
// _ps → packed single (float32)
// _pd → packed double (float64)
// _epi32 → packed int32
// loadu → unaligned load (slower but safe)
// load → aligned load (faster, segfaults if misaligned)
void array_add(const float* a, const float* b, float* out, int n) {
int i = 0;
// Process 8 elements at a time with AVX2:
for (; i + 7 < n; i += 8) {
__m256 va = _mm256_loadu_ps(a + i); // Load 8 floats
__m256 vb = _mm256_loadu_ps(b + i);
__m256 vc = _mm256_add_ps(va, vb); // Add 8 floats
_mm256_storeu_ps(out + i, vc); // Store 8 floats
}
// Handle remaining elements (tail loop):
for (; i < n; i++) {
out[i] = a[i] + b[i]; // Scalar for leftovers
}
}
// Common AVX2 operations:
void avx2_demo(float* a, float* b, float* c, float* out, int n) {
for (int i = 0; i + 7 < n; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
__m256 vb = _mm256_loadu_ps(b + i);
__m256 vc = _mm256_loadu_ps(c + i);
// Fused multiply-add: out = a * b + c (ONE instruction for three ops!)
__m256 result = _mm256_fmadd_ps(va, vb, vc);
// Conditional blend based on comparison:
__m256 zero = _mm256_setzero_ps();
__m256 mask = _mm256_cmp_ps(result, zero, _CMP_GT_OS); // result > 0?
__m256 clamped = _mm256_blendv_ps(zero, result, mask); // ReLU!
_mm256_storeu_ps(out + i, clamped);
}
}std::experimental::simd (C++26 Preview)
std::simd (voted for C++26, available as experimental in GCC/Clang) provides portable SIMD without architecture-specific intrinsics:
#include <experimental/simd>
namespace stdx = std::experimental;
void portable_scale(float* data, float factor, int n) {
using simd_f = stdx::native_simd<float>; // Width determined by target CPU
constexpr int W = simd_f::size(); // 8 on AVX2, 16 on AVX-512
int i = 0;
for (; i + W <= n; i += W) {
simd_f v(data + i, stdx::element_aligned);
v *= factor;
v.copy_to(data + i, stdx::element_aligned);
}
// Tail loop:
for (; i < n; i++) data[i] *= factor;
}
// Same code compiles to:
// SSE2 on x86: xmm registers, 4 floats at a time
// AVX2 on AVX2: ymm registers, 8 floats at a time
// NEON on ARM: ARM NEON registers, 4 floats at a time
// Scalar on rest: standard scalar multiplicationDot Product Optimization: 4 Levels
constexpr int N = 1024;
float a[N], b[N];
// Level 0: Scalar (baseline)
float dot_scalar(const float* a, const float* b, int n) {
float sum = 0.0f;
for (int i = 0; i < n; i++) sum += a[i] * b[i];
return sum;
}
// Level 1: Auto-vectorized (just compile with -O3 -march=native)
// Compiler generates same as: Level 0 with AVX2
// Level 2: AVX2 manual (better control)
float dot_avx2(const float* a, const float* b, int n) {
__m256 acc = _mm256_setzero_ps();
int i = 0;
for (; i + 7 < n; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
__m256 vb = _mm256_loadu_ps(b + i);
acc = _mm256_fmadd_ps(va, vb, acc); // FMA: acc += a * b
}
// Horizontal sum of 8 accumulators:
__m128 low = _mm256_castps256_ps128(acc);
__m128 high = _mm256_extractf128_ps(acc, 1);
__m128 sum4 = _mm_add_ps(low, high);
sum4 = _mm_hadd_ps(sum4, sum4);
sum4 = _mm_hadd_ps(sum4, sum4);
float result = _mm_cvtss_f32(sum4);
for (; i < n; i++) result += a[i] * b[i]; // Tail
return result;
}
// Benchmark (N=1024, single core):
// Scalar: ~500ns
// Auto-vectorized: ~70ns (7× speedup)
// AVX2 manual: ~65ns (similar — compiler is good!)
// AVX-512 FMA: ~35ns (14× speedup vs scalar)Frequently Asked Questions
Is AVX2 code portable across all x86 CPUs?
No. AVX2 requires Intel Haswell (2013) or AMD Ryzen (2017) or newer. AVX-512 requires Ice Lake (Intel, 2019) or Zen 4 (AMD, 2022). Always detect CPU features at runtime using __builtin_cpu_supports("avx2") (GCC/Clang) or CPUID, and provide fallback paths. In deployment, use -march=x86-64-v3 for AVX2 baseline or detect dynamically.
Should I use intrinsics or rely on auto-vectorization?
For most code: write clean loops and let the compiler auto-vectorize with -O3 -march=native. Check with Compiler Explorer (godbolt.org) using -Rpass=loop-vectorize. Only hand-write intrinsics when: (1) the loop has complex dependencies the compiler can't resolve, (2) you need specific instructions like gather/scatter, or (3) profiling shows the auto-vectorized code isn't optimal.
What is the performance impact of misaligned SIMD loads?
On modern Intel/AMD CPUs, misaligned loads (loadu vs load) have zero performance penalty when the data crosses no cache line boundaries. If the load crosses a 64-byte cache line boundary ("split load"), there's a ~4-cycle penalty. For large array processing, align to 64 bytes (cache line size) for best performance.
Key Takeaway
SIMD is the final frontier of single-thread performance optimization. Once you've eliminated algorithmic waste (wrong data structure), removed memory allocation overhead (RAII + pooling), and leveraged cache locality (struct-of-arrays layout), SIMD gives you another 4-16× speedup on data-parallel code. The std::simd (C++26) standard finally makes this portable — write once, compile to AVX2 on x86, NEON on ARM, and scalar on everything else.
Read next: Meta-programming: constexpr and Compile-Time Logic →
Part of the C++ Mastery Course — 30 modules from modern C++ basics to expert systems engineering.
