Good reference info:
The C/C++ AVX intrinsic functions are in the header "immintrin.h".
AVX uses dedicated 256-bit registers, with these C/C++ types:
The 256 bit (32 byte) registers have enough space to store:
For example, here's how you operate on 8 floats at a time, using dedicated AVX _mm256 intrinsic functions. Note how we're using _ps instructions both for load and add.
#include "immintrin.h" void foo(void) { float f[8]={1.0,2.0,1.2,2.1, 5.2,5.3,10.1,11.0}; __m256 v=_mm256_load_ps(&f[0]); v=_mm256_add_ps(v,v); _mm256_store_ps(&f[0],v); farray_print(f,8); }
And here's how you operate on 8 ints at a time. For ints, the arithmetic uses _epi32, and the load and store use the weird _si256 type, which just means one giant 256-bit block of integer data.
#include "immintrin.h" void foo(void) { int f[8]={1,2,0,3, 5,5,10,11}; __m256i v=_mm256_load_si256((const __m256i *)&f[0]); v=_mm256_add_epi32(v,v); _mm256_store_si256((__m256i *)&f[0],v); iarray_print(f,8); }
Usually it's as easy as just operating on 8 iterations of the loop at a time! But there are often minor complications, like scalar values that need to be broadcast to fill all 8 slots:
#include "immintrin.h" const int n=1024; float a[n], b[n]; float c=3.0; long foo(void) { bool use_AVX=true; if (use_AVX) { // fancy AVX loop: __m256 C=_mm256_broadcast_ss(&c); // splat c across all SIMD lanes for (int i=0;i<n;i+=8) { // b[i]=a[i]*c; __m256 A=_mm256_load_ps(&a[i]); __m256 B=_mm256_mul_ps(A,C); _mm256_store_ps(&b[i],B); } } else { // simple float loop: for (int i=0;i<n;i++) { b[i]=a[i]*c; } } return b[0]; }
On my Skylake machine, using AVX takes only 44ns to compute 1000 floats; the simple float loop takes 294ns (a 6.68x speedup!).
Where things get really tricky is when each float wants to do its own separate operations, like per-float branching. AVX handles branches exactly like SSE .
AVX instructions expect input operands to be aligned on a 32-byte boundary. On most 32-bit systems, malloc and new only return pointers aligned to an 8-byte boundary, which caused problems for SSE, which expected 16-byte alignment. So on most 64-bit systems, malloc and new always return data aligned to a 16-byte boundary--half the time you'll also be lucky and your inputs will also be aligned on a 32-byte boundary, but half the time you won't and your AVX code will crash.
If you're doing allocations yourself, C style, the function "_mm_malloc(size_in_bytes,32)" will return you a 32-byte aligned pointer, and it's available in the same Intel headers as the other _mm_ intrinsics.
For std::vector, you should use an aligning allocator like my osl/alignocator.h as an additional template argument:
std::vector<float, alignocator<float,32> > myVec;
If you don't do this, your first AVX load has a 50% chance of crashing.
Of course, you still need to advance by 32 bytes / 8 floats from the start of the vector.