Intel's Advanced Vector Extensions (AVX)
CS 641 Lecture, Dr. Lawlor
AVX has upgraded SSE to use 256-bit registers. These can
contain up to 8 floats, or 4 doubles! Clearly, with registers
that wide, you want to use struct-of-arrays, not
array-of-structs. A single 3D XYZ point is pretty lonely inside
an 8-wide vector, but three 8-wide vectors of XXXXXXXX YYYYYYYY
ZZZZZZZZ works fine.
#include <immintrin.h> /* AVX + SSE4 intrinsics header */
// Internal class: do not use...
class not_vec8 {
__m256 v; // bitwise inverse of our value (!!)
public:
not_vec8(__m256 val) {v=val;}
__m256 get(void) const {return v;} // returns INVERSE of our value (!!)
};
// This is the class to use!
class vec8 {
__m256 v;
public:
vec8(__m256 val) {v=val;}
vec8(const float *src) {v=_mm256_loadu_ps(src);}
vec8(float x) {v=_mm256_broadcast_ss(&x);}
vec8 operator+(const vec8 &rhs) const {return _mm256_add_ps(v,rhs.v);}
vec8 operator-(const vec8 &rhs) const {return _mm256_sub_ps(v,rhs.v);}
vec8 operator*(const vec8 &rhs) const {return _mm256_mul_ps(v,rhs.v);}
vec8 operator/(const vec8 &rhs) const {return _mm256_div_ps(v,rhs.v);}
vec8 operator&(const vec8 &rhs) const {return _mm256_and_ps(v,rhs.v);}
vec8 operator|(const vec8 &rhs) const {return _mm256_or_ps(v,rhs.v);}
vec8 operator^(const vec8 &rhs) const {return _mm256_xor_ps(v,rhs.v);}
vec8 operator==(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_EQ_OQ);}
vec8 operator!=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_NEQ_OQ);}
vec8 operator<(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LT_OQ);}
vec8 operator<=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LE_OQ);}
vec8 operator>(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}
vec8 operator>=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}
not_vec8 operator~(void) const {return not_vec8(v);}
__m256 get(void) const {return v;}
float *store(float *ptr) {
_mm256_store_ps(ptr,v);
return ptr;
}
float &operator[](int index) { return ((float *)&v)[index]; }
float operator[](int index) const { return ((const float *)&v)[index]; }
friend ostream &operator<<(ostream &o,const vec8 &y) {
o<<y[0]<<" "<<y[1]<<" "<<y[2]<<" "<<y[3];
return o;
}
friend vec8 operator&(const vec8 &lhs,const not_vec8 &rhs)
{return _mm256_andnot_ps(rhs.get(),lhs.get());}
friend vec8 operator&(const not_vec8 &lhs, const vec8 &rhs)
{return _mm256_andnot_ps(lhs.get(),rhs.get());}
vec8 if_then_else(const vec8 &then,const vec8 &else_part) const {
return _mm256_or_ps( _mm256_and_ps(v,then.v),
_mm256_andnot_ps(v, else_part.v)
);
}
};
vec8 sqrt(const vec8 &v) {
return _mm256_sqrt_ps(v.get());
}
vec8 rsqrt(const vec8 &v) {
return _mm256_rsqrt_ps(v.get());
}
/* Return value = dot product of a & b, replicated 4 times */
inline vec8 dot(const vec8 &a,const vec8 &b) {
vec8 t=a*b;
__m256 vt=_mm256_hadd_ps(t.get(),t.get());
vt=_mm256_hadd_ps(vt,vt);
return vt;
}
float data[8]={1.2,2.3,3.4,1.5,10.0,100.0,1000.0,10000.0};
vec8 a(data);
vec8 b(10.0);
vec8 c(0.0);
int foo(void) {
c=(a<b).if_then_else(3.7,c);
return 0;
}
(Try this in NetRun now!)
Performance, as you might expect, is single clock cycle despite the
longer vectors--Intel just built wider floating point hardware!
The whole list of instructions is in "avxintrin.h"
(/usr/lib/gcc/x86_64-linux-gnu/4.4/include/avxintrin.h on my
machine). Note that the compare functions still work in basically
the same way as SSE, returning a mask that you then AND and OR to keep
the values you want.
Encoding 64-bit values in x86 Machine Code
Back in 2003, when AMD wanted to add 64 bit capability to their processors, they had two options:
- Throw away the old 32-bit machine code, and start from
scratch. This destroys backward compatibility--Intel has tried rebooting the
instruction set with Itanium, which has not yet been a huge success.
- Try to patch the old 32-bit machine code into 64-bit mode.
This is what AMD chose to do, and it's what all 64-bit x86 machines are
based on today.
Here's how 8-bit, 16-bit, 32-bit, and 64-bit registers are written in modern x86_64:
mov al,1234
mov ax,1234
mov eax,1234
mov rax,1234
(Try this in NetRun now!)
And the machine code:
0: b0 d2 mov al,0xd2
2: 66 b8 d2 04 mov ax,0x4d2
6: b8 d2 04 00 00 mov eax,0x4d2
b: 48 c7 c0 d2 04 00 00 mov rax,0x4d2
prefix
|
opcode
|
data
|
assembly
|
meaning
|
|
b0
|
d2
|
mov al,0xd2
|
8-bit load
|
66
|
b8
|
d2 04
|
mov ax,0x4d2
|
load with a 16-bit prefix (0x66)
|
|
b8
|
d2 04 00 00
|
mov eax,0x4d2
|
load with default size of 32 bits
|
48
|
c7 c0
|
d2 04 00 00
|
mov rax,0x4d2
|
Sign-extended load using REX 64-bit prefix (0x48)
|
Everything but that last line is exactly 100% 32-bit machine code. This incremental approach is beneficial for everybody:
- Compiler writers only need to make minor changes for 64-bit machines.
- Ancient 32-bit machine code still runs fine, and can be incrementally rewritten to use 64-bit as needed.
- The 64-bit instruction decode hardware can be almost entirely shared with the old 32-bit decode path.
There are downsides, however:
- You're stuck with a lot of the old junk from the 1970's. Note how 8-bit loads use far fewer bytes than 64-bit loads, despite the fact that 64-bit loads are probably more common in modern code.
- Some of the bit allocations are... strange.
Encoding New Registers in 64-bit x86 code
In addition to allowing 64-bit operations, AMD extended the x86 register set to have a total of 16 registers.
Regarding bit allocations, the REX
64-bit prefix byte always starts with "0x4". It then has four
additional bits: the high bit says we're in 64-bit mode, and the low
three bits are the high bits for each of the register numbers!
0: 48 01 c0 add rax,rax
3: 48 01 c8 add rax,rcx
6: 48 01 d0 add rax,rdx
9: 48 01 d8 add rax,rbx
c: 48 01 e0 add rax,rsp
f: 48 01 e8 add rax,rbp
12: 48 01 f0 add rax,rsi
15: 48 01 f8 add rax,rdi
18: 4c 01 c0 add rax,r8
1b: 4c 01 c8 add rax,r9
1e: 4c 01 d0 add rax,r10
21: 4c 01 d8 add rax,r11
24: 4c 01 e0 add rax,r12
27: 4c 01 e8 add rax,r13
2a: 4c 01 f0 add rax,r14
2d: 4c 01 f8 add rax,r15
(Try this in NetRun now!)
Note how a "0x01 0xc0" reads from rax with a REX prefix of 0x48; but it reads from r8 with a REX prefix of 0x4c.
The fact that the low three bits of the source register are in the last
byte, and the high bit of the register number are in the first byte, is
a consequence of this strange patching. At the circuit level,
this sort of thing is fairly easy to handle--just pull out the bits
wherever they show up, and run all the wires to the destination--but it
looks weird and is a bit more work for the compiler.
It's still worth it!
Encoding AVX Instructions
In addition to the wider 256-bit ymm registers, AVX allows three-operand arithmetic.
Here's how SSE instructions are encoded:
7: 0f 58 c4 addps xmm0,xmm4
a: 0f 58 cc addps xmm1,xmm4
d: 0f 58 d4 addps xmm2,xmm4
10: 0f 58 dc addps xmm3,xmm4
14: 0f 58 c4 addps xmm0,xmm4
17: 0f 58 c5 addps xmm0,xmm5
1a: 0f 58 c6 addps xmm0,xmm6
1d: 0f 58 c7 addps xmm0,xmm7
(Try this in NetRun now!)
The 0x0f byte is a prefix indicating "ps" mode. 0x58 is the opcode. The source and destination are in the ModR/M byte at the end.
Here's how AVX instructions are encoded.
7: c4 c1 5c 58 c0 vaddps ymm0,ymm4,ymm8
c: c4 c1 5c 58 c8 vaddps ymm1,ymm4,ymm8
11: c4 c1 5c 58 d0 vaddps ymm2,ymm4,ymm8
16: c4 c1 5c 58 d8 vaddps ymm3,ymm4,ymm8
^ ^
1c: c4 c1 5c 58 c0 vaddps ymm0,ymm4,ymm8
21: c4 c1 54 58 c0 vaddps ymm0,ymm5,ymm8
26: c4 c1 4c 58 c0 vaddps ymm0,ymm6,ymm8
2b: c4 c1 44 58 c0 vaddps ymm0,ymm7,ymm8
^ ^
31: c4 c1 5c 58 c0 vaddps ymm0,ymm4,ymm8
36: c4 c1 5c 58 c1 vaddps ymm0,ymm4,ymm9
3b: c4 c1 5c 58 c2 vaddps ymm0,ymm4,ymm10
40: c4 c1 5c 58 c3 vaddps ymm0,ymm4,ymm11
^ ^
(Try this in NetRun now!)
The 0xc4 is a VEX prefix. This is one of several dozen opcodes that AMD bulldozed back in 2003,
now rehabilitated for an entirely new use by Intel. The next two
bytes are VEX info, saying 256-bit operation, giving the middle
register and all the high bits of the register numbers. 0x58 is
still the opcode. The last source and destination are in the
ModR/M byte at the end.
Weird, but it works!