x86 Machine Code and Dynamic Translation

CS 301 Lecture, Dr. Lawlor, 2005/12/07

The assembly instructions we've spent the semester learning all translate into binary.

It's easy enough to write some simple x86 machine code to do simple stuff. For example, the machine code for the "ret" instruction is the single byte 0xc3. We can run this machine code from C or C++ by just typecasting a pointer to that data to a function pointer, and then calling the function pointer:
(Executable NetRun Link)

const static char bar_arr[]={
	0xc3 /* machine code for "ret" instruction */
};

typedef int (*bar_fn)(void); /* function pointer typedef */
bar_fn bar=(bar_fn)bar_arr; /* typecast array to function pointer */
return bar(); /* call the function pointer */

This works, but returns some random value (whatever's in eax).

The machine code for "mov eax,0xDEADBEEF" is the 5 bytes 0xb8,0xEF,0xBE,0xAD,0xDE (because 0xb8 is the "move 4-byte immediate into eax" opcode, and the constant is stored in little-endian format). We can run this machine code just like before, but with this new array:

const static char bar_arr[]={
	0xb8,0xEF,0xBE,0xAD,0xDE, /* mov eax,0xDEADBEEF */
	0xc3 /* ret */
};

Try it!

In general, you can find machine code opcodes by:

Reading the funk_emu docs. Of course, funk_emu is far simpler than full x86.
Reading the reference manual. The Intel manual lists the opcode for every form of every instruction in the big table of instructions.
Sandpile.org has a good opcode map ("one byte opcodes").
Just compile, then disassemble the resulting code!

x86 Instruction Encoding

An x86 instruction can consist of about a zillion parts. See chapter 2 of the Intel reference manual for an exhaustive list.

"Prefix" bytes, used occasionally to do weird stuff:

0xF0 locks the processor bus for the next instruction, which is used for "atomic" instructions in multithreaded code.
0x66 indicates the next operation is a 16-bit operation (the default is 32 bits).
AMD's x86_64 uses 0x40-0x4F as prefix bytes (called "REX" prefixes). These bytes let you do 64-bit integer arithmetic, and access 8 new registers (r8-r15). See the (cursory) x86_64 cheat sheet for details.

The opcode, which is usually just one or two bytes. Sometimes the opcode is totally self-contained, and includes the destination registers inside it; but usually the registers to use are listed in...
Most instructions take their input registers or memory from the most hideous aspect of the x86 architecture--the "ModR/M" byte. It's so nasty it gets its own section, below.
The SIB (Scale-Index-Base) byte, which is only present if the R/M field of the ModR/M byte equals 4, and is used for some fairly rare memory access modes. Consists of a Scale (multiply index register by 1, 2, 4, or 8), Index register, and Base register. See the SIB table for meaning of various SIB values. Remember the weird array-access instructions like "mov word [eax + 4*ebx],3"? They use the SIB byte to represent the memory address.
Any displacement (memory offset) bytes specified by the ModR/M byte (specifically, mod==1 has a 1-byte displacement; mod==2 has a 4-byte displacement here).
Any immediate data, like an immediate addition value, goes after everything else.

Freaky, eh? It's not so bad in practice (Usually).

"0xC3" is just the "return from subroutine" instruction's opcode. There's no ModR/M byte, because this opcode takes no parameters (it's just listed as "0xC3" in the reference manual).
"0xB8" is the "move immediate 4-byte value into register" opcode. It's listed as "0xB8+rd", so you just add the destination register to the opcode: 0xB8 moves into EAX, 0xB9 moves into ECX, 0xBA moves into EBX, all the way to 0xBF moves into EDI. The opcode is followed by the 4-byte value (little-endian!) to move into the register. We used both these instructions above.

ModR/M

This byte is just a bit field giving one register called "reg/opcode" (which is usually the destination register, and determines the column in the ModR/M table), a selector called "mod" (which indicates whether r/m is treated as a plain register or a memory address), and a final register called "r/m" (usually the source register, which selects the row of the ModR/M table). These are stored in a single byte as shown below.

mod	reg/opcode	r/m
2 bits, selects memory or register access mode: 0: memory at register r/m 1: memory at register r/m+byte offset 2: memory at register r/m + 32-bit offset 3: register r/m itself (not memory)	3 bits, usually a register number Usually the destination (but for dyslexic instructions, it's the source register; or occasionally even an extra opcode)	3 bits, register number Usually defines the data source (except for dyslexic instructions) Treated as a pointer for mod!=3, treated as an ordinary register for mod==3. If r/m==4, indicates the real memory source is a SIB byte.

ModR/M byte encoding. See "ModR/M" table for meaning of values. ModR/M is much easier to write in octal, not hex, since the 3-bit fields match exactly with octal digits. You can write octal in C/C++ with a leading 0, so "0301" is octal for 011 000 001 (binary) or 0xC1 (hex).

x86 register numbering is about as bizarre as you'd expect:

Register	eax	ecx	edx	ebx	esp	ebp	esi	edi
Number	0	1	2	3	4	5	6	7

For example, a ModR/M byte like:

ModR/M==0005 (in octal) is composed of mod==0 (so accessing memory), reg/opcode==0 (meaning the normal-destination register is eax, register 0), and r/m==5 (meaning the source is memory pointed to by ebp, register 5).
ModR/M==0312 (octal again) means mod==3 (so we're only touching registers, not memory), reg/opcode==1 (meaning destination is ecx, register 1), and r/m==2 (meaning the normal-source register is edx, register 2)

The saddest part of the ModR/M byte is that there are sometimes two forms of a given instruction. The "normal" versions read from r/m and write to reg/opcode. The "dyslexic" versions do the exact opposite--they read from reg/opcode, and write to r/m. Normal and dyslexic are my own terminology; you won't find these names elsewhere. Normal versions can hence read from memory or a register, but can only write their results to a register. Dyslexic versions are the other way around: they can't read from memory (only registers), but can write to registers or memory.

For example, there are two flavors of 32-bit add: "0x03" is normal, "0x01" is dyslexic. Hence:

For "0x03 0312", we add the "r/m"-listed register 2 (edx) to the "reg/opcode"-listed register 1 (ecx).
For "0x01 0312", we add the "reg/opcode"-listed register 1 (ecx) to the "r/m"-listed register 2 (edx).

This means "0x03 0312" and "0x01 0321" are two different but totally equivalent ways to write "add ecx,edx".

Machine Code	Assembly	C
0x03 0312	add ecx,edx	c+=d;
0x03 0012	add ecx,[edx]	c+=*d;
0x01 0012	add [edx],ecx	*d+=c;
0x01 0312	add edx,ecx	d+=c;

Normal and dyslexic versions of exist for all the ancient instructions: ADD (0x03 and 0x01), SUB (0x2B and 0x29), MOV (0x8B and 0x89), CMP (0x3B and 0x39), AND (0x23 and 0x21), etc.

Only normal versions exist for the SSE instructions (0x0F 0x50-0x5F). For example,

0x0f, 0x58, 0301

"0x0f 0x58" is the "add SSE packed single" opcode, "addps", or "_mm_add_ps". "0xc1" is the ModR/M byte, which from the "ModR/M" table gives mod==3 (a register-register operation), reg/opcode==0 (the destination register is xmm0), and R/M==1 (the source register is xmm1). So all together this is "addps xmm0,xmm1".

Why does anybody care?

People who write assemblers, compilers, and linkers need to know about machine code. But lots of other people do too--people who write interpreters.

Here's an example program where we want to do some simple operations depending on the input file, but they're really slow when interpreted normally. The solution is to write a version of your interpreter loop that spits out machine code to solve the problem, instead of just solving the problem, then call the machine code! If you can make your interpreter use registers (e.g., because you've only got a few variables), you'll get excellent performance--just as good as compiled assembly!

Lots of people do this technique, often called dynamic binary translation:

Java does it to Java bytecode. They call this "Just-In-Time compilation", or JIT.
Microsoft does it to Common Language Runtime bytecode.
Lots of companies do this to run code for one processor on another processor. Apple's done it twice: during the transition from 68000 to PowerPC, and now during the transition to x86.

Here's an example of how to do dynamic binary translation for SSE instructions. The bottom line is that the dynamic version is about 30x faster than the interpreted version!
(Executable NetRun Link, although perhaps not in Internet Explorer; you may have to paste this code into NetRun manually. The initial input data consists of two lines:
r0*=r1;
r0+=r2;
)

#include <xmmintrin.h>

__m128 arr[8]; /* SSE temporary values */

/****************** Assembly support **************/
typedef void (*bar_fn)(void);
extern "C" void bar_asm(void);
bar_fn bar=bar_asm; /* function pointer pointing to current bar routine */

/* GCC-assembly version of bar routine, bar_asm */
__asm__(
".section \".text\"\n"
".globl foo\n"
".type bar_asm,@function\n"
"bar_asm:\n"
"  addps %xmm1,%xmm0\n"
"  mulps %xmm2,%xmm0\n"
"  ret\n"
);

/* Copies values in "arr" into SSE registers, calls bar function pointer, and 
   copies the result in xmm0 back into arr[0]. */
int call_bar(void) {
	__asm__(
	"  mov $arr,%ecx\n" // ecx points to "arr" array
	"  movaps 0x00(%ecx),%xmm0\n" // Fill SSE registers with inputs
	"  movaps 0x10(%ecx),%xmm1\n"
	"  movaps 0x20(%ecx),%xmm2\n"
	"  movaps 0x30(%ecx),%xmm3\n"
	);
	bar(); 
	__asm__(
	"  movaps %xmm0,0x00(%ecx)\n" // Extract outputs
	);
	return 0;
}
/************** Interpreter support ************/
/* One instruction in the bar routine */
class bar_op {
public:
	int i,o; /* input and output registers, 0-7 */
	char op; /* operation to perform (+,-,*, etc.) */
	bar_op(const char *line) { /* HACK: assumes line looks like "r1+=r0..." */
		o=line[1]-'0';
		i=line[5]-'0';
		op=line[2];
	}
};
/* Stored operations to make up the bar routine */
std::vector<bar_op> ops;

/* Interpreted bar routine: switch-and-execute */
int call_interp_bar(void) {
	for (unsigned int i=0;i<ops.size();i++) {
		bar_op &b=ops[i];
		switch(b.op) {
		case '+': arr[b.o]=_mm_add_ps(arr[b.o],arr[b.i]); break;
		case '-': arr[b.o]=_mm_sub_ps(arr[b.o],arr[b.i]); break;
		case '*': arr[b.o]=_mm_mul_ps(arr[b.o],arr[b.i]); break;
		case '/': arr[b.o]=_mm_div_ps(arr[b.o],arr[b.i]); break;
		default: printf("Unknown bar operation!\n"); exit(1); 
		};
	}
	return 0;
}

/* Dynamically assembled x86 bar routine: prepare machine code to run bar */
bar_fn bar_dynamic(void) {
	typedef unsigned char byte;
	std::vector<byte> *out=new std::vector<byte>;  /* stores output machine code */
	for (unsigned int i=0;i<ops.size();i++) {
		out->push_back(0x0F); /* all SSE opcodes start with 0x0F */
		bar_op &b=ops[i];
		switch(b.op) {
		case '+': out->push_back(0x58); break; /* SSE opcode for corresponding operation */
		case '-': out->push_back(0x5C); break;
		case '*': out->push_back(0x59); break;
		case '/': out->push_back(0x5E); break;
		default: printf("Unknown bar operation!\n"); exit(1); 
		};
		/* Prepare ModR/M byte giving input and output registers */
		int mod=3; /* register-register mode (octal 03ds) */
		int destreg=b.o; /* output register */
		int srcreg=b.i; /* source register */
		int ModRM=(mod<<6)+(destreg<<3)+srcreg;
		out->push_back((byte)ModRM);
	}
	out->push_back(0xc3); /* return instruction at end of routine */
	return (bar_fn)(&(*out)[0]);  /* typecast bytes to function pointer */
}

int foo(void) {
	int i;
/* Read the user's input bar routine */
	char line[100];
	while (read_string(line)) {
		if (line[0]=='r')
			ops.push_back(bar_op(line));
	}
/* Set up the (hardcoded) inputs */
	const static float in[8][4]={
		{1.0,2.0,3.0,4.0},
		{10.0,20.0,30.0,40.0},
		{100.0,200.0,300.0,400.0}
	};
	float out[4];
	for (i=0;i<8;i++) arr[i]=_mm_loadu_ps(in[i]);
/* Dynamically assemble bar routine, and make a test run */
	bar=bar_dynamic();
	call_bar();
	_mm_storeu_ps(out,arr[0]);
	printf("%f %f %f %f\n",out[0],out[1],out[2],out[3]);

/* Do timings */
	printf("Bar takes about %.2f ns/call in dynamic machine code\n",
		time_function(call_bar)*1.0e9);
	bar=bar_asm;
	printf("Bar takes about %.2f ns/call in real asm\n",
		time_function(call_bar)*1.0e9);
	printf("Bar takes about %.2f ns/call interpreted\n",
		time_function(call_interp_bar)*1.0e9);
	return 0;
}