Machine Code

CS 301 Lecture, Dr. Lawlor

"Machine code" is a block of binary data that the CPU interprets as a long table of commands. You'll be writing machine code for the homeworks. There's nothing magical about machine code, and in fact you can easily write a little program that walks through and executes a table of our own newly-defined "machine code":

const unsigned char table[]={
	0, /*yo! */
	1, /*print x */
	1, /*print x */
	0, /*yo! */
	2 /* exit */
};

int foo(void) {
	int i=0; /* our location in the table */
	while (1) /* always keep looping through the table */
	switch (table[i++]) { /* look at the next thing in the table */
	case 0: cout<<"Yo!\n"; break; /* single-Yo instruction */
	case 1: cout<<"x\n"; break; /* single-X instruction */
	case 2: return 0; /* stop looping through the table */
	default:
		cout<<"Unrecognized table entry!\n";
		return -999;
	}
}

(Try this in NetRun now!)

Rather than having two identical "print x" commands, we can make the "x" command repeatable, by adding a repetition count.

const unsigned char table[]={
	0, /*yo! */
	1, /*print x... */
	   2, /*       ... two times */
	0, /*yo! */
	2 /* exit */
};

int foo(void) {
	int i=0; /* our location in the table */
	while (1) /* always keep looping through the table */
	switch (table[i++]) { /* look at the next thing in the table */
	case 0: cout<<"Yo!\n"; break; /* single-Yo instruction */
	case 1: { /* multi-x instruction */
		int count=table[i++]; /* next byte in table is the x repeat count */
		for (int repeat=0;repeat<count;repeat++)
			std::cout<<'x'<<endl;
		break;
	}
	case 2: return 0; /* stop looping through the table */
	default:
		cout<<"Unrecognized table entry!\n";
		return -999;
	}
}

(Try this in NetRun now!)

Note that 0, a "Yo!" instruction, stands alone in the table, while 1, a "multi-x" instruction, takes two bytes, because the second byte is an x count. The indented "2" is not an exit command, it's the repetition count for the 1 instruction!

x86 Machine Code

You can of course use any numbers you like for the table values. Here's the same exact idea, but with x86-compatible instruction numbers:

const unsigned char table[]={
	0xb0, /*set x = ... */
	7, /*         ... this byte */
	0xc3 /* exit */
};

int foo(void) {
	int x=0; /* our "register" (temporary storage, and return value) */
	int i=0; /* our location in the table */
	while (1) /* always keep looping through the table */
	switch (table[i++]) { /* look at the next thing in the table */
	case 0xb0: { /* set-x instruction */
		x=table[i++]; /* next byte is the new value for x */
		break;
	}
	case 0xc3: return x; /* stop looping through the table */
	default:
		cout<<"Illegal instruction!\n";
		return -999;
	}
}

(Try this in NetRun now!)

Our table just has (8-bit) bytes in it, but sometimes we want to be able to set an entire (32-bit) int. The standard x86 solution to this is to split the integer into four bytes: first the low byte (lowest value, last two hex digits), then the not-so-low byte, the not-so-high byte, and the highest byte, like so.

const unsigned char table[]={
	0xb8, /* set x =... */
	4, /* low byte is 4 (that is, 0x04) */
	1, /* next byte is 1 (that is, 0x01) */
	0, /* highest two bytes are both zero */
	0,
	0xc3 /* return that */
};

int foo(void) {
	int x=0; /* register */
	int i=0;
	while (1) switch (table[i++]) {
	case 0xb8: // bitwise magic!  Reassemble x from the next 4 bytes in the table.
		x=table[i]|(table[i+1]<<8)|(table[i+2]<<16)|(table[i+3]<<24); 
		i+=4;
		break; 
	case 0xc3: return x;
	default:
		cout<<"Illegal instruction!\n";
		return -999;
	}
}

(Try this in NetRun now!)

This returns "0x104". The "0x04" is the low byte. "0x01" is the next higher byte, and all higher bytes are zero.

What's amazing is that I can tell the CPU to execute the bytes above, and it acts like a function that returns 0x104--the CPU is just table-driven hardware! For example, the byte "0xc3" tells an x86 CPU to return from the current function. The byte "0xb0" is followed by a one-byte parameter to load up for return. So this code actually works!
(Don't worry about the hideous C++ syntax for function pointer stuff.)

const char commands[]={
	0xb0,73, /* load a value to return */
	0xc3 /* return from the current function */
};
int foo(void) {
	typedef int (*fnptr)(void); // pointer to a function returning an int
	fnptr f=(fnptr)commands; // typecast the command array to a function
	return f(); // call the new function!
}

(Try this in NetRun now!)

Terminology & Disassembly

These raw byte commands that the CPU executes are called "machine code". "assembly language" is just a human-readable translation of machine code. An "assembler", like NASM, reads assembly language and writes executable machine code. A "disassembler", like PE Explorer or IDA Pro (for Windows), or objdump (for Linux or Mac OS X), reads an executable and writes assembly language (in NetRun, hit "Disassemble" checkbox under "Options").

If you just want to look at the machine code inside a function, you can just do some pointer typecasting and start printing bytes of machine code:

int bar(void) { /* some random function: we look at bar's machine code below! */
	return 17;
}

int foo(void) {
	const unsigned char *data=(unsigned char *)(&bar);
	for (int i=0;i<10;i++) /* print out the bytes of the bar function */
		std::cout<<"0x"<<std::hex<<(int)data[i]<<"\n";
	return 0;
}

(Try this in NetRun now!)

This prints out the same bytes inside bar that you see in the "Disassembler" tab. Which instructions is the compiler using?