Assembly Language (at long last!)

CS 301 Lecture, Dr. Lawlor

OK, so in the last two weeks, we've looked at bits, bit operations, hexadecimal, tables, and finally machine code (in excruciating detail). Together, these are everything you need to know in order to understand assembly language. Assembly language is, simply, a line-by-line copy of machine code transcribed into human-readable words.

For example, we've been using the "move into register 0" instruction (0xb8) a lot. In an assembler, you can emit the same machine code with this little assembly language program:

mov eax,5
ret

(Try this in NetRun now!)

The assembler (NASM, in this case) will then spit out the following machine code:

00000000 <foo>:
   0:	b8 05 00 00 00       	mov    eax,0x5
   5:	c3                   	ret    
   6:	c3                   	ret

Note the middle column contains the same 0xb8 and so on that in HW2, we wrote by hand. (The duplicate "ret" instructions are because NetRun always puts in a spare "ret" instruction at the end, in case you forget.)

The big advantage of using an assembler is that you don't need to remember all the funky arcane numbers, like 0xb8 or 0xc3 (these are "opcodes"). Intead, you remember a human-readable name like "mov" (short for "move"). This name is called an "opcode mnemonic", but it's always the first thing in a CPU "instruction", so I usually will say "the mov instruction" rather than "the instruction that the mov opcode mnemonic stands for".

There are several parts to this line:

"mov" is the "opcode", "instruction", or "mnemonic". It corresponds to the first byte (or so!) that tells the CPU what to do, in this case move a value from one place to another. The opcode tells the CPU what to do.
"eax" is the destination of the move, also known as the "destination operand". It's a register, register number 0, and it happens to be 32 bits wide, so this is a 32-bit move.
5 is the source of the moved data, also known as the "source operand". It's a constant, so you could use an expression (like "2+3*1") or a label (like "foo") instead.
A semicolon indicates the start of a comment. Unlike in C/C++/Java/C#/..., semicolons are OPTIONAL in assembly!
A newline. Unlike in C/C++/Java/C#/..., you MUST have a newline after each line of assembly.

Unlike C/C++, assembly is line-oriented, so the following WILL NOT WORK:

	mov eax,
	         5

Yup, line-oriented stuff is indeed annoying. Be careful that your editor doesn't mistakenly add newlines to long lines of text! I usually leave off the semicolons for lines without comments, because otherwise I find myself tempted to do this:

	mov ecx, 5;  mov eax, 3;   Whoops!

It doesn't look like it, but the semicolon makes that second instruction A COMMENT!

Arithmetic In Assembly

Here's how you add two numbers in assembly:

Put the first number into a register
Put the second number into a register
Add the two registers
Return the result

Here's the C/C++ equivalent:

int a = 3;
int c = 7;
a += c;
return a;

And finally here's the assembly code:

mov eax, 3
mov ecx, 7
add eax, ecx
ret

(executable NetRun link)

Here are the x86 arithmetic instructions. Note that they *all* take just two registers, the destination and the source.

Opcode	Does	Example
add	+	add eax,ecx
sub	-	sub eax,ecx
imul	*	imul eax,ecx
idiv	/	idiv ecx <- Warning! Weirdness! (see below)
and	&	and eax,ecx
or	\|	or eax,ecx
xor	^	xor eax,ecx
not	~	not eax

Be careful doing these! Assembly is *line* oriented, so you can't say anything like this:
    add edx,(sub eax,ecx)
but you can say:
    sub eax,ecx
    add edx,eax

In assembly, arithmetic has to be broken down into one operation at a time!

Note that "idiv" is really weird. Basically, "idiv bot" divides eax by bot (the eax is hardcoded). But it also treats edx as high bits above eax, so you have to set them to zero first.

idiv bot
means:
top = eax+(edx<<32)
eax = top/bot
edx = top%bot

Here's an example:

mov eax,73; top
mov ecx,10; bottom
mov edx,0 ; high bits of top
idiv ecx ; divide eax by ecx
; now eax = 73/10, edx=73%10
(Try this in NetRun now!)

What a strange instruction!

Assembly Instructions

There are *lots* of instructions. A list of all possible x86 instructions can be found in:

Roger Jegerlehner's CodeTable, in categorized form.
Gary Burt's HTML table, just the basics, listed in alphabetical order, but maybe too terse.
sandpile.org has a list ordered by opcode number.
Giant Intel PDF reference manual (section 3.2), totally complete; but it's alphabetical, and it takes a long time to read a thousand pages!

The really important opcodes are listed in my cheat sheet. Most programs can be writen with mov, the arithmetic instructions (add/sub/imul), the function call instructions (call/ret), the stack instructions (push/pop), and the conditional jumps (cmp/jmp/jl/je/jg/...). We'll learn about these over the next few weeks!

Assembly Registers

Registers are where you store data in assembly language--there aren't any variables, so everything has to either go in registers or somewhere in memory.

Here are the commonly-used x86 registers:

rax. This is the register that stores a function's return value.
rax, rcx, rdx, rsi, rdi. "Scratch" registers you can always overwrite with any value. Note that "rbx" is NOT scratch, for some odd historical reason.
rdi, rsi, rdx, rcx, ... In 64-bit mode, these registers contain function arguments, in left-to-right order.
rsp, rbp. Registers used to run the stack. Be careful with these!

"Scratch" registers you're allowed to overwrite and use for anything you want. "Preserved" registers serve some important purpose somewhere else, so you have to put them back ("save" the register) if you use them--for now, just leave them alone!

Each of these registers is available in several sizes:

rax is the 64-bit, "long" size register. It was added in 2003. I've marked the added-with-64-bit registers in red below.
eax is the 32-bit, "int" size register. It was added in 1985. I'm in the habit of using this register size, since they also work in 32 bit mode, although I should probably use the longer rax registers for everything.
ax is the 16-bit, "short" size register. It was added in 1979.
al and ah are the 8-bit, "char" size parts of the register. al is the low 8 bits (like ax&0xff), ah is the high 8 bits (like ax>>8). They're original back to 1972.

Curiously, you can write a 64-bit value into rax, then read off the low 32 bits from eax, or the low 16 bitx from ax, or the low 8 bits from al--it's just one register, but they keep on extending it!

rax: 64-bit

eax: 32-bit

ax: 16-bit

For example,

mov rcx,0xf00d00d2beefc03; load 64-bit constant
mov eax,ecx; pull out low 32 bits
ret

(Try this in NetRun now!)

Here's the full list of x86 registers. The 64 bit registers are shown in red.

Notes	64-bit long	32-bit int	16-bit short	8-bit char
Values are returned from functions in this register. Multiply instructions put the low bits of the result here too.	rax	eax	ax	ah and al
Typical scratch register. Some instructions use it as a counter (such as SAL or REP).	rcx	ecx	cx	ch and cl
Scratch register. Multiply instructions put the high bits of the result here.	rdx	edx	dx	dh and dl
Preserved register: don't use it without saving it!	rbx	ebx	bx	bh and bl
The stack pointer. Points to the top of the stack (wait for the details!)	rsp	esp	sp	spl
Preserved register. Sometimes used to store the old value of the stack pointer, or the "base".	rbp	ebp	bp	bpl
Scratch register. Also used to pass function argument #2 in 64-bit mode (on Linux).	rsi	esi	si	sil
Scratch register. Function argument #1.	rdi	edi	di	dil
Scratch register. These were added in 64-bit mode, so the names are slightly different.	r8	r8d	r8w	r8b
Scratch register.	r9	r9d	r9w	r9b
Scratch register.	r10	r10d	r10w	r10b
Scratch register.	r11	r11d	r11w	r11b
Preserved register.	r12	r12d	r12w	r12b
Preserved register.	r13	r13d	r13w	r13b
Preserved register.	r14	r14d	r14w	r14b
Preserved register.	r15	r15d	r15w	r15b

Dissasembly at Runtime

You can just typecast a function pointer over to an unsigned char, and play with the bytes of a function's machine code:

int bar(void) { /* some random function: we look at bar's machine code below! */
	return 2;
}

int foo(void) {
	const unsigned char *data=(unsigned char *)(&bar);
	for (int i=0;i<10;i++) /* print out the bytes of the bar function */
		std::cout<<"0x"<<std::hex<<(int)data[i]<<"\n";
	return 0;
}

(Try this in NetRun now!)

This prints out *exactly* the machine code we would write:

0xb8    Opcode for "mov eax,"
0x2     Little-endian constant to load into eax (4 bytes)
0x0
0x0
0x0
0xc3    Opcode for "ret"

You can also click "Disassemble" in NetRun to see the assembly and machine code the compiler or assembler produces.

Hierarchy of "REALLY"

Some people think Facebook is REALLY interesting.
Facebook's webdeveloper says no, REALLY what's happening is my Javascript is moving some DOM XML around.
The main Firefox backend guy says no, REALLY what's happening is my C++ code is simulating your Javascript.
The lead compiler developer says no, REALLY what's happening is my compiler is generating some assembly code from your C++.
An assembly programmer says no, REALLY what's happening is my assembly code is moving some registers around.
A machine code programmer says no, REALLY what's happening is 0xb8 0x01 0x00 0x00 0x00 0xc3.
A high-level CPU designer says no, REALLY what's happening is your instructions are lighting up various circuits on my chip.
A low-level CPU designer says no, REALLY what's happening is your circuits are lighting up my transistors.
A solid-state physicist says no, REALLY what's happening is your transistors are changing the orbital states on my silicon atoms.
A quantum physicist says no, REALLY what's happening is your silicon atoms are just affecting the wavefunctions of my electrons.
An M-theory physicist says no, REALLY an electron is just an illusion caused by the 5-dimensional vibrations of an 11-dimensional membrane.

Thus, the notion that Facebook is interesting is REALLY an illusion.