Assembly Language (finally!)

CS 301 Lecture, Dr. Lawlor

OK, so in the last two weeks, we've looked at bits, bit operations, hexadecimal, tables, and finally machine code (in excruciating detail). Together, these are everything you need to know in order to understand assembly language. Assembly language is, simply, a line-by-line copy of machine code transcribed into human-readable words.

For example, we've been using the "move into register 0" instruction (0xb8) a lot. In an assembler, you can emit the same machine code with this little assembly language program:

mov eax,5
ret

(Try this in NetRun now!)

The assembler (NASM, in this case) will then spit out the following machine code:

00000000 <foo>:
   0:	b8 05 00 00 00       	mov    eax,0x5
   5:	c3                   	ret    
   6:	c3                   	ret

Note the middle column contains the same 0xb8 and so on that in HW2, we wrote by hand. (The duplicate "ret" instructions are because NetRun always puts in a spare "ret" instruction at the end, in case you forget.)

The big advantage of using an assembler is that you don't need to remember all the funky arcane numbers, like 0xb8 or 0xc3 (these are "opcodes"). Intead, you remember a human-readable name like "mov" (short for "move"). This name is called an "opcode mnemonic", but it's always the first thing in a CPU "instruction", so I usually will say "the mov instruction" rather than "the instruction that the mov opcode mnemonic stands for".

There are several parts to this line:

"mov" is the "opcode", "instruction", or "mnemonic". It corresponds to the first byte (or so!) that tells the CPU what to do, in this case move a value from one place to another. The opcode tells the CPU what to do.
"eax" is the destination of the move, also known as the "destination operand". It's a register, register number 0, and it happens to be 32 bits wide, so this is a 32-bit move.
5 is the source of the moved data, also known as the "source operand". It's a constant, so you could use an expression (like "2+3*1") or a label (like "foo") instead.
A semicolon indicates the start of a comment. Unlike in C/C++/Java/C#/..., semicolons are OPTIONAL in assembly!
A newline. Unlike in C/C++/Java/C#/..., you MUST have a newline after each line of assembly.

Unlike C/C++, assembly is line-oriented, so the following WILL NOT WORK:

	mov eax,
	         5

Yup, line-oriented stuff is indeed annoying. Be careful that your editor doesn't mistakenly add newlines to long lines of text!

Instructions

A list of all possible x86 instructions can be found in:

Roger Jegerlehner's CodeTable, in categorized form.
Gary Burt's HTML table, just the basics, listed in alphabetical order, but maybe too terse.
Giant Intel PDF reference manual (section 3.2), totally complete; but nearly impossible to understand.

The really important opcodes are listed in my cheat sheet. Most programs can be writen with mov, the arithmetic instructions (add/sub/mul), the function call instructions (call/ret), the stack instructions (push/pop), and the conditional jumps (cmp/jmp/jl/je/jg/...). We'll learn about these over the next few weeks!

Registers

Here are the commonly-used x86 registers:

rax. This is the register that stores a function's return value.
rax, rcx, rdx, rsi, rdi. "Scratch" registers you can always overwrite with any value. Note that "ebx" is NOT scratch!
rdi, rsi, rdx, rcx, ... In 64-bit mode, these registers contain function arguments, in left-to-right order.
rsp, rbp. Registers used to run the stack. Be careful with these!

Each of these registers is available in several sizes:

rax is the 64-bit, "long" size register. It was added in 2003.
eax is the 32-bit, "int" size register. It was added in 1985.
ax is the 16-bit, "short" size register. It was added in 1979.
al and ah are the 8-bit, "char" size parts of the register. They're original back to 1971.

Curiously, you can write a 64-bit value into rax, then read off the low 32 bits from eax, or the low 16 bitx from ax--it's just one register, but they keep on extending it!

rax: 64-bit

eax: 32-bit

ax: 16-bit

For example,

mov rcx,0xf00d00d2beefc03; load 64-bit constant
mov eax,ecx; pull out low 32 bits
ret

(Try this in NetRun now!)

Arithmetic In Assembly

Here's how you add two numbers in assembly:

Put the first number into a register
Put the second number into a register
Add the two registers
Return the result

Here's the C/C++ equivalent:

int a = 3;
int c = 7;
a += c;
return a;

And finally here's the assembly code:

mov eax, 3
mov ecx, 7
add eax, ecx
ret

(executable NetRun link)

Here are the x86 arithmetic instructions. Note that they *all* take just two registers, the destination and the source.

Opcode	Does	Example
add	+	add eax,ecx
sub	-	sub eax,ecx
imul	*	imul eax,ecx
idiv	/	idiv eax,ecx
and	&	and eax,ecx
or	\|	or eax,ecx
xor	^	xor eax,ecx
not	~	not eax

Be careful doing these! Assembly is *line* oriented, so you can't say:
    add (sub eax,ecx),edx
but you can say:
    sub eax,ecx
    add eax,edx

In assembly, arithmetic has to be broken down into one operation at a time!