Assembly Language (at long last!)

CS 301 Lecture, Dr. Lawlor

OK, so in the last two weeks, we've looked at bits, bit operations, hexadecimal, tables, and finally machine code (in excruciating detail).  Together, these are everything you need to know in order to understand assembly language.  Assembly language is, simply, a line-by-line copy of machine code transcribed into human-readable words.

For example, we've been using the "move into register 0" instruction (0xb8) a lot.  In an assembler, you can emit the same machine code with this little assembly language program:
mov eax,5
ret

(Try this in NetRun now!)

The assembler (NASM, in this case) will then spit out the following machine code:
00000000 <foo>:
0: b8 05 00 00 00 mov eax,0x5
5: c3 ret
6: c3 ret
Note the middle column contains the same 0xb8 and so on that in HW2, we wrote by hand.  (The duplicate "ret" instructions are because NetRun always puts in a spare "ret" instruction at the end, in case you forget.)

The big advantage of using an assembler is that you don't need to remember all the funky arcane numbers, like 0xb8 or 0xc3 (these are "opcodes").  Intead, you remember a human-readable name like "mov" (short for "move").  This name is called an "opcode mnemonic", but it's always the first thing in a CPU "instruction", so I usually will say "the mov instruction" rather than "the instruction that the mov opcode mnemonic stands for".

There are several parts to this line:
Unlike C/C++, assembly is line-oriented, so the following WILL NOT WORK:
	mov eax,
5
Yup, line-oriented stuff is indeed annoying.  Be careful that your editor doesn't mistakenly add newlines to long lines of text!  I usually leave off the semicolons for lines without comments, because otherwise I find myself tempted to do this:
	mov ecx, 5;  mov eax, 3;   Whoops!
It doesn't look like it, but the semicolon makes that second instruction A COMMENT!

Arithmetic In Assembly

Here's how you add two numbers in assembly:
Here's the C/C++ equivalent:
int a = 3;
int c = 7;
a += c;
return a;
And finally here's the assembly code:
mov eax, 3
mov ecx, 7
add eax, ecx
ret
(executable NetRun link)

Here are the x86 arithmetic instructions.  Note that they *all* take just two registers, the destination and the source. 
Opcode
Does
Example
add
+
add eax,ecx
sub
-
sub eax,ecx
imul
*
imul eax,ecx
idiv
/
idiv ecx    <- Warning!  Weirdness!  (see below)
and
&
and eax,ecx
or
|
or eax,ecx
xor
^
xor eax,ecx
not
~
not eax

Be careful doing these!  Assembly is *line* oriented, so you can't say anything like this:
    add edx,(sub eax,ecx)
but you can say:
    sub eax,ecx
    add edx,eax

In assembly, arithmetic has to be broken down into one operation at a time!

Note that "idiv" is really weird. Basically, "idiv bot" divides eax by bot (the eax is hardcoded).  But it also treats edx as high bits above eax, so you have to set them to zero first.

  idiv bot
means:
  top = eax+(edx<<32)
  eax = top/bot
  edx = top%bot

Here's an example:
mov eax,73; top
mov ecx,10; bottom
mov edx,0 ; high bits of top
idiv ecx ; divide eax by ecx
; now eax = 73/10, edx=73%10

(Try this in NetRun now!)

What a strange instruction!

Assembly Instructions

There are *lots* of instructions.  A list of all possible x86 instructions can be found in: The really important opcodes are listed in my cheat sheet.  Most programs can be writen with mov, the arithmetic instructions (add/sub/imul), the function call instructions (call/ret), the stack instructions (push/pop), and the conditional jumps (cmp/jmp/jl/je/jg/...).   We'll learn about these over the next few weeks!

Assembly Registers

Registers are where you store data in assembly language--there aren't any variables, so everything has to either go in registers or somewhere in memory.

Here are the commonly-used x86 registers:
"Scratch" registers you're allowed to overwrite and use for anything you want.  "Preserved" registers serve some important purpose somewhere else, so you have to put them back ("save" the register) if you use them--for now, just leave them alone!

Each of these registers is available in several sizes:
Curiously, you can write a 64-bit value into rax, then read off the low 32 bits from eax, or the low 16 bitx from ax, or the low 8 bits from al--it's just one register, but they keep on extending it!

rax: 64-bit
eax: 32-bit
ax: 16-bit
ah
al

For example,
mov rcx,0xf00d00d2beefc03; load 64-bit constant
mov eax,ecx; pull out low 32 bits
ret

(Try this in NetRun now!)

Here's the full list of x86 registers.  The 64 bit registers are shown in red.

Notes
64-bit
long
32-bit
int
16-bit
short
8-bit
char
Values are returned from functions in this register.  Multiply instructions put the low bits of the result here too.
rax
eax
ax
ah and al
Typical scratch register.  Some instructions use it as a counter (such as SAL or REP).
rcx
ecx
cx
ch and cl
Scratch register.  Multiply instructions put the high bits of the result here.
rdx
edx
dx
dh and dl
Preserved register: don't use it without saving it!
rbx
ebx
bx
bh and bl
The stack pointer.  Points to the top of the stack (wait for the details!)
rsp
esp
sp
spl
Preserved register.  Sometimes used to store the old value of the stack pointer, or the "base".
rbp
ebp
bp
bpl
Scratch register.  Also used to pass function argument #2 in 64-bit mode (on Linux).
rsi
esi
si
sil
Scratch register.  Function argument #1.
rdi
edi
di
dil
Scratch register.  These were added in 64-bit mode, so the names are slightly different.
r8
r8d
r8w
r8b
Scratch register.
r9
r9d
r9w
r9b
Scratch register.
r10
r10d
r10w
r10b
Scratch register. r11
r11d
r11w
r11b
Preserved register.
r12
r12d
r12w
r12b
Preserved register. r13 r13d r13w r13b
Preserved register. r14 r14d r14w r14b
Preserved register. r15 r15d r15w r15b

Dissasembly at Runtime

You can just typecast a function pointer over to an unsigned char, and play with the bytes of a function's machine code:
int bar(void) { /* some random function: we look at bar's machine code below! */
return 2;
}

int foo(void) {
const unsigned char *data=(unsigned char *)(&bar);
for (int i=0;i<10;i++) /* print out the bytes of the bar function */
std::cout<<"0x"<<std::hex<<(int)data[i]<<"\n";
return 0;
}

(Try this in NetRun now!)

This prints out *exactly* the machine code we would write:

0xb8    Opcode for "mov eax,"
0x2 Little-endian constant to load into eax (4 bytes)
0x0
0x0
0x0
0xc3 Opcode for "ret"

You can also click "Disassemble" in NetRun to see the assembly and machine code the compiler or assembler produces.

Hierarchy of "REALLY"

Thus, the notion that Facebook is interesting is REALLY an illusion.