Recall that when we write C++ code, the compiler transforms it into executable machine code that actually runs on the CPU hardware. Machine code is line-for-line equivalent to assembly language.
One way to start learning assembly language is to use a "Disassembler" to see what the compiler generates from your code. (In NetRun, Options -> Actions -> Disassemble, then run.) For example, given this C++ code:
long foo() { return 7; }
We can compile this using: g++ code.c -c -fomit-frame-pointer
We can then disassemble it with: objdump -drC -M intel code.o
code.obj: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <foo>: 0: b8 07 00 00 00 mov eax,0x7 5: c3 ret
|
The stuff on the right, starting with "mov" and "ret", is assembly language.
We can take this assembly language code, convert it to machine code using an "assembler" like nasm, and run it!
mov eax,7 ret
(NetRun takes care of the function setup, which we'll get to in the next few weeks.)
Assembly is a very strange language, designed mostly around the machine it runs on, not around the programmer. For example, "mov" and "ret" are instructions for the CPU to execute. You can't add new instructions without changing the CPU; for example, Intel added the instruction "aesenc" (AES encryption) in 2010. There are hundreds of instructions added over the years, but some commonly used instructions are:
We'll be working our way through these instructions this week!
In assembly you don't have variables, but operate on data in registers. A register is actually a tiny piece of memory hardware inside the CPU, with a fixed size. When the CPU executes a line like "mov eax,7" it stores the constant 7 into the register eax, which is 32 bits wide, the width of an "int" in C or C++. Just like most C++ programs spend their time shuffling values between variables, most assembly programs spend their time shuffling values between registers.
Here are some of the more friendly, easy to use 32-bit registers, and who uses them. (There are also other register sizes and types we'll be covering eventually.)
Notes | 32-bit |
Values are returned from functions in this register. Multiply instructions put the low bits of the result here too. | eax |
Scratch register. Some instructions use it as a counter (such as SAL or REP). | ecx |
Scratch register. Multiply instructions put the high bits of the result here. | edx |
Scratch register. Function argument #1 in 64-bit Linux. | edi |
Scratch register. Also used to pass function argument #2 in 64-bit Linux. | esi |
The big problem with registers is they're in *hardware*: you're stuck with the existing names and sizes, and every function has to share them, just like global variables. If you made up a new language where there are only five global variables, with weird hardcoded names, you'd be laughed straight to the HR office to be fired!
One caution: if you see some assembly where the register names have a percent sign in front of them, like "%eax", you're probably looking at the GNU/AT&T syntax, which annoyingly puts all the registers in the reverse order from the Intel syntax we'll be using.
Unlike C/C++, assembly language is not case sensitive. This means "mov eax,7" and "mOv EaX, 7" are equivalent.
A semicolon indicates the start of a comment. Unlike in C/C++/Java/C#/..., semicolons are OPTIONAL in assembly! I usually leave off the semicolons for lines without comments, because otherwise I find myself tempted to do this:
mov ecx, 5; mov eax, 3; Whoops!
It doesn't look like it, but the semicolon makes that second instruction A COMMENT!
Unlike C/C++, assembly is line-oriented, so you need a newline after each line of assembly, and the following WILL NOT WORK:
mov eax,
5
Line-oriented stuff is indeed annoying. Be careful that your editor doesn't mistakenly add newlines to long lines of text!
Here's how you add two numbers in assembly:
Here's the C/C++ equivalent:
int a = 3;
int c = 7;
a += c;
return a;
And finally here's the assembly code:
mov eax, 3
mov ecx, 7
add eax, ecx
ret
(executable NetRun link)
Here are the x86 arithmetic instructions. Note that they *all* take just two registers, the destination and the source.
Opcode | C++ | Example |
add | + | add eax,ecx |
sub | - | sub eax,ecx |
imul | * | imul eax,ecx |
idiv | / | idiv ecx <- Warning! Weirdness! (see below) |
and | & | and eax,ecx |
or | | | or eax,ecx |
xor | ^ | xor eax,ecx |
not | ~ | not eax |
Be careful doing these! Assembly is *line* oriented, so you can't say anything like this:
add edx,(sub eax,ecx) ; won't work
add edx, eax-ecx ; won't work
but you can say:
sub eax,ecx
add edx,eax
In assembly, arithmetic has to be broken down into one operation at a time, one instruction per line!
Note that "idiv" is really weird, even by the standards of assembly language. "idiv bot" divides eax by bot (the eax part is hardcoded). But it also treats edx as high bits above eax, so you have to set them to zero first.
idiv bot
means:
top = eax+(edx<<32)
eax = top/bot
edx = top%bot
Here's an example:
mov eax,73; top
mov ecx,10; bottom
mov edx,0 ; high bits of top
idiv ecx ; divide eax by ecx
; now eax = 73/10, edx=73%10
What a strange instruction!
CS 301 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.