Arrays, Address Arithmetic, and Strings

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

In both C or assembly, you can allocate and access memory in several different sizes:

C/C++ datatype Bits Bytes Register Access memory Allocate memory
char 8 1 al BYTE [ptr] db
short 16 2 ax WORD [ptr] dw
int 32 4 eax DWORD [ptr] dd
long 64 8 rax QWORD [ptr] dq

For example, we can put full 64-bit numbers into memory using "dq", and then read them back out with QWORD[yourLabel].

Address Arithmetic

If you allocate more than one constant with dq, they appear at ascending addresses.  So this reads the 5, like you'd expect:

dos_equis:
	dq 5   ; writes this constant into a "Data Qword" (8 byte block)
	dq 13  ; writes another constant, at [dos_equis+8] (bytes) 

foo:
	mov rax, [dos_equis] ; read memory at this label
	ret

(Try this in NetRun now!)

Adding 8 bytes (the size of a dq, 8-byte / 64-bit QWORD) from the first constant puts us directly on top of the second constant, 13:

dos_equis:
	dq 5   ; writes this constant into a "Data Qword" (8 byte block)
	dq 13  ; writes another constant, at [dos_equis+8] (bytes)

foo:
	mov rax, [dos_equis+8] ; read memory at this label, plus 8 bytes
	ret

(Try this in NetRun now!)


Accessing an Array

An "array" is just a sequence of values stored in ascending order in memory.  If we listed our data with "dq", they show up in memory in that order, so we can do pointer arithmetic to pick out the value we want.  This returns 7:

mov rcx,my_arr ; rcx == address of the array
mov rax,QWORD [rcx+1*8] ; load element 1 of array
ret

my_arr:
dq 4 ; array element 0, stored at [my_arr]
dq 7 ; array element 1, stored at [my_arr+8]
dq 9 ; array element 2, stored at [my_arr+16]

(Try this in NetRun now!)

Did you ever wonder why the first array element is [0]?  It's because it's zero bytes from the start of the pointer!

Keep in mind that each array element above is a "dq" or an 8-byte long, so I move down by 8 bytes during indexing, and I load into the 64-bit "rax".  
If the array is of 4-byte integers, we'd declare them with "dd" (data DWORD), move down by 4 bytes per int array element, and store the answer in a 32-bit register like "eax".  But the pointer register is always 64 bits!
mov rcx,my_arr ; rcx == address of the array
mov eax,DWORD [rcx+1*4] ; load element 1 of array
ret

my_arr:
dd 0xaaabbbcc ; array element 0, stored at [my_arr]
dd 0xc001007 ; array element 1, stored at [my_arr+4]

(Try this in NetRun now!)

It's extremely easy to have a mismatch between one or the other of these values.  For example, if I declare values with dw (2 byte shorts), but load them into eax (4 bytes), I'll have loaded two values into one register.  So this code returns 0xbeefaabb, which is two 16-bit values combined into one 32-bit register:
mov rcx,my_arr ; rcx == address of the array
mov eax,[rcx] ; load element 0 of array (OOPS! 32-bit load!)
ret

my_arr:
dw 0xaabb ; array element 0, stored at [my_arr]
dw 0xbeef ; array element 1, stored at [my_arr+2]

(Try this in NetRun now!)

You can reduce the likelihood of this type of error by adding explicit memory size specifier, like "WORD" below.  That makes this a compile error ("error: mismatch in operand sizes") instead of returning the wrong value at runtime.
mov rcx,my_arr ; rcx == address of the array
mov eax, WORD [rcx] ; load element 0 of array (OOPS! 32-bit load!)
ret

my_arr:
dw 0xaabb ; array element 0, stored at [my_arr]
dw 0xbeef ; array element 1, stored at [my_arr+2]

(Try this in NetRun now!)

(If we really wanted to load a 16-bit value into a 32-bit register, we could use "movzx" (unsigned) or "movsx" (signed) instead of a plain "mov".)

C++
Bits
Bytes
Assembly Create
Assembly Read
Example
char 8
1
db (data byte)
mov al, BYTE[rcx+i*1]
(Try this in NetRun now!)
short 16
2
dw (data WORD)
mov ax, WORD [rcx+i*2] (Try this in NetRun now!)
int 32
4
dd (data DWORD)
mov eax, DWORD [rcx+i*4] (Try this in NetRun now!)
long 64
8
dq (data QWORD)
mov rax, QWORD [rcx+i*8] (Try this in NetRun now!)


Human C++ Assembly
Declare a long integer. long y; rdx (nothing to declare, just use a register)
Copy one long integer to another. y=x; mov rdx,rax
Declare a pointer to an long. long *p; rax    (nothing to declare, use any 64-bit register)
Dereference (look up) the long. y=*p; mov rdx,QWORD [rax]
Find the address of a long. p=&y; mov rax,place_you_stored_Y
Access an array (easy way) y=p[2]; (sorry, no easy way exists!)
Access an array (hard way) p=p+2;
y=*p;
add rax,2*8; (move forward by two 8 byte longs)
mov rdx, QWORD [rax] ;  (grab that long)
Access an array (too clever) y=*(p+2) mov rdx, QWORD [rax+2*8];  (yes, that actually works!)

Loading from the wrong place, or loading the wrong amount of data, is an INCREDIBLY COMMON problem when using pointers, in any language.  You WILL make this mistake at some point over the course of the semester, so be careful!

C Strings in Assembly 

In plain C, you can put a string on the screen with the standard C library "puts" function:

puts("Yo!");

(Try this in NetRun now!)

You can expand this out a bit, by declaring a string variable.  In C, strings are stored as (constant) character pointers, or "const char *":

const char *theString="Yo!";
puts(theString);

(Try this in NetRun now!)

Internally, the compiler does two things:

In assembly, these are separate steps:

Here's an example:

mov rdi, theString ; rdi points to our string
extern puts  ; declare the function
call puts    ; call it
ret

theString:    ; label, just like for jumping
	db `Yo!`,0  ; data bytes for string (don't forget nul!)

(Try this in NetRun now!)

In assembly,  there's no obvious way to tell the difference between a label designed for a jump instruction (a block of code), a label designed for a call instruction (a function), a label designed as a pointer (like a string), or many other uses--it's just a pointer!

Strings as Arrays

There's a classic terse C idiom for walking a string, by incrementing a char * to walk down through the bytes until you hit the zero byte at the end:
    while (*p++!=0) { /* do something to *p   */ }

If you unpack this a bit, you find:

Here's a typical example, in C:

char s[]="string";   // declare a string
char *p=s;           // point to the start
while (*p++!=0) if (*p=='i') *p='a';  // replace i with a
puts(s);

(Try this in NetRun now!)

Here's a similar pointer-walking trick, in assembly:

mov rdi,stringStart
again:
	add rdi,1 ; move pointer down the string
	cmp BYTE[rdi],'a' ; did we hit the letter 'a'?
	jne again  ; if not, keep looking

extern puts
call puts
ret

stringStart:
	db 'this is a great string',0

(Try this in NetRun now!)

(We'll see how to declare modifiable strings later.)