Pointers, Pointer Arithmetic, and Messy Uncertain Death

CS 301 Lecture, Dr. Lawlor
Pointers are one of C++'s most powerful features, but also their most dangerous.  (Cue dramatic music.)  In assembly, we need pointers to access memory, which is used to hold values that won't fit in registers.

In C++, we use the "address-of" & operator to get the address of a variable, returning a pointer.  It looks somehow related to bitwise AND, but it isn't.  The corresponding dereference operator * looks at what the pointer points to, returning the underlying variable:
int x=3; // happy bunny
int *p=&x; // point to the happy bunny
std::cout<<"The pointer is "<<p<<endl;
int i=*p; // summon the happy bunny
return i;

(Try this in NetRun now!)

You can also do pointer arithmetic, where you change what the pointer points to.  Pointer arithmetic is dangerous, because you can easily meddle with forces you can barely understand, much less control:
int x=3; // happy bunny
int *p=&x; // point to the happy bunny
for (int r=0;r<1000000;r++) p++; // move way past the bunnies
std::cout<<"The new pointer is "<<p<<endl;
int i=*p; // summon the EVIL DEMON MONSTER
return i;

(Try this in NetRun now!)

As you might expect, even reading from a way-bad pointer can cause your program to crash (or, adding a little drama, "to die... horribly!").  So you have to make sure that what you're pointing to is really there.  This is doubly true for writing--you have to make sure it's there, and that you're allowed to write to it.  The standard way to do this is just to be very careful about how you write your pointer manipulation code.

Here's some valid pointer manipulation code, where we use ++ to move the pointer to the next integer, -- to move it back to the previous integer, and then we add two to jump over two integers at once:

int arr[4];
arr[0]=100;
arr[1]=101;
arr[2]=102;
arr[3]=103;
int *p=arr; /* points to arr[0] */
std::cout<<"At p: "<< *p <<endl;
p++; /* move pointer down, to arr[1] */
std::cout<<"After p++: "<< *p <<endl;
p++; /* move pointer down some more, to arr[2] */
std::cout<<"After another p++: "<< *p <<endl;
p--; /* move pointer back, to arr[1] again */
std::cout<<"And then a p--: "<< *p <<endl;
p=p+2; /* moves by 2 *ints*, to arr[3] */
std::cout<<"p=p+2: "<< *p <<endl;
return 0;

(Try this in NetRun now!)

Note that this means that arrays are just a series of items at increasing addresses in memory.  That's all an array is.

In C/C++, the compiler knows you're pointing to an integer.  So when you say "p=p+2", the compiler moves the pointer by two integers, which is a total of eight bytes.  You can see that byte count by printing out the pointers as they move, like the following.  (Note now we're printing "p" the pointer; not "*p" the integer.)

int arr[4];
arr[0]=100;
arr[1]=101;
arr[2]=102;
arr[3]=103;
int *p=arr; /* points to arr[0] */
std::cout<<"At p: "<< p <<endl;
p++; /* move pointer down, to arr[1] */
std::cout<<"After p++: "<< p <<endl;
p++; /* move pointer down some more, to arr[2] */
std::cout<<"After another p++: "<< p <<endl;
p--; /* move pointer back, to arr[1] again */
std::cout<<"And then a p--: "<< p <<endl;
p=p+2; /* moves by 2 *ints*, to arr[3] */
std::cout<<"p=p+2: "<< p <<endl;
return 0;

(Try this in NetRun now!)

Pointers literally *are* just this byte count.  You can do pointer arithmetic on byte counts in C++ by typecasting your pointers to "char *", but the syntax looks a little weird, because to access an int, you have to cast back to "int *":

int arr[4];
arr[0]=100;
arr[1]=101;
arr[2]=102;
arr[3]=103;
char *p=(char *)arr; /* points to the bytes in arr[0] */
std::cout<<"At p: "<< *(int *)p <<endl;
p+=4; /* move pointer down, to arr[1] */
std::cout<<"After p++: "<< *(int *)p <<endl;
p+=4; /* move pointer down some more, to arr[2] */
std::cout<<"After another p++: "<< *(int *)p <<endl;
p-=4; /* move pointer back, to arr[1] again */
std::cout<<"And then a p--: "<< *(int *)p <<endl;
p+=8; /* moves by 2 *ints*, to arr[3] */
std::cout<<"p=p+2: "<< *(int *)p <<endl;
return 0;

(Try this in NetRun now!)

Byte pointers are useful to learn in C++, because they're all you get in assembly language!

Memory Allocation

Memory (like real estate) in theory could be used by anybody for anything at any time (the anarchist squatter's paradise!).  Of course, in practice, it works a lot better to set up rules by which you can figure out what memory's yours, and what isn't (e.g., deeds, leases, rental contracts).  So a piece of memory can be:

  1. Owned by you, and used by you.  This is the good kind of memory, the only kind you should be using.
  2. Owned by somebody else, and erroniously used by you.  You can read or write surprisingly far past the end of an array before crashing, although you can easily end up overwriting something used by some other part of the program, or crashing yourself.  (A confusing "memory corruption" error.)
  3. Owned by somebody else, and deadly to even look at.  You get a "segmentation fault" access violation if you access this pointer.  The CPU enforces the OS's wishes using the "page table", which you'll hear about in CS 321 (unless I tell you first!). 
You can't tell the difference between type 2 memory (dangerous, but not right now) and type 3 memory (immediate death), so stick to type 1!

The bottom line is you really need to claim memory before using it, and then only use the part you claimed.  It's easy to accidentally run off the end of an array (owned by you, class 1 memory) into other bytes of memory owned by some other part of the program (class 2 memory), or delete an array (so it's no longer owned by you) and use it later, etc.  Sadly, in C++ it's up to you the programmer to make sure your uses of memory are correct, unlike Java or C# where pointers aren't allowed and array indices are all carefully checked by the compiler.

Anyway, there are a bunch of different ways for your code to legally claim some memory, including:

Pointers in Assembly Language

In assembly language, you store pointers in registers, but usually the 64-bit registers like "rax" on a 64-bit machine rather than the 32-bit register like "eax".  The syntax for dereferencing a pointer (to get at what it points to) is just "[rax]".  The array-looking square brackets say to access memory.  For example,
	mov rax,rcx
copies the value stored in register rcx into the register rax, overwriting whatever was stored in rax before.  By contrast,
	mov [rax],rcx
copies the value stored in register rcx into the memory pointed to by rax.  


In C++, this is like the difference between:
	someptr = 0;
which overwrites the pointer itself, and
	*someptr = 0;
which overwrites the memory the pointer points to.


Here's a complete example of assembly memory access.  I call malloc to get 40 bytes of space.  malloc returns the starting address of this space in rax (the 64-bit version of eax).  That is, the rax register is acting like a pointer.  I can then read and write from the pointed-to memory using the bracket syntax shown above:
mov edi, 40; malloc's first (and only) parameter: number of bytes to allocate
extern malloc
call malloc
; on return, rax points to our newly-allocated memory
mov ecx,7; set up a constant
mov [rax],ecx; write it into memory
mov edx,[rax]; read it back from memory
mov eax,edx; copy into return value register
ret

(Try this in NetRun now!)

Rather than copy via the ecx register, you can specify you want a 32-bit memory write and read using "DWORD" in front of the brackets, like this:
mov edi, 40; malloc's first (and only) parameter: number of bytes to allocate
extern malloc
call malloc
; on return, rax points to our newly-allocated memory
mov DWORD [rax],7; write constant into memory
mov eax,DWORD [rax]; read it back from memory
ret

(Try this in NetRun now!)

The available memory access sizes in NASM are "DWORD" for 32 bits (like an int), "QWORD" for 64 bits (pointer or long), "WORD" for 16 bits (short int), or "BYTE" for 8 bits (char or byte).

You can do pointer arithmetic to move your pointer around, either by just modifying the register (as in rcx below), or by folding the new address into the memory access as an "offset" (as in the "DWORD [rax + 16]" below):
mov edi, 40; malloc's first (and only) parameter: number of bytes to allocate
extern malloc
call malloc
; on return, rax points to our newly-allocated memory
mov rcx,rax; copy the pointer
add rcx,16 ; shift new pointer down by 16 bytes
mov DWORD [rcx],7; write constant into memory at shifted pointer
mov eax,DWORD [rax+16]; read it back from same memory using constant offset
ret

(Try this in NetRun now!)

In general, assembly language programs spend a lot of time copying data in and out of memory.  The advantage of doing this is that memory (gigabytes) is *way* bigger than registers (only a few dozen bytes!).