Bytes, ASCII, Big and Little Endian Integers

ASCII: Bytes as Characters

The American Standard Code for Information Interchange is a mapping from byte values to printable characters.

You can look up an ASCII code in C++ by converting a single-quoted 'char' constant into an integer, like:

return '\n';

(Try this in NetRun now!)

This returns 10 (0xA), because that's the byte the ASCII committee chose to represent a newline.

You can print the whole ASCII table pretty easily, like:

for (int i=0;i<200;i++) {
	char c=(char)i;
	std::cout<<" ASCII "<<i<<" = "<<c<<"\n";
}

(Try this in NetRun now!)

You can think of ASCII codes as decimal or hexadecimal bytes. It's the same data, deep down.

Dec   Hex   Char
--------------------
0     00    NUL '\0'
1     01    SOH 
2     02    STX 
3     03    ETX 
4     04    EOT 
5     05    ENQ 
6     06    ACK 
7     07    BEL '\a'
8     08    BS  '\b'
9     09    HT  '\t'
10    0A    LF  '\n'
11    0B    VT  '\v'
12    0C    FF  '\f'
13    0D    CR  '\r'
14    0E    SO  
15    0F    SI  
16    10    DLE 
17    11    DC1 
18    12    DC2 
19    13    DC3 
20    14    DC4 
21    15    NAK 
22    16    SYN 
23    17    ETB 
24    18    CAN 
25    19    EM  
26    1A    SUB 
27    1B    ESC 
28    1C    FS  
29    1D    GS  
30    1E    RS  
31    1F    US

Dec   Hex   Char
-----------------
32    20    SPACE
33    21    ! 
34    22    " 
35    23    # 
36    24    $ 
37    25    % 
38    26    & 
39    27    ' 
40    28    ( 
41    29    ) 
42    2A    * 
43    2B    + 
44    2C    , 
45    2D    - 
46    2E    . 
47    2F    / 
48    30    0 
49    31    1 
50    32    2 
51    33    3 
52    34    4 
53    35    5 
54    36    6 
55    37    7 
56    38    8 
57    39    9 
58    3A    : 
59    3B    ; 
60    3C    < 
61    3D    = 
62    3E    > 
63    3F    ?

Dec   Hex   Char
--------------------
64    40    @
65    41    A
66    42    B
67    43    C
68    44    D
69    45    E
70    46    F
71    47    G
72    48    H
73    49    I
74    4A    J
75    4B    K
76    4C    L
77    4D    M
78    4E    N
79    4F    O
80    50    P
81    51    Q
82    52    R
83    53    S
84    54    T
85    55    U
86    56    V
87    57    W
88    58    X
89    59    Y
90    5A    Z
91    5B    [
92    5C    \	'\\'
93    5D    ]
94    5E    ^
95    5F    _

Dec   Hex   Char
----------------
96    60    `
97    61    a
98    62    b
99    63    c
100   64    d
101   65    e
102   66    f
103   67    g
104   68    h
105   69    i
106   6A    j
107   6B    k
108   6C    l
109   6D    m
110   6E    n
111   6F    o
112   70    p
113   71    q
114   72    r
115   73    s
116   74    t
117   75    u
118   76    v
119   77    w
120   78    x
121   79    y
122   7A    z
123   7B    {
124   7C    |
125   7D    }
126   7E    ~
127   7F    DEL

ASCII less than 32 ("control characters") or greater than 128 ("high ASCII") show up in various weird ways depending on which machine and web browser you're running.

Here's the same thing, indexed by hex digits, and including all the funny "high ASCII" characters:

ASCII

(

)

;

[

]

{

}

€

�

‚

„

…

†

‡

‰

‹

�

‘

’

“

”

•

–

—

™

›

�

ASCII table. Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit. E.g., "A" is 0x41.

Inside a C++ string, the characters you type are automatically converted from ASCII to binary. You can enter hex bytes in the middle of a string using a backslash-x: the string "Funky symbol is \x80" ends with a Euro byte (character €, value 0x80).

Possibly the world's most obtuse first program is:

std::cout<<"\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21\x0A";

(Try this in NetRun now!)

Here's the bytes inside the "cout" object itself, printed out as ASCII bytes. They look like random garbage.

cout<<(const char *)&cout;

(Try this in NetRun now!)

"sizeof": Get Number of Bytes

Eight bits make a "byte" (note: it's pronounced exactly like "bite", but always spelled with a 'y'), although in some rare networking manuals (and in French) the same eight bits would be called an "octet" (hard drive sizes are in "Go" or "To", Giga-octets or Tera-octets, when sold in French). In DOS and Windows programming, 16 bits is a "WORD", 32 bits is a "DWORD" (double word), and 64 bits is a "QWORD"; but in other contexts "word" means the machine's natural binary processing size, which ranges from 32 to 64 bits nowadays. "word" should be considered ambiguous; "bit" and "byte" have the same meaning everywhere.

Object	Overflow Value	Bits	Hex Digits (4 bits each)	Bytes (8 bits each)
Bit	2	1	less than 1	less than 1
Byte, char	256	8	2	1
"short" (or Windows WORD)	65536	16	4	2
"int" (Windows DWORD)	>4 billion	32	8	4
"long" (or in 32-bit C++, "long long")	>16 quadrillion	64	16	8

There's a nice little builtin function in C/C++ called "sizeof" that returns the number of bytes (technically, the number of characters) used by a variable or data type. Sadly, C/C++ don't specify how many bytes various data types like "int" have, so it depends on the machine:

32-bit x86 (little endian)

32-bit PowerPC (big endian)

64-bit x86 or Itanium

Java / C#

sizeof(char)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==4
sizeof(long long)==8
sizeof(void *)==4
sizeof(float)==4
sizeof(double)==8
sizeof(long double)==12

"ILP32"

sizeof(char)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==4
sizeof(long long)==8
sizeof(void *)==4
sizeof(float)==4
sizeof(double)==8
sizeof(long double)==8

"ILP32"

sizeof(char)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==8
sizeof(long long)==8
sizeof(void *)==8
sizeof(float)==4
sizeof(double)==8
sizeof(long double)==16

"LP64"

sizeof(byte)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==8
 /* no need for long long */
 /* no pointers in Java */
sizeof(float)==4
sizeof(double)==8
 /* no long double in Java */
sizeof(Char)==2

Note the deciding difference between "32 bit machines" and "64 bit machines" is the size of a pointer--4 or 8 bytes. "int" is 4 bytes on all modern machines. "long" is 8 bytes in Java or a 64-bit machine, and just 4 bytes on 32-bit machines.

Here's a program that prints out the above:

char c;
short s;
int i;
long l;
long long ll;
void *v;
float f;
double d;
long double ld;
std::cout<<"sizeof(char)=="<<sizeof(c)<<"\n";
std::cout<<"sizeof(short)=="<<sizeof(s)<<"\n";
std::cout<<"sizeof(int)=="<<sizeof(i)<<"\n";
std::cout<<"sizeof(long)=="<<sizeof(l)<<"\n";
std::cout<<"sizeof(long long)=="<<sizeof(ll)<<"\n";
std::cout<<"sizeof(void *)=="<<sizeof(v)<<"\n";
std::cout<<"sizeof(float)=="<<sizeof(f)<<"\n";
std::cout<<"sizeof(double)=="<<sizeof(d)<<"\n";
std::cout<<"sizeof(long double)=="<<sizeof(ld)<<"\n";
return 0;

(executable NetRun link)

Try this out on some different machines! Note that on some Windows compilers, you might need to say "__int64" instead of "long long". Also note that "long long" has nothing to do with the Chinese concert pianist Lang Lang.

Big and Little Endian Memory Access

Let's say we ask the CPU to treat four bytes as a single integer, using a typecast like so:

const unsigned char table[]={
	1,2,3,4
};

int foo(void) {
	typedef int *myPtr;
	myPtr p=(myPtr)table;
	return p[0];
}

(Try this in NetRun now!)

This program returns "0x4030201", which is rather the opposite of what you might expect. The mismatch here is that we write (arabic) numerals right-to-left (just like arabic), but we write table entries (and everything else) left-to-right.

So the CPU reads the first, leftmost table entry (1) to get the lowest-valued byte (0x01), which we write on the right side (0x...01). Similarly, the last table entry (4) is interpreted as the highest-valued byte (0x04), which we write on the left side (0x04...).

But this depends on the CPU! All x86 CPUs start with the lowest-valued byte (the "little end" of the integer comes first, hence "little endian"), but many other CPUs, such as the PowerPC, MIPS, and SPARC CPUs, start with the highest-valued byte (the "big end" of the integer, hence "big endian"). So this same code above returns 0x01020304 on a PowerPC--try this!

The big and little endian naming confusing exists even in the non-computer world. Consider that the following are all little-endian (starting with the least-significant information):

Fairbanks, Alaska, USA
John Smith, Carpet Cleaner
Pittsburgh Technical Institute

Yet the following are all big-endian (starting with the biggest information):

University of Alaska Fairbanks
Cleaners, Carpet: Smith, John (like in a phonebook)
907 474-7678
$7.32
École Polytechnique de Montréal

You can see big- and little-endian byte storage going not just from bytes to ints, but also from ints to bytes:

int foo(void) {
	int x=0xa0b0c0d0; /* Integer value we'll pick apart into bytes */
	typedef unsigned char *myTable; /* We'll make it an array of chars */
	myTable table=(myTable)&x; /* point to the bytes of the integer x */
	for (int i=0;i<4;i++) /* print each byte of the integer x */
		std::cout<<std::hex<<(int)table[i]<<" ";
	std::cout<<std::endl;
	return 0;
}

(Try this in NetRun now!)

This code prints "d0 c0 b0 a0" on a little-endian machine--the first byte is the lowest-value "0xd0".

Machine Code as Bytes

Here's some x86 machine code encoded into a C++ string, and run on the CPU. Once it's compiled, the bytes of this string work fine as CPU machine code, just like the char arrays in the homework.

const char *fn="\xb8\x07\x00\x00\x00\xc3";
return ((int (*)(void))(fn))();

(Try this in NetRun now!)

Here's the same machine code, encoded into a C++ "long". Remember that the long is stored in memory little-endian!

const static long fn=0xc300000007b8;
return ((int (*)(void))(&fn))();

(Try this in NetRun now!)

Note that newer x86 machines mark the stack with the "NX" (No eXecute) bit to prevent the CPU from executing code there. This is a useful security feature, but it means the above code crashes without the "const static".

std::vector<char> fn;
fn.push_back(0xb8);
fn.push_back(0x07);
fn.push_back(0x00);
fn.push_back(0x00);
fn.push_back(0x00);
fn.push_back(0xc3);
return ((int (*)(void))(&fn[0]))();

(Try this in NetRun now!)

Here, I'm putting the bytes of machine code into a std::vector. This only works on my 32-bit machine; on my 64-bit machine, std::vector's storage space is marked NX, so this code crashes rather than run.