Bytes, ASCII, Big and Little Endian Integers
CS 301 Lecture, Dr. Lawlor
ASCII: Bytes as Characters
The American Standard Code for Information Interchange is a mapping from byte values to printable characters.
You can look up an ASCII code in C++ by converting a single-quoted 'char' constant into an integer, like:
return '\n';
(Try this in NetRun now!)
This returns 10 (0xA), because that's the byte the ASCII committee chose to represent a newline.
You can print the whole ASCII table pretty easily, like:
for (int i=0;i<200;i++) {
char c=(char)i;
std::cout<<" ASCII "<<i<<" = "<<c<<"\n";
}
(Try this in NetRun now!)
You can think of ASCII codes as decimal or hexadecimal bytes. It's the same data, deep down.
Dec Hex Char -------------------- 0 00 NUL '\0' 1 01 SOH 2 02 STX 3 03 ETX 4 04 EOT 5 05 ENQ 6 06 ACK 7 07 BEL '\a' 8 08 BS '\b' 9 09 HT '\t' 10 0A LF '\n' 11 0B VT '\v' 12 0C FF '\f' 13 0D CR '\r' 14 0E SO 15 0F SI 16 10 DLE 17 11 DC1 18 12 DC2 19 13 DC3 20 14 DC4 21 15 NAK 22 16 SYN 23 17 ETB 24 18 CAN 25 19 EM 26 1A SUB 27 1B ESC 28 1C FS 29 1D GS 30 1E RS 31 1F US
|
Dec Hex Char ----------------- 32 20 SPACE 33 21 ! 34 22 " 35 23 # 36 24 $ 37 25 % 38 26 & 39 27 ' 40 28 ( 41 29 ) 42 2A * 43 2B + 44 2C , 45 2D - 46 2E . 47 2F / 48 30 0 49 31 1 50 32 2 51 33 3 52 34 4 53 35 5 54 36 6 55 37 7 56 38 8 57 39 9 58 3A : 59 3B ; 60 3C < 61 3D = 62 3E > 63 3F ?
|
Dec Hex Char -------------------- 64 40 @ 65 41 A 66 42 B 67 43 C 68 44 D 69 45 E 70 46 F 71 47 G 72 48 H 73 49 I 74 4A J 75 4B K 76 4C L 77 4D M 78 4E N 79 4F O 80 50 P 81 51 Q 82 52 R 83 53 S 84 54 T 85 55 U 86 56 V 87 57 W 88 58 X 89 59 Y 90 5A Z 91 5B [ 92 5C \ '\\' 93 5D ] 94 5E ^ 95 5F _
|
Dec Hex Char ---------------- 96 60 ` 97 61 a 98 62 b 99 63 c 100 64 d 101 65 e 102 66 f 103 67 g 104 68 h 105 69 i 106 6A j 107 6B k 108 6C l 109 6D m 110 6E n 111 6F o 112 70 p 113 71 q 114 72 r 115 73 s 116 74 t 117 75 u 118 76 v 119 77 w 120 78 x 121 79 y 122 7A z 123 7B { 124 7C | 125 7D } 126 7E ~ 127 7F DEL
|
ASCII less than 32 ("control characters") or greater than 128 ("high
ASCII") show up in various weird ways depending on which machine and
web browser you're running.
Here's the same thing, indexed by hex digits, and including all the funny "high ASCII" characters:
ASCII |
x0 |
x1 |
x2 |
x3 |
x4 |
x5 |
x6 |
x7 |
x8 |
x9 |
xA |
xB |
xC |
xD |
xE |
xF |
0x |
\0
|
|
|
|
|
|
|
|
|
|
\n
|
|
|
\r
|
|
|
1x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2x |
|
! |
" |
# |
$ |
% |
& |
' |
( |
) |
* |
+ |
, |
- |
. |
/ |
3x |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
: |
; |
< |
= |
> |
? |
4x |
@ |
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
K |
L |
M |
N |
O |
5x |
P |
Q |
R |
S |
T |
U |
V |
W |
X |
Y |
Z |
[ |
\ |
] |
^ |
_ |
6x |
` |
a |
b |
c |
d |
e |
f |
g |
h |
i |
j |
k |
l |
m |
n |
o |
7x |
p |
q |
r |
s |
t |
u |
v |
w |
x |
y |
z |
{ |
| |
} |
~ |
|
8x |
€ |
�
|
‚ |
ƒ |
„ |
… |
† |
‡ |
ˆ |
‰ |
Š |
‹ |
Œ |
�
|
Ž |
�
|
9x |
�
|
‘ |
’ |
“ |
” |
• |
– |
— |
˜ |
™ |
š |
› |
œ |
�
|
ž |
Ÿ |
Ax |
|
¡ |
¢ |
£ |
¤ |
¥ |
¦ |
§ |
¨ |
© |
ª |
« |
¬ |
|
® |
¯ |
Bx |
° |
± |
² |
³ |
´ |
µ |
¶ |
· |
¸ |
¹ |
º |
» |
¼ |
½ |
¾ |
¿ |
Cx |
À |
Á |
 |
à |
Ä |
Å |
Æ |
Ç |
È |
É |
Ê |
Ë |
Ì |
Í |
Î |
Ï |
Dx |
Ð |
Ñ |
Ò |
Ó |
Ô |
Õ |
Ö |
× |
Ø |
Ù |
Ú |
Û |
Ü |
Ý |
Þ |
ß |
Ex |
à |
á |
â |
ã |
ä |
å |
æ |
ç |
è |
é |
ê |
ë |
ì |
í |
î |
ï |
Fx |
ð |
ñ |
ò |
ó |
ô |
õ |
ö |
÷ |
ø |
ù |
ú |
û |
ü |
ý |
þ |
ÿ |
ASCII table. Horizontal axis gives the low hex digit,
vertical axis the high hex digit, and the entry is ASCII for that hex
digit. E.g., "A" is 0x41.
Inside a C++ string, the characters you type are automatically
converted from ASCII to binary. You can enter hex bytes in the
middle of a string using a backslash-x: the string "Funky symbol is
\x80" ends with a Euro byte (character €, value 0x80).
Possibly the world's most obtuse first program is:
std::cout<<"\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21\x0A";
(Try this in NetRun now!)
Here's the bytes inside the "cout" object itself, printed out as ASCII bytes. They look like random garbage.
cout<<(const char *)&cout;
(Try this in NetRun now!)
"sizeof": Get Number of Bytes
Eight bits make a "byte" (note: it's pronounced exactly like "bite",
but always spelled with a 'y'), although in some rare networking
manuals (and in French) the same eight bits would be called an
"octet" (hard drive sizes are in "Go" or "To", Giga-octets or Tera-octets, when sold in French).
In DOS and Windows programming, 16 bits is a "WORD", 32 bits is
a
"DWORD" (double word), and 64 bits is a "QWORD"; but in other contexts
"word" means the machine's
natural binary processing size, which ranges from 32 to 64 bits
nowadays. "word" should be considered ambiguous; "bit" and "byte" have the same meaning everywhere.
Object
|
Overflow Value
|
Bits
|
Hex Digits (4 bits each)
| Bytes (8 bits each) |
Bit
|
2
|
1
|
less than 1
| less than 1 |
Byte, char
|
256
|
8
|
2
| 1
|
"short" (or Windows WORD)
|
65536
|
16
|
4
| 2
|
"int" (Windows DWORD)
|
>4 billion
|
32
|
8
| 4
|
"long" (or in 32-bit C++, "long long")
|
>16 quadrillion
|
64
|
16
| 8 |
There's a nice little builtin function in C/C++ called "sizeof" that
returns the number of bytes (technically, the number of characters)
used by a variable or data type. Sadly, C/C++ don't specify how
many bytes various data types like "int" have, so it depends on the
machine:
32-bit x86 (little endian)
|
32-bit PowerPC (big endian)
|
64-bit x86 or Itanium
|
Java / C#
|
sizeof(char)==1 sizeof(short)==2 sizeof(int)==4 sizeof(long)==4 sizeof(long long)==8 sizeof(void *)==4 sizeof(float)==4 sizeof(double)==8 sizeof(long double)==12
"ILP32"
|
sizeof(char)==1 sizeof(short)==2 sizeof(int)==4 sizeof(long)==4 sizeof(long long)==8 sizeof(void *)==4 sizeof(float)==4 sizeof(double)==8 sizeof(long double)==8
"ILP32"
|
sizeof(char)==1 sizeof(short)==2 sizeof(int)==4 sizeof(long)==8 sizeof(long long)==8 sizeof(void *)==8 sizeof(float)==4 sizeof(double)==8 sizeof(long double)==16
"LP64"
|
sizeof(byte)==1 sizeof(short)==2 sizeof(int)==4 sizeof(long)==8 /* no need for long long */ /* no pointers in Java */ sizeof(float)==4 sizeof(double)==8 /* no long double in Java */ sizeof(Char)==2
|
Note the deciding difference between "32 bit machines" and "64 bit
machines" is the size of a pointer--4 or 8 bytes. "int" is 4
bytes on all modern machines. "long" is 8 bytes in Java or a
64-bit machine, and just 4 bytes on 32-bit machines.
Here's a program that prints out the above:
char c;
short s;
int i;
long l;
long long ll;
void *v;
float f;
double d;
long double ld;
std::cout<<"sizeof(char)=="<<sizeof(c)<<"\n";
std::cout<<"sizeof(short)=="<<sizeof(s)<<"\n";
std::cout<<"sizeof(int)=="<<sizeof(i)<<"\n";
std::cout<<"sizeof(long)=="<<sizeof(l)<<"\n";
std::cout<<"sizeof(long long)=="<<sizeof(ll)<<"\n";
std::cout<<"sizeof(void *)=="<<sizeof(v)<<"\n";
std::cout<<"sizeof(float)=="<<sizeof(f)<<"\n";
std::cout<<"sizeof(double)=="<<sizeof(d)<<"\n";
std::cout<<"sizeof(long double)=="<<sizeof(ld)<<"\n";
return 0;
(executable NetRun link)
Try this out on some different machines! Note that on some
Windows compilers, you might need to say "__int64" instead of "long
long". Also note that "long long" has nothing to do with the
Chinese concert pianist Lang Lang.
Big and Little Endian Memory Access
Let's say we ask the CPU to treat four bytes as a single integer, using a typecast like so:
const unsigned char table[]={
1,2,3,4
};
int foo(void) {
typedef int *myPtr;
myPtr p=(myPtr)table;
return p[0];
}
(Try this in NetRun now!)
This program returns "0x4030201", which is rather the opposite of what
you might expect. The mismatch here is that we write (arabic)
numerals right-to-left (just like arabic), but we write table entries
(and everything else) left-to-right.
So the CPU reads the first, leftmost table entry (1) to get the
lowest-valued byte (0x01), which we write on the right side
(0x...01). Similarly, the last table entry (4) is interpreted as
the highest-valued byte (0x04), which we write on the left side
(0x04...).
But this depends on the CPU! All x86 CPUs start with the
lowest-valued byte (the "little end" of the integer comes first, hence
"little endian"),
but many other CPUs, such as the PowerPC, MIPS, and SPARC CPUs, start
with the highest-valued byte (the "big end" of the integer, hence "big
endian"). So this same code above returns 0x01020304 on a
PowerPC--try this!
The big and little endian naming confusing exists even in the
non-computer world. Consider that the following are all
little-endian (starting with the least-significant information):
- Fairbanks, Alaska, USA
- John Smith, Carpet Cleaner
- Pittsburgh Technical Institute
Yet the following are all big-endian (starting with the biggest information):
You can see big- and little-endian byte storage going not just from bytes to ints, but also from ints to bytes:
int foo(void) {
int x=0xa0b0c0d0; /* Integer value we'll pick apart into bytes */
typedef unsigned char *myTable; /* We'll make it an array of chars */
myTable table=(myTable)&x; /* point to the bytes of the integer x */
for (int i=0;i<4;i++) /* print each byte of the integer x */
std::cout<<std::hex<<(int)table[i]<<" ";
std::cout<<std::endl;
return 0;
}
(Try this in NetRun now!)
This code prints "d0 c0 b0 a0" on a little-endian machine--the first byte is the lowest-value "0xd0".
Machine Code as Bytes
Here's some x86 machine code encoded into a C++ string, and run on
the CPU. Once it's compiled, the bytes of this string work fine as CPU
machine code, just like the char arrays in the homework.
const char *fn="\xb8\x07\x00\x00\x00\xc3";
return ((int (*)(void))(fn))();
(Try this in NetRun now!)
Here's the same machine code, encoded into a C++ "long". Remember that the long is stored in memory little-endian!
const static long fn=0xc300000007b8;
return ((int (*)(void))(&fn))();
(Try this in NetRun now!)
Note that newer x86 machines mark the stack with the "NX" (No eXecute) bit
to prevent the CPU from executing code there. This is a useful
security feature, but it means the above code crashes without the
"const static".
std::vector<char> fn;
fn.push_back(0xb8);
fn.push_back(0x07);
fn.push_back(0x00);
fn.push_back(0x00);
fn.push_back(0x00);
fn.push_back(0xc3);
return ((int (*)(void))(&fn[0]))();
(Try this in NetRun now!)
Here, I'm putting the bytes of machine code into a std::vector.
This only works on my 32-bit machine; on my 64-bit machine,
std::vector's storage space is marked NX, so this code crashes rather
than run.