ASCII and C Strings

CS 301 Lecture, Dr. Lawlor

Here's the American Standard Code for Information Interchange (ASCII), a simple way to represent letters in the English alphabet as numbers.  You can think of those numbers as decimal or hexadecimal:
Dec   Hex   Char
--------------------
0 00 NUL '\0'
1 01 SOH
2 02 STX
3 03 ETX
4 04 EOT
5 05 ENQ
6 06 ACK
7 07 BEL '\a'
8 08 BS '\b'
9 09 HT '\t'
10 0A LF '\n'
11 0B VT '\v'
12 0C FF '\f'
13 0D CR '\r'
14 0E SO
15 0F SI
16 10 DLE
17 11 DC1
18 12 DC2
19 13 DC3
20 14 DC4
21 15 NAK
22 16 SYN
23 17 ETB
24 18 CAN
25 19 EM
26 1A SUB
27 1B ESC
28 1C FS
29 1D GS
30 1E RS
31 1F US
Dec   Hex   Char
-----------------
32 20 SPACE
33 21 !
34 22 "
35 23 #
36 24 $
37 25 %
38 26 &
39 27 '
40 28 (
41 29 )
42 2A *
43 2B +
44 2C ,
45 2D -
46 2E .
47 2F /
48 30 0
49 31 1
50 32 2
51 33 3
52 34 4
53 35 5
54 36 6
55 37 7
56 38 8
57 39 9
58 3A :
59 3B ;
60 3C <
61 3D =
62 3E >
63 3F ?
Dec   Hex   Char
--------------------
64 40 @
65 41 A
66 42 B
67 43 C
68 44 D
69 45 E
70 46 F
71 47 G
72 48 H
73 49 I
74 4A J
75 4B K
76 4C L
77 4D M
78 4E N
79 4F O
80 50 P
81 51 Q
82 52 R
83 53 S
84 54 T
85 55 U
86 56 V
87 57 W
88 58 X
89 59 Y
90 5A Z
91 5B [
92 5C \ '\\'
93 5D ]
94 5E ^
95 5F _
Dec   Hex   Char
----------------
96 60 `
97 61 a
98 62 b
99 63 c
100 64 d
101 65 e
102 66 f
103 67 g
104 68 h
105 69 i
106 6A j
107 6B k
108 6C l
109 6D m
110 6E n
111 6F o
112 70 p
113 71 q
114 72 r
115 73 s
116 74 t
117 75 u
118 76 v
119 77 w
120 78 x
121 79 y
122 7A z
123 7B {
124 7C |
125 7D }
126 7E ~
127 7F DEL

Here's the same thing, indexed by hex digits, and including all the funny "high ASCII" characters:
ASCII x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x
       


 
1x                
2x
! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ 
8x
ƒ ˆ Š Œ
Ž
9x
˜ š œ
ž Ÿ
Ax   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­
® ¯
Bx ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
ASCII table.  Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit.  E.g., "A" is 0x41.

In C, C++, or NASM, you can just write 'A' (with single quotes) and the compiler will insert the number 0x41.  This is good, because it means you can compare "char" variables directly to character constants.  For example, this checks for uppercase letters:
    if ('A'<=c && c<='Z') ...

As another example, an extremely silly way to return 1 is to:
    return 'Q'-'P';

Theoretically, these examples wouldn't work if your compiler happens to be using some other character set than ASCII, such as EBCDIC (which is a more convenient encoding on punched cards!).  I don't worry about this, personally--ASCII will never die out.

The bottom line thing to remember is that in ASCII, one character is one byte.

C Strings

A "C string" is just an array of ASCII characters/bytes followed by a "terminating nul", which is an ASCII "nul" character/byte (that has the value zero).  So for example, the string "Yo!" actually occupies four bytes of memory: 'Y', 'o', '!', and the terminating 0.   This C code returns "4":
    char *c="Yo!";
    return sizeof(c);

There's a bunch of handy routines built into the C standard library (#include <string.h>) to manipulate strings.  For example, you can copy a string and its terminating zero byte with "strcpy":
    char *c="Yo!";
    char d[20];
    strcpy(d,c); /* copies bytes from c into d */
    printf("%s\n",d);

There's also a nice routine "strlen" that returns the length of the string in bytes NOT counting the terminating zero.

In C or C++, the compiler treats a double-quoted string as a *pointer* to the first character of the string.  So "A" takes 2 bytes, and has some weird pointer value; but 'A' takes 1 bytes, and has the value 0x41.

The standard way to walk through a C string is to keep incrementing the pointer until you hit the terminating zero:
char d[20]; /* String storage */
strcpy(d,"Yo!"); /* Fill out string */

char *c=d; /* Points to start of string */
while (*c!=0) /* While we haven't hit the end of the string... */
{
if (*c == 'o') {*c='a';} /* Read or write this character */
c++; /* Advance to the next character */
}

printf("%s\n",d);
(executable NetRun link)

The whole middle loop can in fact be written as a single for loop:
    for (char *c=d;*c!=0;c++)  if (*c=='o') *c='a';

And a string copy (just like strcpy) can be written like this:
char ds[20]; /* destination */
char *d=ds;
char *s="Yo!"; /* source */

while (0!=(*d++=*s++)) {}

printf("%s\n",ds);
(executable NetRun link)


Check out this little example of how to allocate, copy, modify, and print strings in pure assembly.

What happens if you call strcpy, but the destination isn't big enough to store all the characters?  Well, strcpy doesn't know that, and just keeps blindly writing along!  So if you allocate a little string buffer on the stack, and the string you're copying is longer than the buffer, some of the stack gets overwritten!  Not only will this overwritten stack cause horrible crashes, it can actually allow an attacker to execute his own code on the machine, just by passing it a carefully crafted "buffer overflow" string.   This is bad.   So you MUST either:
  1. Make sure, every time you call strcpy, sprintf, cin>>c_string, or *anything* that deals with C strings, that the source is small enough to fit inside the destination--if it's too small, figure out a way to enlarge the destination or safely abort.
  2. Never use C string buffers.  Use std::string instead, which automatically resizes itself to fit the data you copy into it.
I prefer the second approach, myself.

Unicode

That's fine for English (which is nowadays always encoded using ASCII), but some languages use accent characters, and others use little idiographic pictures, and so can't fit all possible characters into a single byte.  Hence the invention of Unicode, which uses one int per character, often called a "wchar_t".  To work with normal text files, they've developed a way to encode Unicode characters into 8-bit chunks called UTF-8. UTF-8 is defined in such a way that plain old ASCII files work as expected (one byte, one character), but high ASCII is redefined to allow multi-byte characters.  This means you can mostly get away with ignoring other character sets, and just treat all text as ASCII.  This only causes problems when rendering one character at a time, or doing operations on each character.

Some systems support "wide" character types like "wchar_t" (for Unicode characters), "std::wstring", "std::wcin", and "std::wcout" (for Unicode input and output).  The idea is that using wide characters would allow you to treat all Unicode and ASCII characters in the same way.  Sadly, these don't seem to work on my Linux machines yet...