ASCII and C Strings

CS 301 Lecture, Dr. Lawlor

Here's the American Standard Code for Information Interchange (ASCII), a simple way to represent letters in the English alphabet as numbers. You can think of those numbers as decimal or hexadecimal:

Dec   Hex   Char
--------------------
0     00    NUL '\0'
1     01    SOH 
2     02    STX 
3     03    ETX 
4     04    EOT 
5     05    ENQ 
6     06    ACK 
7     07    BEL '\a'
8     08    BS  '\b'
9     09    HT  '\t'
10    0A    LF  '\n'
11    0B    VT  '\v'
12    0C    FF  '\f'
13    0D    CR  '\r'
14    0E    SO  
15    0F    SI  
16    10    DLE 
17    11    DC1 
18    12    DC2 
19    13    DC3 
20    14    DC4 
21    15    NAK 
22    16    SYN 
23    17    ETB 
24    18    CAN 
25    19    EM  
26    1A    SUB 
27    1B    ESC 
28    1C    FS  
29    1D    GS  
30    1E    RS  
31    1F    US

Dec   Hex   Char
-----------------
32    20    SPACE
33    21    ! 
34    22    " 
35    23    # 
36    24    $ 
37    25    % 
38    26    & 
39    27    ' 
40    28    ( 
41    29    ) 
42    2A    * 
43    2B    + 
44    2C    , 
45    2D    - 
46    2E    . 
47    2F    / 
48    30    0 
49    31    1 
50    32    2 
51    33    3 
52    34    4 
53    35    5 
54    36    6 
55    37    7 
56    38    8 
57    39    9 
58    3A    : 
59    3B    ; 
60    3C    < 
61    3D    = 
62    3E    > 
63    3F    ?

Dec   Hex   Char
--------------------
64    40    @
65    41    A
66    42    B
67    43    C
68    44    D
69    45    E
70    46    F
71    47    G
72    48    H
73    49    I
74    4A    J
75    4B    K
76    4C    L
77    4D    M
78    4E    N
79    4F    O
80    50    P
81    51    Q
82    52    R
83    53    S
84    54    T
85    55    U
86    56    V
87    57    W
88    58    X
89    59    Y
90    5A    Z
91    5B    [
92    5C    \	'\\'
93    5D    ]
94    5E    ^
95    5F    _

Dec   Hex   Char
----------------
96    60    `
97    61    a
98    62    b
99    63    c
100   64    d
101   65    e
102   66    f
103   67    g
104   68    h
105   69    i
106   6A    j
107   6B    k
108   6C    l
109   6D    m
110   6E    n
111   6F    o
112   70    p
113   71    q
114   72    r
115   73    s
116   74    t
117   75    u
118   76    v
119   77    w
120   78    x
121   79    y
122   7A    z
123   7B    {
124   7C    |
125   7D    }
126   7E    ~
127   7F    DEL

Here's the same thing, indexed by hex digits, and including all the funny "high ASCII" characters:

ASCII

(

)

;

[

]

{

}

€

�

‚

„

…

†

‡

‰

‹

�

‘

’

“

”

•

–

—

™

›

�

ASCII table. Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit. E.g., "A" is 0x41.

In C, C++, or NASM, you can just write 'A' (with single quotes) and the compiler will insert the number 0x41. This is good, because it means you can compare "char" variables directly to character constants. For example, this checks for uppercase letters:
if ('A'<=c && c<='Z') ...

As another example, an extremely silly way to return 1 is to:
return 'Q'-'P';

Theoretically, these examples wouldn't work if your compiler happens to be using some other character set than ASCII, such as EBCDIC (which is a more convenient encoding on punched cards!). I don't worry about this, personally--ASCII will never die out.

The bottom line thing to remember is that in ASCII, one character is one byte.

C Strings

A "C string" is just an array of ASCII characters/bytes followed by a "terminating nul", which is an ASCII "nul" character/byte (that has the value zero). So for example, the string "Yo!" actually occupies four bytes of memory: 'Y', 'o', '!', and the terminating 0.   This C code returns "4":
    char *c="Yo!";
    return sizeof(c);

There's a bunch of handy routines built into the C standard library (#include <string.h>) to manipulate strings. For example, you can copy a string and its terminating zero byte with "strcpy":
    char *c="Yo!";
    char d[20];
    strcpy(d,c); /* copies bytes from c into d */
    printf("%s\n",d);

There's also a nice routine "strlen" that returns the length of the string in bytes NOT counting the terminating zero.

In C or C++, the compiler treats a double-quoted string as a *pointer* to the first character of the string. So "A" takes 2 bytes, and has some weird pointer value; but 'A' takes 1 bytes, and has the value 0x41.

The standard way to walk through a C string is to keep incrementing the pointer until you hit the terminating zero:

char d[20]; /* String storage */
strcpy(d,"Yo!"); /* Fill out string */

char *c=d; /* Points to start of string */
while (*c!=0) /* While we haven't hit the end of the string... */
{
	if (*c == 'o') {*c='a';} /* Read or write this character */
	c++; /* Advance to the next character */
}

printf("%s\n",d);

(executable NetRun link)

The whole middle loop can in fact be written as a single for loop:
for (char *c=d;*c!=0;c++) if (*c=='o') *c='a';

And a string copy (just like strcpy) can be written like this:

char ds[20]; /* destination */
char *d=ds;
char *s="Yo!"; /* source */

while (0!=(*d++=*s++)) {}

printf("%s\n",ds);

(executable NetRun link)

Check out this little example of how to allocate, copy, modify, and print strings in pure assembly.

What happens if you call strcpy, but the destination isn't big enough to store all the characters? Well, strcpy doesn't know that, and just keeps blindly writing along! So if you allocate a little string buffer on the stack, and the string you're copying is longer than the buffer, some of the stack gets overwritten! Not only will this overwritten stack cause horrible crashes, it can actually allow an attacker to execute his own code on the machine, just by passing it a carefully crafted "buffer overflow" string. This is bad. So you MUST either:

Make sure, every time you call strcpy, sprintf, cin>>c_string, or *anything* that deals with C strings, that the source is small enough to fit inside the destination--if it's too small, figure out a way to enlarge the destination or safely abort.
Never use C string buffers. Use std::string instead, which automatically resizes itself to fit the data you copy into it.

I prefer the second approach, myself.

Unicode

That's fine for English (which is nowadays always encoded using ASCII), but some languages use accent characters, and others use little idiographic pictures, and so can't fit all possible characters into a single byte. Hence the invention of Unicode, which uses one int per character, often called a "wchar_t". To work with normal text files, they've developed a way to encode Unicode characters into 8-bit chunks called UTF-8. UTF-8 is defined in such a way that plain old ASCII files work as expected (one byte, one character), but high ASCII is redefined to allow multi-byte characters. This means you can mostly get away with ignoring other character sets, and just treat all text as ASCII. This only causes problems when rendering one character at a time, or doing operations on each character.

Some systems support "wide" character types like "wchar_t" (for Unicode characters), "std::wstring", "std::wcin", and "std::wcout" (for Unicode input and output). The idea is that using wide characters would allow you to treat all Unicode and ASCII characters in the same way. Sadly, these don't seem to work on my Linux machines yet...