String Parsing
CS 321 Lecture,
Dr. Lawlor, 2006/03/22
Reading text files can be quite painful. The problem is that
you've often got to slog through the contents of the file
yourself.
Check out these string input examples (Directory, Zip, Tar-gzip).
For example, to read a std::string from the standard input, you could do (like in this example):
std::string s;
std::cin>>s;
But this stops reading at a space character. If you want to allow
spaces, and read all the way to, say, a semicolon, then somebody's got
to walk through the characters in a little loop until you hit the
semicolon. Sometimes it's possible to find a library to do
this--for example, std::istream::getline takes a "terminator" character
you can set to semicolon, although it reads into a bare "char *", not a
string. So sometimes you have to build the little loop yourself
like in this example.
The loop is nasty, and hence it's a good idea to hide it inside a subroutine, like in this example.
Putting stuff into subroutines usually means generalizing the stuff you
could have otherwise hardcoded, so in this case we pass a list of
terminator characters as a string. If this string gets long,
checking each character against all possible terminators would be slow.
So it sometimes makes sense to build a little table, to speed up this
character checking. The idea is to index the table by the next
character, which immediately tells you if you should stop reading or
not. See this example.
It's possible to use this table-driven approach to parse really
complicated languages--take CS 331 (computer languages), or look at the
parser code generated by YACC (yet another compiler compiler) to see how this is done.
International Text
So far, we've treated strings as arrays of bytes, and assumed
characters were the same as bytes. That's fine for English (which
is nowadays always encoded using ASCII),
but some languages use accent characters, and others use little
idiographic pictures, and so can't fit all possible characters into a
single byte. Hence the invention of Unicode,
which uses one int per character, often called a "wchar_t". To
work with normal text files, they've developed a way to encode Unicode
characters into 8-bit chunks called UTF-8.
UTF-8 is defined in such a way that plain old ASCII files work as
expected (one byte, one character), but high ASCII is redefined to
allow multi-byte characters. This means you can mostly get away
with ignoring other character sets, and just treat all text as
ASCII. This only causes problems when rendering one character at
a time, or doing operations on each character.
Some systems support "wide" character types like "wchar_t" (for Unicode
characters), "std::wstring", "std::wcin", and "std::wcout" (for Unicode
input and output). The idea is that using wide characters would
allow you to treat all Unicode and ASCII characters in the same
way. Sadly, these don't seem to work on my Linux machines yet...