XML
CS 321 Lecture,
Dr. Lawlor, 2006/04/14
XML
(eXtensible Markup Language) is an international-standard way of
representing data in an ASCII-style markup language. It's suspiciously
similar to HTML, but unlike HTML, XML lets you make up your own tags,
like a new "crap" tag you use for describing statements you don't
believe: "<crap>I am an egg</crap>".
XML is usually parsed by a dedicated generic XML parsing library.
Libraries are built into JavaScript and .NET; but for C/C++ you have to
download one. I recommend "expat"
1.0, because it's a small and simple parser. There's a new
not-quite-standard called the DOM for parsing XML, but it's not used
very often yet.
The list of valid tags that can be used in a document, and the fashion
in which tags can be nested, can be stored in a Data Type Descriptor,
or DTD.
A "Validating" XML parser can check the tags against the DTD, and give
good error messages if the tags don't make sense--this means you don't
have to do as much error checking in your own code.
There's also a standard for tag-to-tag converting XML documents to HTML (or other XML-style formats) called XSL, the eXtensible Style Language.
Parsing XML with Expat
"Expat" is a simple
XML parser library. To use the library, you build an XML_Parser
object and register a set of functions for the library to call when it:
- Hits a start tag, like <foo bar="baz">
- Hits some user data, like "bob"
- Hits and end tag, like </foo>
Your "start", "data", and "end" functions can do anything with the
XML--save it to a tree structure, look through until they find the data
they're looking for, or just print or convert the stuff as it goes by.
Here's a tiny example (Directory, Zip, Tar-gzip)