Databases and SQL

CS 321 2007 Lecture, Dr. Lawlor

There are lots of stupid things about ordinary binary and ASCII files:

You can't tell where the data for one object stops and the data for the next object begins. This is why tagged binary formats and XML-style tagged ASCII formats were developed.
You can't easily add one item to every object in an existing file.
You can't easily locate objects in the file, unless the file is sorted or you build an little "index". As the contents of the file changes, you need to keep the file sorted, or update the index, which is a pain.
It's easy for a file to get "corrupted", for example if a program crashes in the middle of writing it.

Databases fix every one of these shortcomings.

Databases (like files) contain tables (like arrays) which contain rows (like structs) which contain fields (just like fields of a struct). The database interface keeps track of field, row, and table names, so you don't have to tag things.
Databases make it very easy to add tables, rows, and fields, without losing the data already in the database.
Where databases really shine is doing searching. Databases can keep many indices for a single table, represent the indices with efficient balanced search trees, update the indices to follow changes in the table, and look up data via indices very quickly.
Databases are designed to never get corrupted, even if programs accessing the database, or the database server itself, crashes or loses power during a write. The last few pieces of data might not actually make it to disk, but the data in the database is going to be OK.

So databases provide awesome, incredible features that often come in handy in real programs.

Accessing Database Data

Unfortunately, the only standard way to access data in a database is via SQL (pronounced either one letter at a time like "ess-que-ell", or as one word like "see-kwel"), which stands for "HORRIBLE 1970'S-ERA TELETYPE DATABASE INTERFACE". No, that's not right. It's "Structured Query Language". Anyway, it's an ASCII-type language--you store the operation you want done on the database as a string, and the database server parses the string to figure out what to do.

For example, if you've set up a table (see below) called "log", you can read everything out of the table with this SQL command:

	SELECT * FROM log;

The "*" stands for the fields of the table you want to extract; in this case, all the fields. "SELECT item FROM log;" would pull out just the "item" field from the "log" table.

You can also use a "WHERE" clause to extract a subset of the data from a table. This would extract all the rows of the "log" table where the "importance" field is greater than or equal to 3:

	SELECT * FROM log WHERE importance>=3;

You can string together "WHERE" clauses with logical operators like "AND". Keep in mind--the database usually does not have to read the whole database to find these rows, because it can use its index.

	SELECT * FROM log WHERE importance>=3 AND name='cool';

SQL can even transform the data before comparing it in a "WHERE" clause. For example, you could use the SOUNDEX builtin function to grab all the log items whose name *sounds like* 'cool':

	SELECT * FROM log WHERE SOUNDEX(name)=SOUNDEX('cool');

Builtin functions in MySQL include UCASE (make uppercase), SOUNDEX (map to soundex codes), CONCAT (paste together two strings), and a zillion date functions. You can apply these functions anywhere, even as part of the "*", to uppercase the output of the SELECT command:

	SELECT UCASE(*) FROM log WHERE name='cool';

Inserting Database Data

You add rows to a database using the "INSERT" SQL command.

	INSERT INTO log (user,item) VALUES ('bob',17);

This adds a row to the "log" table, setting the fields "user" and "item" to the values 'bob' and "17".

You can also modify existing rows with the "UPDATE" command. For example, if 'bob' changes his name to 'bobina', you can update all his "log" table entries with:

	UPDATE log SET user='bobina' WHERE user='bob';

Warning! If you forget the "WHERE" clause, this will UPDATE every row of the "log" table! (Don't ask me how I know this...)

Important businesses like banks need the ability to make a sequence of SQL operations into a single indivisible unit called a "transaction". For example, to deposit a check you might do a SELECT on the source bank account number, UPDATE the source balance, SELECT the destination account, and UPDATE its balance. It's crucially important that all of these things happen together, so that for example a crash doesn't result in the money being lost in limbo, debited from one account but not credited to the other. So a bank will do a "BEGIN WORK;" command to start a transaction, issue their SQL, and finally do a "COMMIT;" to complete the transaction.

Setting up Databases and Tables (in MySQL)

One popular database server in the open-source world is MySQL.

MySQL stores the database in a file (some commercial databases prefer to take over an entire disk). This file is accessed exclusively by the "database server program" called "mysqld". This is not your code, and it doesn't run your code. All it does is take SQL queries from your program, and give you back the results. This is like the "reserve desk" at a library, where they don't trust you to run around in the stacks--instead, you (the program) have to ask the librarian (the mysqld server) to get your data for you.

MySQL is network-accessible, and hence has its own conception of account names and passwords--these have no relationship to the normal OS accounts on the actual machine, they exist only for the database. You can set your MySQL "root" password to "funker" like this:

	mysqladmin -u root password funker

You can then create a database, for example, called "testdb":

	mysqladmin -u root -pfunker create testdb

Finally, you can issue SQL commands to this database by running:

	mysql -u root -pfunker -Dtestdb

This will drop you into the MySQL command line.

There are SQL commands to create tables. For example:

CREATE TABLE log (
        num INT NOT NULL auto_increment,
        importance INT DEFAULT '0' NOT NULL,
        date TIMESTAMP NOT NULL,
        user VARCHAR(10) DEFAULT '',
        item VARCHAR(40),
        PRIMARY KEY (num),
        KEY xdate(date),
        KEY ximportance(importance)
);

SHOW TABLES;
DESCRIBE log;

This creates the "log" table with a bunch of fields. "num" and "importance" are integer fields. "date" is a TIMESTAMP field. "user" is a string with up to 10 characters. "item" is a string with up to 40 characters. "xdate" and "ximportance" are indices built to allow fast searching on the "date" and "importance" fields.

Now you can apply a bunch of INSERT commands to fill the new "log" table with data.

INSERT INTO log (importance,user,item) 
        VALUES (1,"olawlor",'Created table; start of debugging');
INSERT INTO log (importance,user,item) 
        VALUES (0,"olawlor",'Checking out tables...');
INSERT INTO log (importance,user,item) 
        VALUES (0,"nobody",'Silly crap');
INSERT INTO log (importance,user,item) 
        VALUES (1,"nobody",'Not so silly no more!');

And then you can apply SELECT commands to pull data out of the "log" table.

SELECT * FROM log;
SELECT * FROM log WHERE importance=1;
SELECT * FROM log WHERE importance=1 AND user="nobody";

Cool, yes? There's also a set of C functions you can call to do queries (as well as C#, Perl, Java, PHP, ...). For example:

  MYSQL *mysql;
  mysql=(MYSQL *)malloc(sizeof(MYSQL));
  mysql_init(mysql);
  mysql_connect(mysql,"localhost","root","funker");
  mysql_select_db(mysql,"testdb");
  MYSQL_RES *result=mysql_query(mysql,"INSERT INTO log (user,item) VALUES ('bill','Stuff');");
  mysql_close(mysql);

Connecting to the database server is quite slow--it seems to take milliseconds on my box. So you want to connect to the database once, and then issue a bunch of queries. Each query reasonably fast (assuming the data's still cached in memory!), in my tests taking around 64 microseconds per access. This is a thousand times slower than a single memory access, largely due to the overhead of switching to the separate database server process and parsing your ASCII request, but it's still way faster than a disk access. And unlike memory or even most disk I/O, database operations will survive across multiple runs of the program, and scale nicely to even very large databases--multi-terabyte databases are common in the commercial world!

Learn more by trying out some of the many tutorials on the net ("mysql tutorial" has 38 million hits):

MySQL is installed in the "images\linux" and "images\h2k_mysql" virtual machines on the 321 OS Sampler DVDs. Check out the "images\h2k_mysql\NOTES.txt" file for commands to run.
The MySQL documentation contains a decent tutorial on creating and connecting to databases.
DevShed also has a decent MySQL tutorial.
Virtual Workshops has a good SQL tutorial. Lots of nitty-gritty syntax illustrated.
CS 425 is an entire course on databases--building them, optimizing their performance, implementing them, and so on!