Databases and SQL
CS 321 2007 Lecture, Dr. Lawlor
There are lots of stupid things about ordinary binary and ASCII files:
- You can't tell where the data for one object stops and the data
for the next object begins. This is why tagged binary formats and
XML-style tagged ASCII formats were developed.
- You can't easily add one item to every object in an existing file.
- You can't easily locate objects in the file, unless the file is
sorted or you build an little "index". As the contents of the
file changes, you need to keep the file sorted, or update the index,
which is a pain.
- It's easy for a file to get "corrupted", for example if a program crashes in the middle of writing it.
Databases fix every one of these shortcomings.
- Databases (like files) contain tables (like arrays) which contain rows (like structs) which contain fields
(just like fields of a struct). The database interface keeps
track of field, row, and table names, so you don't have to tag things.
- Databases make it very easy to add tables, rows, and fields, without losing the data already in the database.
- Where databases really shine is doing searching. Databases
can keep many indices for a single table, represent the indices with
efficient balanced search trees, update the indices to follow changes
in the table, and look up data via indices very quickly.
- Databases are designed to never get corrupted, even if programs
accessing the database, or the database server itself, crashes or loses
power during a write. The last few pieces of data might not
actually make it to disk, but the data in the database is going to be
OK.
So databases provide awesome, incredible features that often come in handy in real programs.
Accessing Database Data
Unfortunately, the only standard way to access data in a database is via SQL
(pronounced either one letter at a time like "ess-que-ell", or as one
word like "see-kwel"), which stands for "HORRIBLE 1970'S-ERA TELETYPE
DATABASE INTERFACE". No, that's not right. It's "Structured
Query Language". Anyway, it's an ASCII-type language--you store
the operation you want done on the database as a string, and the
database server parses the string to figure out what to do.
For example, if you've set up a table (see below) called "log", you can read everything out of the table with this SQL command:
SELECT * FROM log;
The "*" stands for the fields of the table you want to extract; in this
case, all the fields. "SELECT item FROM log;" would pull out just
the "item" field from the "log" table.
You can also use a "WHERE" clause to extract a subset of the data from
a table. This would extract all the rows of the "log" table where
the "importance" field is greater than or equal to 3:
SELECT * FROM log WHERE importance>=3;
You can string together "WHERE" clauses with logical operators like "AND". Keep in mind--the database usually does not have to read the whole database to find these rows, because it can use its index.
SELECT * FROM log WHERE importance>=3 AND name='cool';
SQL can even transform the data before comparing it in a "WHERE" clause. For example, you could use the SOUNDEX builtin function to grab all the log items whose name *sounds like* 'cool':
SELECT * FROM log WHERE SOUNDEX(name)=SOUNDEX('cool');
Builtin functions in MySQL
include UCASE (make uppercase), SOUNDEX (map to soundex codes), CONCAT
(paste together two strings), and a zillion date functions. You
can apply these functions anywhere, even as part of the "*", to
uppercase the output of the SELECT command:
SELECT UCASE(*) FROM log WHERE name='cool';
Inserting Database Data
You add rows to a database using the "INSERT" SQL command.
INSERT INTO log (user,item) VALUES ('bob',17);
This adds a row to the "log" table, setting the fields "user" and "item" to the values 'bob' and "17".
You can also modify existing rows with the "UPDATE" command. For
example, if 'bob' changes his name to 'bobina', you can update all his
"log" table entries with:
UPDATE log SET user='bobina' WHERE user='bob';
Warning! If you forget the "WHERE" clause, this will UPDATE every row of the "log" table! (Don't ask me how I know this...)
Important businesses like banks need the ability to make a sequence of SQL operations into a single indivisible unit called a "transaction".
For example, to deposit a check you might do a SELECT on the source
bank account number, UPDATE the source balance, SELECT the destination
account, and UPDATE its balance. It's crucially important that
all of these things happen together, so that for example a crash
doesn't result in the money being lost in limbo, debited from one
account but not credited to the other. So a bank will do a "BEGIN WORK;" command to start a transaction, issue their SQL, and finally do a "COMMIT;" to complete the transaction.
Setting up Databases and Tables (in MySQL)
One popular database server in the open-source world is MySQL.
MySQL stores the database in a file (some commercial databases prefer
to take over an entire disk). This file is accessed exclusively
by the "database server program" called "mysqld". This is not
your code, and it doesn't run your code. All it does is take SQL
queries from your program, and give you back the results. This is
like the "reserve desk" at a library, where they don't trust you to run
around in the stacks--instead, you (the program) have to ask the
librarian (the mysqld server) to get your data for you.
MySQL is network-accessible, and hence has its own conception of
account names and passwords--these have no relationship to the normal
OS accounts on the actual machine, they exist only for the
database. You can set your MySQL "root" password to "funker" like
this:
mysqladmin -u root password funker
You can then create a database, for example, called "testdb":
mysqladmin -u root -pfunker create testdb
Finally, you can issue SQL commands to this database by running:
mysql -u root -pfunker -Dtestdb
This will drop you into the MySQL command line.
There are SQL commands to create tables. For example:
CREATE TABLE log (
num INT NOT NULL auto_increment,
importance INT DEFAULT '0' NOT NULL,
date TIMESTAMP NOT NULL,
user VARCHAR(10) DEFAULT '',
item VARCHAR(40),
PRIMARY KEY (num),
KEY xdate(date),
KEY ximportance(importance)
);
SHOW TABLES;
DESCRIBE log;
This creates the "log" table with a bunch of fields. "num" and
"importance" are integer fields. "date" is a TIMESTAMP
field. "user" is a string with up to 10 characters. "item"
is a string with up to 40 characters. "xdate" and "ximportance"
are indices built to allow fast searching on the "date" and
"importance" fields.
Now you can apply a bunch of INSERT commands to fill the new "log" table with data.
INSERT INTO log (importance,user,item)
VALUES (1,"olawlor",'Created table; start of debugging');
INSERT INTO log (importance,user,item)
VALUES (0,"olawlor",'Checking out tables...');
INSERT INTO log (importance,user,item)
VALUES (0,"nobody",'Silly crap');
INSERT INTO log (importance,user,item)
VALUES (1,"nobody",'Not so silly no more!');
And then you can apply SELECT commands to pull data out of the "log" table.
SELECT * FROM log;
SELECT * FROM log WHERE importance=1;
SELECT * FROM log WHERE importance=1 AND user="nobody";
Cool, yes? There's also a set of C functions you can call to do
queries (as well as C#, Perl, Java, PHP, ...). For example:
MYSQL *mysql;
mysql=(MYSQL *)malloc(sizeof(MYSQL));
mysql_init(mysql);
mysql_connect(mysql,"localhost","root","funker");
mysql_select_db(mysql,"testdb");
MYSQL_RES *result=mysql_query(mysql,"INSERT INTO log (user,item) VALUES ('bill','Stuff');");
mysql_close(mysql);
Connecting to the database server is quite slow--it seems to take
milliseconds on my box. So you want to connect to the database
once, and then issue a bunch of queries. Each query reasonably
fast (assuming the data's still cached in memory!), in my tests taking
around 64 microseconds per access. This is a thousand times
slower than a single memory access, largely due to the overhead of
switching to the separate database server process and parsing your
ASCII request, but it's still way faster than a disk access. And
unlike memory or even most disk I/O, database operations will survive
across multiple runs of the program, and scale nicely to even very
large databases--multi-terabyte databases are common in the commercial
world!
Learn more by trying out some of the many tutorials on the net ("mysql tutorial" has 38 million hits):
- MySQL is installed in the "images\linux" and "images\h2k_mysql"
virtual machines on the 321 OS Sampler DVDs. Check out the
"images\h2k_mysql\NOTES.txt" file for commands to run.
- The MySQL documentation contains a decent tutorial on creating and connecting to databases.
- DevShed also has a decent MySQL tutorial.
- Virtual Workshops has a good SQL tutorial. Lots of nitty-gritty syntax illustrated.
- CS 425 is an entire course on databases--building them, optimizing their performance, implementing them, and so on!