Death and Mayhem caused by Threads
CS 321 Lecture,
Dr. Lawlor, 2006/02/13
As we've seen, threads can cause weird race conditions. Sometimes race conditions affect code that's really important.
Deadlock on Mars
Processes and threads have a
"scheduling priority"--this controls the amount of time dedicated to
running the process. Linux does this with a "nice" value, where
higher nice value has lower priority (because you're being nice to the
CPU). You can run any program with a nice value using the "nice"
command, like:
nice -n 20 ./myprogram
In Linux, the "nice" value just scales the length of the process
timeslice (see linux/kernel/sched.c:task_timeslice). This is a
good thing, because it means everything on the system keeps running; the low-priority stuff just runs slower.
But Linux, like most other systems, also supports "priority classes",
where an item in the high-priority class will totally dominate the CPU,
and things in lower priority classes will only run if there's no
high-class work available. This total denial of the lower classes
can result in big problems. One of the simplest happens like this:
- Sally Schmaux, a low-priority process, grabs the disk lock and starts doing her unimportant disk stuff.
- Victor Vinci, a high-priority process, wakes up and steals the CPU from Sally, since Victor has important work to do.
- Victor needs to do something on the disk, so he grabs the disk lock.
- The disk lock is held by Sally, so Victor waits--he spins for a while, but then gives up the CPU voluentarily with a call like "sleep".
- The OS will *never* choose Sally over Victor, so it will always wake Victor up again.
This is a classic deadlock--Victor is waiting for Sally to give up the
disk lock, but Sally can't give up the disk lock until Victor gives her
the CPU. The solution is called "priority inheritance"--Sally's
unimportant, and her locks are unimportant, but if somebody important
needs a lock she holds, that makes her important. In other words,
the OS needs to boost the priority of the holder of a lock that an
important process is waiting on.
This "priority inheritance" is complicated, and slightly slows down
lock aquisition, so it was disabled by the JPL engineers for the Mars Sojourner rover that landed in 1997. Big problem. The rover locked up several times on the surface of Mars, until the engineers figured out what was happening and sheepishly turned priority inheritance back on.
THERAC-25 Race Condition
The THERAC-25
was a piece of medical equipment capable of generating either an X-Ray
beam or scanning an Electron beam over the surface of a cancer
patient. When you chose "Electron" mode, the software would begin
this process:
- Enter an 8-second magnet calibration cycle, and then
- Remove the piece of metal used to generate X-Rays
When you chose "X-Ray" mode, the software would insert the piece of
metal used to generate X-Rays. If you chose "Electron" mode
(starting the calibration cycle) and then within 8 seconds went back
and chose "X-Ray" mode, the end of the Electron mode setup would remove
the crucial piece of metal needed for X-Ray mode, and a high-strength
electron beam was sent directly into the patient. At least two
people died as a direct result of this problem, and several more were
permanently maimed.
The problem is just one of simple multithreaded locking--the two modes
should have been exclusive, but because they were both coded in
assembly and run directly from the keyboard interrupt, they could
interleave in a strange way.
Ariane 5 Overflow Explosion
The Ariane 5 rocket exploded at its first launch due to a simple overflow check.