Death and Mayhem caused by Threads

CS 321 Lecture, Dr. Lawlor, 2006/02/13

As we've seen, threads can cause weird race conditions.  Sometimes race conditions affect code that's really important.

Deadlock on Mars

Processes and threads have a "scheduling priority"--this controls the amount of time dedicated to running the process.  Linux does this with a "nice" value, where higher nice value has lower priority (because you're being nice to the CPU).  You can run any program with a nice value using the "nice" command, like:
    nice -n 20 ./myprogram

In Linux, the "nice" value just scales the length of the process timeslice (see linux/kernel/sched.c:task_timeslice).  This is a good thing, because it means everything on the system keeps running; the low-priority stuff just runs slower.

But Linux, like most other systems, also supports "priority classes", where an item in the high-priority class will totally dominate the CPU, and things in lower priority classes will only run if there's no high-class work available.  This total denial of the lower classes can result in big problems.  One of the simplest happens like this:
This is a classic deadlock--Victor is waiting for Sally to give up the disk lock, but Sally can't give up the disk lock until Victor gives her the CPU.  The solution is called "priority inheritance"--Sally's unimportant, and her locks are unimportant, but if somebody important needs a lock she holds, that makes her important.  In other words, the OS needs to boost the priority of the holder of a lock that an important process is waiting on.

This "priority inheritance" is complicated, and slightly slows down lock aquisition, so it was disabled by the JPL engineers for the Mars Sojourner rover that landed in 1997.  Big problem.  The rover locked up several times on the surface of Mars, until the engineers figured out what was happening and sheepishly turned priority inheritance back on.

THERAC-25 Race Condition

The THERAC-25 was a piece of medical equipment capable of generating either an X-Ray beam or scanning an Electron beam over the surface of a cancer patient.  When you chose "Electron" mode, the software would begin this process:
When you chose "X-Ray" mode, the software would insert the piece of metal used to generate X-Rays.  If you chose "Electron" mode (starting the calibration cycle) and then within 8 seconds went back and chose "X-Ray" mode, the end of the Electron mode setup would remove the crucial piece of metal needed for X-Ray mode, and a high-strength electron beam was sent directly into the patient.  At least two people died as a direct result of this problem, and several more were permanently maimed.

The problem is just one of simple multithreaded locking--the two modes should have been exclusive, but because they were both coded in assembly and run directly from the keyboard interrupt, they could interleave in a strange way.

Ariane 5 Overflow Explosion

The Ariane 5 rocket exploded at its first launch due to a simple overflow check.