Death and Denial: The Failure of the THERAC-25, A Medical Linear Accelerator

by

Anne Marie Porrello

in partial completion of the requirements for

Computer Science 440

Dennis W. Butler, Instructor


Imagine that your doctor has recently informed you that you have cancer, most probably terminal, and your only hope for a cure involves radiation therapy. How much are you going to question the radiation process and equipment? Are you going to ask:

I would guess that you might ask a few of these questions, but that you would assume that the machine delivering the radiation is safe and that the people who designed it and manipulate it are properly qualified. So, whose job is it to ask, and answer, the other questions?

Between June 1995 and January 1987, six patients were seriously injured or killed by unsafe administration of radiation from the Therac-25 medical linear accelerator. In this paper I will first explain what a medical linear accelerator is and then describe the birthing process of the Therac-25. Next, I will examine the accident history, and explore the causes behind the accidents. I will study various propositions, regarding the accidents, made by Atomic Energy of Canada Limited (AECL), the company who designed and produces the Therac-25. In addition, I will examine the Therac-25's software bugs. Lastly, I will look at the government's reactions and explore what has been done to prevent similar accidents in the future. By the end, we should have an answer to the question: Whose job is it to ask, and answer, medical equipment safety questions?

What is the THERAC-25?

The Therac-25 is a medical linear accelerator manufactured by AECL. A linear accelerator ("linac") is a particle accelerator, a device that increases the energy of electrically charged atomic particles. The charged particle are accelerated by the introduction of an electric field, producing beams of particles which are then focused by magnets.

Linacs are used to treat cancer patients. A patient is exposed to beams of particles, or radiation, in doses designed to kill a malignancy. Since malignant tissues are more sensitive than normal tissues to radiation exposure, a treatment plan can be developed that permits the absorption of an amount of radiation that is fatal to tumor cells but causes relatively minor damage to normal tissue. Shallow tissue is treated with electrons, but to reach deeper tissue, X-ray photons are needed (Grolier, 1985).

The Hardware

AECL combined forces with a French company, CGR, and created two linacs before the Therac-25: the Therac-6 and the Therac-20. The Therac-6 is a six million electron volt (MeV) accelerator that produced X-rays only; and the Therac-20 is a 20-MeV X-ray or electron accelerator. (An eV, the electron volt, is a unit of work needed to move an electron through a potential of 1 volt (Grolier, 1985).) Eventually, after the companies ended their partnership, AECL developed the Therac-25. Like the Therac-20, the Therac-25 is a dual-mode machine, but it requires much less space because it has a unique design structure (Leveson and Turner, 1993, p.19) . The Therac-25 uses two magnets to fold the electrons 180 degrees and 270 degrees before reaching their target. By positioning elements correctly, a turntable controls which mode the machine will use. When the machine is in electron mode, magnets on the turntable spread the beam to a safe concentration. In electron mode, various levels of energy are available (from 5 to 25-MeV) (O'Brien, 1985, p. 101). In photon mode, a much greater electron-beam current is needed because a "beam flattener" is used to produce a consistent treatment area. Only one level of energy (25-MeV) is available in photon mode (O'Brien, 1985, p. 101). If the beam flattener is not in position, a dangerously high output rate will occur; this is a significant hazard of a dual-mode machine, because it is possible that not all the devices will be lined up properly and a high output could occur. The turntable also includes a third mode, the field-light position, which uses a light to help position patients correctly. When the machine is in field light position, no mechanism is used to control the beam concentration because no beam is expected. This produces another possible hazard of the machine, in the event that a beam is incorrectly produced (Leveson and Turner, 1993, p. 25).

The Therac-25 is enclosed in a radiation treatment room in order to prevent unnecessary radiation exposure to individuals working near the machine. The machine operator has contact with the patient through visual and audio monitors located within the treatment room (Leveson and Turner, 1993, p. 25) .

The Software

The design of real-time computing systems is the most challenging and complex task that can be undertaken by a software engineer. By its very nature, software for real-time systems makes demands on analysis, design, and testing techniques that are unknown in other application areas. (Pressman, 1992, p. 481)

"...All these computing interactions—be they helpful or intrusive—are examples of real-time computing. The computer is controlling something that interacts with reality on a timely basis. In fact, timing is the essence of the interaction&... An unresponsive real-time system may be worse than no system at all." (Pressman, 1992, p. 481)

The Therac-25's software was developed from the Therac-20's software, which was developed from the Therac-6's software. One programmer, over several years, revised the Therac-6 software into the Therac-25 software (AECL has not released any information about the programmer or his credentials). An important difference between the Therac-20 software and the Therac-25 software is the overall role that each plays in the machine. In the Therac-20, the role of software is limited. The software simply adds convenience to the hardware. However, in the Therac-25, software exclusively performs many of the critical safety checks of the system; these safety checks are also included in the hardware of the Therac-20, but were not included in the Therac-25 hardware. The Therac-25 software is responsible for:

The last two responsibilities reveal some of the ways that the software is responsible for the safety of the system.

The Therac-25 runs on an custom-designed real-time operating system. The software has four major components: stored data, a scheduler, a set of critical and non-critical tasks, and interrupt services. The interrupt services include (among others): a treatment console screen interrupt handler and a treatment console keyboard interrupt handler. The scheduler directs all non-interrupt events and orders simultaneous events. Tasks are divided into critical and non-critical categories. Every 0.1 seconds tasks are initiated and critical tasks are executed first, with non-critical tasks taking up any remaining time. Critical tasks include:

Non-critical tasks include (among others):

The software of the Therac-25 also controls the positioning of the turntable, a possible hazard discussed previously, and checks the position of the turntable so that all necessary devices are in place (Leveson and Turner, 1993, p. 21).

The Therac-25 software also contained several "user-friendly" features. During system testing, operators complained that it took too long to enter the treatment plan, since it had to be done twice: once in the treatment room and a second time at a terminal outside of the room. For convenience, AECL redesigned the software so operators could simply use a set of carriage returns, at the terminal outside the treatment room, to verify the data input within in the room (Leveson and Turner, 1993 p. 24) . Another "convenient" feature of the Therac-25 involved a "proceed" key. There were two ways that the Therac-25 could shut down: a treatment suspend or a treatment pause. A treatment suspend indicated a serious error and required a complete system restart. A treatment pause, which was apparently not as serious, required only a single-key command (the "P" key) to restart the machine, and all treatment specifications remained intact. A treatment pause could occur five times before the machine required a complete system restart. With a treatment pause, a simple error message would occur, i.e. "malfunction" followed by a number of the malfunction. However, there were no indication in the users manual as to what each malfunction number meant (Leveson and Turner, 1993, p. 24) .

In later sections, I will discuss how the real-time nature of the system, the addition of user-friendly features, poor documentation, and failures to secure safety, contributed to the radiation accidents.

The Accidents

Six accidents involving enormous radiation overdoses to patients took place between 1985 and 1987. In this section I will simply give a brief overview of the accidents. See Table 1.0 for a summary of the accident history.

The first accident occurred at Kennestone Regional Oncology Center in Marietta. On June 3, 1985, a sixty-one year old woman was receiving follow-up treatment after a malignant tumor was removed from her breast. When the machine was activated, she felt "a tremendous rush of heat&ldots;this red-hot sensation." She told the operator of the Therac-25 "you burned me." Although later she developed reddening and swelling in the center of the treatment area, AECL denied that the machine burned the patient. and the swelling was attributed to normal treatment reaction. Eventually, her shoulder froze and she began to experience spasms. She was admitted to the hospital, but her doctors continued to send her for Therac-25 radiation treatments. Eventually the patient's breast had to be removed, and she completely lost the use of her shoulder and arm (Leveson and Turner, 1993, p. 22).

The second accident occurred at the Ontario Cancer Foundation clinic in Canada. On 26 July 1985, a 40-year old patient received her 24th Therac-25 treatment. During the treatment, the machine caused a treatment pause and issued an "H-tilt" error message. The operator proceeded to push the "P" button since the machine indicated that no dose had been delivered to the patient. The machine continued to shut down and the operator pushed the "P" button each time until the machine suspended after the fifth attempt. Each time the machine indicated that no dose had been given to the patient. The operator of the Therac-25 was used to this type of behavior from the machine and called the technician, who found nothing wrong with the machine. This also was a common situation. The patient, however, complained of an "electric tingling shock" in her hip. Eventually radiation overexposure was suspected and the patient was hospitalized. She died three months later of cancer, but a total hip-replacement would have been necessary if she had continued to live (Leveson and Turner, 1993, p. 23) .

The third accident involved a woman who developed red parallel stripes on her hip, the treatment area. She was treated at the Yakima Valley Memorial Hospital in 1985. Her doctors continue to order treatments for her even after these stripes appeared. Radiation overexposure was not considered as a cause until over a year later. Eventually, the patient received surgical treatment and, except for minor disability and scarring, is alive and well today (Leveson and Turner, 1993, p. 26-27).

Another Therac-25 accident, the fourth in the series, developed at the East Texas Cancer Center in March of 1986. A male patient was to receive therapy on his upper back. The Therac-25 operator had typed in incorrect treatment information by indicating X-ray mode instead of electron mode. She merely used the "cursor up" key to edit the mode entry and then quickly pressed "enter" (one of the user-friendly features), and started treatment. The machine shut down with treatment pause, and a "malfunction 54" error message was displayed on the screen. This error message indicated that either a dose too high or a dose too low had been delivered. Since an underdose value appeared on the screen and the operator was used to quirks in the machine, she hit the "P" key to continue with the treatment. The machine repeated the "Malfunction 54" error message and indicated the same underdose was delivered. The operator had no contact with the patient, because the usual audio and video monitors were not working properly. After the first attempt at treatment, the patient felt an "electric shock" or as if "someone had poured hot coffee" on his back. He knew this was not normal and began to get up from the treatment table when the second treatment was delivered. The patient felt a tremendous shock in his arm, and felt that "his hand was leaving his body". He had to pound on the treatment room door to get the operator's attention. The patient eventually loss the use of his left arm and both legs, was unable to speak, and had several other complications. He died from complications five months later (Leveson and Turner, 1993, p. 27-28).

A fifth accident occurred, the second at the East Texas Cancer Center, in April of 1986, just one month later. As in the previous accident, the same operator entered the wrong mode of treatment and quickly edited the correct mode in and hit a quick serious of enter keys. The machine shut down again with a "Malfunction 54" message. This time, however, the intercom had been working and the operator heard a loud noise followed by moaning from the patient. The patient was receiving radiation on the side of his face. He died three weeks after the accident, after falling into a coma and suffering severe neurological damage (Leveson and Turner, 1993, p. 28) .

The last of the accidents occurred at the Yakima Valley Memorial Hospital. On January 17, 1987 an operator placed a patient on the turntable in the field-light position for small position verification doses. After attempting to administer the treatment dose, the machine shut down with a quick malfunction message and a treatment pause. The operator pushed the "P" button, and the machine paused again. The machine indicated that the patient had received his prescribed 7 rad of treatment. The patient, however, complained of a "burning sensation" and died three months later from complications related to the overdose (Leveson and Turner, 1993, p. 33) .

Therac-25: Radiation Accident Summary
Date of the Accident Location of the Accident Extent of injuries to patient Number of months after the first accident
June 3, 1985 Marietta, GA Breast removal, loss of use of arm
July 26, 1985 Ontario, Canada Total hip replacement needed 1
January 6, 1986 Yakima, WA Minor disability and scarring 7
March 21, 1986 Tyler, TX Death 9
April 11, 1986 Tyler, TX Death 10
January 17,1987 Yakima, WA Death 19

AECL'S Reactions to the Accidents

Why were accidents allowed to occur over a 19-month long period? The answer to this questions appears in the reactions of AECL when notified of possible Therac-25 radiation accidents. In this section, I will examine these reactions as well as AECL's overconfidence in the Therac-25 system safety.

After the first accident, in 1985, AECL was informed about the situation and was asked if the Therac-25 could operate in electron mode without scanning to spread the beam (as described in the hardware section). When AECL responded three days later, it was to say that improper scanning was impossible. The hospital staff had a difficult time discerning the cause of the first burn, because they had never seen a radiation burn of this severity. Eventually, the patient was estimated to have received a dose in the range of 15,000 - 20,000 rad (radiation absorbed dose). To help put this dosage amount into perspective, a normal dose is in the "200-rad range, and doses of 500 -1,000 rad can be fatal if delivered to the whole body (Leveson and Turner 23)." The patient eventually initiated a lawsuit against the hospital and AECL. Even upon notification of the lawsuit, AECL did not proceed to investigate the possible occurrence of scanning failure. They continued to believe that such an event was impossible (Leveson and Turner, 1993, p 23).

AECL responded to the second accident by sending a service engineer to investigate the Therac-25 machine. He was unable to reproduce the malfunction that took place, but suspected that the problem lie in a microswitch used to determine turntable position. In trying to fix this situation, AECL uncovered some problems involving the turntable positioning. AECL made some hardware and software changes to fix these problems. After the changes, AECL wrote a letter to the hospital claiming to have increased the safety of the machine by "at least five orders of magnitude", yet they did not really discover why the accident occurred. The were merely guessing. AECL informed only four users in the United States to discontinue treatment with an "H-tilt" error message. AECL voluntarily recalled the machine while making the above mentioned changes to it (Leveson and Turner, 1993, p. 23).

AECL's reaction to the third accident is perplexing. Upon receiving a letter of notification from the hospital, describing the patient injury, AECL responded with a letter informing the hospital that "after careful consideration, we are of the opinion that this damage could not have been produced by any malfunction of the Therac-25" (Leveson and Turner, 1993, p. 27). The letter continued to explain that an overdose was impossible and that there had been no other similar accidents! The hospital was under the opinion that the safety improvements of the machine, as proclaimed by AECL (as a 10,000,000 percent improvement!) guaranteed that the Therac-25 could not be responsible for the burn. No further action was taken (Leveson and Turner, 1993, p. 23-26).

AECL responded to the fourth accident by suggesting that an electrical problem could have caused the accident. Another engineering firm tested the machine for electrical problems, but found none. AECL continued to claim that the Therac-25 could not possibly overdose a patient, and that no other accidents had been reported to them. No other action was taken (Leveson and Turner, 1993, p. 28).

Unfortunately, it was not until the fifth accident that AECL responded in a thorough way. By this point, however, the FDA was also investigating the Therac-25. The next section will discuss further action taken by AECL.

A Collective Response

The fifth accident occurred at the same location as the fourth. As a result, someone besides the AECL engineers had knowledge that more than one possible accident transpired while using the Therac-25. A physicists from the hospital where the two accidents occurred investigated both accidents thoroughly, discovering that the accidents were due to the quick changes made to the setup parameters by the machine operators. Through a quick series of returns, the physicists could reproduce the "Malfunction 54" error, something that AECL never could do (Leveson and Turner, 1993).

At about this same time, the FDA was also investigating the Therac-25 accidents. They determined that the Therac-25 was defective and required that AECL submit a corrective action plan (CAP) for FDA approval. They also mandated that AECL inform all users of the Therac-25 of possible machine malfunctions. In response, AECL wrote a letter to users understating the problems with the machine. The FDA responded to AECL's letter as follows:

"We have reviewed&ldots;[the] letter to purchasers and have concluded that it does not satisfy the requirements for notification to purchasers of a defect in an electronic product. Specifically, it does not describe the defect nor the hazards associated with it. The letter does not provide any reason for disabling the cursor key and the tone is not commensurate with the urgency for doing so. In fact, the letter implies the inconvenience to operators outweighs the need to disable the key. We request that you immediately re-notify purchasers." (Leveson and Turner, 1993, p. 31).

AECL submitted their first CAP, containing six items which included: fixing the software problem which caused the fifth accident, having Malfunctions 1 - 64 cause a machine suspend rather than pause, and adding a new circuit that only administrative staff can reset if a high pulse is detected. No hardware safety interlocks were mentioned. However, AECL concluded, again, that the CAP changes would improve "machine safety by many orders of magnitude" (of an already ridiculously high figure) "and virtually eliminates the possibility of lethal doses as delivered in the Tyler incident (Leveson and Turner, 1993, p. 32)." However, the FDA was not satisfied and was concerned with the overall software engineering practice of AECL. There was an absence of documentation of software specifications and details of software test plans (Leveson and Turner, 1993, p.32). (Obviously, no one at AECL had taken CSc440 with Dennis Butler.) Later on, in an FDA hearing, the quality assurance manager described testing as done in two parts. The first was a "small amount" of software testing, and the second involved total system testing of 2,700 total hours. He later qualified that this was 2,700 hours of actual machine use (Leveson and Turner 20)! The FDA eventually had to require that AECL do extensive testing on the system each time a software change was made, and that they should write up a software testing plan and installation testing plan.

Before AECL turned in their complete CAP, the sixth and last accident occurred. Although this problem was caused by a separate software problem than in the Taylor accident, the changes specified in the CAP would have prevented the final accident. After the accident, the FDA declared the Therac-25 to be defective and informed users of the serious potential problems and asked that the machine be used only if need outweighs potential risks (Leveson and Turner, 1993).

AECL eventually turned in there completed CAP. In reference to the test plan, a FDA reviewer wrote: "Amazingly, the test data presented to show that the software changes to handle the edit problems in the Therac-25 are appropriate prove the exact opposite result " (Leveson and Turner, 1993, p. 37). Although a data entry problem was determined to be the cause of the incorrect test results, the problem seems indicative of AECL's lack of quality control. Eventually, the CAP was accepted, which included a hardware safety interlock among 20 other hardware and software changes. The Therac-25 machine is still in use today.

THE Software Errors

Each bug contained in the Therac-25 software was also found in the software of the Therac-20. However, the hardware safety interfaces in the Therac-20 prevented any accidents from occurring in the other machine. (Leveson and Turner 29). Although an overall poor engineering process was used, the main weakness of the Therac-25 software is in the lack of specifications and formal testing procedures. As a result, certain bugs remained in the software as the product was distributed to users. In this section, I will discuss the software bugs that contributed to the Therac-25 accidents.

The Therac-25 software errors that cause radiation overexposures can be reduced down to interface errors. The first of these errors involved the entering of treatment data by the machine operator. Once an operator enters treatment information at the terminal outside of treatment room, the magnets used to filter and control radiation levels are set. There are several magnets, and the process takes about 8 seconds. If the operator makes a very, very quick change of the treatment information, within 1 second, the change is registered. Or, if the operator is rather slow about it, takes more than 8 seconds, the change is also registered. However, if the change occurs within the eight seconds it takes to set the magnets, the change is not detected and the magnets continue to be set up improperly, and thus the level of radiation is set up improperly. As I mentioned earlier, this is the main hazard of a dual -mode system, and is what happened in the fourth and fifth accidents (Leveson and Turner, 1993, p. 30). Once the magnets are set, there is no test performed to double check that the treatment information entered matches how the magnets are set. Another variable, which controls whether photon or electron mode is to be used, does detect the operator edit and sets the mode to the edited mode. As mentioned earlier, much higher levels of radiation are needed in photon mode to produce the same levels of output in electron mode. Therefore, if the beam is set for photon mode, but the turntable is set up for electron mode, a radiation overdose occurs (Jacky, 1989).

The second of the software errors, causing the sixth and possibly other accidents, involved the nature of the real-time system. When the turntable is in test mode, a variable called Class3 is set to a non-zero value. As long as the operator is testing the position of the light beam, the variable increments. Once testing procedures are complete, the variable is set to zero and the radiation beam is allowed to pass. Class3, however, was stored in one byte of memory. As a result, every 265th increment results in the value of zero assigned to it. In the sixth accident, the operator pushed the set button at the exact moment that Class3 rolled over to zero. As a result, a full prescription beam was released without any of the beam flatteners in place (Leveson and Turner, 1993).

Conclusions

Many factors contributed to the cause of the Therac-25 accidents. AECL was responsible for many of the factors. AECL used an overall, poor engineering method. For example, only one programmer was assigned to creating the machine's complex, real-time software. Formal software specifications and testing criteria were not written. In addition, very little software and system testing was performed. To achieve the levels of machine safety as proclaimed by AECL's engineers, the machine would have had to run and been tested for over 100,000 years (Littlewood and Strigini, 1992). In addition, when old bugs are fixed in a program, new bugs tend to crop up. According to Littlewood and Strigini (1992), after a software bug is fixed, there is only about a fifty percent chance that the program will function for the same length of time, without failure, as it did before the bug was fixed. AECL's claims of safety improvement by "orders of magnitude" after machine fixes were completely unfounded. In addition, AECL's denial that the machine could malfunction, despite finding and fixing several problems, was groundless. As long as AECL was convinced that their machine could not cause a radiation overdose, they were not going to discover any machine deficiencies.

However, AECL was not completely to blame for the Therac-25 accidents; machine operators and technicians also contributed their share of mistakes. For example, it seems strange to me that the operators of the Therac-25 would eventually become comfortable with operating the machine despite frequent error messages. According to Jacky, the Therac-25 typically issued as many as forty error messages a day! Especially since the consequences of machine failure could be death, I believe that the operators had the responsibility of insisting that the machine function properly without the errors, or at least require better documentation of possible errors and their causes. Also, operators of the machine seemed to rely too much on the inflated machine safety statistics, as defined by AECL, as reasons for not further investigating possible overdoses.

Lastly, even the federal government contributed their share to the Therac-25 accidents. Despite knowledge of AECL's poor engineering practices, the FDA allowed the Therac-25 to continue to be used. The FDA also appeared to have too much confidence in AECL's machine safety figures.

I believe that manufacturers are ultimately responsible for the safety of medical equipment that they produce, since they know the machine better than anyone else. And, as Jacky ( 1989) points out, safety should be manufactured into the product, not an after though added to it. However, technicians and operators of medical equipment also share in the responsibility of ensuring safety. Since they have day-to-day contact with the machine, they should be aware of any quirks or inconsistencies. Doctors who monitor patients receiving treatment from medical equipment are also responsible for ensuring patient safety. If a doctor suspects possible equipment malfunction, the matter should be investigated. It also seems to me that treatment should be suspended until the investigation is complete. The federal government should also be involved in ensuring the safety of medical equipment if any question of safety should arise. Finally, patients should always question the safety of medical equipment that could cause injury or death, especially in a society where profit and prestige can take precedence over health and safety. Since the patients are the ones at risk, unfortunately the final responsibility ultimately falls on their shoulders.

References

  1. Gowen, Lon D. and James S. Collofello. "Software Safety and Preliminary Hazard Analysis." Professional Safety November 1994: 20-25.

  2. Jacky, Johathan. "Programmed for Disaster." The Sciences 29 (1989) : 22-27.

  3. Leveson, Nancy G., and Clark S. Turner. "An Investigation of the Therac-25 Accidents." Computer July 1993 : 18-41.

  4. "Linear Accelerators." The 1995 Grolier Multimedia Encyclopedia. CD-ROM. Danbury: Grolier, 1995.

  5. Littlewood, Bev and Lorenzo Strigini. "The Risks of Software." Scientific American, Nov 1992: 62-75.

  6. O'Brien, P. "Radiation Protection Aspects of a New High-Energy Linear Accelerator." Med. Phys. Jan/Feb 1985: 101-107.

  7. Pressman, Roger. Software Engineering, A Practitioner's Approach. 3rd ed. New York: Graw-Hill, Inc, 1992.

  8. Wallace, Dolores, D. Richard Kuhn, and Laura M. Ippolito. "An Analysis of Selected Software Safety Standards." IEEE AES Magazine August 1992: 3-13.