System Crashes: A Case Report

There is increasing talk of the convergence of biomedical and information technologies. The connection of medical devices to information networks presents new opportunities and challenges. We present here a specific case of serious problems that arose from a series of circumstances, some of which were and some of which were not likely preventable. The goal is to promote further discussion and planning both for prevention, organizational resilience, effective response systems, and especially to promote more effective working relationships among all of the entities involved in developing, managing, and servicing the systems.

This case describes a real event, but the company and organization names are fictitious. It is presented in two parts, with this first installment describing the events. We encourage readers to consider what processes might be put in place to have prevented this occurrence and then e-mail your suggestions to the editor at [email protected]. In a later article we will discuss your suggestions.

Case Report

In October 2006, the system that manages electrocardiograms (ECGs)—the ECG manager—for what we will call a large academic medical center (LAMC) came crashing down. Because of the advantages of the system, some LAMC services had become completely dependent on it. In the emergency department (ED) and operating room (OR), clinicians used it to retrieve ECG histories on patients, and in the cardiology accounting office, the information was used to automatically generate billings. It was 4 months before the system returned to minimal functionality. Even now—1 year later—problems continue and the system has not returned to its former functionality, awaiting release of a new version of software that incorporates fixes for the bugs identified.

The original system was deployed in 1999 on the then-current NT operating system. The hospital biomedical engineering department (BME) was responsible for the application, which resided on a server in the health care system’s information technology (IT) server farm. The manufacturer, who we will call Global Biomedical Devices Inc (GBD), specified the server, which was different hardware than the IT standard. Thus, BME was responsible for the server’s support. At the hospital’s request, GBD wrote software for its server to provide billing function-ality. This nonstandard application fed data to an HL7 server and subsequently to a billing server and results application. The accounting office was thrilled to have the automation of what was before a cumbersome, paper-based process.

In 2005, a seemingly simple upgrade of the server’s RAID drives resulted in a multitude of hardware and software failures that required rebuilding the system. Because the daily tape backups were not working, the data had to be copied to an IS computer. When the backup was restored, it was discovered that the data were corrupted, and a virus was detected. The system was cleaned, and no data were lost. But, subsequently, there were more frequent needs for system reboots that may have been associated with the initial attack. Early in 2006, funding was requested to replace the system, which was scheduled to occur in mid-2007.

The Crash

In October 2006, the system crashed. It was also infected with an NT virus. Unfortunately, the operating system vendor was no longer releasing patches for NT systems. GBD continued to support the clinical ECG manager system, even though both the OS and server hardware it ran on were essentially out of support. The system was brought back online a week later but began to exhibit worsening signs of instability. GBD was called in to troubleshoot under its service contract. Multiple hardware and software problems ensued. GBD engineers worked at reloading software, and replacing several pieces of hardware, a seemingly bad tape drive, and hard drives, to name a few. During this time, it was also discovered that several patient stress test systems owned by the nuclear cardiology division were also infected with the virus. These systems, still supported by GBD, were also running on an NT platform.

LAMC engineers requested that the manufacturer rebuild the server offline and reload the software. After several unsuccessful attempts at repair, GBD agreed. They began the process in mid-November. Almost 6 weeks had passed since the system first crashed in early October. During the rebuild, they ran into multiple hardware problems and quickly learned that they could no longer obtain some of the required replacement parts such as a motherboard and a RAID controller. Having started the process, they could not go backward. Unfortunately, because of the vintage of the system, a redundant server—which would have been achieved with the new system—was not possible. ECGs could not be transmitted to the server for archiving and subsequent billing.

This time, the company’s engineers were unable to successfully repair the system. There seemed to be irreparable damage to the server, and replacement hardware was not readily available. Five days into the rebuild process, GBD decided to upgrade the core system, server hardware, and software (new OS and application software). From mid-November to early December, a litany of bureaucratic challenges and a series of technical issues delayed the installation of the new server. During this time there was an increasing sense of urgency from users of the system. Although contingency systems were put in place, there seemed to be nonaligned expectations across departments. Frustration was increasing from personnel in billing, cardiology, IT, and BME. Although the problem was consuming many staff hours, tangible results were not evident to the users. Around the end of November, the server was installed, but there were problems with recovering the backup data set.

On December 1, 2006, the new server was placed online. Some of the problems were resolved: ECGs were successfully transmitted and archived. But, one new problem after another was discovered, associated with incompatibilities among the new system, the new version of the application, the original nonstandard interface, and several new security protocols from the OS vendor. Ultimately, all HL7 interfaces were rendered temporarily unusable.

These were the kinds of problems that may have been sorted out in testing during a planned replacement process. In countless meetings and conference calls over the next several months, engineers and IS specialists from various IS divisions struggled to learn why the HL7 interfaces still were not functional and why the system remained unstable.

From the BME perspective, the IT organization was inflexible in supporting possible solutions to the problem. There was great frustration in trying to understand and gain cooperation from their IT colleagues. BME was using all of its available (but very limited) resources to understand the problem, but they had no authority to convene the necessary players and get the manufacturer, LAMC IT, and BME staff together to work toward a solution.

Personnel at GBD and LAMC IT who had been involved in the original implementation that had created the connectivity to the network and the billing application had long since left. There was not sufficient documentation. No one person really understood the various subsystem dependencies. Although BME was the system administrator, another semi-independent IT subdivision had developed the billing application. It was also challenging to get the needed assistance from the various IT divisions that were involved in network management. Because it was a medical device, there were limitations on how the original NT system could be made to fit the network standards. Thus, while the server resided in the IT server farm, support and day-to-day control was the domain of the BME team.

From the IT perspective, BME did not have adequate knowledge of the system dependencies and had not responded quickly enough to what appeared to be a critical situation. Why, they wondered, didn’t BME hunker down and go 24/7 until the problem was solved?

From December until February 2007, the ECG management system operated with temporary, limited availability until a new, compatible HL7 interface server was installed. This enabled the HL7 interfaces, yet billing issues and other system failures still persisted for 8 months beyond the crash. In addition, there remains an access issue for users—increased system security of the new operating system limits the number of users and requires individual user accounts, thus the 14,000 users that previously had been assigned access could not be transferred to the new system automatically. Under the new system, each user must be assigned a distinct ID and system password one at a time, a process estimated to require 3 person months.

Although the basic archiving, retrieval, and billing functions are operable, problems with the interface and workstation crashes persist. While the cause of the interface problem has been identified, it cannot be remedied—until release of the next software patch from the manufacturer, which, due to regulatory constraints, will take 3 to 5 additional months. Another bug related to the interface has yet to be resolved.

Discussion

What are the issues illustrated by this event, including technical and organizational? Was this event preventable? Are future such events likely to happen? Could this happen in your hospital? If it is not likely, what steps have you taken to avoid it? How well does your biomedical engineering department communicate and work with your IT organization?

As biomedical and information systems continue to collide, circumstances such as this case study have the potential to affect every facility on some level. Please consider the above questions and e-mail your answers to the editor at . In the next installment, we will discuss your suggestions and other possible solutions.

Jeffrey B. Cooper, PhD, is director of biomedical engineering, Partners Healthcare System, and professor of anaesthesia, Harvard Medical School; Patricia Volpe, BSBE, MBA, is director of biomedical engineering, Massachusetts General Hospital, Boston; and James Noga, BS, MS, is the chief information officer, Massachusetts General Hospital. For more information, contact .

System Crashes: A Case Report

Case Report

The Crash

Discussion

Related Posts

Department of One

Converging Technologies, Converging Teams

TriMedx Introduces Seven Stages of Clinical Engineering Calculator

Pulse Check: Compensation Trends 2001

Recent Posts