Crisis management: Planning for a systems crash

Technically Speaking

Morris Stemp

Morris Stemp is the CEO of Stemp Systems Group, a health IT solutions provider in New York City.

Bookmark and Share

With increasing regulations, decreasing reimbursements, meaningful use compliance, and a host of clinical and operational issues to contemplate, the last thing a practice administrator wants to think about is planning for a systems crash. Yet, this is exactly what is mandated by various data security provisions of both HIPAA and the HITECH law passed as part of 2009’s stimulus package. The two requirements below apply to physician practices:

  • 164.308(a)(7)(i) Standard: Contingency plan. Establish (and implement as needed) policies and procedures for responding to an emergency or other occurrence (for example, fire, vandalism, system failure, and natural disaster) that damages systems that contain electronic protected health information.
  • 164.308(a)(7)(ii) Implementation specifications: (A) Data backup plan (Required). Establish and implement procedures to create and maintain retrievable exact copies of electronic protected health information. (B) Disaster recovery plan (Required). Establish (and implement as needed) procedures to restore any loss of data. [pagebreak]

Disaster planning

A full organizational disaster or business continuity plan would include planning for events that could force the total or partial closure of a medical facility, such as a fire or flood. This column focuses on a single component of that comprehensive plan — planning for a systems crash in key clinical or operational systems that would adversely affect the smooth and effective delivery of vital patient services. Additionally, it is not sufficient just to have a plan. To be in compliance with HIPAA, the plan must be documented as a set of written policies and procedures.

Analysis of a systems crash

A systems crash is basically a dramatic term for a systems failure; the most critical impact of a systems failure is that key systems are not accessible in the usual and customary manner by the doctors, medical staff, and patients (if applicable). Therefore, the defining feature of a “crash” is really the inaccessibility of the key programs or data, regardless of why they are inaccessible.

Systems can fail or become inaccessible in many ways. The most obvious failure is a failure of a key hardware component of the computers or related infrastructure (network switches, routers, etc.) on which the systems operate. But there are a number of other failures which can cause systems to be inaccessible, including:

  • The operating system of a server hosting key programs or data could become corrupt causing the server not to boot up. Computer viruses and rootkits (which can be used by hackers to install and run malware) are two ways a server operating system can become corrupt.
  • The database itself in which the data is stored can become corrupt or the data can become inconsistent (i.e., different and conflicting versions of the same data appear in different places). Data in a database can be inadvertently updated with erroneous transactions or updated with poorly written update programs. Computer viruses can also damage data and index tables inside a database.
  • Many medical facilities use systems that are hosted on remote servers not located inside their facility. Any multi-location practice must, by definition, have at least one location which connects remotely to a central location hosting the servers. These connections to remotely hosted servers are generally achieved via a VPN (virtual private network) over the Internet or some other type of dedicated telecommunications network, possibly a point-to-point T1 or a multiprotocol label switching (MPLS) cloud. Any failure or downtime of the Internet or telecommunication circuits, either at the remote site or the central location, will cause systems at a central location to be inaccessible via remote access. A communications failure at the remote site will take just that one site down while a communications failure at the main site will affect all remote sites and is thus much more serious.
  • All systems are powered via the regional electrical grid. The recent power outages on the East Coast caused by Hurricane Irene are an example of how power outages can take a system down. Parts of Connecticut were without power for an entire week! Even if a medical practice had its systems hosted in sunny Florida, that practice would have had no Internet service to support its remote connection. Of course the practice is probably not able to open its doors or see patients if it has no electricity, but what if there is a patient emergency and the doctor has to view the electronic health records to provide emergency care? [pagebreak]

Planning for the inevitable

Planning for a system failure requires implementing different solutions to address each of the possible causes of a systems crash. It is worth noting that in the majority of the causes for system inaccessibility described above, having a usable data backup would not be of any help at all. Data backups (assuming they are usable, which is a major assumption not tested frequently enough in real life) are only useful to resolve data corruption. Many of the causes of inaccessibility described above stem from external factors.

Internet backup: For many smaller practices, their lifeline to their medical records is probably the Internet connection in their office, which enables the practice to connect to its EHR system remotely hosted usually either by a local hospital or in a data center rented by the EHR software company. The simplest way to avoid a common outage is to install a second Internet connection using a different connection technology. So if the practice currently uses a cable modem, install a Verizon FIOS line or a DSL line. With proper configuration, the second Internet line can serve not only as a standby failover line but can actually provide supplemental bandwidth when both connections are working.

Anti-virus (AV) software: AV software on all servers and workstations is an absolute requirement to avoid a crash. It is critical not only to install the software but to keep the software current with updated software versions and anti-virus definitions. Virus developers are a notoriously crafty and creative group. Even with updated systems, viruses can still sneak in with varying impact. Before installing a new AV system, make sure to test the system on just a few workstations or servers. AV software can adversely affect performance and frequently the configuration must be refined by setting excluded files and folders.

Back-up power source: Electrical outages come in varying scope, from a momentary outage to interruptions of days or even weeks. Even a momentary outage can cause servers to shut down in an abrupt manner, leaving the operating system and database in a potentially unstable state. No amount of protection short of self-power generation (as used by hospitals) will protect against this. Reasonable precautions for small practices include plugging all servers into a battery-backed Uninterruptable Power Supply (UPS), which can provide immediate power to a system for at least five minutes and can send a signal to the attached servers to shut down in a graceful manner if power is not restored within a specified time duration. Note that a surge strip does not provide an acceptable level of protection. A UPS can be purchased online or in most office-supply and electronics stores.

Data backup: Finally, a situation we have all probably found ourselves in is wishing we had a good current backup. There are many ways to create a backup, from simply making a copy of a data file or program directory onto some local hard disk or USB drive to storing multiple versions of these copies at timed intervals both on location and at one or more secure remote off-site facilities. (To learn more about different backup possibilities, see sidebar.) [pagebreak]

Communications

Another key element of crisis management is communications. How the response to a crash is managed, and the status of the recovery process communicated to the users of the system, is vital to restoring confidence in the system.

While it is usually not possible to cover up a systems crash, I have observed a tendency with technical staff to minimize the impact of system interruptions and understate the amount of time it will take to resolve a system failure. Managing the expectations of the users of the system and communicating accurate estimates as to when the systems will be restored will provide a good deal of comfort to otherwise harried medical staff. It may be possible for medical staff to work around the system or implement pre-planned system-down workflows, but the anticipated amount of downtime dictates what the appropriate reaction should be.

Testing and practice

The final key element of crisis management and planning for a systems crash is to test all of the recovery and failover solutions and to practice putting the recovery plans into action. Testing should be done frequently with testing methods and schedules that make sense for each recovery scenario. For example, the only way to test if the failover Internet line really fails over in a timely and seamless manner is to disconnect the primary circuit.

The only fool-proof way to test a disaster recovery plan is to simulate the disaster in all its gory details with a full systems shutdown along with full failover and then total recovery back to original systems. This true fail-over/fail-back test requires down time and staff time and this is not a concept that doctors want to consider with a waiting room full of patients. Yet this is truly the only way to simulate a real-life potential system crash.

It is important that plans to update existing systems or install new systems include the appropriate updates to the system disaster recovery plans. The testing process may also reveal aspects of the recovery plans which are out of date and must therefore be revised.

Conclusion

It’s not possible to predict every possible event which can lead to a systems crash or to systems becoming inaccessible. But it is possible to plan responses to many events which can reasonably be foreseen based on the technology behind the systems and history. A medical practice should not expect to avoid all possible disasters, but with proper planning, documentation, testing, and practice, a medical practice can ensure it will be able to rapidly respond to the crisis and make its systems accessible to physicians and staff in the fastest possible manner to avoid any lapse in service to its patients.


Backup needed

It is obvious (although not always a matter of actual practice) that a medical practice running an EHR system on its own servers must have a working and tested backup system. But even if you use hosted or cloud-based EHR systems, you should be concerned about data backup. Why? Walk around your office and look at the programs each staff member uses during the day outside of the EHR system. They probably include software to create patient, payer, and vendor letters, charts, office signs, spreadsheets, schedules, conference presentations, etc.; scan copies of vendor invoices and EOBs; and send and receive email. 

Where is all this data stored? Does each user save their own information on their individual workstations? Is shared data stored on some server, central storage device, or possibly one user’s workstation? Does any of this information include HIPAA-protected patient health information, and if so, how is it secured? Can you afford to lose some or all of this data to a natural disaster, fire, or theft? I think the answers are clear and in most cases, the exposure is great. You need a backup strategy.

Backup strategy

The first step in creating a backup strategy is to centralize the storage of all data in appropriately secured folders on a central server or a network-attached storage device called an NAS. Of course, HIPAA requires that these central storage repositories be secured in a locked area. Storing data on a workstation is a huge risk: Workstation hard drives may fail, the workstations are not locked up, and users may download programs from the Internet which can corrupt the operating system or the data, or install viruses. 

It is much easier to backup data from a single storage location than to backup, monitor, and test multiple locations. Once all data is stored in a central location, it is possible to implement an effective backup strategy that can be monitored and tested regularly.

It is vital to have backups of both your data and your operating system, and the backup itself is just one of three key steps in the process. The other two steps are monitoring and testing. Most systems provide some logging and notification feature that logs all activity and sends alerts as to any errors or backup failures along with optional notifications of backup success. Make sure you configure the alerting system and run some test alerts.

Backup solutions

By now, you’re probably thinking, “I get it! I get it! I need to have off-site backups of my data and the operating system. Now give me some guidance on how to get those backups done!” So let’s look at a couple of solutions on the market and examine how they work to create an effective backup solution. 

The first backup methodology I will review is essentially a basic copy of data files and folders from the working or source location to a backup destination. Dozens of programs can facilitate this copy process along with effective logging and alert notifications. Many of these systems are free; most are under $100. (We use a system called SyncBack that offers more than 100 features.) The key to SyncBack and programs like it is a process called “synchronization” which only copies data that does not already exist on the backup destination, thus speeding up the transfer process. These types of programs are extremely flexible regarding destination options for a backup, allowing local USB drives, remote file servers, FTP servers, and even the use of Google and Amazon storage servers.

Carbonite and Mozy are two heavily marketed backup services that are essentially file and folder backup services that offer only an Internet-based remote data center storage option. One of the key benefits of these services compared to SyncBack, for example, is that they provide a very simple user interface for less technical users. But they also offer much less flexibility.

 The backup methodologies described above are great for backing up files and folders but none have an option for backing up the entire operating system. This requires a process known as “imaging” or “cloning” in which an exact replica or clone of a workstation or server hard disk is created and then stored on an external device. One of the most popular systems for accomplishing this is Acronis, which offers a range of solutions for individual computers through servers.

This type of backup can restore an entire hard disk, or it can be used to restore individual files and folders. Should your hard disk fail or your operating system need to be completely reloaded due to corruption or a virus, you can use the image backup to totally restore your system to exactly the way it looked at the time of the backup. Using a process called bare-metal restore, there is no need to first load an operating system and then restore your data — the backup includes everything you need.

Experiencing any type of system downtime is always a major hassle. Having an effective backup plan can make the difference between a hassle that lasts an hour and a real disaster which can force you to shut down your practice. So repeat after me: “Backup, monitor, test…backup, monitor, test…”


 

Related Resources

Backup needed