The HIPAA Cluster: Ensuring Data

By Sean Igo

Research in health-care related fields, including clinical, biomedical, nursing, and public health research, is an increasingly important and well-funded endeavor, but it has certain complications. Chief among these is the fact that any experiments involving humans and the resulting data are required to conform to stringent legal and ethical standards, as set out in the Health Insurance Portability and Accountability Act of 1996, known by its acronym "HIPAA."

Such data, called Protected Health Information (PHI), must conform to HIPAA privacy standards. Therefore, it is important that researchers using PHI have access to a secure computing environment in which they may store and manipulate PHI. At the same time, clinical research is increasingly making use of computationally intensive techniques such as data mining, machine learning, statistics, and operations on large data sets. These requirements - large capacity, secure storage and high performance computing - make CHPC the ideal organization to maintain a HIPAA-compliant computational research environment.

Stimulated by NIH grant funding, CHPC now maintains such an environment, named "homer" after University of Utah Professor Emeritus Homer Warner - a pioneer in the field of Biomedical Informatics. Informally, the environment is known as the "HIPAA sandbox." Created through a collaboration of CHPC and the University's Department of Biomedical Informatics, it is isolated from the main CHPC clusters and access to it is highly controlled.

What is in the HIPAA Sandbox?

The HIPAA sandbox consists of three kinds of computers:

- Windows interactive nodes, which run the Windows XP operating system,

- Linux interactive nodes, running Red Hat Enterprise Linux 5, and

- Compute nodes, which constitute a parallel-processing supercomputing environment organized around the MPI parallel application library.

Interactive nodes are machines used to run software directly, much as you would run it on a personal desktop or laptop computer, with two differences. First, they sit behind the sandbox's security perimeter, accessed through a secure Virtual Private Network (VPN). Second, they're multicore, server-grade machines much more powerful than the typical personal computer. Interactive nodes are also used to submit batch processes to the compute nodes.

The HIPAA sandbox compute nodes are powerful machines just like the Linux-based interactive nodes, except that they are networked to operate in concert on parallel processing tasks. Instead of interacting directly with the computer nodes, users submit jobs to be run in batches, as with CHPC's other computing clusters. The compute nodes will soon be expanded to 32 cores, communicating over a high-speed Infiniband network.

The sandbox also includes a multi-terabyte storage system. Regular backups are performed, and the backup media are securely stored, handled only by HIPAA-trained CHPC staff.

As of this writing, the HIPAA sandbox is in its early stages. In addition to the three new compute nodes, new interactive nodes are planned by researchers starting new projects requiring HIPAA compliance. As with CHPC's other resources, researchers are able to buy additional hardware for their particular needs.

What kind of research will be done in the Sandbox?

One type of research is the application of Natural Language Processing (NLP) techniques (see CHPC's spring 2009 newsletter) to understand clinical text. There is a large volume of text created during the course of treatment of patients, including admission notes, nursing notes, surgical narratives, discharge summaries. There is much that could be learned through analysis of this text. For example, by tracking events that occur during the course of a patient's treatment, a computer system might be able to support or warn against subsequent courses of action. Another possibility is tracking medications the patient has taken and raising a red flag against prescribing drugs that interact harmfully with the earlier medicines.

There are several biomedically-oriented text processing packages currently available. One such, MetaMap, is an application available online from the National Library of Medicine. It segments input text into phrases and attempts to categorize them according to standard medical terminology. For example, it can perform such actions as recognizing the phrase "lung cancer" as a neoplastic process.

A common shortcoming among these currently available medical NLP packages is that they expect the text given to them to be well-formed: grammatically correct and with words spelled properly. In practice, this is not always the case. One project currently hosted on the HIPAA sandbox is a system which attempts to address this problem. The project is called "POET" for Parseable Output Extracted from Text. The goal of the project is to develop tools that convert poorly formed text into a more suitable form for input to these medical NLP applications. It consists of several subprojects, including a module to recognize medical acronyms and abbreviations and modules to extract grammatical structure from tabular or list data.

The Sandbox is rapidly acquiring new users from various organizations on campus including the Departments of Biomedical Informatics (DBI) and Radiology, the College of Nursing, and the School of Medicine's FURTHeR project, an infrastructure being built under the large NIH translational research grant called the Clinical and Translational Science Award (CTSA). When CHPC's director, Dr. Julio Facelli, and his colleague at the DBI, Dr. John Hurdle, canvassed PIs of NIH-funded projects at the Health Science Campus last spring, they were able to recruit no fewer than eight PIs who were willing to shift their biomedical computing environment to a homer-like setting. For these researchers, the real appeal of using CHPC as a computing resource is the fact that the CHPC handles systems management issues (e.g., rapid response to electrical power issues, provision of reliable cooling and heating, VPN support for a work-anywhere computing experience, ensuring a highly secure HIPAA environment compared to their office computers or departmental servers, and automatic upgrades of key software) in addition to potential access to high performance computing power.

How to get started using the sandbox?

If you are conducting research that uses data governed by HIPAA privacy rules and believe that CHPC's HIPAA-compliant environment would be useful to you, please contact us to set up a CHPC account.

Permission to use a given dataset is governed by the approval of the University's Institutional Review Board (IRB). Researchers must submit a proposal to the IRB listing the data to be used and the people who will have access to it. If the IRB approves the use of the data in question, the researcher is given an IRB number. In order to store the data in CHPC's HIPAA Sandbox, the researcher must provide CHPC with this number and a list of the users who will be permitted to see the data. Thereafter, the data may be transferred to CHPC and only the IRB-approved users will be able to work with it.

The initial response to the HIPAA-compliant homer environment has exceeded expectations. Homer was designed to grow gracefully, but the CHPC and faculty at the DBI are pursuing ways to secure more hardware to support the growing interest. Policies governing use of homer are expected to evolve as interest grows.