Identify Patient Sets - De-ID
University of Pittsburgh | Health Sciences @ Pitt | Center for Biomedical Informatics  
Overview
Home
Publications
Project Team
Research Tools
De-ID
IPS Retrieval Engine
Automatic Model Creation
Encoder
De-ID Topics
Limitations
Input Format
Other links:
IPS System support page
Clinical Research Informatics Service
Pitt's Institutional Review Board
NegEx: a simple negation identifier
Research Practice Fundamentals

De-ID Overview

The problem to solve
The Health Insurance Portability and Accountability Act of 1996, or as it is known today as HIPAA, has required that the use of protected health information (PHI) in research studies is not permitted except with the explicit consent of the patient. However, HIPAA does allow for the creation of de-identified health information. In order for clinical researchers to use clinical data in a way that complies with HIPAA, it is necessary to de-identify the records.

De-ID process
The extraction of PHI and de-identification requires a defined and structured process. The Clinical Research Informatics Service [CRIS] at the University of Pittsburgh oversees the data extraction and de-identification process involving databases at the UPMC. CRIS serves as the IRB designated honest-broker" for approved studies including those involving multiple data sources. This enables the researcher to have CRIS coordinate data collection and linkage files from the multiple sources and not risk re-identification by one of the data sources.

Use of the De-ID program is limited to only IRB approved projects. CRIS works closely with the clinical researcher prior to IRB submission to ensure that the de-identified information will be able to be used in a particular study. A copy of the IRB approval letter is required by CRIS.

De-ID mechanics
De-ID uses a set of heuristics to identify the presence of any of the HIPAA 18 identifiers within the text. Supplemental dictionaries of geographic locations, hospital names, popular names found in the U.S. Census are also used to locate identifiable text. The UMLS Methatheasurus is utilized to ensure that words or phrases that may be medical terms with proper names are preserved.

De-ID replaces the identifiable text with specific tags. Names found multiple times in the report are consistently replaced with the same tag to improve readability of the report. The downside of applying De-ID is the removal of a small amount of clinical information during the de-identification process. In our work to date, we have found only minor problems with this.

The 18 HIPAA identifiers [CPR 164.514(2)(i)]
1. Names of individuals, relatives, employers or household members
2. Geographical subdivisions smaller than a state except the first 3 digits of zip codes; however if the region contains less than 20,000 people, the entire zip code must be replaced.
3. All elements of dates (except year) directly relating to an individual; all ages over 89 years must be grouped into a single category of 90 or older
4. Telephone numbers
5. Fax numbers
6. Electronic mail addresses
7. Social security numbers
8. Medical record numbers
9. Health plan beneficiary numbers
10. Other account numbers
11. License numbers
12. Vehicle identifiers
13. Device identifiers and serial numbers
14. Web universal resource locators (URLs)
15. Internet protocol (IP) addresses
16. Biometric identifiers This item does not occur in free-text clinical reports
17. Full face photogrphic images or any othrer comparable image This item does not occur in free-text clinical reports
18. Any other unique identifying number, characteristic or code
  De-ID, IPS & Encoder Copyright © 1999-2002 University of Pittsburgh. All rights reserved.