Scalable Natural Language De-identification based on Machine Learning Approaches

Li, Muqun

Scalable Natural Language De-identification based on Machine Learning Approaches

dc.creator	Li, Muqun
dc.date.accessioned	2020-08-22T00:03:49Z
dc.date.available	2018-03-27
dc.date.issued	2018-03-27
dc.identifier.uri	https://etd.library.vanderbilt.edu/etd-03262018-113355
dc.identifier.uri	http://hdl.handle.net/1803/11460
dc.description.abstract	Electronic medical record (EMR) systems have been progressively adopted in numerous aspects of clinical care and healthcare endeavors. As the quantity and diversity of such data grows, so too has its repurposing to support secondary use in a number of settings (e.g., public health and biomedical research). However, the dissemination of such data has been relatively limited to structured data, as documents that contain natural language (e.g., clinical communications between clinicians) has posed concerns over the extent to which the privacy of the corresponding patients can be upheld. To mitigate this concern, various federal and state laws, and the agencies that oversee them, recommending minimizing the amount of information disclosed and adhering to de-identification principles, such as that specified in the Privacy Rule of the Health Insurance Portability and Accountability Act of 1996. De-identification aims to remove protected health information (PHI), including explicit identifiers (e.g., patient names) and quasi-identifiers (e.g., dates of birth). While structured data is relatively straightforward to de-identify, unstructured natural language is more challenging because it is not always evident when a potential identifier is communicated. As a consequence, manually or automatically, it is improbable in practice to detect and amend every potential identifier without affecting non-identifying information in a scalable manner. This dissertation seeks to address the scalability challenge in de-identification systems based on machine learning by achieving three tasks in the context of natural language de-identification. Starting with a collection of unannotated natural language clinical data, which will potentially be subject to the exploit of malicious attackers when shared, the ultimate aim of the system is to successfully recognize and, subsequently, protect the PHI. The first task of this dissertation introduces a framework, based on game theory, to model the cost and benefits for a healthcare organization (HCO) that shares EMR data and a recipient (who is a potential adversary) that may exploit the residual identifiers. Upon doing so, we introduce a strategy to discover an optimized solution for the HCO that minimizes the amount of training resources needed to allocate to achieve natural language de-identification at a sufficient level of performance. The second aspect of the scalability challenge this dissertation focuses on is how to better utilize a given set of training data for de-identification. We propose and develop a feature extraction and clustering strategy to partition clinical documents into inferred types over which de-identification models are trained, tested, and ultimately applied. For the last part of the problem, we incorporate active learning in the de-identification workflow and conduct studies to prove that, if the machine learning de-identification system can actively request information to help create a better model from outside of the system (e.g., a knowledgeable human assistant), then less training data will be needed to maintain (or even improve) the performance of trained models. Simulations on a real-world clinical trials dataset and a publicly available i2b2 dataset demonstrate the effectiveness of active learning comparing to passive learning in de-identification.
dc.format.mimetype	application/pdf
dc.subject	de-identification
dc.subject	game theory
dc.subject	EMR
dc.subject	active learning
dc.subject	machine learning
dc.subject	natural language processing
dc.subject	data privacy
dc.subject	clustering
dc.title	Scalable Natural Language De-identification based on Machine Learning Approaches
dc.type	dissertation
dc.contributor.committeeMember	Yevgeniy Vorobeychik
dc.contributor.committeeMember	Douglas H. Fisher
dc.contributor.committeeMember	Daniel Fabbri
dc.contributor.committeeMember	Lynette Hirschman
dc.contributor.committeeMember	Khaled El Emam
dc.type.material	text
thesis.degree.name	PHD
thesis.degree.level	dissertation
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Vanderbilt University
local.embargo.terms	2018-03-27
local.embargo.lift	2018-03-27
dc.contributor.committeeChair	Bradley A. Malin

Files in this item

Name:: Li.pdf
Size:: 5.591Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record