Computational Phenotyping and Phenome-wide Association Studies: Leveraging Machine Learning and Natural Language Processing to Understand Electronic Health Record Data

Teixeira, Pedro Luis, Jr.

Computational Phenotyping and Phenome-wide Association Studies: Leveraging Machine Learning and Natural Language Processing to Understand Electronic Health Record Data

dc.creator	Teixeira, Pedro Luis, Jr.
dc.date.accessioned	2020-08-22T20:53:17Z
dc.date.available	2017-08-27
dc.date.issued	2015-08-27
dc.identifier.uri	https://etd.library.vanderbilt.edu/etd-08262015-232710
dc.identifier.uri	http://hdl.handle.net/1803/14022
dc.description.abstract	The aims of this project are 1) to evaluate various data sources and algorithms for identifying hypertensive individuals within the electronic health record, and 2) to develop and evaluate a novel method for identifying associations between genotypes and natural language processing-based phenotypes extracted from the electronic health record. The author evaluated data sources and hypertension phenotyping algorithms using a set of 631 individuals manually reviewed for hypertension status based on their electronic health record data. Combinations of data sources outperformed methods that leveraged any category individually. Random forest models trained with billing codes, medications, vital signs, and hypertension concept counts achieved a median AUC of 0.976. The best algorithms performed similarly at a second site. The author also developed a novel method for phenome-wide association studies using natural language processing-based phenotypes (NLP-PheWAS). Using 29,722 individuals with Exome data, the author extracted 11,553 unique concepts from narrative text after negation, note section, and semantic type filtering. The method replicated 43.7% of known, statistically powered associations from the National Human Genome Research Institute’s genome-wide association catalog. NLP-PheWAS also identified two potentially novel associations among the SNPs studied. They included an association between optic disc neovascularization and rs1497546 and between Langerhans-Cell Histiocytosis and rs7193343. NLP-PheWAS is a promising method for enabling rapid discovery, interpretation of novel associations, and increased understanding of genetic influences within the rapidly expanding narrative text of electronic health records.
dc.format.mimetype	application/pdf
dc.subject	biomedical informatics
dc.subject	phenome-wide association studies
dc.subject	hypertension
dc.subject	random forests
dc.subject	machine learning
dc.subject	natural language processing
dc.title	Computational Phenotyping and Phenome-wide Association Studies: Leveraging Machine Learning and Natural Language Processing to Understand Electronic Health Record Data
dc.type	dissertation
dc.contributor.committeeMember	Thomas A. Lasko, M.D., Ph.D.
dc.contributor.committeeMember	Todd L. Edwards, M.S., Ph.D.
dc.contributor.committeeMember	S. Trent Rosenbloom, M.D., MPH
dc.contributor.committeeMember	Dan M. Roden, M.D.
dc.type.material	text
thesis.degree.name	PHD
thesis.degree.level	dissertation
thesis.degree.discipline	Biomedical Informatics
thesis.degree.grantor	Vanderbilt University
local.embargo.terms	2017-08-27
local.embargo.lift	2017-08-27
dc.contributor.committeeChair	Joshua C. Denny, M.D., M.S.

Files in this item

Name:: Teixeira.pdf
Size:: 4.309Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record