Identifying high quality MEDLINE articles and web sites using machine learning
Aphinyanaphongs, Yindalon
:
2007-12-28
Abstract
In this dissertation, I explore the applicability of text categorization machine learning methods to identify clinically pertinent and evidence-based articles in the literature and web pages on the internet. In the first series of experiments, I found that text categorization techniques identify high quality articles in internal medicine in the content categories of prognosis, diagnosis, etiology, and treatment better than the Clinical Query Filters of Pubmed. In a second set of experiments, I established that the text categorization models generalized both to time periods outside the training set and to areas outside of internal medicine including pediatrics, oncology, and surgery. My third set of experiments revealed that text categorization models built for a specific purpose identified articles better than both bibliometric (number of citations and impact factor) and web-based measures (Google PageRank, Yahoo WebRanks, and total web page hit count). In the fourth set of experiments, I built models for purpose, format, and additional content categories from a labeled gold standard that have high discriminatory power. Furthermore, we built a system called EBMSearch that implements these models to all of MEDLINE. Finally I extended these methods to the web and built the first validated models that identify websites that make false cancer treatment claims outperforming previous unvalidated models and PageRank by 30% area under the receiver operating curve. In conclusion, machine learning-based text categorization methods provide a powerful framework for identifying clinically applicable articles in the medical literature and the Internet.