Título: Machine Learning Techniques for the Diagnosis of Pediatric Tuberculosis
Autores: Coston, Amanda
Fecha: 2013-07-26
2013-07-26
2013-05-06
2013-07-26
Publicador: Universidad de Princenton
Fuente:
Tipo: Princeton University Senior Theses
Tema:
Descripción: The goal of this project was two-fold: first, to improve the performance of machine learning algorithms for the diagnosis of pediatric tuberculosis, and second, to use machine learning algorithms to better understand the problem of diagnosis. We constructed and examined Bayes nets using a MATLAB toolbox by Kevin Murphy and we experimented with 26 other machine learning algorithms in the Weka software package. We found that while the Bayes nets have better accuracy when we initialize parameters based on medical knowledge, creating our own structure based on medical knowledge did not increase performance; a naive Bayes net does better than the our handcrafted Bayes net. Neither the Bayes nets nor any of the Weka algorithms performed at the level necessary for use in real medical settings. Calibration curves show that the predicted probabilities of the Bayes nets and Weka algorithms do not correspond to the probability of positive diagnosis. Among the Weka algorithms, we found that decision algorithms generally have better performance, with the alternating decision tree and the ensemble methods (bagging and Adaboost) on decision stumps performing the best. Overall, false negative rates are much higher than false positive rates, which does not bode well for practical applications since false negatives yield significantly dire consequences in real life. We found that we could lower the false negative rates and generally improve the performance of the Bayes nets by guessing the label of unknown instances, a method we call predictive labeling. Using a variety of algorithms, we also tested for which features were most important to diagnosis. The structure of alternating decision trees as well as traditional decision trees contributed to our understanding. We also randomized the data for each feature to see which had the greatest effect on performance, reasoning that the feature whose randomization had the greatest effect would be the most important. In addition, we implemented an explanation algorithm by selecting which feature in each patient would change the probability of diagnosis most if not present. Using these algorithms we found that the most important features for diagnosis were malaise and weight loss. Moving forward, we recommend obtaining larger and more comprehensive data sets that may yield better performance from the Bayes nets and other machine learning algorithms.
Idioma: Inglés

Artículos similares:

Engineering solutions for a carbon-constrained world por Celia, M. A.,Nordbotten, J. M.
Impact of capillary forces on large-scale migration of CO2 por Nordbotten, Jan M.,Dahle, Helge K.
Impact of geological heterogeneity on early-stage CO2 plume migration por Ashraf, Meisam,Lie, Knut-Andreas,Nilsen, Halvor M.,Nordbotten, Jan M.,Skorstad, Arne
A model-oriented benchmark problem for CO2 storage por Dahle, Helge K.,Eigestad, Geir T.,Nordbotten, Jan M.,Pruess, K.
CO2 trapping in sloping aqiufers: High resolution numerical simulations por Elenius, Maria,Tchelepi, Hamdi,Johannsen, Klaus
Report from CO2 storage workshop por Dahle, Helge K.,Lien, Martha,Nordbotten, Jan M.,Lie, Knut-Andreas,Braathen, Alvar,Helmig, Rainer,Class, Holger,Celia, Michael A.
Summary of Princeton Workshop on Geological Storage of CO2 por Celia, Michael A.,Nordbotten, Jan M.,Bachu, Stefan,Kavetski, Dmitri,Gasda, Sarah
10