Distribution-Driven Augmentation of Real-World Datasets for Improved Cancer Diagnostics With Machine Learning

Price, Stephen

Student Work

Distribution-Driven Augmentation of Real-World Datasets for Improved Cancer Diagnostics With Machine Learning

Public Deposited

Recent advancements in technology and the growth of machine learning (ML) led to an abundance of data available for modeling. However, not all data is suitable for ML. This is particularly true for medical datasets, which often contain imbalances in their diagnostic classes and have entries with missing values. Such datasets lead to biased, skewed, or inaccurate ML models. In addition, valuable information for model training may be fragmented across multiple datasets, making it inaccessible to ML. To address these challenges, we propose Distribution-Driven Augmentation (DDA), a novel framework to curate real-world datasets and make them suitable for ML. DDA provides a novel approach to data curation through generative augmentation using Kernel Density Estimators to create new values for class-balancing, null-filling, and joining datasets. Implemented on datasets familiar to the research community, DDA yields promising results, improving model performance on a benchmark medical dataset from 4.96% to 76.3%. Additionally, implementing the unique DDA algorithm for joining medical datasets scored an average F1 of 98.7%, outperforming models trained on either of the datasets individually. These results highlight the benefits of a general framework to curate and join real-world datasets, opening the door for expanding the use of machine learning in the medical domain and specifically for diagnosing cancer.

This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.

Creator