Clinical Trial Prediction via Natural Language Processing and Graph Mining

Rambacher, Calvin

Student Work

Clinical Trial Prediction via Natural Language Processing and Graph Mining

Public

This MQP was completed by a single student, who created a baseline model for clinical trial prediction. The goal of this MQP was to predict the clinical trial phase based on historical medical data. To do this, the student used the ElementTree API in Python to parse over 100 thousand XML files into a database. The database was then cleaned using a combination of Pandas, pairwise & listwise deletion, and reformatting values. The database was then split into training and testing data: the former used for fitting the Machine Learning model known as CatBoost (a gradient boosting machine specifically designed for data with many categorical variables), whereas the latter was used for evaluating the model performance. The model was then analyzed to find how much each factor influences the clinical trial phase. It was discovered that the source, or from where the clinical trial originates, has the most impact on how well the clinical trial proceeds through phases.

This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.

Creator