NHANES Analysis

Assessing Cancer Risk via Gradient-Boosted Decision Trees

PDF GitHub Repo

Project Status: [On Hold]

Introduction

The purpose of this project is to develop a cancer classification and risk prediction model in the hopes of providing advance warning to patients at risk for cancer. Every year, the National Health and Nurtitional Examination Survey (NHANES) gathers patient data detailing symptoms, demographic information, family history, diet, lab results, and comorbidities (among other considerations). This data is made available to allow the general public to analyze and develop models based on the responses. The main difficulty lies within the quality of the data as many of the responses are missing. In this project, I present an analysis of feature selection and engineering before developing an object-oriented pipeline that allows for seamless trials of parameterizations. Ultimately, a gradient-boosted decision tree model performed optimally, achieving a 24.2% recall on a heldout test set. I believe there is still room for improvement in the feature selection and preprocessing steps, so this project remains [On Hold] until the opportunity to continue arises.

Methods Used

Object-Oriented Programming
Pipelining
Inferential Statistics
Missing Value Imputation, Categorical Encoding
Feature Selection, Filtering
Decision Trees, SVMs
Ensembling Methods, Boosting
Data Visualization
Predictive Modeling
Model Persistence

Technologies

Python, Jupyter
Pandas, Numpy
Scikit-Learn
Tensorflow
Seaborn, Matplotlib
joblib

Project Overview

Data
- 50,000 patients with over 500 possible survey responses
- Roughly 9% are positive for having cancer at some point in their lives
- Requires substantial hand-selection and preprocessing
Feature Selection
- Began with a correlational filter to remove highly correlated features
- Passed data through mutual information filter to identify subset with some relationship to the target class
- Finally passed data through variance filter to remove features with minimal variance
- Lasso regression and tree-based methods were also tested, but resulted in declines in model performance
Performance Metric
- Chose 5-fold cross-validation on balanced training set to assess overall accuracy
Predictive Modeling
- Ran parameter tuning grid searches over Decision Trees, SVMs, Random Forests, GBDTs, and Neural Networks
- Found that an optimized GBDT delivered an accuracy of 76.5% and recall of 24.2%