Gene Selection
Project Status: [Completed]
Introduction
The purposes of this project are ultimately to help guide genetic research and to construct a classifier that may quickly distinguish Acute Lymphoblastic Leukemia from Acute Myeloid Leukemia to ensure the patient receives the proper treatment. This is achieved by identifying combinations of genes that play a significant role in distinguishing the two cancer classes. The struggle in this process is that the data contains very few samples (72) with respect to the number of features (9000+). In this project, I approach this challenge by leveraging bootstrapping with artificial neural networks to form a collection of models that can then be used to score and rank each gene. Finally, the number of highest scoring genes is optimized with a K-Nearest Neighbors model to identify the sub-selection most suited to distinguising the two cancer classes.
Methods Used
- Inferential Statistics
- Predictive Modeling
- Machine Learning
- Neural Networks
- Algorithm Development
- Data Visualization
Technologies
- Python, Jupyter
- Pandas, Numpy
- Scikit-Learn
- Tensorflow
- Seaborn, Matplotlib
Project Overview
- Data
- Data consists of 72 patients with over 9000 gene expression levels
- 47 classified as patients with ALL, 25 with AML
- Data already cleaned and suitable for use in ML models
- Baseline Model for Bootstrapping
- Wanted a model that allowed for nonlinear combinations of gene expressions
- Needed a model that remained weak and trained quickly
- Chose to use a small densely connected neural network with high dropout rate to satisfy the preconditions
- Gene Scoring
- Initial plan was to use individual model accuracies to score the individual genes
- Majority of models only predicted majority class, however, so new methodology needed to be developed
- Decided to only consider models that scored above that threshold to identify genes (or combinations) that may have predictive power
- KNN Optimization
- Performed a grid search over the number of highest scoring genes to find the optimal number suited for classification
- Identified 48 genes that when used in combination achieved an overall accuracy of 97.2% for classifying patients as having ALL or AML