Gene Selection

PDF GitHub Repo

Project Status: [Completed]

Introduction

The purposes of this project are ultimately to help guide genetic research and to construct a classifier that may quickly distinguish Acute Lymphoblastic Leukemia from Acute Myeloid Leukemia to ensure the patient receives the proper treatment. This is achieved by identifying combinations of genes that play a significant role in distinguishing the two cancer classes. The struggle in this process is that the data contains very few samples (72) with respect to the number of features (9000+). In this project, I approach this challenge by leveraging bootstrapping with artificial neural networks to form a collection of models that can then be used to score and rank each gene. Finally, the number of highest scoring genes is optimized with a K-Nearest Neighbors model to identify the sub-selection most suited to distinguising the two cancer classes.

Methods Used

  • Inferential Statistics
  • Predictive Modeling
  • Machine Learning
  • Neural Networks
  • Algorithm Development
  • Data Visualization

Technologies

  • Python, Jupyter
  • Pandas, Numpy
  • Scikit-Learn
  • Tensorflow
  • Seaborn, Matplotlib

Project Overview

  • Data
    • Data consists of 72 patients with over 9000 gene expression levels
    • 47 classified as patients with ALL, 25 with AML
    • Data already cleaned and suitable for use in ML models
  • Baseline Model for Bootstrapping
    • Wanted a model that allowed for nonlinear combinations of gene expressions
    • Needed a model that remained weak and trained quickly
    • Chose to use a small densely connected neural network with high dropout rate to satisfy the preconditions
  • Gene Scoring
    • Initial plan was to use individual model accuracies to score the individual genes
    • Majority of models only predicted majority class, however, so new methodology needed to be developed
    • Decided to only consider models that scored above that threshold to identify genes (or combinations) that may have predictive power
  • KNN Optimization
    • Performed a grid search over the number of highest scoring genes to find the optimal number suited for classification
    • Identified 48 genes that when used in combination achieved an overall accuracy of 97.2% for classifying patients as having ALL or AML