CS 5043: HW5: RNNs + CNNs

Assignment notes:

The Problem

Proteins are chains of amino acids that perform many different biological functions, depending on the specific sequence of amino acids. Families of amino acid chains exhibit similarities in their structure and function. For a new chain, one problem we would like to solve is that of predicting the family that it most likely belongs to. In this assignment, we will be classifying amino acid chains as one of 18 different families.

Data Set

The Data set is available on SCHOONER: The data are already partitioned into five independent folds, with the classes stratified across the folds (the samples for class k are distributed equally across the five folds). However, the different classes have different numbers of examples, with as much as a 1:10 ratio between the minority and majority classes.

Each example consists of:

There are two ways to load the data (provided in pfam_loader.py):

Both loaders return the same dictionary format (documented in pfam_loader.py). And, the data sets fit entirely in RAM. Two important properties of this dictionary are:

You can also use create_tf_datasets() to convert this dictionary representation into TF Datasets, which makes it easy to scale batches that will optimally fit with GPU memory.

Deep Learning Experiment

Objective: Create two different neural network models that can predict the family of a given amino acid. We will compare two different architectures:
  1. SimpleRNN + AveragePooling: One or more layers of SimpleRNN modules, separated by an average pooling step:
  2. CNN: One or more layers of 1D CNN + Appropriate MaxPooling
The precise definition of these is up to you, but you should stay within these classes of solutions. You should also adjust hyper-parameters for each so that they can do their best (with respect to the validation set) without changing model architecture. That said, you should expect some performance differences between these model types.

Notes:

Performance Reporting

Once you have selected a reasonable architecture and set of hyper-parameters, perform 5 rotations of experiments for each architecture (10 different models). Produce the following figures:
  1. Figure 0a,b: Network architectures from plot_model()

  2. Figure 1: Training set accuracy as a function of epoch for each rotation. Include both models. Each model type should be clearly indicated by the line style/color

  3. Figure 2: Validation set accuracy as a function of epoch for each of the rotations. Include both models

  4. Figure 3: Bar plot of test set accuracy for both models.

  5. Figure 4: Scatter plot of test set accuracy: RNN vs CNN


Provided Code

In code for class:


Hints


What to Hand In

Turn in a single zip file that contains:

Do not turn in pickle files.

Grading

References


andrewhfagg -- gmail.com

Last modified: Wed Apr 3 16:12:59 2024