CS 5043: HW5: RNNs + CNNs
Assignment notes:
- Deadline: Tuesday April 2nd @11:59pm.
- Hand-in procedure: submit a zip file to Gradescope
- This work is to be done on your own. As with HW3/4,
you may share solution-specific code snippets in the open on
Slack (only!), but
not full solutions. In addition, downloading
solution-specific code is not allowed.
- Do not submit MSWord documents.
The Problem
Proteins are chains of amino acids that perform many different
biological functions, depending on the specific sequence of amino
acids. Families of amino acid chains exhibit similarities in their
structure and function. For a new chain, one problem we would like to
solve is that of predicting the family that it most likely belongs
to. In this assignment, we will be classifying amino acid chains as
one of 46 different families.
Data Set
The Data set is available on SCHOONER:
- /home/fagg/datasets/pfam: directory tree containing the data
(including two tar files). Note that we are using the
pfamB data set.
The data are already partitioned into five independent folds, with the
classes stratified across the folds (the samples for class k are
distributed equally across the five folds). However, the different
classes have different numbers of examples, with as much as a 1:10
ratio between the minority and majority classes.
Each example consists of:
- A tokenized string of length 3934 amino acids. The strings in
the data set have been padded this time on the left hand side,
just as in our example in class.
In addition to
the padding token, there is also a token that corresponds to
the "unknown" amino acid. Within each string, there can be long
runs of "unknown" tokens.
- A tokenized class label.
There are two ways to load the data (provided in pfam_loader.py):
- prepare_data_set(): loads the raw csv data, constructs the
train/validation/test data sets, and performs the
tokenization. These files are smaller, but require CPU
processing before training.
- load_rotation(): loads an already constructed rotation from a
pickle file. These files are a lot larger, but require no
processing once loaded.
Both loaders return the same data set format (documented in
pfam_loader.py). And, the data sets fit entirely in RAM.
Deep Learning Experiment
Objective: Create three different neural network models that can predict the family
of a given amino acid. We will compare three different architectures:
- SimpleRNN: One or more layers of SimpleRNN modules
- CNN: One or more layers of 1D CNN + Appropriate MaxPooling
- SimpleRNN + AveragePooling: Multiple layers of SimpleRNN
modules separated by AveragePooling modules.
The precise definition of these is up to you, but you should stay
within these classes of solutions. You should also adjust
hyper-parameters for each so that they can do
their best (with respect to the validation set) without changing model
architecture. That said, you should expect some performance
differences between these model types.
Notes:
- Each network should have an Embedding layer
- Your network should have 46 outputs, one for each class. Use
the softmax() nonlinearity for the final layer.
- Class labels from the loader are integers (they are not one-hot
encoded). You can either convert the integers to a 1-hot
encoded representation and use categorical cross-entropy
for your loss, or you can keep the integers and use sparse
categorical cross-entropy (this loss function will
automatically do the conversion for you).
- Likewise, you will need to use sparse categorical
accuracy if you are using the raw integers
Performance Reporting
Once you have selected a reasonable architecture and set of
hyper-parameters, produce the following figures:
- Figure 0a,b,c: Network architectures from plot_model()
- Figure 1: Training set accuracy as a function of epoch for each
rotation. Include all 3 models. Each model type should be
clearly indicated by the line style/color
- Figure 2: Validation set accuracy as a function of epoch for
each of the rotations. Include all three models
- Figure 3: Test set accuracy for all three models. For rotation
triples (corresponding rotation for the three models), compute:
- test(CNN) - test(SimpleRNN), and
- test(SimpleRNN+Max) - test(SimpleRNN)
Create a scatter plot using these two as coordinates.
Provided Code
In code for class:
- pfam_loader.py: data loading and conversion tools
Hints
- If you are using the GPU on the supercomputer, you might try
unrolling your recurrent layer.
- You will probably not be able to use the entire dataset all at
once in model.fit(). Instead, use create_tf_datasets() to
translate the numpy arrays into TF datasets that are batched.
- The training set has 110K samples in it; this can take a long
time to touch all of these samples. If you want to reduce the
number of samples you use for each epoch, turn on repeat
in create_tf_datasets() and set steps_per_epcoh in model.fit to
the number of batches you want to use per epoch.
- The dataset is sorted by class, so you can end up with a
strange training effect when batching. So, if you batch, then
you should shuffle, too. I am using a shuffle buffer of 100.
- If you set steps_per_epoch larger than the number of batches in
your training set, then training will halt after one epoch.
- Turning on caching helps on the GPU (I haven't tested the
CPU-only case)
What to Hand In
Turn in a single zip file that contains:
- All of your python code (.py) and any notebook files (.ipynb)
[Gradescope can render notebook files directly - no need to
convert to pdf!]
- Figures 0-3
Do not turn in pickle files.
Grading
- 20 pts: Clean, general code for model building (including
in-code documentation)
- 15 pts: Figure 0
- 15 pts: Figure 1
- 15 pts: Figure 2
- 15 pts: Figure 3
- 20 pts: Reasonable relative test set performance for all rotations
References
- Full Data Set
- Pfam: The protein families database in 2021: J. Mistry,
S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar,
E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj,
L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research
(2020) doi: 10.1093/nar/gkaa913
andrewhfagg -- gmail.com
Last modified: Sat Apr 1 17:58:59 2023