CS 5043: HW5: RNNs + CNNs
Assignment notes:
- Deadline: Thursday, April 4th @11:59pm.
- Hand-in procedure: submit a zip file to Gradescope
- This work is to be done on your own. However, you may share
solution-specific code snippets in the open on
Slack (only!), but not full solutions. In addition, downloading
solution-specific code is not allowed.
- Do not submit MSWord documents.
The Problem
Proteins are chains of amino acids that perform many different
biological functions, depending on the specific sequence of amino
acids. Families of amino acid chains exhibit similarities in their
structure and function. For a new chain, one problem we would like to
solve is that of predicting the family that it most likely belongs
to. In this assignment, we will be classifying amino acid chains as
one of 18 different families.
Data Set
The Data set is available on SCHOONER:
- /home/fagg/datasets/pfam: directory tree containing the data
(including two tar files). Note that we are using the
pfam data set.
The data are already partitioned into five independent folds, with the
classes stratified across the folds (the samples for class k are
distributed equally across the five folds). However, the different
classes have different numbers of examples, with as much as a 1:10
ratio between the minority and majority classes.
Each example consists of:
- A tokenized string of length 1574 amino acids. The
strings in
the data set have been padded on the left hand side.
In addition to
the padding token, there is also a token that corresponds to
the "unknown" amino acid. Within each string, there can be long
runs of "unknown" tokens.
- A tokenized class label.
There are two ways to load the data (provided in pfam_loader.py):
- prepare_data_set(): loads the raw csv data, constructs the
train/validation/test data sets, and performs the
tokenization. These files are smaller, but require CPU
processing before training.
- load_rotation(): loads an already constructed rotation from a
pickle file. These files are a lot larger, but require no
processing once loaded. I suggest using this approach.
Both loaders return the same dictionary format (documented in
pfam_loader.py). And, the data sets fit entirely in RAM. Two
important properties of this dictionary are:
- n_tokens: the total number of distinct tokens (including the
padding and unknown tokens)
- n_classes: the number of distinct PFAM classes in the data set.
You can
also use create_tf_datasets() to convert this dictionary
representation into TF Datasets, which makes it easy
to scale batches that will optimally fit with GPU memory.
Deep Learning Experiment
Objective: Create two different neural network models that can predict the family
of a given amino acid. We will compare two different architectures:
- SimpleRNN + AveragePooling: One or more layers of SimpleRNN
modules, separated by an average pooling step:
- Use AveragePooling1D
- You may use a Bidirectional SimpleRNN in place of the
SimpleRNN)
- All but the last RNN will return sequences
- CNN: One or more layers of 1D CNN + Appropriate MaxPooling
The precise definition of these is up to you, but you should stay
within these classes of solutions. You should also adjust
hyper-parameters for each so that they can do
their best (with respect to the validation set) without changing model
architecture. That said, you should expect some performance
differences between these model types.
Notes:
- Perform the experiment for rotations 0...4 for both model
types.
- Each network type should have an Embedding layer. Embedding
layers effectively pre-multiply the 1-hot encoded token with a
trainable matrix. Since the 1-hot encoding has exactly one '1', this
multiplication effectively selects a column from the matrix
(the EmbeddingLayer is actually more clever than doing a large
matrix multiplication).
This trainable matrix allows the network to figure out which
tokens should share similar representations and which should
have very different representations. The number of embeddings
should be less than the number of distinct tokens.
- Your network should have 18 outputs, one for each class. Use
the softmax() nonlinearity for the final layer.
- Class labels from the loader are integers (they are not one-hot
encoded). You can either convert the integers to a 1-hot
encoded representation and use categorical cross-entropy
for your loss, or you can keep the integers and use sparse
categorical cross-entropy (this loss function will
automatically do the conversion for you).
- Likewise, you will need to use sparse categorical
accuracy if you are using the raw integers
Performance Reporting
Once you have selected a reasonable architecture and set of
hyper-parameters, perform 5 rotations of experiments for each
architecture (10 different models). Produce the following figures:
- Figure 0a,b: Network architectures from plot_model()
- Figure 1: Training set accuracy as a function of epoch for each
rotation. Include both models. Each model type should be
clearly indicated by the line style/color
- Figure 2: Validation set accuracy as a function of epoch for
each of the rotations. Include both models
- Figure 3: Bar plot of test set accuracy for both models.
- Figure 4: Scatter plot of test set accuracy: RNN vs CNN
Provided Code
In code for class:
- pfam_loader.py: data loading and conversion tools
Hints
- If you are using the GPU on the supercomputer, you might try
unrolling your recurrent layer.
- The training set has 110K samples in it; this can take a long
time to touch all of these samples. If you want to reduce the
number of samples you use for each epoch, turn on repeat
in create_tf_datasets() and set steps_per_epcoh in model.fit to
the number of batches you want to use per epoch.
- The dataset is sorted by class, so you can end up with a
strange training effect when batching. So, if you batch, then
you should shuffle, too. I am using a shuffle buffer of 100.
- If you set steps_per_epoch larger than the number of batches in
your training set, then training will halt after one epoch.
- Turning on caching helps on the GPU (I haven't tested the
CPU-only case). The dataset is small enough to fit in RAM.
- It is possible to achieve an accuracy of .995 for independent
data with this data set. A very naive 1-layer SimpleRNN will
only be able to achieve .25. For the RNN and CNN tools, you
should be able to achieve performance much closer to this high
end than the low end.
- In RNNs, we have very deep models and coupled parameters; you
will find that there are some regions
of the error space that have very steep gradients. I find that
RNNs benefit from gradient clipping. The Adam optimizer has an
argument clipnorm that allows you to enable this. I am
using 10^-2 and it is working quite well.
- You can also pull the pkl version of the datasets from
/scratch/fagg/pfam. Using this will improve start-up
time for your jobs.
What to Hand In
Turn in a single zip file that contains:
- All of your python code (.py) and any notebook files (.ipynb)
- Figures 0-4
Do not turn in pickle files.
Grading
- 20 pts: Clean, general code for model building (including
in-code documentation)
- 10 pts: Figure 0
- 15 pts: Figure 1
- 15 pts: Figure 2
- 15 pts: Figure 3
- 15 pts: Figure 4
- 10 pts: Reasonable test set performance for all rotations for at least one
of your model types (.95 or better)
References
- Full Data Set
- Pfam: The protein families database in 2021: J. Mistry,
S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar,
E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj,
L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research
(2020) doi: 10.1093/nar/gkaa913
andrewhfagg -- gmail.com
Last modified: Wed Apr 3 16:12:59 2024