CS 5043: HW6: String Classification
Assignment notes:
- Deadline: Tuesday, April 22nd @11:59pm.
- Hand-in procedure: submit a zip file to Gradescope
- This work is to be done on your own. However, you may share
solution-specific code snippets in the open on
Slack (only!), but not full solutions. In addition, downloading
solution-specific code is not allowed.
- Do not submit MSWord documents.
The Problem
Proteins are chains of amino acids that perform many different
biological functions, depending on the specific sequence of amino
acids. Families of amino acid chains exhibit similarities in their
structure and function. For a new chain, one problem we would like to
solve is that of predicting the family that it most likely belongs
to. In this assignment, we will be classifying amino acid chains as
one of 46 different families.
Data Set
Deep Learning Experiment
Objective: Compare the performance of three different neural network
model types that can predict the family of a given amino acid. Each
model type be composed of stacks of different recurrent layers:
For my implementation, I have one network building
function that will generate all three of these model types (I suggest
the same for you). Your overall architecture will look like this:
- Embedding layer
- Preprocessing Conv1D layer with striding to reduce the temporal
length of the strings
- One or more modules composed of:
- SimpleRNN (keras.layers.SimpleRNN) layer or GRU layer
(keras.layers.GRU). Return sequences is True
- 1D average pooling
- A final recurrent layer with return sequences set to False
- One or more Dense layers, with the final output layer using a
softmax non-linearity
Multi-Headed-Attention network is a more involved architecture:
- Embedding layer
- PositionalEncoding layer that augments the tokens with position
information (use the default 'add' combination type)
- Preprocessing Conv1D layer with striding to reduce the temporal
length of the strings
- One or more modules composed of:
- MHA Layers (keras.layers.MultiHeadAttention). Note
that when an instance of this class is "called", it has two
arguments (K/V and Q; in this situation, they are the same
tensor for both)
- 1D average pooling
- A layer that reduces the set of hyper tokens to a single one.
You could explicitly clip out one of the tokens from the
sequence or use something like GlobalMaxPooling1D
- One or more Dense layers, with the final output layer using a
softmax non-linearity
The precise definition of these is up to you, but you should stay
within these classes of solutions. You should also adjust
hyper-parameters for each so that they can do
their best (with respect to the validation set) without changing model
architecture. That said, you should expect some performance
differences between these model types.
Model Notes
- Perform the experiment for rotations 0...4 for all three model
types.
- Each network type should have an Embedding layer. Embedding
layers effectively pre-multiply the 1-hot encoded token with a
trainable matrix. Since the 1-hot encoding has exactly one '1', this
multiplication effectively selects a column from the matrix
(the EmbeddingLayer is actually more clever than doing a large
matrix multiplication).
This trainable matrix allows the network to figure out which
tokens should share similar representations and which should
have very different representations. The number of embeddings
should be less than the number of distinct tokens.
- Your network should have 46 outputs, one for each class. Use
the softmax() nonlinearity for the final layer.
- Class labels from the loader are integers (they are not one-hot
encoded). You can either convert the integers to a 1-hot
encoded representation and use categorical cross-entropy
for your loss, or you can keep the integers and use sparse
categorical cross-entropy (this loss function will
automatically do the conversion for you).
- Likewise, you will need to use sparse categorical
accuracy if you are using the raw integers
API Notes
Loading Data
dat = load_rotation(basedir=args.dataset, rotation=args.rotation, version=args.version)
- dataset is the pfam directory that you are using
- rotation is an integer
- version should be 'B'
- dat is a dictionary containing the three data sets for this
rotation, as well as key meta-data
Convert to TF Dataset
Translates the dat structure from load_rotation() into a set of
three TF Datasets
dataset_train, dataset_valid, dataset_test = create_tf_datasets(dat,
batch=batch,
prefetch=args.prefetch,
shuffle=args.shuffle,
repeat=(args.steps_per_epoch is not None))
- Since individual examples are large, I suggest a small batch
size (~8)
Embedding Layer
An embedding layer translates a sequence of integers of some length
into a sequence of token embeddings. Each integer value corresponds to
one of the unique tokens. The length of the strings and the number of
unique tokens is given by the data set that you load. The number of
embeddings should be smaller than the number of unique tokens.
tensor = keras.Input(shape=(len,))
input_tensor = tensor
tensor = keras.layers.Embedding(n_tokens,
n_embeddings,
input_length = len)(tensor)
- Input tensor: (batch, len)
- Output tensor shape is: (batch, len, n_embeddings)
Multi-Headed Attention
tensor = keras.layers.MultiHeadAttention(num_heads=num_heads,
key_dim=head_size)(tensor, tensor)
- num_heads is the number of attention heads to use
- key_dim is the length of the key/query/value vectors for
the individual heads. This should be small relative to the
full state vectors stored in the tensor input
Input and output tensor shape is: (batch, len, embeddings)
Performance Reporting
Once you have selected a reasonable architecture and set of
hyper-parameters, perform five rotations of experiments. Be
careful in your choice of GPU for each model type.
Produce the
following figures/results:
- Figures 0a,b,c: Network architecture from plot_model(). Please
include tensor shapes.
- Figures 1a,b: Training and validation set accuracy as a function of epoch for each rotation (each figure has fifteen curves).
- Figure 2: Validation set accuracy for the GRU model vs validation set
accuracy for the Attention model. There will be one curve per
rotation; each point along the curve represents the performance
of the two models for a specific epoch of training. Because these two models will require different
numbers of training epochs, pad the shorter of the two with its
final value so that both vectors are the same length.
- Figure 3: Bar graph showing test accuracy for the
three different model types (one group of bars for each
rotation). Report these accuracies in text form, too.
- Figure 4: Combining the test data across the five folds, show
the contingency table. Show counts (not percentages).
- Reflection:
- Describe the specific choices you made in the model
architectures.
- Discuss in detail how consistent your model performance
is across the different rotations.
- Discuss the relative performance of the three model types.
- Compare and contrast the different models with respect
to the required number of training epochs and the amount
of time required to complete each epoch. Why does the
third model type require so much compute time for each
epoch?
Provided Code
In code for class:
- pfam_loader.py: data loading and conversion tools
- positional_encoder.py: positional encoder for Attention
Hints
- The dnn_2025_04 environment does not have GPU support (though it
is more up to date). dnn still has GPU support
- If you are using the GPU on the supercomputer, you might try
unrolling your recurrent layer. Note that using the GPU and
unrolling the network can really help
or really hurt, depending on the specifics of library and
driver versions.
- The training set has ~200K samples in it; this can take a long
time to touch all of these samples. If you want to reduce the
number of samples you use for each epoch, turn on repeat
in create_tf_datasets() and set steps_per_epcoh in model.fit to
the number of batches you want to use per epoch. You must do both.
- The dataset is sorted by class, so you can end up with a
strange training effect when batching. So, if you batch, then
you should shuffle, too. I am using a shuffle buffer of 100+.
- If you set steps_per_epoch larger than the number of batches in
your training set, then training will halt after one epoch.
- Turning on caching helps on the GPU (I haven't tested the
CPU-only case). The dataset is small enough to fit in RAM.
- It is possible to achieve an accuracy of .995 for independent
data with this data set. A very naive 1-layer SimpleRNN will
only be able to achieve a low performance.
- In RNNs, we have very deep models and coupled parameters; you
will find that there are some regions
of the error space that have very steep gradients. I find that
RNNs benefit from gradient clipping. The Adam optimizer has an
argument clipnorm that allows you to enable this. I am
using 10^-2 and it is working quite well.
- You can also pull the pkl version of the datasets from
/scratch/fagg/pfam. Using this will improve start-up
time for your jobs.
- Be careful to only reserve a GPU if you are actually going to
use it.
Network / Training Notes
Frequent updates here...
- dnn supports GPUs; dnn_2025_04 does not
- So far, I have not had any success with using GPUs with any of
the architectures (they either error out or seem to go into
infinite loops). However, I have been running in CPU mode on
GPU nodes. Be careful about your thread and memory usage.
- The first 1D convolutional layer is really important for making
progress. I am using striding of 16 right now.
- RNN/GRU models can take a long time show any interesting
performance increases. My GRU networks are requiring 100-150s
for each epoch, for about 24 hours before they asymptote.
- Attention model: I have a model with 167K parameters that is
taking ~95s for each training epoch (full data set). I was
able to train up to .94+ (validation) in about 6 hours.
What to Hand In
Turn in a single zip file that contains:
- All of your python code (.py) and any notebook files (.ipynb)
- All of your text files (.txt)
- Figures + Reflection
Do not turn in pickle files.
Grading
- 20 pts: Clean, general code for model building (including
in-code documentation)
- 10 pts: Figure 0
- 15 pts: Figure 1
- 15 pts: Figure 2
- 15 pts: Figure 3
- 15 pts: Figure 4
- 10 pts: Reasonable test set performance for all rotations for at least one
of your model types (.95 accuracy or better)
References
- Full Data Set
- Pfam: The protein families database in 2021: J. Mistry,
S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar,
E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj,
L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research
(2020) doi: 10.1093/nar/gkaa913
andrewhfagg -- gmail.com
Last modified: Mon Apr 14 14:54:42 2025