CS 5043: HW6: Advanced RNNs and Attention
Assignment notes:
- Deadline: Thursday, April 11th @11:59pm.
- Hand-in procedure: submit a zip file to Gradescope
- This work is to be done on your own. However, you may share
solution-specific code snippets in the open on
Slack (only!), but not full solutions. In addition, downloading
solution-specific code is not allowed.
The Problem
We are using the same problem type as in the previous homework
assignment, but we are using a more complicated data set. Specifically:
- The string lengths are now 3934 amino acids (as before, they
have been padded so every example has the same number of
tokens)
- We are now distinguishing between 46 different protein
families.
- The full data set is ~200,000 samples.
- Use dataset version "B" when loading.
Because of the scale of the problem, we are going to use advanced
techniques for classifying these strings. Specifically, we will use:
Deep Learning Experiment
You are to implement two different network architectures:
GRU network:
- Embedding layer
- Preprocessing Conv1D layer with stride of 4.
- One or more GRU layers (keras.layers.GRU), with the last set to
return sequences False.
- One or more Dense layers, with the output using a softmax
non-linearity
Multi-Headed-Attention network (remember that you need to use the
Model API for this):
- Embedding layer.
- Preprocessing Conv1D layer with stride of 4.
- PositionalEncoding layer that augments the tokens with position
information (use the default 'add' combination type)
- One or more MHA Layers (keras.layers.MultiHeadAttention). Note
that when an instance of this class is "called", it has two
arguments (K/V and Q; in this situation, they are the same tensor)
- A layer that reduces the set of hyper tokens to a single one.
You could explicitly clip out one of the tokens from the
sequence
or use
something like GlobalMaxPooling1D
- One or more Dense layers, with the output using a softmax
non-linearity
NOTE:
The dnn environment has a library problem that shows up when we try to
use GRU and Attention layers. Please use our older environment
instead:
conda activate tf
module load cuDNN/8.9.2.26-CUDA-12.2.0
Performance Reporting
Once you have selected a reasonable architecture and set of
hyper-parameters, produce the following figures:
- Figure 0a,b: Network architectures from plot_model()
- Figure 1: Training set Accuracy as a function of epoch for each
rotation of five rotations for both models (all on one figure).
- Figure 2: Validation set Accuracy as a function of epoch for
each of the rotations (both models on one figure)
- Figure 3: Scatter plot of Test Accuracy for the GRU vs
Attention models.
- Figure 4: Scatter plot of the number of training epochs for the
GRU and Attention models.
- Reflection: answer the following questions:
- For your Multi-Headed Attention implementation, explain
how you translated your last MHA layer into an output
probability distribution
- Is there a difference in performance between the two
model types?
- How much computation did you need for the
training for each model type in terms of the number of
epochs and time?
What to Hand In
Turn in a single zip file that contains:
- All of your python code (.py) and any notebook file (.ipynb)
[Gradescope can render notebook files directly - no need to
convert to pdf!]
- Figures 0-4
- Reflection
Grading
- 20 pts: Clean, general code for model building (including
in-code documentation)
- 10 pts: Figures 0a,b
- 10 pts: Figure 1
- 10 pts: Figure 2
- 10 pts: Figure 3
- 10 pts: Figure 4
- 15 pts: Reasonable test set performance for all rotations of at
least one of the architectures (0.99 or better)
- 15 pts: Reflection
References
- Full Data Set
Pfam: The protein families database in 2021: J. Mistry,
S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar,
E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj,
L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research
(2020) doi: 10.1093/nar/gkaa913
- Keras Multi-headed Attention Layer
Hints
andrewhfagg -- gmail.com
Last modified: Thu Apr 11 17:29:17 2024