CS 5043: HW8: Conditional Diffusion Networks
Assignment notes:
- Deadline: Thursday, May 2nd @11:59pm.
- Hand-in procedure: submit a zip file to Gradescope
- This work is to be done on your own. However, you may share
solution-specific code snippets in the open on
Slack (only!), but not full solutions. In addition, downloading
solution-specific code is not allowed.
The Problem
We will continue our exploration into generating synthetic satellite
images that are conditioned on semantic image labels. Here, we will
use a Markov chain of models that slowly introduce noise into a real
image (this is how we will generate our training data), and a reverse
set of models that will start from a completely random image and
slowly remove noise until a meaningful satellite image is revealed.
Your model will take as input:
- A noised image
- A time stamp (in the denoising sequencing)
- The semantic label image
And produce as output a guess at the noise that must be removed from
the input noised image in order to make it less noisy.
Once your model is trained, we will use it to generate synthetic
images by:
- Randomly producing a noisy image (each pixel channel is drawn
from a standard Normal distribution)
- Over T time steps:
- Use the learned model to estimate the noise that must be
removed
- Subtract this estimate from the current image
- Inject a small amount of new noise to help the search
process
The Generator
We have provided a new version of the chesapeake_loader.py file with a
new TF Dataset creator called create_diffusion_dataset(). An
example of its use is given in the hw8_examples.ipynb file.
Following, Algorithm 18.1 in the book, this
generator produces batches of examples - for each example, the
generator randomly selects a time step in the noising process to
sample from and produces a single 2-tuple of the form:
- A dictionary of input tensors. The names of the tensors are:
- label_input (shape: examples,R,C,7): A 1-hot encoded
representation of the semantic labels
- image_input (shape: examples,R,C,3): a noised
version of the real image that corresponds to the
labels. Note: the individual channels are in the range
+/-1 (they have been scaled from 0...1)
- time_input (shape: examples,1): An integer time
index that is the step in the noising process that this
sample was taken from.
- A tensor of desired noise output (shape: examples,R,C,3). Each
channel is drawn from a Normal distribution. So, values can
take on any value, but they will largely be constrained in the
+/-1 range.
The Model
You should start from the semantic labeling U-net that you have
already developed in HW 4 or your U-net from HW 7. The differences are:
- You now have 3 inputs. NOTE: the names of these Inputs MUST
be the same as the dictionary key names used by the
generator. If you
don't do this, then TF will not know how to connect the
generator values to the inputs of your model.
- You are predicting unbounded noise values. Think about what
this means for your output non-linearity.
- You can use MSE or MAE as the loss function.
- You should translate the position index into a form that can be
appended to the semantic labels. You have a couple of options:
- Copy a scaled version of this index (scaled so its
value is in the range 0...1) across a plane that is the
same size as your semantic labels: examples,R,C,1.
- Use the provided PositionEncoder layer that uses the
same encoding from Attention: it takes as
input a time step index and translates it into an
embedding of a defined length. The shape of this for
every example is examples,embedding_size. I am
experimenting with an embedding size of 30 right now.
Then, copy the embedding vector to each pixels:
examples,R,C,embedding_size.
I provide the PositionEncoder class, and give examples of using
it and of copying a value across pixels in the examples notebook.
- I append the input image, semantic label image and the time
encoding "image" together as input to the U-net. In addition,
I introduce the semantic labels and time encodings at each step
of the "up" side of the U (at the same time that the skip
connections are appended). I scale these down to the right
sizes using AveragePooling2D.
The Fake Image Generator
After you have trained your model, use the provided generator notebook
to visualize the results. A few notes:
- The configuration of your image size, time steps, and the
training schedule parameters (alpha, beta, gamma) must be the
same as you use during the training process.
- We are using one of our older TF Dataset generators to pull in
corresponding labels and true images (no need for the other
information).
-
Training Process
You can train your model using a conventional model.fit() process with
early stopping (no meta models!).
Experiments
We are only working to get a single model running for this assignment.
Once you have completed your main experiment, produce
the following:
- Figure 0: Model architectures from plot_model()
- Figure 1: Show training set MSE as a function of epoch.
- Figure 2: Show 2-3 examples of the model producing a new image
from noise. Show each time step (or maybe every other time step).
- Figure 3: Show a gallery of final images.
What to Hand In
Turn in a single zip file that contains:
- All of your python code (.py) and any notebook file (.ipynb)
- Figures 0-3
Grading
- 30 pts: Clean, general code for model building (including
in-code documentation)
- 10 pts: Figure 0
- 15 pts: Figure 1
- 15 pts: Figure 2
- 15 pts: Figure 3
- 15 pts: Convincing generated images (interesting textures and
no strange colors)
- 10 pts: Bonus if you can convincingly generate roads, buildings
or water features
Hints
- I am not using a GPU except for some very small scale
experiments. I have yet to get a GPU working for this on
Schooner.
- 20 time steps is too small to give solid results (but there is
hope). I am currently experimenting with 50.
- I am liking batch sizes in the range of 8 to 16 (the latter
gives more stable results), with steps-per-epoch in the range
of 16 to 32.
- The provided code that produces the beta, alpha and sigma
schedules is free to use. Beta and sigma increase linearly
with increasing time (alpha is computed automatically from beta).
Lots of people have different ideas
as to what these should look like. For the code that I
provided, a max beta of 0.2 is working well.
- Batch Normalization is critical.
- I am using an lrate of 10^-4 (so, go slow)
- WandB is useful again since we are performing individual
training runs.
- I suggest starting small to get things working (small images
and small number of timesteps) and then moving
on to larger scales.
Frequently Asked Questions
andrewhfagg -- gmail.com
Last modified: Thu Apr 25 10:58:43 2024