CS 5043: HW7: Conditional Diffusion Networks
DRAFT
Assignment notes:
- Deadline: Thursday, May 1st @11:59pm.
- Hand-in procedure: submit a zip file to Gradescope
- This work is to be done on your own. However, you may share
solution-specific code snippets in the open on
Slack (only!), but not full solutions. In addition, downloading
solution-specific code is not allowed.
The Problem
We will generate synthetic satellite
images that are conditioned on semantic image labels. Here, we will
use a Markov chain of models that slowly introduce noise into a real
image (this is how we will generate our training data), and a reverse
set of models that will start from a completely random image and
slowly remove noise until a meaningful satellite image is revealed.
Your model will take as input:
- A noised image
- A time stamp (in the denoising sequencing)
- The semantic label image
And produce as output a guess at the noise that must be removed from
the input noised image in order to make it less noisy.
Once your model is trained, we will use it to generate synthetic
images by:
- Randomly producing a noisy image (each pixel channel is drawn
from a standard Normal distribution)
- Over T time steps:
- Use the learned model to estimate the noise that must be
removed
- Subtract this estimate from the current image
- Inject a small amount of new noise to help the search
process
The Generator
We have provided a new version of the chesapeake_loader4.py file with a
new TF Dataset creator called create_diffusion_dataset(). An
example of its use is given in the hw7_examples.ipynb file.
Following, Algorithm 18.1 in the book, this
generator produces batches of examples. For each example, the
generator randomly selects a time step in the noising process to
sample from and produces a single 2-tuple of the form:
- A dictionary of input tensors. The names of the tensors are:
- label_input (shape: examples,R,C,7): A 1-hot encoded
representation of the semantic labels
- image_input (shape: examples,R,C,3): a noised
version of the real image that corresponds to the
labels. Note: the individual channels are in the range
+/-1 (they have been scaled from 0...1 for you already)
- time_input (shape: examples,1): An integer time
index that is the step in the noising process that this
sample was taken from.
- A tensor of desired noise output (shape: examples,R,C,3). Each
channel is drawn from a Normal distribution. So, values can
take on any value, but they will largely be constrained in the
+/-1 range.
The Model
You should start from the semantic labeling U-net that you have
already developed in HW 4. The differences are:
- You now have 3 inputs. NOTE: the names of these Inputs MUST
be the same as the dictionary key names used by the
generator. If you
don't do this, then TF will not know how to connect the
generator values to the inputs of your model.
- You are predicting unbounded noise values. Think about what
this means for your output non-linearity.
- You can use MSE or MAE as the loss function.
- You should translate the position index into a form that can be
appended to the semantic labels.
I provide the PositionEncoder class, and give examples of using
it and of copying a value across pixels in the examples notebook.
- I append the input image, semantic label image and the time
encoding "image" together as input to the U-net. In addition,
I introduce the semantic labels and time encodings at each step
of the "up" side of the U (at the same time that the skip
connections are appended). I scale these down to the right
sizes using AveragePooling2D.
Provided Tools
diffusion_tools.py
- compute_beta_alpha(): Generates sequences:
- beta: injected noise at each step t. This is a linear function
- alpha (paper: \bar{alpha}): injected noise from time 0 to time t
- sigma: added noise level during the inference step
- compute_beta_alpha2(): Same, but using a sine shape for beta
- convert_image(): Converts each channel of an image that is
zero-centered and a standard deviation of about 1 into the range
0...1
- PositionEncoder(): Position encoder layer from previous
homework
chesapeake_loader4.py
Key function:
ds_train, ds_valid = create_diffusion_dataset(
base_dir=args.dataset,
patch_size=args.image_size[0],
fold=args.rotation,
filt=args.train_filt,
cache_dir=args.cache,
repeat=args.repeat,
shuffle=args.shuffle,
batch_size=args.batch,
prefetch=args.prefetch,
num_parallel_calls=args.num_parallel_calls,
alpha=alpha,
time_sampling_exponent=args.time_sampling_exponent)
Training Process
You can train your model using a conventional model.fit() process with
early stopping (no meta models!).
Inference Process
I suggest doing this in a notebook. To reload a saved model, you will
need:
model = keras.models.load_model(fname + '_model.h5', custom_objects={'PositionEncoder': PositionEncoder,
'ExpandDims': keras.src.ops.numpy.ExpandDims,
'Tile': keras.src.ops.numpy.Tile,
'mse': 'mse'})
Experiments
We are only working to get a single model running for this assignment.
Once you have completed your main experiment, produce
the following:
- Figure 0: Model architectures from plot_model()
- Figures 1a,b: Show training and validation set MSE (or MAE) as
a function of epoch.
- Figure 2: Show 2-3 examples of the model producing a new image
from noise. Show each time step (or maybe every other time step).
- Figure 3: Show a gallery of final images.
What to Hand In
Turn in a single zip file that contains:
- All of your python code (.py) and any notebook file (.ipynb)
- Figures 0-3
Grading
- 30 pts: Clean, general code for model building (including
in-code documentation)
- 10 pts: Figure 0
- 15 pts: Figure 1
- 15 pts: Figure 2
- 15 pts: Figure 3
- 15 pts: Convincing generated images (interesting textures and
no strange colors)
- 5 pts: Bonus if you can convincingly generate roads, buildings
or water features
Hints
- I am not using a GPU except for some very small scale
experiments. I have yet to get a GPU working for this on
Schooner.
- 20 time steps is too small to give solid results (but there is
hope). I am currently experimenting with 50. The original
paper used 1000. Remember that you need to adjust the beta
sequence when you change the number of steps.
- I am liking batch sizes in the range of 8 to 16 (the latter
gives more stable results), with steps-per-epoch in the range
of 16 to 32.
- The provided code that produces the beta, alpha and sigma
schedules is free to use. Beta and sigma increase linearly
with increasing time (alpha is computed automatically from beta).
Lots of people have different ideas
as to what these should look like. For the code that I
provided, a max beta of 0.2 is working well.
- Batch Normalization is critical.
- I am using an lrate of 10^-4 (so, go slow)
- WandB is useful again since we are performing individual
training runs.
- I suggest starting small to get things working (small patch
sizes and small number of timesteps) and then moving
on to larger scales.
Frequently Asked Questions
andrewhfagg -- gmail.com
Last modified: Tue Apr 29 13:27:01 2025