CS 5043: HW7: Conditional Diffusion Networks

DRAFT

Assignment notes:

Deadline: Thursday, May 1st @11:59pm.
Hand-in procedure: submit a zip file to Gradescope
This work is to be done on your own. However, you may share solution-specific code snippets in the open on Slack (only!), but not full solutions. In addition, downloading solution-specific code is not allowed.

The Problem

We will generate synthetic satellite images that are conditioned on semantic image labels. Here, we will use a Markov chain of models that slowly introduce noise into a real image (this is how we will generate our training data), and a reverse set of models that will start from a completely random image and slowly remove noise until a meaningful satellite image is revealed.

Your model will take as input:

A noised image
A time stamp (in the denoising sequencing)
The semantic label image

And produce as output a guess at the noise that must be removed from the input noised image in order to make it less noisy.

Once your model is trained, we will use it to generate synthetic images by:

Randomly producing a noisy image (each pixel channel is drawn from a standard Normal distribution)
Over T time steps:
- Use the learned model to estimate the noise that must be removed
- Subtract this estimate from the current image
- Inject a small amount of new noise to help the search process

The Generator

We have provided a new version of the chesapeake_loader4.py file with a new TF Dataset creator called create_diffusion_dataset(). An example of its use is given in the hw7_examples.ipynb file. Following, Algorithm 18.1 in the book, this generator produces batches of examples. For each example, the generator randomly selects a time step in the noising process to sample from and produces a single 2-tuple of the form:

A dictionary of input tensors. The names of the tensors are:
- label_input (shape: examples,R,C,7): A 1-hot encoded representation of the semantic labels
- image_input (shape: examples,R,C,3): a noised version of the real image that corresponds to the labels. Note: the individual channels are in the range +/-1 (they have been scaled from 0...1 for you already)
- time_input (shape: examples,1): An integer time index that is the step in the noising process that this sample was taken from.
A tensor of desired noise output (shape: examples,R,C,3). Each channel is drawn from a Normal distribution. So, values can take on any value, but they will largely be constrained in the +/-1 range.

The Model

You should start from the semantic labeling U-net that you have already developed in HW 4. The differences are:

You now have 3 inputs. NOTE: the names of these Inputs MUST be the same as the dictionary key names used by the generator. If you don't do this, then TF will not know how to connect the generator values to the inputs of your model.
You are predicting unbounded noise values. Think about what this means for your output non-linearity.
You can use MSE or MAE as the loss function.
You should translate the position index into a form that can be appended to the semantic labels.
- Use the provided PositionEncoder layer that uses the same encoding from Attention: it takes as input a time step index and translates it into an embedding of a defined length. The shape of this for every example is (examples,embedding_size). I am experimenting with an embedding size of 30 right now. Then, copy the embedding vector to each pixels: examples,R,C,embedding_size.
  See keras.ops.expand_dims() and keras.ops.tile().
I provide the PositionEncoder class, and give examples of using it and of copying a value across pixels in the examples notebook.
I append the input image, semantic label image and the time encoding "image" together as input to the U-net. In addition, I introduce the semantic labels and time encodings at each step of the "up" side of the U (at the same time that the skip connections are appended). I scale these down to the right sizes using AveragePooling2D.

Provided Tools

diffusion_tools.py

compute_beta_alpha(): Generates sequences:
- beta: injected noise at each step t. This is a linear function
- alpha (paper: \bar{alpha}): injected noise from time 0 to time t
- sigma: added noise level during the inference step
compute_beta_alpha2(): Same, but using a sine shape for beta
convert_image(): Converts each channel of an image that is zero-centered and a standard deviation of about 1 into the range 0...1
PositionEncoder(): Position encoder layer from previous homework

chesapeake_loader4.py

Key function:

       
    ds_train, ds_valid = create_diffusion_dataset(
        base_dir=args.dataset,
        patch_size=args.image_size[0],
        fold=args.rotation,
        filt=args.train_filt,
        cache_dir=args.cache,
        repeat=args.repeat,
        shuffle=args.shuffle,
        batch_size=args.batch,
        prefetch=args.prefetch,
        num_parallel_calls=args.num_parallel_calls,
        alpha=alpha,
        time_sampling_exponent=args.time_sampling_exponent)

Training Process

You can train your model using a conventional model.fit() process with early stopping (no meta models!).

Inference Process

I suggest doing this in a notebook. To reload a saved model, you will need:

model = keras.models.load_model(fname + '_model.h5', custom_objects={'PositionEncoder': PositionEncoder,
                                                                     'ExpandDims': keras.src.ops.numpy.ExpandDims,
                                                                     'Tile': keras.src.ops.numpy.Tile,
                                                                     'mse': 'mse'})

Experiments

We are only working to get a single model running for this assignment. Once you have completed your main experiment, produce the following:

Figure 0: Model architectures from plot_model()
Figures 1a,b: Show training and validation set MSE (or MAE) as a function of epoch.
Figure 2: Show 2-3 examples of the model producing a new image from noise. Show each time step (or maybe every other time step).
Figure 3: Show a gallery of final images.

What to Hand In

Turn in a single zip file that contains:

All of your python code (.py) and any notebook file (.ipynb)
Figures 0-3

Grading

30 pts: Clean, general code for model building (including in-code documentation)
10 pts: Figure 0
15 pts: Figure 1
15 pts: Figure 2
15 pts: Figure 3
15 pts: Convincing generated images (interesting textures and no strange colors)
5 pts: Bonus if you can convincingly generate roads, buildings or water features

Hints

I am not using a GPU except for some very small scale experiments. I have yet to get a GPU working for this on Schooner.
20 time steps is too small to give solid results (but there is hope). I am currently experimenting with 50. The original paper used 1000. Remember that you need to adjust the beta sequence when you change the number of steps.
I am liking batch sizes in the range of 8 to 16 (the latter gives more stable results), with steps-per-epoch in the range of 16 to 32.
The provided code that produces the beta, alpha and sigma schedules is free to use. Beta and sigma increase linearly with increasing time (alpha is computed automatically from beta). Lots of people have different ideas as to what these should look like. For the code that I provided, a max beta of 0.2 is working well.
Batch Normalization is critical.
I am using an lrate of 10^-4 (so, go slow)
WandB is useful again since we are performing individual training runs.
I suggest starting small to get things working (small patch sizes and small number of timesteps) and then moving on to larger scales.

Frequently Asked Questions

None so far...

andrewhfagg -- gmail.com

Last modified: Tue Apr 29 13:27:01 2025