We empirically demonstrate that increased complexity in medical imaging data leads to reduced GAN performance in generating realistic medical images, and show how this insight can serve as a guide for efficient GAN training in order to provide machine learning models with high-fidelity synthetic training data for clinical applications.
The proliferation of machine learning models in diverse clinical applications has led to a growing need for high-quality medical data, often scarce due to patient privacy concerns as well as wet lab and data annotation cost constraints. Generative Adversarial Networks (GANs) try to address this problem by synthesizing medical images. However, the optimal training set sizes required to efficiently train a GAN to produce high-fidelity images are unknown. Existing methods primarily emphasize architecture-centric modifications to achieve desired outcomes while paying less attention to data-centric approaches.
We propose a data-centric optimization method for efficient GAN training in medical image synthesis using state-of-the-art GANs—StyleGAN 3 and SPADE-GAN—based on the relationship between the image complexity distribution of a dataset, and variable training set sizes corresponding to the fidelity of synthetic images.
Objectively, image complexity can be defined as the variety of features and details within an image. Entropy is a traditional, heuristic-based method of calculating the complexities of images in small-scale datasets. It is defined as the uncertainty or “surprise” in data, specifically, the variation in the distribution of pixel intensities of an image in grayscale format.
Our approach utilizes Larkin's delentropy, which incorporates a new density function known as the deledensity. As a joint probability function, it is formulated as
\[ p(i,j) = \frac{1}{4 W H} \sum_{w=0}^{W-1} \sum_{h=0}^{H-1} \delta_{i,d_{x}}(w,h) \delta_{j,d_{y}}(w,h), \]
where \( d_{x} \) and \( d_{y} \) denote the derivative kernels in the \( x \) and \( y \) direction, \( \delta \) is the Kronecker delta to describe the binning operation required to generate a histogram, and \( H \) and \( W \) are the image’s dimensions (height and width). By obtaining this, we can then calculate delentropy as
\[ DE = -\frac{1}{2} \sum_{i=0}^{I-1} \sum_{j=0}^{J-1} p(i,j) \log_{b} p(i,j), \]
such that \( I \) and \( J \) represent the number of bins (discrete cells) in the 2D distribution, and the \( \frac{1}{2} \) is derived from Papoulis’ generalized sampling expansion.
Delentropy analyzes the relationship between the local and global features of an image, specifically accounting for an image’s gradient vector field and pixel co-occurence, encapsulating both its compositional and spatial information as a whole.
To interpret this measure, yielding a high delentropy suggests an image has more sophisticated details, therefore being more complex. A low delentropy indicates a simple structure and a less-detailed image.
The goal of our pipeline is to identify the role of image dataset size in the fidelity of generated images from Style GAN 3 and SPADE-GAN, for which to be compared to the delentropy (image complexity) distribution of each dataset.
We first set all training images to a consistent 512x512 resolution. As such, training parameters were based on the size of the preprocessed images, as documented in the GANs' official implementations. SPADE-GAN relies on segmentation masks, so we utilized pre-existing annotations or generated masks using TorchXRayVision.
Through our data-centric approach, StyleGAN 3 and SPADE-GAN were run with the official, publicly available implementations with default hyperparameters and no augmentations to each network’s architecture. For each GAN training run, image (training) set size was subsequently set to 500 images (baseline), 1000 images (middle ground), and 2500 images (comprehensive), randomly sampled each run. The trained GAN was then used to generate synthetic medical images, which the fidelity was evaluated for each training set size.
The Fréchet Inception Distance (FID) was used to evaluate the fidelity of the synthetically generated images for GANs (i.e. GAN performance) for each training set. A lower FID score signifies that a GAN is more proficient at generating synthetic data close to its target distribution. From this, we obtained fidelity curves for each dataset that describe how FID scores trend with increasing training set size shown below.
Predictably, FID scores consistently decreased with increasing dataset size.
StyleGAN 3 trained on 2500 images reduced FID scores by a 48% average, compared to when trained on 500 images. SPADE-GAN experienced an analogous 31% FID score reduction on average. SPADE-GAN outperformed StyleGAN 3 across all datasets and training sizes, with FID scores averaging 33% lower.
Reason: SPADE-GAN’s architecture used segmentation masks for extra structural information, compared to StyleGAN 3 which relied on raw data.
The graphs below show the distribution of image complexities (delentropies) for each image in each medical image dataset, respectively.
The Chest X-ray dataset (homogeneous distribution) yields the lowest FID scores—indicates that GANs had easier training runs with this dataset. The Polyps Set (widest distribution and more complex images) correlates with the highest FID scores—suggests that GANs had more challenging training runs with this dataset.
Insight: Shows an inverse relationship—GAN performance decreases with an increasing spread of image complexities within a dataset.
SPADE-GAN outperformed StyleGAN 3 with lower FID scores and smoother, non-overlapping FID curves. Performance plateaued after 1000 images, with little perceptual improvement in generated images as shown in the comparison of synthetic images. StyleGAN 3's performance did not plateau between 500 and 2500 images. FID scores showed increasingly negative slopes, indicating improved feature capture beyond 1000 images.
Insight: FID curves serve as benchmarks for datasets with comparable delentropy distributions and SOTA GAN models.
Due to limited resources, training set sizes were constraint to 500, 1000, and 2500 images leading to coarse-grained results. Also, FID was used as the sole evaluation metric—may not be a good measure of how well synthetic images perform on a downstream task. Finally, a study with other generative AIs such as VAEs, autoregressive models, and diffusion models may provide more significant insights.
1. We empirically prove that higher image complexity leads to poorer image fidelity and lesser performance in GANs.
2. We demonstrate that given a dataset of a similar delentropy distribution, healthcare professionals can reference our benchmarks of the closest image fidelity curve to gain an estimate of the optimal training set sizes to produce desired image fidelity results.
3. We show the potential for studies using a data-centric approach with image complexity as a guide for model training—more thorough experiments are required before a truly comprehensive representation can be reached.
We are grateful to Michael Lam and Kevin Zhu for their excellent mentorship, constructive feedback, and unwavering support throughout our research.
@InProceedings{Cagas_2024_ACCV,
author = {Cagas, William and Ko, Chan and Hsiao, Blake and Grandhi, Shryuk and Bhattacharya, Rishi and Zhu, Kevin and Lam, Michael},
title = {Medical Imaging Complexity and its Effects on GAN Performance},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops},
month = {December},
year = {2024},
pages = {207-217}
}