Simply, Diffusion models somewhat (and this is extremely hand wavy) work as the "reverse" of an image recognizer.
It is taught the concepts of images through image learning (build neural circuits) to detect image features. Then, you do the reverse, iterate on noise to generate features.
Holy fucking shit
How did you find this archeological piece?