Text-to-image creation models have recently revolutionized artificial intelligence (AI) and the way creative image synthesis is done. They use powerful language models to understand text input prompts and turn them into manageable multi-dimensional structures called tokens, which contain all the key information contained in the given text.
Large text paradigms such as CLIP use these tokens with a differentiated learning objective for multimodal retrieval tasks, which involve finding closely related matches between text and images. CLIP exploits large image-text pairs datasets to learn about relationships between image captions and text. Well-established diffusion models, such as Stable Diffusion, DALL-E, or Midjourney, use CLIP for semantic awareness in the diffusion process, which is the sequence of combined actions to add noise to an image and reduce noise to restore a more accurate perception.
From these complex models, simpler but still robust solutions can be derived through sample distillation (SDS). SDS involves training a smaller model to predict scores (or log probabilities) assigned to images by a larger, pre-trained model, which serves as a guide for the estimation process.
🚀 Join the fastest ML Subreddit community
Although very powerful and effective at simplifying complex diffusion models, SDS suffers from synthetic artefacts. One of the main issues associated with SDS is mode collapse, which describes its tendency to converge towards specific modes. This often produces blurry output, capturing only the elements explicitly shown in the prompt, as in Figure 2.
In this optics, a new information distillation technique has been proposed, called degree delta distillation (DDS). The name of this technique comes from the way the degree of distillation is calculated. Unlike SDS, which queries the generative model with an image-text pair, DDS uses an additive reference-pair query, in which the text matches the image content.
The result is the difference, or delta, between the results of the two queries.
The basic form of DDS requires two pairs of images and text, one is the reference and does not change during optimization, and the other is the optimization target, which must match the target text vector. DDS results in effective color grading, which takes into account edited areas of an image while leaving other areas untouched.
In DDS, the source image and its text annotations help to estimate the unwanted and noisy gradient directions given by SDS. In fine or partial editing of the image with a new text description, reference estimation helps to obtain a clearer gradient direction of the image update.
Moreover, DDS can modify images by changing their textual descriptions without the need for a visual mask to be calculated or provided. In addition, it allows training a model from image to image without the need for associated training data, resulting in a no-shot image translation method. According to the authors, this no-shot training technique can be used for single- and multi-tasking image translation. Moreover, source distribution can include both original and synthetic images.
An image is reported below to compare the performance difference between DDS and the latest methods for image-to-image translation.
This was a summary of Delta Denoising Score, a new AI technology to provide accurate, clean, and detailed image-to-image and text-to-image synthesis. If you are interested, you can learn more about this technology in the links below.
scan the paper And Project page. Don’t forget to join 20k+ML Sub RedditAnd discord channelAnd Email newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we’ve missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check out 100’s AI Tools in the AI Tools Club
Daniel Lorenzi has a master’s degree. He received his PhD in Information and Communication Technology for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He holds a Ph.D. Candidate at the Institute of Information Technology (ITEC) at Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working at the Christian Doppler Laboratory at ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoS assessment.