Stability AI has partnered with the DeepFloyd AI research lab to present a research version of its latest technology, called DeepFloyd IF. The text-to-image pixel cascading diffusion model is designed to generate high-quality images from text inputs. The model is available under a non-commercial license and is permitted for research, enabling research laboratories to explore and experiment with advanced text-to-image generation methods. The release of this model aligns with the Stability AI organization’s commitment to sharing innovative technologies with the broader research community. The company plans to eventually release a fully open source DeepFloyd IF model.
The newly released DeepFloyd IF model has many great features. First, it uses the T5-XXL-1.1 language model as a text encoder to help understand text prompts. The model also uses mutual attention layers to better align the generated text and image vector. One notable feature of the DeepFloyd IF model is its ability to accurately apply textual descriptions to create images with different objects appearing in different spatial relationships. This was previously a difficult task for text-to-image models. Another noteworthy feature is the high degree of photo-realism in the generated images, which is reflected in the impressive zero-shot FID score of 6.66 in the COCO data set. The DeepFloyd IF model can also generate images with non-standard aspect ratios, including portrait or landscape orientations and the standard square aspect.
In addition to text-to-image generation, the DeepFloyd IF model offers image-to-image translations without a snapshot. This is achieved by resizing the original image to 64px, adding noise through forward propagation, and using backpropagation with a new vector to reduce noise in the image. The style can be modified to ultra-fine units via a quick text description. This approach allows the style, patterns, and details in the output image to be modified while preserving the basic look of the source image without the need for fine-tuning.
🚀 Join the fastest ML Subreddit community
The DeepFloyd IF model works in three phases to generate high quality images from text prompts. The T5-XXL frozen language model converts a text prompt into a qualitative representation in the first stage. Then, in the second phase, a basic diffusion model is applied to convert the qualitative text into a 64 x 64 image, which is then upscaled to 256 x 256 using two super-resolution modal text models. During the third stage of the process, a final model is used to refine the image to a resolution of 1024 x 1024. The IF model includes different versions of the base model and the super resolution model, which have other parameters. Although the Phase 3 model is not yet available, alternative higher level models such as the Stable Diffusion x4 Upscaler can be used.
The DeepFloyd IF model is trained on a custom high-quality dataset called LAION-A, which contains 1 billion (image, text) pairs. The dataset is an aesthetic subset of the English part of the LAION-5B dataset, and the data was filtered using custom filters to remove inappropriate content. The model was initially released under a research license, and the creators welcome feedback to improve the model’s performance and scalability. The form can be used in different areas, such as art, design, storytelling, virtual reality, and accessibility. The creators ask many research questions related to the technical, academic, and ethical aspects of the model. Access to model weights is available on Deep Floyd’s Hugging Face space, and the model card and code are available on GitHub. Gradio’s demo is made available to everyone, and the creators invite people to join the public discussions.
Don’t forget to join 20k+ML Sub RedditAnd discord channelAnd Email newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we’ve missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check out 100’s AI Tools in the AI Tools Club
Niharika is a Technical Consultant Intern at Marktechpost. She is a third year undergraduate student and is currently pursuing a Bachelor of Technology degree from Indian Institute of Technology (IIT), Kharagpur. She is a highly motivated person with a keen interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these areas.