Humans began to interact with the world through the two best pillars of language and vision. This is all because of the super good capabilities of the recently popular Large Language Models (LLMs). LLM has taken the world by storm with its significantly increased performance. LLMs such as GPT-3, T5, PaLM, etc. have begun to imitate humans by learning to read, summarize, and generate text data.
Artificial intelligence researchers have developed a general-purpose assistant that can effectively follow multimedia language and vision instructions that align with human intent to easily complete real-world tasks. For this purpose, language-enhanced foundational vision models are developed in open-world visual understanding to perform tasks such as classification, detection, segmentation, annotation, visual generation, and editing. With the release of OpenAI’s GPT-4, the adapter model behind the popular chatbot, ChatGPT, and its multimedia capabilities has proven to be a good addition to the list of LLMs.
In a recent paper, the authors present the first attempt to use GPT-4 to generate multimedia image and language instruction tracing data. The team introduced LLaVA, Senior Assistant for Language and Vision, a large end-to-end trained multimodal model that connects the vision encoder and Vicuna for general purpose visual and language comprehension. Vicuna is an open source chatbot with 13B parameters that is trained by tuning LLaMA to the conversations a user is engaged in.
🚀 Join the fastest ML Subreddit community
LLaVa is an attempt to extend instruction tuning into the multimedia space. The main objective is to enable users to complete their tasks in real time with the help of a visual assistant who can effectively follow vision instructions and multimedia language which is in line with human intentions. The significant contributions made by the team are as follows –
- Multimedia Help Follow Data – The team provided a perspective for data reframing and a pipeline for converting image-text pairs into a help follow format with the help of the GPT-4 model.
- Large Multimedia Models – The team developed a large multimedia model by connecting the CLIP open visual encoder with the LLaMA language decoder and tuning it end-to-end to the generated educational vision language data.
- The pilot study attempts to verify the effectiveness of user-generated data for tuning LMM instructions. He even suggests practical tips for building a general-purpose visual worker that follows the instructions.
- SOTA performance was achieved with the help of GPT-4 on the Science QA multimedia logic dataset.
- Open Source Nature – The project is open source, the multimedia generated instruction data, the code base for generating the data and training the model, the model checkpoint, and the video chat demos are open to the public for access and can be accessed at https://github.com/haotian-liu/LLaVA .
LLaVA demonstrated outstanding multimedia conversation capabilities and achieved a relative score of 85.1% compared to GPT-4 on a multimedia instruction-following synthetic dataset. When fine-tuned to Science QA, the synergy of LLaVA and GPT-4 achieved a new SOTA accuracy of 92.53%. The results make LLaVA a promising approach and a significant contribution to the released language paradigms.
scan the research paper, code, And project. Don’t forget to join 20k+ML Sub RedditAnd discord channelAnd Email newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we’ve missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check out 100’s AI Tools in the AI Tools Club
Tania Malhotra is a final year from University of Petroleum and Energy Studies, Dehradun, and is pursuing a BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is passionate about data science and has good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.