Learn about AudioGPT: a multimedia AI platform that connects ChatGPT to audio-based models

Estimated read time: 5 min

Wireless

The AI ​​community is now greatly influenced by the large language paradigm, and the introduction of ChatGPT and GPT-4 has led to advanced natural language processing. With massive web script data and robust engineering, LLMs can read, write, and converse like humans. Despite successful applications in text processing and generation, the success of the voice-music-voice-talking-head method) is limited, although it is very useful because: 1) In real-world scenarios, humans communicate using spoken language throughout day-to-day conversations, and use the spoken assistant to make life is more comfortable; 2) Processing of phonological modality information is required to achieve successful artificial generation.

A critical step for LLM toward more advanced AI systems is the understanding and production of sound, music, voice, and speaking heads. Despite the advantages of the vocal method, it is still difficult to train LLMs that support voice processing because of the following issues: 1) Data: Very few sources provide real-world spoken conversations, and obtaining human-tagged speech data is an expensive and time-consuming process. In addition, multilingual conversational speech data is needed compared to a large number of web text data, and the amount of data is limited. 2) Computational resources: Training a multimedia LLM from scratch requires computation and takes a lot of time.

In this work, researchers from Zhejiang University, Peking University, Carnegie Mellon University, and Rimin University in China present “AudioGPT,” a system designed to be excellent at understanding and producing the manner of sound in spoken dialogues. particularly:

🚀 Join the fastest ML Subreddit community

  1. They use a variety of phonological basis models to process complex phonological information rather than training a multimedia LLM from scratch.
  2. They connect the LLM to I/O interfaces for speech conversations rather than training a spoken language model.
  3. They use LLM as a general-purpose interface that enables AudioGPT to solve many audio understanding and generation tasks.

It would be pointless to start the training from scratch because the phonemic basis models can already understand and produce speech, music, voice, and heads of speech.

Using I/O interfaces, ChatGPT, and spoken language, LLM can communicate more effectively by converting speech to text. ChatGPT uses the chat engine and instant manager to determine user intent when processing audio data. The AudioGPT process can be divided into four parts, as shown in Figure 1:

• Method conversion: Using I/O interfaces, ChatGPT and spoken language LLMs can communicate more effectively by converting speech to text.

• Task Analysis: ChatGPT uses the chat engine and real-time manager to determine user intent when processing audio data.

• Model mapping: ChatGPT allocates phonemic baseline models for comprehension and generation after receiving structured arguments for presentations, timbre, and language control.

• Response Design: Generate and provide consumers with the final answer after implementing the Voice Basis Model.

Figure 1: AudioGPT Overview. Method transformation, task analysis, model mapping, and response generation are the four processes that make up AudioGPT. In order to handle difficult voice tasks, it provides ChatGPT with voice base models. In addition, it connects to the modalities conversion interface to enable spoken communication. We are developing design guidelines to evaluate the consistency, capacity, and robustness of a multimodal LLM.

Evaluating the effectiveness of a multimodal LLM in understanding human intention and coordinating cooperation between different basis paradigms has become an increasingly popular research issue. Results from experiments show that AudioGPT can process complex audio data in a multi-round dialogue for various AI applications, including the generation and understanding of speech, music, voice, and speaking heads. They describe design concepts and evaluation procedures for AudioGPT consistency, capability, and robustness in this study.

They propose AudioGPT, which provides ChatGPT with audio foundation models for complex audio functions.

This is one of the main contributions of the paper. The method transformation interface is associated with ChatGPT as a general purpose interface to enable spoken communication. They describe design concepts and evaluation procedures for a multimedia LLM and evaluate the consistency, capability, and robustness of AudioGPT. AudioGPT effectively understands and produces audio through many rounds of discussion, enabling people to produce rich and diverse audio materials with unheard of simplicity. The code has been opened on GitHub.


scan the paper And github link. Don’t forget to join 20k+ML Sub RedditAnd discord channelAnd Email newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we’ve missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check out 100’s AI Tools in the AI ​​Tools Club


Anish Teeku is a Consultant Trainee at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. Likes to communicate with people and collaborate on interesting projects.


Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.