What will artificial intelligence (AI) look like in the future? Imagine that they can understand and perform complex tasks with just a simple command; they can also visually capture the user's expressions and movements to determine their emotional state. This is no longer a scene in a Hollywood science fiction movie, but a "multi-modal AI" that is gradually entering reality.
According to a recent report by the US "Forbes" website, giants such as Metaverse Platform Company, OpenAI and Google have all launched their own multi-modal AI systems and are sparing no effort to increase investment in research and development of such systems and strive to improve various models. to improve the accuracy of dynamic content output, thereby improving the interactive experience between AI and users.
Multimodal AI marks a paradigm change. It will profoundly change the face of many industries and reshape the digital world.
Giving AI “multi-sensory” capabilities
How do humans understand the world? We rely on multiple senses such as sight, hearing, and touch to receive information from countless sources. The human brain integrates these complex data patterns to draw a vivid "picture" of reality.
IBM's official website defines multi-modal AI as follows: it can integrate and process machine learning models from multiple modalities (data types), including input in the form of text, images, audio, video, etc. It's like giving AI a whole set of senses so that it can perceive and understand input information from multiple angles.
This ability to understand and create information across different modalities has surpassed the previous single-modal AI that focused on integrating and processing specific data sources, and has won the favor of major technology giants.
At this year's Mobile Communications Conference, Qualcomm deployed the large multi-modal model it developed on an Android phone for the first time. Whether users are inputting photos, voice or other information, they can communicate smoothly with the AI assistant. For example, users can take a photo of food and ask the AI assistant: What are these ingredients? What dishes can be made? How many calories are in each dish? The AI assistant can give detailed answers based on photo information.
In May this year, OpenAI released the multi-modal model GPT-4o, which supports input and output of any combination of text, audio, and images. Subsequently, Google also launched its latest multi-modal AI product Gemini 1.5 Pro the next day.
On September 25, Metaverse Platform Company released its latest open source large language model Llama 3.2. The company's CEO Mark Zuckerberg said in the keynote speech that this is the company's first open source multi-modal model that can process text and visual data simultaneously, marking AI's significant progress in understanding more complex application scenarios. .
Quietly promoting changes in various fields
Multimodal AI is quietly changing the face of many fields.
In the field of health care, IBM's "Watson Health" is comprehensively analyzing patients' imaging data, medical record text and genetic data to help doctors diagnose diseases more accurately and strongly support doctors in formulating personalized treatment plans for patients.
The creative industries are also undergoing a transformation. Digital marketing experts and filmmakers are leveraging this technology to create customized content. Just imagine, with just a simple prompt or concept, an AI system can write a compelling script, generate a storyboard (a series of illustrations arranged together to form a visual story), create a soundtrack, and even produce preliminary scene cuts.
The field of education and training is also moving towards personalized learning with the help of multi-modal AI. The adaptive learning platform developed by Newton Company in the United States can use multi-modal AI to deeply analyze students' learning behaviors, expressions and voices, and adjust teaching content and difficulty in real time. Experimental data shows that this method can improve students' learning efficiency by 40%.
Customer service is also one of the exciting applications of multimodal AI systems. Not only can chatbots respond to text queries, they can also understand a customer's tone of voice, analyze their facial expressions, and respond with appropriate language and visual cues. This more human-like communication promises to revolutionize the way businesses interact with customers.
Technology ethics challenges still need to be overcome
However, the development of multi-modal AI also faces many challenges.
Henry Idel, founder of AI consulting company Hidden Space, said that the power of multimodal AI lies in its ability to integrate multiple data types. However, how to effectively integrate these data is still a technical problem.
In addition, multi-modal AI models often consume a large amount of computing resources during operation, which undoubtedly increases their application costs.
More notably, multimodal data contains more personal information. When multi-modal AI systems can easily identify faces, voices and even emotional states, how to ensure that personal privacy is respected and protected? And how can effective measures be taken to prevent them from being used to create “deepfakes” or other misleading content? These are all questions worth pondering.