Chinese Developer Introduces Multimodal Model Unifying Text, Image, Video

BEIJING: The Beijing Academy of Artificial Intelligence (BAAI) has launched Emu3, a multimodal model that unifies text, image, and video processing using next-token prediction, in a significant advancement for artificial intelligence.

The model aims to reshape the way AI handles multiple types of media simultaneously, pushing beyond traditional language models.

Emu3 operates by tokenizing text, images, and videos into a unified discrete space, allowing a single transformer to be trained from scratch on a combination of multimodal sequences, according to Wang Zhongyuan, director of BAAI.

This innovative approach bypasses the need for diffusion or compositional models, which have typically been required for such complex tasks.

“Emu3 demonstrates that next-token prediction can serve as a powerful framework for multimodal AI, delivering state-of-the-art performance across a variety of tasks,” Wang said in a press release. He said that Emu3 is capable of both generating and understanding content.

The new model has outperformed several well-established task-specific systems in terms of generation and perception tasks, according to BAAI. The organization also announced that it has made Emu3’s key technologies and models available to the international technology community as part of an open-source initiative.

Technology experts have hailed the development as a significant opportunity for exploring multimodality through a single architecture, simplifying the complexities of combining diffusion models with large language models (LLMs).

“Emu3 opens up new possibilities for practical applications, such as robot intelligence, autonomous driving, and multimodal dialogue systems,” Wang added.

Where Integrity is Everything

Chinese Developer Introduces Multimodal Model Unifying Text, Image, Video

Latest

Videos

Op-Ed