Multimodal Generative AI

Introduction

The generative artificial intelligence (AI) domain intersects with multiple data types, including text, images, audio, and more. This emerging field leverages the complexity and richness of the real world, transforming how machines understand and generate multimedia content. Generative AI, historically rooted in unimodal systems handling a single data type, has evolved. Classic examples like text-based large language models perform tasks within their respective domains effectively, however, a significant breakthrough comes with multimodal generative AI which integrates and synthesizes information across various sensory inputs, reflecting more closely how we perceive and interact with the world. This blog discusses a modest sampling of the available multimodal models.

Multimodal Models Landscape

The field of multimodal models is rapidly advancing, with significant strides being made across various AI applications, including text, image, video, and audio processing. These models are designed to handle and integrate multiple types of data, enhancing the scope and performance of generative AI. Through strategic modifications and integrations, these models are set to continue revolutionizing how we understand and generate data across various modalities.

Stable Diffusion

Stable Diffusion – This is a key player, particularly for image generation. This link is a demo: Stable Diffusion Demo. The original authors of Stable Diffusion were Stability AI, CompVis, and Runway. Stability AI collaborated with Eleuther AI and LAION to implement an update, Stable Diffusion 2. Stable Diffusion 1 is primarily focused on generating high-quality images from textual prompts, but it has limitations in handling complex textures and achieving finer details. On the other hand, Stable Diffusion 2 introduced improvements in the training process, especially with the inclusion of techniques like classifier-free guidance. These advancements allow Stable Diffusion 2 to produce more detailed and diverse outputs with better handling of lighting and shadowing effects. Moreover, Stable Diffusion 2 supports a wider range of input conditions, improving multimodal task performance compared to Stable Diffusion 2.

Stable Diffusion 2 is trained using a diffusion process, a generative method that involves learning to reverse a stepwise degradation of data. Here’s an overview of the training process:

1. Diffusion Process – Stable Diffusion 2, like its predecessor, is based on the denoising diffusion probabilistic model (DDPM). During training, images are progressively corrupted by adding noise in a series of steps. The model learns to reverse this process, generating high-quality images from random noise by predicting the noise added at each step.

2. Latent Space Representation – Instead of working directly with pixel-level data, Stable Diffusion 2 operates in a latent space. This is achieved using a variational autoencoder (VAE), where the model first encodes images into a lower-dimensional latent space, and the diffusion process occurs within this latent space. This allows for a more efficient and scalable image generation process since the model only needs to focus on learning in a smaller, compressed space.

3. Text-to-Image Guidance – Stable Diffusion 2 uses a technique called classifier-free guidance, where it is trained both with and without text conditioning. During inference, it can adjust the guidance scale to influence the strength of the relationship between text prompts and image generation. This approach provides greater control over how closely the generated image aligns with the input prompt.

4. Noise Prediction and Denoising – At each step of the diffusion process, Stable Diffusion 2 predicts the noise added to the image in that step. It iteratively removes noise, refining the image with each denoising step. Through this reverse process, it eventually generates a high-quality image from a noisy representation. Overall, Stable Diffusion 2 training involves a combination of latent space diffusion, noise prediction, and advanced text conditioning, leading to a powerful and versatile image generation model.

Segment Anything Model 2 (SAM 2)

Meta’s SAM 2 model stands out as a state-of-the-art tool in image segmentation, with the primary capability of segmenting static images based on OpenCLIP embeddings. However, SAM2 has an even more intriguing potential when integrated with Meta’s ImageBind model. ImageBind, designed for multimodal understanding, enables SAM 2 to extend its segmentation capabilities beyond static images to video content. This integration occurs through a modification where ImageBind replaces OpenCLIP as the embedding generator. Once this switch is made, SAM 2 can leverage ImageBind’s ability to process videos and segment them frame by frame. Reference this is a link to Meta’s SAM2 model paper: Meta’s SAM 2 Model Paper

ImageBind

ImageBind is a cutting-edge multimodal model designed to unify inputs from diverse modalities, including text, image, video, and audio. By binding together these different data types, it enables more seamless and integrated outputs. This model is unique in its ability to process and connect inputs without needing explicit modality-specific annotations. The uniqueness of ImageBind lies in its ability to integrate six types of data: visual data (both image and video), thermal data (infrared images), text, audio, depth information, and, intriguingly, movement readings from an inertial measuring unit (IMU). This integration of multiple data types into a single embedding space is a concept that will only fuel the ongoing boom in generative AI.

ImageBind is a Transformer-based multimodal model that can generate joint embeddings across seven modalities: text, image, video, depth, thermal, audio, and inertial measurement units (IMU). ImageBind was developed by Meta and introduced in this paper: Meta’s ImageBind Paper The ImageBind model uses the Transformer architecture for the encoder of all modalities and has separate encoders for each modality, except for images and videos which share the same encoder. The encoders are trained through a contrastive learning objective between pairs of modalities, where one modality in the pair is the image modality is subjected to augmentation and the other is a different modality.

ImageBind does not require all modalities to appear concurrently within the same datasets. Instead, it takes advantage of the inherent linking nature of images, demonstrating that aligning the embedding of each modality with image embeddings gives rise to emergent cross-modal alignment. ImageBind provides emergent zero-shot classification and retrieval capabilities between pairs of modalities it has not seen before, in addition to between image pairs.

1. Feature Embedding Generation – Takes as input texts, RGB images, videos, depth images, thermal images, audios, and/or IMU signals, and generates feature embeddings.

2. Zero-shot Classification – Takes as input texts, RGB images, videos, depth images, thermal images, audios, and/or IMU signals, and predicts matching classes (labels) between every pair of modalities.

Here’s an overview of how ImageBind works:

1. Unified Latent Space – ImageBind works by mapping data from different modalities into a common latent space. A latent space is a lower-dimensional representation where features from different types of data can be understood and compared. For example, an image and a corresponding audio file can be represented in this space in such a way that they can be linked or associated even if they were not explicitly paired during training. This means that the model can generalize across various inputs and understand the relationships between different modalities.

2. Multimodal Representation Learning – Unlike traditional models that may require modality-specific embeddings or annotations (like pairing an image with a specific audio clip during training), ImageBind learns multimodal representations in a self-supervised manner. During training, ImageBind is exposed to data from multiple modalities, but not necessarily paired data. It learns to bind different modalities by leveraging the underlying structure and context of the data. For instance, ImageBind can associate the sound of a lion roaring with images of a lion without needing explicit image-audio pairs in the training set.

3. Cross-Modal Retrieval and Generation – Once trained, ImageBind can perform tasks like cross-modal retrieval, where it retrieves data in one modality based on input from another. For instance, given a textual prompt, it can retrieve relevant images or audio files. Similarly, given an image, it can generate or retrieve related audio. This ability to switch between modalities and process them in an integrated manner makes it highly versatile for generative AI tasks across multiple data types.

4. Scalability and Flexibility – ImageBind is scalable, meaning it can be extended to new modalities without requiring significant retraining or adaptation. For instance, the model can be adapted to handle new data types like thermal or depth data, further increasing its applicability in diverse fields, from virtual reality to autonomous systems.

5. Application in SAM 2 – In the context of SAM 2 (Segment Anything Model 2), ImageBind is used to enhance segmentation capabilities. If SAM 2 is modified to use ImageBind instead of OpenCLIP, it gains the ability to segment videos, rather than being limited to static images. This is because ImageBind can handle video frames, associating them with other modalities such as audio or textual prompts to perform fine-grained segmentation across time rather than just within a single frame.

Conclusion

The multimodal AI landscape is vibrant and rapidly evolving, with models like Stable Diffusion, SAM 2, and ImageBind pushing the boundaries of generative and segmentation tasks. The field of multimodal generative AI has seen rapid advancements with several key models at the forefront, each tailored to integrate multiple types of data, such as text, images, audio, and more. Stable Diffusion is one prominent model currently used for generating high-quality images from textual prompts. It works through diffusion processes that iteratively refine outputs to create photo-realistic visuals, making it a popular choice for text-to-image generation tasks. Stable Diffusion’s versatility, along with open-source accessibility, has broadened its adoption across various creative and industrial applications.

Meta’s ImageBind represents a significant leap in multimodal AI by enabling the binding of six different types of data modalities (text, image, video, audio, depth, and thermal). This model operates on the principle of aligning these diverse forms of data into a shared space, which allows for more natural and integrated understanding across modalities(Meta ImageBind Paper). This capability opens up possibilities in cross-modal retrieval and understanding, pushing the boundaries of what AI can achieve in real-world applications. Another major innovation is Meta’s SAM 2 which builds on prior segmentation models to create a more refined, generalized approach to segmenting any object in an image, with minimal guidance. SAM 2 is particularly useful in enhancing tasks that require detailed image analysis, such as medical imaging and autonomous vehicle navigation.

These models highlight the growing trend of integrating various data types, aiming to deliver seamless and efficient AI solutions. The advancements in these models provide a foundation for future research and applications in multimodal generative AI, offering promising directions for creative, scientific, and industrial use cases. In a future blog we will provide a discussion of fine tuning multimodal generative AI models. Please contact us for more information on how we can help supercharge your organization with Generative AI!