Extending Text2Image Models to Accept Multi-Modal Conditions by Encoding to the CLIP Latent Space

From FDHwiki
Jump to navigation Jump to search

"Anything to image, learning about cultures": Introduction and Motivation

Motivation

Image 1: Cultural areas

In our quest to authentically represent cultural narratives, the necessity of multimodal technologies is paramount. These technologies, particularly with innovations like CLIP, facilitate a fusion of text, image, and audio inputs, enabling a nuanced portrayal of cultural realities. This approach aligns with the modern definition of intelligence, which encompasses not just logical reasoning but also the understanding of complex ideas and experiences. Our project is driven by this understanding and the need to explore cultural perspectives deeply. Current research on multimodal generative models is burgeoning, yet there's a pressing need to focus on the depth of cultural understanding within these models. It's through this comprehension that we can appreciate the vast tapestry of human traditions and foster a deeper empathy across diverse societies. Programs that focus on the interpretation of cultural symbols are key to unlocking a more profound grasp of our multifaceted world.

Our team, rooted in the diverse heritages of China and Mexico, is motivated by the capabilities of multimodal technologies in interpreting varied cultural expressions. We aim to transcend common cultural stereotypes, seeking a more authentic and comprehensive understanding of our diverse backgrounds. This project is our endeavor to bridge cultural divides and foster a deeper appreciation of the rich tapestry of human traditions through the lens of multimodal technology.

Leveraging GenAI technology, our project presents a platform that highlights the richness of global cultures through aspects like food, clothing, paintings, and instruments. It aims to deepen appreciation for cultural heritage and facilitates cultural exchange, enabling users to explore and connect with the diversity of the world's communities.

Principal goal

The primary objective of our project is to explore the degree to which multimodal models comprehend cultural elements. By integrating two open-source models, we've enabled the generation of images based on inputs from voice, image, and text. We've established a website to demonstrate the usability of our model and have embarked on a series of benchmarks and experiments to probe the model's understanding of cultural nuances.

Deliverables

Image 2: Network project architecture

At the end of the project, we will be able to visualize five main deliverables within the project:

  1. A comprehensive website that showcases our multimodal generative model, offering users the ability to search and generate images.
  2. A curated multimodal dataset encompassing four cultural themes, serving as the foundation for our benchmark.
  3. Benchmark results of our model's search capabilities, showing its effectiveness and precision.
  4. Experimental findings on the model's ability to generate culturally relevant content, demonstrating its understanding and application in diverse cultural contexts.
  5. Results from experiments testing the model's coherence abilities.

Project Plan and Milestones

Weekly Project Plan

The foundation of Digital humanities course consists of 14 weeks, however for project planning only 11 weeks were considered, which correspond to the time allocated to the project.

Below you can see a table with the planning of the 9 weeks.

Date Model programming Web page development Data set construction Experimentation and evaluation
Week 4 Identified the type of model to use
Week 5 ImageBind Model development Website first demo/draft
Week 6 ImageBind model test


Work on backend and model mix
|

Make a data set search plan
Week 7 State Diffusion Model Deployment Division of subtopics and data levels
Week 8 Dataset Collection in levels and samples manually
Week 9 Dataset Collection in levels and samples manually Wiki Page for the project development
Week 10 Dataset Collection in levels and samples manually Continue web construction
Week 11 Website improvement in frontend and backend mix Culture Search Benchmark and data analysis processes
Week 12 Add samples for differents categories for tested Culture Generation Benchmark, and data analysis


Test the results of the two models

Week 13 Finalize and improve the website
  • Publish the website
  • Tested the 2 functions and models
Week 14 Prepare the Wikipedia page and final report


Make the last presentation

Final project performance

Milestones

A "milestone" is a project marker, for the development of this project, 3 important milestones were defined, which represented the 3 essential phases of the project. For each milestone, specific tasks were assigned that indicated whether it was complete.

Milestone 1 Model development

Delivery date November 10

This milestone focuses on the development of clip models with the multi-modal process for creating images, it focuses on the development of both models of both dissemination and image creation, as well as, the development of the raw material necessary for its operation.

This milestone corresponds to the first of our project as it represents the first progress of the project, a model from which to start and later adapt it to a friendly environment for its dissemination.

Among the key activities we find:

  • Culture Images Dataset Collection
  • ImageBind Model Deployment
  • Diffusion Model Deployment


Milestone 2 Website development and improvement

Delivery date November 25

This second milestone focuses on designing a web environment that serves as a user interface and allows us to use our models.

Among the activities to be carried out in said milestone we find:

  • Website Demo Development
  • Website Frontend Design
  • Combination of Two Models
  • Website backend development
  • Website Improvement

Milestone 3 Models test

Delivery date December 14

The function of searching and generating images of our models is undoubtedly a fundamental part for the project to come out, however it is not the only one, since the image generated must be in accordance with what is sought and represent the culture from a real perspective, in this way way to analyze the quality of the models is essential, this planning, testing and analysis of the results will give an opinion on the viability of the models for the proposed objective.

Among the activities required to validate the operation, quality and precision of the models we find:

  • Search Benchmark
  • Generation Experiment
  • Model Coherence Experiment

Milestone 4 Final deliveries

Delivery date December 19

This final milestone represents the final deliverables of the project,among which we find:

  • Wiki page documentation
  • Github code report
  • Presentation speech
  • PPTX

Within the final deliverables that we made as a team we found 3 main elements that would be the models for both search and image generation, the website and environment for the user with full functionality and link between the development of the back end and fornt end and We finished with the first data collection, which was generated manually for the quality and development of the project.

Methodology

Deep Learning Models

Image 3: Imagebind

Imagebind

ImageBind, released in May 2023 by Meta Research, is an embedding model that combines data from six modalities: images and video, text, audio, thermal imaging, depth, and IMUs, which contain sensors including accelerometers and orientation monitors. Using ImageBind, you can provide data in one modality – for example, audio – and find related documents in different modalities, such as video.

Through ImageBind, Meta Research has shown that data from many modalities can be combined in the same embedding space, allowing richer embeddings. This is in contrast to previous approaches, where an embedding space may include data from one or two modalities. Later in this post, we’ll talk about the practical applications of the ImageBind embeddings.

As of writing this post, ImageBind follows a series of pioneering new open source models released by Meta Research with uses in computer vision. This includes the Segment Anything Model that set a new standard for zero-shot image segmentation, and DINOv2, another zero-shot computer vision model.

Image 4: Stable Diffusion

Stable Diffusion

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing AI spring.

It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. It was developed by researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a compute donation by Stability AI and training data from non-profit organizations.

Stable Diffusion is a latent diffusion model, a kind of deep generative artificial neural network. Its code and model weights have been open sourced, and it can run on most consumer hardware equipped with a modest GPU with at least 4 GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services.

Combination

Our model is the combination of ImageBind and Stable Diffusion, more specifically, we use the unified latent space of ImageBind and the inference part of Stable-Diffusion-2-1-unclip. We use the ImageBind to transform multimodal (including text, image and audio) into one latent space, and then put the embeddings from the same space into Stable Diffusion to generate images. Because the ImageBind and Stable-Diffusion-2-1-unclip are all trained on the same text-image dataset OpenCLIP, they should have the similar latent space, and that's why we can directly consider the output of ImageBind as the input of Stable Diffusion. Our model should be one of the most powerful multimedia search and generation open-source model which can run on the normal computers nowadays.

Dataset Build

The dataset was generated manually, focusing on 4 different areas for an analysis of the different cultural perspectives of the world, From Cultural Perspectives" that talk about this involve recognizing and accurately depicting cultural symbols, costumes, architecture, and other elements unique to various cultures.

The data set has over 300 samples that analyze different cultures through 2 branches Cultures around the world i.e. focused on geographic locations and Civilizations through time which focuses on looking at different ancient cultures and how temporality affected this cultural perspective.

Here is the access to the dataset: https://drive.google.com/drive/folders/1HVIvpNfguROwW224Fko3LD4m3nJP6-FC?usp=sharing

Content

The content of our dataset focuses on 4 areas of analysis: instruments, clothing, paintings, and food, with the content of each of these being as follows:

  • Instruments:
    • China:
      • Stringed instruments: guzheng, pipa, erhu, gaohu, harp , dulcimer.
      • Blow instrument: flute, xiao, suona.
      • Percussion instrument: gongs, drums and wooden fish.
    • Europe
      • Stringed instruments: violin, cello, guitar, harp.
      • Woodwind instruments: flute, clarinet, oboe.
      • Brass instruments: trumpet, trombone, French horn.
      • Keyboard instrument: piano, organ
    • America
      • String instrument: chalapatita, bass, cuatro.
      • Blow instrument: pan flute, saxophone.
      • Percussion instrument: conga drums, marimba.
  • Clothing:
    • Historical period
      • Ancient costumes: Egyptian, Greek and Roman costumes.
      • Medieval costumes:
      • Renaissance costumes:
      • 17th to 19th century costumes:
      • 20th century clothing:
    • Geographical location
      • Asian costumes
      • African clothing
      • European clothing
      • Arab and Middle Eastern clothing
      • Costumes of the Americas
  • Painting
    • Artistic trends
      • Renaissance
      • Baroque
      • Rococo
      • Romanticism
      • Impressionism
      • Post-impressionism
      • Modern and contemporary art
    • Drawing
    • Chinese paintings
      • Landscape
      • Flowers and birds
      • Characters
  • Food
    • Asia
    • America
    • Europe
    • Middle East

Collection

For the development of each sample of our dataset we looked for at least 2 of the following elements:

  • Audio
  • Image
  • Text

Being these elements a parity between each other, each sample is focused on achieving a quality and accuracy analysis of our model later on.

Benchmark and Experiment

To ensure the quality and verify the operation of our models, we have carried out different benchmark processes to determine the precision and quality of the models.

Search Benchmark

Approach

The ImageBind can transform different kind of modalities into embeddings in the same latent space. The closer the related information is, the closer they are in this latent space, and the more similar the embeddings they represent are. For example, for a picture of a dog, an audio of dog barking, and a piece of text 'a beautiful city', the first two will be closer in space than the latter two. With the help of ImageBind, we can directly search for related pictures using sound, text or pictures without retraining this model.

In this part, we will use ImageBind as search engine for our culture dataset, in order to test whether our model has a certain ability to understand culture in the search part. In details, we divide our culture dataset into four parts, which are about paintings, clothing in different periods, musical instruments, and food and clothing in different regions. For every part, we will visualize the distribution of multimodal embeddings in the space and calculate search accuracy to measure its capabilities.

Result Analysis

  • Paintings
Image 5: Distribution in Latent Space
Image 6: Search Result
In this part, we want to check whether our model can distinguish between different types of paintings. During the search phase, almost all searches found the correct category, only when searching with text 'a sketch' returned incorrect results. At the same time, watching the Image 5, we can find that our model search part can distinguish between text and image (squares and circles become two groups), and also can distinguish from different categories of oil paintings (circles of the same color come together), which indicates that our model search part has a certain cultural awareness in painting.
  • Clothing in different historical period
Image 7: Distribution in Latent Space
Image 8: Search Result
In this part, we want to check whether our model can distinguish between clothing in different historical period. Watching the Image 8, we can find that although the search accuracy is not bad (three out of four searches can be returned correctly), but in the Image 7 in the left, the circles of the same color do not cluster together well, indicating that the search part of the model is not very Ideal for understanding clothing from different historical periods.
  • Instruments
Image 9: Histogram of Search Result
In this part, we want to check whether our model can distinguish between different types of instruments. The special thing about this part is that they are audio-image pairs. We plan to use audio to search the relevant images. It should be noted that in this part, the pictures corresponding to each audio has three attributes, namely the region (South America, Chinese, European), type (string, percussion, woodwind, keyboard and so on) and name (harp, dulcimer, suona, flute and so on) of the instrument. Looking at the Image 9 above, it is not difficult to find that in about half of the searches, the picture of the corresponding musical instrument can be found directly through the audio, and more than 80% of the searches can return the correct type and area. Although the search accuracy may not be as high as searching for images with text, given the difficulty of the task, we can task the model with good capabilities in instrument perception.
  • Food and Clothing in different area
Image 10: Distribution in Latent Space
Image 11: Search Result
In this part, we want to check whether our model can distinguish between food and clothing together which are from different areas. The special thing about this part is that there are two kinds of items exiting in the dataset, namely food and clothing, and they are also from different area. Looking the Image 11, we can find the model can get a good search accuracy with text, but it is easy to confuse the American and Europe (results show that the model mistook European-style clothes for American clothes, and also mistook American-style clothes for European clothes).I think this is because the two countries have a certain cultural connection. Besides, we can also discover that the model has a better area perception on the food comparing to the clothing according to the Image 10 (because squares of the same color have more significant clustering).

Generation Experiment

Approach

The standard approach in machine learning is to evaluate the system on a set of standard benchmark datasets, ensuring that they cover a range of tasks and domains. This method, however, may not be fully suitable for evaluating the capabilities of Generative AI models, due to their vast training data and their ability to perform tasks beyond the typical scope of narrow AI systems. Since we do not have access to the full details of its vast training data, we have to assume that it has potentially seen every existing benchmark or at least some similar data. The second reason for going beyond traditional benchmarks is probably more significant: One of the key aspects of Generative AI models' intelligence is its generality, the ability to seemingly understand and connect any topic, and to perform tasks that go beyond the typical scope of narrow AI systems. Benchmarks for such generative or interactive tasks can be designed too, but the metric of evaluation becomes a challenge (see e.g., [PSZ+21] for some recent progress on this active research area in NLP).

So, to design an experiment to test the multimodal generation models like ImageBind and Stable Diffusion for their understanding of cultural elements, we take inspiration from the approach used in studying GPT-4's intelligence, as detailed in the paper "Sparks of Artificial General Intelligence." In this experiment, we tested various themes including food, instrument, clothing, and painting, utilizing text+image and audio+image as inputs.

Result Analysis

  • Food
Input: Image+Text
Input Image Text
Indian
family dinner
family dinner
Indian family gathering eating traditional Indian cuisine. Introduce elements like a plate of Basmati rice, a bowl of rich, creamy Dal Makhani, vibrant vegetable Biryani, and freshly baked Naan bread. Decorate with garnishes like coriander leaves and slices of lemon.
Japanese
family dinner
family dinner
Jpanese family gathering eating traditional Japanese cuisine. Introduce elements like a plate of sushi and a bowl of ramen.
Mexican
family dinner
family dinner
Envision a lively Mexican family gathering, embracing the rich culinary traditions of Mexico.
Korean
family dinner
family dinner
Korean families are sitting at tables, enjoying crispy Korean fried chicken and various side dishes like kimchi and pickled radishes.
Chinese
family dinner
family dinner
A warm and festive scene of a Chinese family gathered around a table for a Chinese New Year celebration. The family is enjoying hotpot meal. There is a hotpot on the table
The model demonstrates a foundational grasp of the distinctions among various traditional cuisines, capturing the general essence of each. However, it lacks clarity in the finer details.
  • Indian: The model depicts an Indian family gathering to enjoy a traditional meal. While the facial features of the individuals are not accurately represented, resulting in a weird appearance, the model successfully portrays recognizable elements of traditional Indian cuisine, such as curry and baked pancakes.
  • Japanese: The sushi and sashimi in the image are not clearly defined but are identifiable as such.
  • Mexican: The image features baked corn, identifiable as part of traditional Mexican cuisine. However, Diego and Max denied that they don't eat corn in this way.
  • Korean: Despite the weird facial features, the image successfully represents Korean cuisine through recognizable elements like fried chicken and kimchi.
  • Chinese: Although the hot pot is not depicted precisely, the image shows the scene of Chinese people gathering for a hot pot meal during the New Year celebration.


Input 1 Input 2 Input 3
Lang Shining: Flower and Bird Album
Lang Shining: Flower and Bird Album
Monet's Water Lilies
Monet's Water Lilies
Monet's Water Lilies
Monet's Water Lilies
A oil painting A Chinese traditional freehand brushwork landscape painting A rococo style oil painting
In all, the model displays a rudimentary ability to discern between oil paintings and Chinese traditional paintings. Yet it needs to improve in accurately separating the genres within oil paintings.
  • Oil Painting: The model can somewhat distinguish between oil paintings and traditional Chinese paintings, as seen in its attempt to replicate oil painting textures and colors. However, it struggles to differentiate between oil painting genres, blending them with Chinese art's distinct outlines and spatial awareness. This creates a mixed style, indicating the model understands various artistic mediums but lacks precision in defining specific oil painting genres.
  • Chinese traditional freehand brushwork landscape painting: The image combines Monet's Impressionism with Chinese brushwork, reflecting a mix of Monet's color palette and the defined outlines typical of Chinese painting. This blend indicates the model's grasp of both styles but also its difficulty in separating them distinctly.
  • Rococo style oil painting: The third image, aiming to combine Monet's Impressionism with Rococo elements, mainly displays Monet's influence. The lily pads and colors align with his style, but the Rococo's decorative aspects are subdued.
  • Instrument
Input: Image+Audio
Input Image Audio Input Image Audio Input Image Audio
Bird
Bird image
Bird image
Bird Audio Suona
Suona image
Suona image
Suona Audio Flute
Flute image
Flute image
Flute Audio
Drums
Drums image
Drums image
Drums Audio Erhu
Erhu image
Erhu image
Erhu Audio Piano
Piano image
Piano image
Piano Audio
Cello
Cello image
Cello image
Cello Audio Conga
Conga image
Conga image
Conga Audio


The model exhibits a notable limitation in recognizing different musical instruments sounds. While it can identify and visually represent bird calls by generating images of birds, it cannot discern and visually translate the sounds of musical instruments. This suggests that the model’s capacity to process and correlate audio input with appropriate visual output is currently restricted and needs further development, especially when the auditory stimuli are complex, such as those from musical instruments. The challenge lies in the model's interpretive capabilities to integrate and coherently express varied types of sensory data, particularly auditory, into a corresponding visual representation.
  • Clothing
Input: Image+Text
Input Image Text
Chinese Qing Dynasty clothing 1 Chinese traditional clothing a chinese emporer of Qing Dynasty wearing Qing Clothing
Chinese Qing Dynasty clothing 2 Chinese traditional clothing clothing of a chinese emporer of Qing Dynasty
Chinese Qing Dynasty clothing 3 Chinese traditional clothing a cloth during China's Qing Dynasty
Korean Hanbok Korean clothing korean traditonal clothing
Japanese kimono Japanese clothing Japanese traditonal clothing
Mexican Mexican clothing Mexican traditonal clothing
Despite some inconsistencies, particularly with the Hanbok, the images generally show a level of cultural insight and detail in representing traditional garments.
  • Chinese: The model successfully depicts Qing dynasty clothing with its signature yellow color and detailed patterns, showing imperial connections.
  • Korean: For the Korean Hanbok, the representation is less precise, capturing some of the attire's grace but missing key design elements.
  • Japanese: The Japanese kimono is well-rendered, aligning with the traditional aesthetics for special occasions.
  • Mexican: Mexican clothing is vibrantly portrayed, reflecting the festive spirit of cultural attire.

Coherence Experiment

Approach

In this phase of the experiment, we aim to assess the coherence between image generation and retrieval capabilities of ImageBind and Stable Diffusion models. To accomplish this, we use the images generated in the generation phase as input queries in the search part, conducting searches within our curated dataset. This approach allows us to measure the models' ability to produce and recognize consistent visual content, which is essential for ensuring the reliability and applicability of such models in practical scenarios. Emphasizing coherence is crucial, as it underpins the models' effectiveness in generating contextually relevant and recognizable images that are indispensable for robust image-based search applications.

For each iteration of searching our database for a photo generated by our model, a coherence value will be assigned as follows:


Image 32: Coherence weighting

In this table it is key to differentiate the meaning of the values assigned by coherence level, 4 values were assigned as shown in the table:

value 1 is assigned when the search model manages to assign an image that does not match either the creation segment, geographical location or culture.

Value number 2 is assigned when the image found from the generated image matches at the first level, i.e. it matches the same segment used in its creation (clothes, instruments, food, paintings) but does not match the geographical region or temporality.

Value number 3, is assigned when the found image shares in a second level coherence with the generated image, that is to say it is not only the same segment but it also shares the same geographic region or temporality with the searched image, in this point the image that is the product of the search shares geographic location but not culture or country in specific.

Finally, value number 4, is assigned when the image found by the search model is an image that shares section, geographic area or temporality and precise culture than the one used in its creation, being the point of greater coherence between both models.

This scale and ponderation will allow us to measure the level of coherence of our search system with the generation system, by ensuring that both the products of the generation model are understood by our search model itself.

Results Analysis

The experiment involved 25 images generated in the experiment part. We categorized the results into four cultural themes: food, clothing, instruments, and paintings, assigning coherence levels from one to four for each generation.

Image 33: Data logging


Once the registry and categorization of each test was obtained, we proceeded to analyze the count of the values for each of the 4 categories Food, clothes, instruments and paintings.


Image 34: Search table for section
Image 35: Generate table for section

In this way, the average per category in function of search and creation is as follows:

Image 36: Search coherence Graph
Image 37: Generation coherence Graph

From images 29 and 30, we discerned the coherence performance across these thematic sections. The bar graphs indicate the coherence scores, allowing us to calculate an average performance for each category. Our models predominantly achieved coherence levels between one and two, indicating a basic understanding of the generated content in relation to the search parameters.

The data tables provided insights into the depth of the model's representational accuracy within each thematic section. It was observed that the sections of painting and clothing yielded coherence scores above two, suggesting a more nuanced understanding and identification of the content up to a second-tier level, such as the style of painting or the geographic origin of clothing.

Specifically, in the search process, the model demonstrated a better interpretation of paintings with an average score above 2.5, suggesting the model's ability to recognize up to a second level of detail. Similarly, clothing-related images were processed with a coherence value over 2.0, indicating the model's capacity to discern images of clothing up to the geographic zone from which they originate.

Limitation and Future Work

In discussing the limitations of our project, several key factors arise.

Firstly, the reliance on a manually curated dataset for testing introduces an inherent bias, as the data may not fully capture the diversity or complexity of real-world scenarios. Therefore, the outcomes we've observed, while informative, should be viewed with an understanding that they may not align perfectly with more diverse or complex situations.

Secondly, the size of manually collected cultural dataset is still limited, which means that we can only conduct relatively simple tests and cannot conduct more in-depth tests, and the conclusions obtained will also have a certain degree of randomness.

Thirdly, when examining the outputs generated by the model, it is clear that although it has a general grasp of cultural themes across food, clothing, and art, it struggles significantly with the recognition of musical instruments from audio inputs. Furthermore, the detail in the imagery produced is often lacking, resulting in peculiarities that detract from the authenticity of the representation.

Lastly, the open-source models utilized in our project, while robust, do not possess the computational parameters of more advanced, commercially available multimodal models such as GPT-4. Coupled with our computational resource constraints, this gap underscores the challenges we face in achieving the high standards set by state-of-the-art technologies. Therefore, while our model demonstrates promising capabilities, it cannot compare with the more sophisticated multimodal systems in the field.

Github Report

https://github.com/HAOTIAN89/A-useful-generateAI-tool-for-DH

References

[1] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.

[2] Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[3] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022