Terzani online museum: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 86: Line 86:


[[Image:Database_design.png|700px|left|thumb|Schema of the MongoDB collections]]
[[Image:Database_design.png|700px|left|thumb|Schema of the MongoDB collections]]
As the data is primarily unstructured owing to the non-definitive number of tags, annotation, and bounding boxes an image can have, we use a [https://en.wikipedia.org/wiki/NoSQL NoSQL] database and choose [https://en.wikipedia.org/wiki/MongoDB MongoDB] due to its representation of data as documents. Using the [https://pymongo.readthedocs.io/en/stable/ PyMongo], we created three different collections on this database.
As the data is primarily unstructured owing to the non-definitive number of tags, annotations, and bounding boxes an image can have, we use a [https://en.wikipedia.org/wiki/NoSQL NoSQL] database and choose [https://en.wikipedia.org/wiki/MongoDB MongoDB] due to its representation of data as documents. Using [https://pymongo.readthedocs.io/en/stable/ PyMongo], we created three different collections in this database.


* '''Image Annotations:''' This is the base collection where each object has a unique ID and contains an IIIF annotation and in addition to the annotations that have a bounding box (object localization, landmark, and OCR) obtained with Google Vision.
* '''Image Annotations:''' This is the main collection. Each object has a unique ID and contains a IIIF annotation alongside Google Vision's additional annotations that have a bounding box (object localization, landmark, and OCR).
* '''Image Feature Vectors:'''  This collection contains the mapping between the object ID and its corresponding feature vector.
* '''Image Feature Vectors:'''  This collection contains the mapping between the object ID and its corresponding feature vector.
* '''Image Tags:''' This is a meta collection on top of the ''Image Annotations'' to help process the text search queries faster by searching the labels and returning the related image labels. It contains one object for each annotation, bounded object, landmark, and text detected by Google Vision, and they store a list of IDs of photos corresponding to them.
* '''Image Tags:''' This is a meta collection on top of the ''Image Annotations'' to help process the text search queries faster by searching the labels and returning the related image labels. It contains one object for each annotation, bounded object, landmark, and text detected by Google Vision, and they store a list of IDs of photos corresponding to them.

Revision as of 21:41, 13 December 2020

Introduction

The Terzani online museum is a student course project created in the context of the DH-405 course. From the archive of digitized photos taken by the journalist and writer Tiziano Terzani, we created a semi-generic method to transform IIIF manifests of photographs into a web application. Through this platform, the users can easily navigate the photos based on their location, filter them based on their content, or look for similar images to the one they upload. Moreover, the colorization of greyscale photos is also possible.

The Web application is available following this link.

Photograph of Tiziano Terzani

Motivation

Many inventions in human history have set the course for the future, specifically those that helped people passing on their knowledge. Storytelling is an essential part of the human journey. From family pictures to the exploration of deep space, stories form connections. Historically, tales were mostly transmitted orally. This tradition slowly gave way to writing, which also kept evolving. The different methods of transmission have each influenced their times and the way historians perceived it. In that context, the 19th century film-camera transformed how stories were shared. For the first time in history, scenes could be accurately and instantanously captured. Moreover, this invention was soon accessible by a large public as its production cost lessened. As such, an abundance of photographs were taken throughout the 20th century. However, today, the vast majority of them are lying in drawers or in archives with the risk of being damaged or destroyed. One way to preserve this knowledge is to digitize it. This alone does not revive it nor give it the importance it deserves. Therefore, the aim of this project is to create a medium for these large collections of photos so that anyone, anywhere, can easily access them for research purposes or simply to explore a different time.

Our work specifically focuses on Tiziano Terzani, an Italian journalist and writer. During the second half of the 20th century, he has extensively traveled in East Asia[1] and has witnessed many important events. He and his team captured pictures of immense historical value. The Cini Foundation digitized some his photos[2]. However, the foundation did not organise them, rendering the navigation through the collection tedious. Thus, we created a web application facilitating the access to Terzani's photo archive.

Description of the realization

The Terzani Online Museum is a web application with multiple features allowing users to navigate through Terzani's photo collections. The different pages of the website described below are accessible on the top navigation bar of the website.

Terzani Online Museum Home

Home

The home page welcomes the users to the website. It invites them to read about Terzani or to learn about the project on the about page.

About

The about page describes the website's features to the visitors. It guides them through the usage of the gallery, landmarks, text queries, and image queries.

Terzani Online Museum Gallery

Gallery

The gallery allows users to quickly and easily explore photo collections of specific countries. On the website's gallery, users can find a world map centered on Asia. On top of this map, a red overlay shows the countries for which photo collections are available. By clicking on any State, the associated photos are displayed on the right side of the page. By clicking on an image, users can open a modal window with the full size photo - unlike the gallery where cropped versions are shown - alongside its IIIF annotation. An option to colorize the images is also available on the modal window.

The next feature on gallery is to be able to see at a glance the famous landmarks that are present in the photographs. For that, the users can click on the Show landmarks button to display markers at the locations of the landmarks. Clicking on a marker opens a small pop-up with the location name and a button allowing to show the photos of that landmark.


Search

On the Terzani Online Museum's search page, users can explore the photographs depending on what they depict. The requests can either be made by text, or by image. The search results are displayed similarly to the gallery page.

Text queries

Users are invited to write the content they are looking for in the Terzani photo collections. This content can correspond to multiple things. It can be general labels associated with the photographs, specific localized objects in the image, or it can be recognized text from the photos.

Below the text field, users can select two additional parameters to tune their queries. Only show bounding boxes restrains the results to the localized objects and crops their display around them. Search for exact word constraints the search domain to match precisely the input and thereby not displaying the results that are generated by possible combinations of words.

Side by side an original image from the Terzani archive and its automatically colorized version

Image queries

Users can also upload an image from their device and obtain the 20 most similar pictures from all collections.


Photo colorization

To breathe life into the photo collections, we implemented a colorization feature. When users click on a photo and on the Colorize button, a new window displays the automatically colorized picture.

Note: this feature is currently disabled on the website because of the lack of GPU

Methods

Data Processing

Acquiring IIIF annotations

As the IIIF annotations of photographs form the basis of the project, the first step is to collect them. The Terzani archive is available on the Cini Foundation server[3]. However, it does not provide an API to download the IIIF manifests of the collections. Therefore, we use Python's Beautiful Soup module to read the root page of the archive[4] and to extract all collection IDs. Using the collected IDs, we obtain the corresponding IIIF manifest of the collection using urllib. We can then read these manifests and only keep the annotations of photographs whose label explicitly states that it represents its front side.

As we want to display the photos in a gallery sorted by country, we need to associate each IIIF annotation with the photo's origin. This information is available on the root page of the Terzani archive[5], as the collections' names take after their origin. As these names are written in Italian and are not all formatted the same, we manually map each photo collection to its country. In this process, we ignored the collections that have multiple country names.

Annotating the Photographs

Once in possession of all the photographs' IIIF, we annotate them using Google Cloud Vision. This tool provides a Python API with a myriad of annotation features. For the scope of this project, we decided to use the following :

  • Object localization: Detects which objects are on the image with their bounding box.
  • Text detection: OCR text on the image alongside their bounding box.
  • Label detection: Provides general labels to the whole image.
  • Landmark detection: Provides the name of the place and its coordinates if the image contains a famous landmark.
  • Web detection: Searches if the same photo is on the web and returns its references alongside a description. We make use of this description as an additional label for the whole image.
  • Logo detection: Detects any (famous) product logos within an image along with a bounding box.

For each IIIF annotation, we first read the image data into byte format and then use Google Vision API to get the additional annotations. However, some of the information returned by API cannot be used as it is. We processed bounding boxes and all texts the following way :

  • Bounding boxes: To be able to display the bounding box with the IIIF format, we need its top-left corner coordinates as well as its width and height. For the OCR text, logo, and landmark detection, the coordinates of the bounding box are relative to the image, and thus we can use them directly.
    • As for object localization, the API normalizes the bounding box coordinates between 0 and 1. The width and height of the photo is present in its IIIF annotation, which allows us to "de-normalize" the coordinates.
  • Texts: Google API returns text in English for various detections and in other identified languages for text OCR detection. As to improve the search result, along with the original annotation returned by the API, we also add tags after performing some cleansing steps.
    • Lower Case: Converts all the characters in the text to lowercase
    • Tokens: Converts the strings into words using nltk word tokenizer.
    • Punctuation: Removes all word punctuation.
    • Stem: Converts the words into their stem form using the porter stemmer from nltk.

We then store the annotations and bounding box information together in JSON format.

Photo feature vector

A general CNN [1]

The feature vector of a photograph finds its use in the search for similar images. For each photo in the collection, we generate a 512-dimensional vector using Resnet to represent the image. The feature vector, which is the output of the Convolutional Neural Network, is a numeric representation of the photo. Recently, there has been a plethora of success in training the deep neural networks that perform tasks such as classification and localization with near-human cognition. The hidden layers in these networks learn the intermediate representations of the image, and thus they can serve as a representation of the image itself. Hence for this project, we used a trained Resnet 18 to generate the feature vectors of the photo collections. We chose Resnet because of the relatively small feature size. We take the the feature vector as the output of the average pooling layer, where the feature learning part ends. Similar to the annotations, a JSON document stores the vectors.

Database

Schema of the MongoDB collections

As the data is primarily unstructured owing to the non-definitive number of tags, annotations, and bounding boxes an image can have, we use a NoSQL database and choose MongoDB due to its representation of data as documents. Using PyMongo, we created three different collections in this database.

  • Image Annotations: This is the main collection. Each object has a unique ID and contains a IIIF annotation alongside Google Vision's additional annotations that have a bounding box (object localization, landmark, and OCR).
  • Image Feature Vectors: This collection contains the mapping between the object ID and its corresponding feature vector.
  • Image Tags: This is a meta collection on top of the Image Annotations to help process the text search queries faster by searching the labels and returning the related image labels. It contains one object for each annotation, bounded object, landmark, and text detected by Google Vision, and they store a list of IDs of photos corresponding to them.

Website

Back-end technologies

An account of handling data similarly to the way of its creation and not having to manage complex features like authentication, we choose a Python web framework of Flask which provides the essential tools to build a Web server.

The server primarily processes the users' queries. Along with making a bridge between the client and the database, It also takes care of colorizing a photo that is computational heavy.

Front-end technologies

To build our webpages, we make use of the course of the conventional HTML5 and CSS3. To make the website responsive on all kinds of devices and of screen sizes, we use Twitter's CSS framework Bootstrap. The client-side programming uses JavaScript with the help of the JQuery library. Finally, for easy usage of data coming from the server, we use the Jinja2 templating language.

Gallery by country

To create the interactive map, we used the open-source JavaScript library Leaftlet. To put in evidence the countries that Terzani visited, we used the feature that allows us to display GeoJSON on top of the map. We used GeoJSON maps to construct such a document that contains the countries we mapped manually.

When the user clicks on a country, the client makes an AJAX request to the server. In turn, the server queries the database to get the IIIF annotations of pictures matching the requested country. When the client gets this information back, it uses the image links from the IIIF annotations to display them to the user. The total number of results for a given country serves to compute the number of pages required to display all of them, while each page contains 21 images. To create the pagination, we use HTML <a> tags, which, on click, make an AJAX request to the server asking for the relevant IIIF annotations

Map of landmarks

When the user clicks on the Show landmarks button, an AJAX request is made to the server asking for the name and geolocalisation of all landmarks in the database. With this information, we can create with the Leaflet library a marker for unique landmarks. Additionally, Leaflet also allows creating a customized pop-up when clicking on the position. These pop-ups contain simple HTML with a button which, on click, queries for the IIIF annotations of the corresponding landmark.

Text queries pipeline

Search by Text

Querying photographs by text happens in multiple steps described below. The numbers correspond to the numbers on the schema on the right.

  1. The user enters their query on the search bar. The client makes a request containing the user input to the server.
  2. Upon receiving the user text query, the server
    1. Tokenizes it into lower case words followed by removing any punctuation.
      1. The words undergoing stemming if the user does not indicate to search for an exact match.
    2. Then the server queries the Image Tag collection to retrieve the image IDs corresponding to each word.
  3. The MongoDB database responds with the desired object IDs
  4. Upon receiving the object IDs,
    1. The server orders the images in the sequence of text matching score.
    2. Then queries Image Annotation collection to retrieve the IIIF annotation of these objects.
      1. If the user checked the Only show bounding boxes checkbox, the server also asks for the bounding boxes information.
  5. The MongoDB database responds with the desired IIIF annotations and the bounding boxes if requested.
  6. When the server gets the IIIF annotations
    1. It constructs the IIIF image URLs of all results so that the resulting image has the shape of a square.
      1. If the user requests to show the bounding boxes only, then the server creates the IIIF image URL to obtain the bounding box region of the image.
    2. The client receives the IIIF annotations and image URLs from the server.
  7. Using Jinja2, the client creates an HTML <img> for each Image URL.
    1. The image data, hosted on the Cini Foundation server, are queried using the IIIF image URLs.
  8. The Cini Foundation server answers with the image data displaying the results to the user.
Image queries pipeline

Search by Image

The process to query similar photographs is similar to the text queries.

  1. The user uploaded an image from their computer. The client makes a request containing the data of this image to the server.
  2. Upon receiving this request, the server
    1. Computes the feature vector of the user's image using a ResNet 18 in a similar fashion described while creating the database.
    2. It then queries the database for all feature vectors.
  3. The database answers with all feature vectors.
  4. When the server has all the feature vectors, it creates a similarity vector between the user uploaded image and all of the images returned by the database.
    1. The server obtains the similarity between the feature vectors using Cosine similarity.
    2. Then, the server selects the top 20 images having the highest similarity and queries the Image Annotation database for the IIIF of the photos.
  5. The remaining steps fall in place similarly to the text search case without the bounding box requirement.

Image colorization

The tool for image colorization is called DeOldify. DeOldfiy uses a deep generative model called NoGAN to transform a black & white image into a coloured one. All details about this tool can be found on its GitHub page. When the user clicks on the Colorise photo button, the client makes a POST request to the server with the selected image URL. In turn, the server initializes a DeOldify instance which applies its precomputed model to the selected black & white image and returns a colorised version. Before returning this image to the user, it cached to avoid colourising the same image again.

Quality assessment

Assessing the quality of our product is rather tricky. While our project makes use of many technologies (Google Vision, DeOldify, ...), we did not train any model or modified them in any way. Thus, it is not our job to evaluate their quality. As the Terzani Online Museum is a user-centered project, we thought it made more sense for the users, not the developers, to assess its quality. We therefore gathered feedback in the form of guided and non-guided user testing. We still however provide our own critical views regarding what the users cannot see, namely the data processing part.

Data Processing

In an ideal scenario, we would have liked the data processing part to be fully generic and automated. The scraping of IIIF annotation from the Cini Foundation server however requires some manual work. Indeed the lack of an API to easily access any IIIF manifest coerced us to parse the structure of the Terzani archive webpage. This means that if the structure of this page were to change, the code we wrote to scrape the IIIF annotation would be useless. Moreover, as the country of a collection is not available in the IIIF annotations, we have to manually set them from the name of the collection. The rest of the pipeline however is fully automated and generic. This is why we assess that we have developped a semi-generic method, where some manual work, scraping IIIF annotations and assigning a country to each of them, has to be done before running the automated script.

Concerning the creation of new tags and annotations for the photographies using Google Vision API, we can generally assess that the results are sufficiently reliable and coherent. However, as the API doesn't provide any control over the alphabets for the OCR, we noticed that it very often misses detection of words written in Chinese, Japanese, Vietnamese/, Hindi, etc... Results for english text however are very impressive and sometimes more precise than a human eye.

The annotation step is also very time consuming. Because the images are not stored on Google Cloud Storage, the process cannot happen asynchronously which leads to a large amount of waiting time. A further improvement of this project would be to make our own code asynchronous to accelerate the process. Moreover, we could also parallelize the computation of feature vectors to optimize the data processing even further.

Website user feedback

Text queries

The first feedback we got about text queries was that the way they weren't made on the exact text input was counter intuitive. Indeed we originally thought it would be too strenuous for the user to search for exact words to find a match and therefore resorted to make partial matches. This however creates unexpected results where you get photos of cathedrals when you were looking for cats, and were you got carving photos when searching for cars. We answered that concern by adding the Search for exact words checkbox which disables the partial matches.

Otherwise, the users were mostly happy with this feature and had fun making queries. The failing cases (e.g. the bounding boxes for "dog" also show photos of a pig and a monkey) were seen as more amusing than annoying. We asked the users to rate on a scale of 1 to 7 how relevant the results of their queries were (1 being irrelevant and 7 entierely relevant). The average relevancy score over all testers was approximately 6.2, which allows us to safely say that this feature is working well.

Image queries

Concerning the image queries, we had the remark that it would be practical to display the uploaded image next to the results. We therefore decided to implement this suggested feature and to also display the selected image before making the search queries.

Users were mostly pleased with the results, though not as impressed as for the text queries. They noticed that a picture of a face results with photos of people and a picture of a house gives building pictures, but didn't get an extremely similar photo that amazed them. We also asked them to rate the relevancy of this feature's results on a scale of 1 to 7 and got an average score of 5.8. It should be however noticed that this feature was tested on a subset of 1000 images from the 8500 available ones. Augmenting the number of potential results would also augment the chances of finding similar images.

While users did not complain about query time, the image queries take about 1-2 seconds to execute. This is due to the fact that the feature vector of the uploaded image has to be sequentially compared with all feature vectors from the database. As a further optimization, we could parallelize this computation to make it scale better and faster overall.

Gallery

Many testers noticed the same weird behaviour with the Show landmarks. It would be more intuitive that once clicked, this button becomes Hide landmarks. With the current behaviour, multiple clicks on this button keep adding the same markers again and again, resulting in the shadow of the buttons growing. This is an undesired behaviour that we should seek to correct in a further version.

Code Realisation and Github Repository

The GitHub repository of the project is at Terzani Online Museum. There are two principal components of the project. The first one corresponds to the creation of a database of the images with their corresponding tags, bounding boxes of objects, landmarks and text identified, and their feature vectors. The functions related to these operations are inside the folder (package) terzani and the corresponding script in the scripts folder. The second component is the website that is in the website directory. The details of installation and usage are available on the Github repository.

Limitations/Scope for Improvement

While some limitations and improvements for this project have already been cited and the Quality Assessment, we still provide the complete list in this section.

  • Due to lack of time, we are unable to change the behavior of the Show landmarks button on the gallery page that keeps on adding markers.
  • The partial matches for the text queries could be enhanced with natural language processing to avoid getting semantically different results (e.g. cathedral when looking for cat).
  • An option to search for similar photographs could be added on the modal window displaying the details of an image.
  • The confidence scores of Google Vision annotations could be used to sort the results
  • The pagination currently shows all the pages numbers. It could be reduced to a number picker instead.
  • The comparison of a feature vector with the whole feature vector database could be parallelized
  • The creation of the database could be made asynchronous and/or parallel

Extension idea

An idea to bring this project further would be to couple it with Terzani's writings. As Terzani has written many books and articles about Asia, it would be interesting to try to match his photographs with his texts. This way, the readers experience would be enhanced by having a visualisation of what they are reading.

Schedule

We spent the first couple of weeks setting the scope of our project. The original idea was to only colorise the photographs from the Terzani archive, but we quickly realized that there were already powerful softwares capable of doing. Therefore we moved the goalposts during week 5 and made the following schedule for the Terzani online museum.

☑: Completed ☒: Partially completed ☐: Did not undertake

Timeframe Tasks
Week 5-6
  • Investigate methods to scrape images from the Cini IIIF manifest. ☑
  • Study methods and models to colorize images. ☑
Week 6-7
  • Exploring Google Vision API. ☑
  • Prototype Image colorization. ☑
  • Investigation web technologies to create a website. ☑
  • Preliminary prototyping of the website. ☑
Week 7-8
  • Designing the database. ☑
  • Script to run google vision API on the images and store them in the database. ☑
  • Develop a basic text-matching based search engine. ☑
Week 8-9
  • Prepare a midterm presentation. ☑
  • Use Google Vision's tags in the text search queries. ☑
  • Enhance website UI. ☑
Week 9-10
  • Fill database with photographies' feature vectors. ☑
  • Pre-process Google Vision annotations (lemmatization, tokenization, ...) to enhance search queries. ☑
  • Manually attribute a country to each photo collection. ☑
  • Create a photo gallery to explore the photo by country. ☑
Week 10-11
  • Improving the Website UI. ☑
  • Create an inverted file to process the search queries faster. ☑
  • Start searching users for feedback. ☑
Week 11-12
  • Create an Image-based search engine. ☑
  • Allow image colorization option to the website. ☑
  • Hosting the website. ☑
  • User feedback. ☑
Week 12-13
  • Website modifications based on user feedback. ☒
  • Code refactoring. ☑
  • Report writing. ☑
  • Exploratory Data Analysis on the Image annotation data (optional). ☐
  • Add feedback option on the website. ☑
Week 13-14
  • Report writing. ☑
  • Code refactoring. ☑

References