Terzani online museum: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
mNo edit summary
No edit summary
 
(126 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Introduction =
= Introduction =
Many inventions in human history have set the course for the future. One of them is a camera, which has transformed the way we share bondings and stories. [[Image:Terzani.jpg|150px|right|thumb|Terzani]] Storytelling is an essential part of the human journey. Starting from what one does when waking up in the morning to the explorations in the deep space, the connections are formed through stories. The tales were mostly transferred through sound until recently[https://en.wikipedia.org/wiki/Smriti]. The invention of the camera[https://en.wikipedia.org/wiki/Camera] started to change the landscape of this narrative. People could capture live-action into images. Later, this leads to the invention of the video camera[https://en.wikipedia.org/wiki/Video_camera], color photos/videos[https://en.wikipedia.org/wiki/Color_photography] [https://en.wikipedia.org/wiki/Color_motion_picture_film], and now we can nearly see the entire world in augmented reality[https://en.wikipedia.org/wiki/Augmented_reality] as if we are living the moment. However, many important events in history are captured as photographs for a long time now. To understand the then situations better, one would immensely gain by looking at the photographs if not videos. If not for research, it is always fascinating to visit history and appealing to see the pictures of past times. Nevertheless, having to go through an innumerable number of images is not productive. Thus, we want to create an online platform that would help access these photographs and travel through them.


Tiziano Terzani[https://en.wikipedia.org/wiki/Tiziano_Terzani] was an Italian journalist and writer. He had extensively traveled in east Asia[https://amzn.to/3oDXoYg] and witnessed many important events. During the travel, he and his team have captured many pictures.  The Cini foundation digitized some of his photo collections [http://dl.cini.it/collections/show/1353 ''Terzani photo collections'']. Through this project, we try to bring the photographs taken during his trip to South Asian countries to an online platform and present them to the 21st-century digital audience.  
The [[Terzani online museum]] is a student course project created in the context of the [http://fdh.epfl.ch/index.php/Main_Page DH-405 course]. From the archive of digitized photos taken by the journalist and writer [https://en.wikipedia.org/wiki/Tiziano_Terzani Tiziano Terzani], we created a semi-generic method to transform IIIF manifests of photographs into a web application. Through this platform, the users can easily navigate the photos based on their location, filter them based on their content, or look for similar images to the one they upload. Moreover, the colorization of greyscale photos is also possible.


Through this platform, the viewers can search the photos based on location or text search or by uploading an image to get similar images. Also, as an experimental feature, users can try to colorize the pictures.
The Web application is available following [https://terzani-demo.dhlab.epfl.ch/ this link].
[[Image:Terzani.jpg|170px|right|thumb|Photograph of Tiziano Terzani]]
 
= Motivation =
 
Many inventions in human history have set the course for the future, specifically those that helped people passing on their knowledge. Storytelling is an essential part of the human journey. From family pictures to the exploration of deep space, stories form connections. [https://en.wikipedia.org/wiki/Smriti Historically], tales were mostly transferred orally. This tradition slowly gave way to writing, which also kept evolving. The different methods of transmission have each influenced their times and the way historians perceived it. In that context, the 19th-century film-camera transformed how stories are shared. For the first time in history, a chance to capture scenes accurately and instantaneously emerged. Eventually, it became accessible to a large public as its production cost lessened and the abundance of photography increased throughout the 20th century. Today, the vast majority of them are lying in drawers or archives with the risk of being damaged or destroyed. One way to preserve this knowledge is to digitize it. This step alone might not revive them nor give them the spotlight. Therefore, this project aims to create a medium for the historical photo collections so that anyone, from anywhere, can easily access them for research purposes or explore a different time.
 
Our work focuses on Tiziano Terzani, an Italian journalist, and writer. During the second half of the 20th century, he has extensively traveled in East Asia[https://amzn.to/3oDXoYg] and has witnessed many important events. He and his team captured pictures of immense historical value. The Cini Foundation digitized some of his [http://dl.cini.it/collections/show/1353 work]. However, it lacks organization, rendering the navigation through the collections tedious. Our web application alienates this by facilitating access to Terzani's photo archive in multiple ways.
 
= Description of the realization =
 
The [https://terzani-demo.dhlab.epfl.ch/ Terzani Online Museum] is a web application with multiple features allowing users to navigate through Terzani's photo collections. The different pages of the website described below are accessible on the top navigation bar of the website.
[[File:terzani_home.jpg|thumbnail|400px|right|Terzani Online Museum Home]]
=== Home ===
The home page welcomes the users to the website. It invites them to read about Terzani or to learn about the project on the ''about'' page.
 
=== About ===
The ''about'' page describes the website's features to the visitors. It guides them through the usage of the gallery, landmarks, text queries, and image queries.
 
[[File:gallery.PNG|thumbnail|600px|right|Terzani Online Museum Gallery]]
=== Gallery ===
The ''gallery'' allows visitors to quickly and easily explore the photo collections of specific countries. On the website's gallery, users can find a world map centered on Asia. On top of this map, a red overlay shows the countries for which photo collections are available. By clicking on any State, users can view the pictures associated with it. Clicking on an image opens a modal window with the full-size photo - unlike the gallery where cropped versions are shown - alongside its IIIF annotation. An option to colorize the images is also available on this modal window.
 
The next feature on the ''gallery'' is to be able to see at a glance the famous landmarks that are present in the photographs. For that, the users can click on the <code>Show landmarks</code> button above the map to display markers on the map at the locations of the landmarks. Clicking on the highlighted marker opens a pop-up with the location name and a button to show the photos of that landmark.
 
=== Search ===
 
On the Terzani Online Museum's search page, users can explore the photographs depending on their content. The requests can either be made by text, or by image. The search results are displayed similarly to the ''gallery'' page.
 
==== Text queries ====
Users are invited to input in a text field the content they are looking for in the Terzani collection photos. This content can correspond to multiple things like general labels associated with the photographs or specific localized objects in the image, or it can be recognized text from the photos.
Below the text field, users can select two additional parameters to tune their queries. ''Only show bounding boxes'' restrains the results to the localized objects and crops their display around them. ''Search with exact word'' constraints the search domain to match precisely the input and thereby not displaying partial matches. [[File:Colorisation.png|600px|right|thumb|Side by side an original image from the Terzani archive and its automatically colorized version]]
 
==== Image queries ====
Users can also upload an image from their device and obtain the 20 most similar pictures from all collections.
 
 
=== Photo colorization ===
 
To breathe life into the photo collections, we implemented a colorization feature. When users click on a photo and the ''Colorize'' button, a new window displays the automatically colorized picture.
 
<span style=color:red>Note: this feature is currently disabled on the website because of the lack of GPU</span>


= Methods =
= Methods =
== Data Processing ==
=== Acquiring IIIF annotations ===
As the IIIF annotations of photographs form the basis of the project, the first step is to collect them. The Terzani archive is available on the [http://dl.cini.it/ Cini Foundation server]. However, it does not provide an API to download the IIIF manifests of the collections. Therefore, we use Python's [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup] module to read the root page of the archive[http://dl.cini.it/collections/show/1352] and to extract the collection IDs. Using the collected IDs, we obtain the corresponding IIIF manifest of the collection using [https://docs.python.org/3/library/urllib.html urllib]. We can then read these manifests and only keep the annotations of photographs whose label explicitly states that it represents its front side.
As we want to display the photos in a gallery sorted by country, we need to associate each IIIF annotation with the photo's origin. This information is available on the root page of the [http://dl.cini.it/collections/show/1352 Terzani archive], as the collections' names take after their origin. As these names are written in Italian and are not all formatted the same, we manually map each photo collection to its country. In this process, we ignored the collections that have multiple country names.
=== Annotating the Photographs===
Once in possession of all the photographs' IIIF, we annotate them using [https://cloud.google.com/vision Google Cloud Vision]. This tool provides a Python API with a myriad of annotation features. For the scope of this project, we decided to use the following :
* Object localization: Detects the objects in the images along with their bounding box.
* Text detection: Recognizes text in the image alongside its bounding box.
* Label detection: Provides general labels for the whole image.
* Landmark detection: Detects and returns the name of the place and its coordinates if the image contains a famous landmark.
* Web detection: Searches if the same photo is on the web and returns its references alongside a description. We make use of this description as an additional label for the whole image.
* Logo detection: Detects any (famous) product logos within an image along with a bounding box.
For each IIIF annotation, we first read the image data into byte format and then use Google Vision's API to get the additional annotations. However, some of the information returned by the API cannot be used as it is. We processed bounding boxes and all texts the following way :
* Bounding boxes: To be able to crop the photos around bounding boxes with the IIIF format, we need its top-left corner coordinates as well as its width and height. For the OCR text, logo, and landmark detection, the coordinates of the bounding box are relative to the image, and thus we can use them directly.
** As for object localization, the API normalizes the bounding box coordinates between 0 and 1. The width and height of the photo are present in its IIIF annotation, which allows us to ''de-normalize'' the coordinates.


== Acquiring IIIF annotations ==
* Texts: Google API returns text in English for various detections and in other identified languages for text OCR detection. As to improve the search result, along with the original annotation returned by the API, we also add tags after performing some cleansing steps.
** Lower Case: Converts all the characters in the text to lowercase.
** Tokens: Converts the strings into words using nltk word tokenizer.
** Punctuation: Removes all word punctuation.
** Stem: Converts the words into their stem form using the porter stemmer from ''nltk''.


As the whole project is based on IIIF annotations of photographies, we must first collect them. The Terzani archive can be found on the Cini Foundation server[http://dl.cini.it/]. However, it doesn't provide any specific API to download the IIIF manifests of all collections at once. Therefore we used the Python module Beautiful Soup to read the root page of the archive[http://dl.cini.it/collections/show/1352] and to extract all collection ID from there. Once we had collected these IDs, we could make a request to each corresponding IIIF manifest using urllib. We could then simply read this manifest and only keep the annotations whose label explicitely says that it represents the recto of the photography.
We then store the annotations and bounding box information together in JSON format.


In our project, we're displaying photographies in a gallery sorted by country. The country information is however not present in the IIIF annotation. It is however available on the root page of the Terzani archive[http://dl.cini.it/collections/show/1352] : collections' names take after their origins. As these names are however not all formatted the same and written in Italian, we decided that it would be easier to map each collection to its country by hand rather doing it algorithmically. One problem occured for the collections whose name is made of multiple countries. It is impossible for us to know which photo is part of which of these countries. Thus we decided to not assign a country to any of these collections.
=== Photo feature vector ===


== Annotating the photographies ==
[[Image:Feature_vector.jpg|500px|right|thumb|A general CNN <ref> https://www.mathworks.com/discovery/convolutional-neural-network-matlab.html </ref>]]
The feature vector of a photograph finds its use in the search for similar images. For each photo in the collection, we generate a 512-dimensional vector using [https://en.wikipedia.org/wiki/Residual_neural_network Resnet] to represent the image. The feature vector, which is the output of the Convolutional Neural Network, is a numeric representation of the photo. Recently, there has been a plethora of success in training deep neural networks that perform tasks such as classification and localization with near-human cognition. The hidden layers in these networks learn the intermediate representations of the image, and thus they can serve as a representation of the image itself. Hence for this project, we used a pre-trained [https://pytorch.org/hub/pytorch_vision_resnet/ Resnet 18] to generate the feature vectors of the photo collections. We chose Resnet because of the relatively small feature size. We take the feature vector as the output of the average pooling layer, where the feature learning part ends. Similar to the annotations, a JSON document stores the vectors.


Once in posession of the IIIF of the photographies, we annotated them using Google Cloud Vision[https://cloud.google.com/vision]. This tool provides a Python API with a myriad of annotation features. For the scope of this project we decided to use the following :
=== Database ===
* Object localization : detects which objects are on the image with their bounding box
* Text detection : OCR tool giving all text that could be read on the image alongside with their bounding box
* Label detection : gives general labels to the whole image
* Landmark detection : in the case where the image represents a famous landmark, gives the name of this place as well as its coordinates
* Web detection : if any similar photo is on the web, gives its reference alongside with a small description. We only make use of the description as an additional label for the whole image.


[[Image:Database_design.png|700px|left|thumb|Schema of the MongoDB collections]]
As the data is primarily unstructured owing to the non-definitive number of tags, annotations, and bounding boxes an image can have, we use a [https://en.wikipedia.org/wiki/NoSQL NoSQL] database and choose [https://en.wikipedia.org/wiki/MongoDB MongoDB] due to its representation of data as documents. Using [https://pymongo.readthedocs.io/en/stable/ PyMongo], we created three different collections in this database.


For each IIIF annotation, we first download its corresponding photo and then use the Google Vision API to get all these new information. However some of the values returned by Google Vision cannot be used as they are. We processed bounding boxes and all texts the following way :
* '''Image Annotations:''' This is the main collection. Each object has a unique ID and contains an IIIF annotation alongside Google Vision's additional annotations that have a bounding box (object localization, landmark, and OCR).
* '''Image Feature Vectors:''' This collection contains the mapping between the object ID and its corresponding feature vector.
* '''Image Tags:''' This is a meta collection on top of the ''Image Annotations'' to help process the text search queries faster by searching the labels and returning the related image labels. It contains one object for each annotation, bounded object, landmark, and text detected by Google Vision, and they store a list of IDs of photos corresponding to them.


* Bounding boxes : a bounding box is given by its 4 coordinates normalized between 0 and 1. To able to display the bounding box with the IIIF format, we need the non-normalized coordinates of its top left corner and its width and height. Luckily, the whole photo width and height is present in the IIIF annotation, which allows us to "de-normalize" the coordinates. The computation of the bouding box height and width is then only a matter of very simple algebra.
* Texts : Ravi ?


== Website ==
=== Back-end technologies ===


== Architecture ==
On account of handling data similarly to the way of its creation and not having to manage complex features like authentication, we choose [https://flask.palletsprojects.com/en/1.1.x/ Flask], a Python web framework that provides the essential tools to build a Web server.
[[File:Dataprocessing.png|600px|center|thumb|Dataprocessing Pipeline]]


[[File:website_architecture.png|600px|center|thumb|Website Pipeline]]
The server primarily processes the users' queries. Along with making a bridge between the client and the database, It also takes care of colorizing photos.


== Database Design ==
=== Front-end technologies ===


== Creation of Image tags, annotation, and Landmarks ==
To build our webpages, we make use of conventional HTML5 and CSS3. To make the website responsive on all kinds of devices and of screen sizes, we use Twitter's CSS framework Bootstrap. The client-side programming uses JavaScript with the help of the JQuery library. Finally, for easy usage of data coming from the server, we use the Jinja2 templating language.
== Creation of Image feature vectors ==
 
== Creation of Text based Search ==
=== Gallery by country ===
== Creation of Image based Search ==
 
== Creation of Country based gallery ==
To create the interactive map, we used the open-source JavaScript library [https://leafletjs.com/ Leaftlet]. To put in evidence the countries that Terzani visited, we used the feature that allows us to display [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] on top of the map. We used [https://geojson-maps.ash.ms/ GeoJSON maps] to construct such a document that contains the countries we mapped manually.
 
When the user clicks on a country, the client requests the server. In turn, the server queries the database to get the IIIF annotations of pictures matching the requested country. When the client gets this information back, it uses the image links from the IIIF annotations to display them to the user. The total number of results for a given country serves to compute the number of pages required to display all of them, while each page contains 21 images. To create the pagination, we use HTML <code><a></code> tags, which, on click, make a request to the server asking for the relevant IIIF annotations.
 
=== Map of landmarks ===
 
When the user clicks on the <code>Show landmarks</code> button, a request is made to the server asking for the name and geolocalisation of all landmarks in the database. With this information, we can create with the ''Leaflet'' library a marker for unique landmarks. Additionally, ''Leaflet'' also allows creating a customized pop-up when clicking on the position. These pop-ups contain simple HTML with a button which, on click, queries for the IIIF annotations of the corresponding landmark.
 
[[File:Text_search_architecture.png|400px|right|thumb|Text queries pipeline]]
=== Search by Text ===
 
Querying photographs by text happens in multiple steps described below. The numbers correspond to the numbers on the schema on the right.
 
# Users enter their query in the search bar and the client makes a request containing the input to the server.
# Upon receiving the user text query, the server tokenizes it into lower case words and removes any punctuation. The words also undergo stemming if the user did not indicate to search with an exact match. Then the server queries the <code>Image Tag</code> collection to retrieve the image IDs corresponding to each word.
# The MongoDB database responds with the requested object IDs
# Upon receiving the object IDs, the server orders the images in the sequence of text matching score. It then queries the <code>Image Annotation</code> collection to retrieve the IIIF annotation of these objects. If the user checked the <code>Only show bounding boxes</code> checkbox, the server also asks for the bounding boxes information.
# The MongoDB database responds with the requested IIIF annotations and the bounding boxes if needed.
# When the server gets the IIIF annotations, it constructs the IIIF image URLs of all results so that the resulting image has the shape of a square. However, if the user requested to show the bounding boxes only, then the server creates the IIIF image URLs so that they will crop each photo around its bounding box. The client then receives this information from the server.
# Using Jinja2, the client creates an HTML <code><img></code> tag for each Image URL and queries the data hosted on the Cini Foundation server.
# The Cini Foundation server answers with the image data and they can be displayed to the users.
 
[[File:Image_queries.png|400px|right|thumb|Image queries pipeline]]
 
=== Search by Image ===
 
The process to query similar photographs is similar to the text queries.
 
# Users upload an image from their device. The client makes a request containing the data of this image to the server.
# Upon receiving this request, the server computes the feature vector of the user's image similarly using a ResNet 18 as described earlier. It then queries the database for all feature vectors.
# The database answers with all feature vectors.
# When the server has all the feature vectors, it creates a similarity vector between the user uploaded image and all of the images returned by the database. The server obtains the similarity between the feature vectors using [https://en.wikipedia.org/wiki/Cosine_similarity Cosine similarity]. Then, it selects the top 20 images having the highest similarity and queries the <code>Image Annotation</code> database for the corresponding IIIF of the photos.
# 6. 7. 8. The remaining steps fall in place similarly to the text search case without the bounding box requirement.
 
=== Image colorization ===
 
The tool for image colorization is called [https://deoldify.ai/ DeOldify]. DeOldfiy uses a deep generative model called NoGAN to transform a black & white image into a colored one. The details about this tool are on its [https://github.com/jantic/DeOldify GitHub page].
When the user clicks on the <code>Colorise the photo</code> button, the client makes a POST request to the server with the selected image URL. In turn, the server initializes a DeOldify instance. This instance applies its precomputed model to the selected black & white image and returns a colorized version. Before returning this image to the client, it caches it to avoid colorizing the same image again.


= Quality assessment =
= Quality assessment =
Assessing the quality of this project is not straight forward. While the project makes use of many technologies (Google Vision, DeOldify, etc.), we did not train any models or modify them. Thus, assessing the resulting quality based on metrics is not relevant. At the same time, the Terzani Online Museum is a user-centered project, where we enable the users to go through thousands of photographs at their convenience. Hence the assessment is carried out based on the experience of the user. We, therefore, gathered feedback in the form of guided and non-guided user testing. Nevertheless, we provide our critical views regarding what the users cannot see, namely the data processing part.
=== Data Processing ===
In an ideal scenario, we would have liked data processing to be generic and automated. The scraping of IIIF annotation from the Cini Foundation server, however, requires some manual work. Indeed, the lack of an API to easily access the IIIF manifests coerced us to parse the structure of the Terzani archive webpage. Therefore if the page structure were to change, the program to scrape the IIIF annotations would be invalid. Moreover, as the country of a collection is not available in the IIIF annotations, we manually set them using the collection names. The rest of the pipeline is fully automated and generic. For these reasons, we assess that we have developed a semi-generic method, where some manual work - scraping IIIF annotations and assigning a country to each of them - has to be performed before running the automated script.


We have successfully created a structured pipeline to perform the crucial steps for extracting the data and making it available for search engines. In further subsections, we provide the analysis and evaluation of the effectiveness of each step.  
Concerning the creation of tags and annotations for the photographs using Google Vision's API, we can generally assess that the results are sufficiently reliable and coherent. However, in terms of text, the API itself has constraints on the languages that it can detect automatically. Thus, most of the detected text is the one that contains English alphabets. Nevertheless, for that text that is visible, we have results that are many times not evident to the human eyes.  


== Data Harvesting ==
The annotation using the API is a time-consuming step. As we do not store any of the images on google cloud storage, the process itself cannot happen asynchronously and amounts to large amounts of lead time. A further improvement - with some additional cost - to this project would be to make the program asynchronous by storing the photos on cloud storage and accelerate the process. Moreover, we could also parallelize the computation of feature vectors to optimize data processing even further.
The first step in the pipeline is to obtain the photographs available on the Cini Foundation server. As the website did not provide an API to access the data, we have resorted to standard web scrapping techniques on the HTML page and create a binary file to store the IIIF annotation of the image. Although we have successfully extracted all the images present on the server, there is an amount of manual work that prevents making this an autonomous step. The other hurdle is the country information for each photograph for which we have manually annotated countries by going through the website. Thus not all images have a country associated with them as collections have multiple countries associated with them.  


== Text based Image search ==
=== Website user feedback ===
The creation of tags to search images based on is one of the trickiest steps in the pipeline, as we have the least control over the process of creation. The Google Vision API has produced sufficiently reliable results for most of the photographs that have observable objects in recent past photography. However, in terms of text, the API itself has constraints on the languages that it can automatically. Thus, most of the detected text is the one that contains English alphabets. Nevertheless, for that text that is visible, we have results that are many times not evident to the human eyes. As we do not store any of the images on google cloud storage, the process itself cannot happen asynchronously and amounts to large amounts of lead time.


It is always strenuous for the user to search for exact words and find a match. Thus we resorted to using regular expressions for the search queries. Nevertheless, this comes with the problem of returning many search results that might not be relevant always. For instance, the search for a car or a cat can show images of carving or a cathedral.
==== Text queries ====


<write user feedback on search results>
The first feedback about text queries was that they were sometimes counter-intuitive. It is always strenuous for the user to search for exact words and find a match. Thus we resorted to using regular expressions for the search queries. Nevertheless, this comes with the problem of returning many search results that might not be relevant always. For instance, the search for a car or a cat can show images of carving or a cathedral. We resolved this by adding the ''Search with exact words'' checkbox to disable partial matches.


The results in the section of the website were widely acceptable.
Otherwise, users were mostly happy with this feature and had fun making queries. The failing cases such as the bounding boxes for ''dog'' additionally returning photos of pig or monkey were seen more as amusing than annoying. We asked the users to rate on a scale of 1 to 7 (1 being irrelevant and 7 being entirely relevant) the results of their queries. The average relevancy score of overall testers is approximately 6.2, which allows us to assume that this feature works well.


== Similar Image search  ==
==== Image queries ====
Alike the other search engine, the results in this section are also not measurable through a metric. The observation from the search results is that they are returned based on the structures present in the source image. These results are appropriate most of the time as the engine would return faces for faces, buildings for buildings, and cars for cars. Due to the constraint that most of the photographs are monochromatic, the colors in the source images do not significantly aid the search process.  
Users were mostly pleased with the results, though not as impressed as for the text queries. This feature received an average relevancy score of 5.8. On a side note, this test website only has a subset of 1000 images from the 8500 available ones. Augmenting the search space might also augment the chances of finding similar images. The feature of showing the image uploaded by the user was under development during the testing phase for some users and they have pointed to add it.


Although the next issue does not directly concern the quality of the result produced, it would affect the user's interaction with the service. Currently, the search between the images happens sequentially. Parallelizing this would expedite the process.
While users did not complain about query time, the image queries take about 1-2 seconds to execute. This time requirement is because the feature vector of the uploaded image has to be sequentially compared with all feature vectors from the database. As a further optimization, we could parallelize this computation to make it scale better and faster overall.


<write user feedback on search results>
 
 
==== Gallery ====
Many testers noticed the same unusual behavior with the <code>Show landmarks</code> button. It would be more intuitive if, once clicked, this button became <code>Hide landmarks</code>. The current behavior leads users to repeatedly click on the button, adding the same markers again, and resulting in the shadows on the map growing each time. We should seek to correct this bug in a future version.


= Code Realisation and Github Repository =
= Code Realisation and Github Repository =
The GitHub repository of the project is at [https://github.com/JanMaxime/terzani_online_museum ''Terzani Online Museum'']. There are two principal components of the project. The first one corresponds to the creation of a database of the images with their corresponding tags, bounding boxes of objects, landmarks and text identified, and their feature vectors. The functions related to these operations are inside the folder (package) ''terzani'' and the corresponding script in the ''scrpits'' folder. The second component is the website that is in the ''website'' directory. The details of installation and usage are available on the Github repository.


= Limitations and Scope for Improvement =
The GitHub repository of the project is at [https://github.com/JanMaxime/terzani_online_museum ''Terzani Online Museum'']. There are two principal components of the project. The first one corresponds to the creation of a database of the images with their corresponding tags, bounding boxes of objects, landmarks, identified text, and their feature vectors. The functions related to these operations are inside the folder (package) ''terzani'' and the corresponding script in the ''scripts'' folder. The second component is the website that is in the ''website'' directory. The details of installation and usage are available on the GitHub repository.
 
= Limitations/Scope for Improvement =
Including the outcome of feedback from users, we provide a list of potential improvements to the project in a future version.
 
* Due to lack of time, we are unable to change the behavior of the <code>Show landmarks</code> button on the ''gallery'' page that keeps on adding markers.
*Enhancing partial text matching on text queries to avoid returning semantically different results
* We can extend the option to search for similar images, for photos within the collection.
* Using Google Vision's annotations confidence scores to sort the results.
* The pagination currently shows all the page numbers. It could be replaced by a number picker instead.
*Parallelization of the program to compare feature vectors.
* The creation of the database could be made asynchronous and/or parallel.
 
= Extension idea =
 
An idea to bring this project further would be to couple it with Terzani's writings. As Terzani has written many books and articles about Asia, it would be interesting to try to match his photographs with his texts. This way, the users' experience would be enhanced by having visual support while reading.


= Schedule =  
= Schedule =  
We spent the first week setting up the scope of our project. The original idea was only to colorize the photographs from the Terzani archive, but we quickly realized that there is already state of the art software capable of doing it. Therefore, we moved the goalposts during week 5 and made the following schedule for the [[Terzani online museum]].
☑: Completed
☒: Partially completed
☐: Did not undertake


{|class="wikitable"
{|class="wikitable"
Line 80: Line 204:
|Week 5-6
|Week 5-6
|
|
* Investigate methods to scrape images from the Cini IIIF manifest.
* Investigate methods to scrape images from the Cini IIIF manifest.
* Study methods and models to colorize images.
* Study methods and models to colorize images.
|-
|-
|Week 6-7
|Week 6-7
|
|
* Exploring Google Vision API
* Exploring Google Vision API. ☑
* Prototype Image colorization
* Prototype Image colorization. ☑
* Investigation web technologies to create a website
* Investigation web technologies to create a website. ☑
* Preliminary prototyping of the website
* Preliminary prototyping of the website. ☑
|-
|-
|Week 7-8
|Week 7-8
|
|
* Designing the database
* Designing the database. ☑
* Script to run google vision API on the images and store in the database
* Script to run google vision API on the images and store them in the database. ☑
* Develop a basic text-matching based search engine.
* Develop a basic text-matching based search engine.
|-
|-
|Week 8-9
|Week 8-9
|
|
* Prepare midterm presentation
* Prepare a midterm presentation. ☑
* Use Google Vision's tags in the text search queries
* Use Google Vision's tags in the text search queries. ☑
* Enhance website UI
* Enhance website UI. ☑
|-
|-
|Week 9-10
|Week 9-10
|
|
* Fill database with photographies' feature vectors
* Fill database with photographies' feature vectors. ☑
* Pre-process Google Vision annotations (lemmatization, tokenization, ...) to enhance search queries
* Pre-process Google Vision annotations (lemmatization, tokenization, ...) to enhance search queries. ☑
* Manually attribute a country to each photo collection
* Manually attribute a country to each photo collection. ☑
* Create a photo gallery to explore the photo by country
* Create a photo gallery to explore the photo by country. ☑
|-
|-
|Week 10-11
|Week 10-11
|
|
* Improving the Website UI
* Improving the Website UI. ☑
* Create an inverted file to process the search queries faster
* Create an inverted file to process the search queries faster. ☑
* Start searching users for feedback
* Start searching users for feedback. ☑


|-
|-
|Week 11-12
|Week 11-12
|
|
* Create Image-based search engine
* Create an Image-based search engine. ☑
* Allow image colorization option to the website
* Allow image colorization option to the website. ☑
* Hosting the website
* Hosting the website. ☑
* User feedback
* User feedback. ☑
|-
|-
|Week 12-13
|Week 12-13
|
|
* Website modifications based on user feedback
* Website modifications based on user feedback. ☒
* Code refactoring
* Code refactoring. ☑
* Report writing
* Report writing. ☑
* Exploratory Data Analysis on the Image annotation data (optional)
* Exploratory Data Analysis on the Image annotation data (optional). ☐
* Add feedback option on the website
* Add feedback option on the website. ☑
|-
|-
|Week 13-14
|Week 13-14
|
|
* Report writing
* Report writing. ☑
* Code refactoring
* Code refactoring. ☑
|-
|-


|}
|}
= Authors & Developers =
Maxime François Jan
Ravinithesh Reddy Annapureddy
= References =
= References =

Latest revision as of 15:14, 14 December 2020

Introduction

The Terzani online museum is a student course project created in the context of the DH-405 course. From the archive of digitized photos taken by the journalist and writer Tiziano Terzani, we created a semi-generic method to transform IIIF manifests of photographs into a web application. Through this platform, the users can easily navigate the photos based on their location, filter them based on their content, or look for similar images to the one they upload. Moreover, the colorization of greyscale photos is also possible.

The Web application is available following this link.

Photograph of Tiziano Terzani

Motivation

Many inventions in human history have set the course for the future, specifically those that helped people passing on their knowledge. Storytelling is an essential part of the human journey. From family pictures to the exploration of deep space, stories form connections. Historically, tales were mostly transferred orally. This tradition slowly gave way to writing, which also kept evolving. The different methods of transmission have each influenced their times and the way historians perceived it. In that context, the 19th-century film-camera transformed how stories are shared. For the first time in history, a chance to capture scenes accurately and instantaneously emerged. Eventually, it became accessible to a large public as its production cost lessened and the abundance of photography increased throughout the 20th century. Today, the vast majority of them are lying in drawers or archives with the risk of being damaged or destroyed. One way to preserve this knowledge is to digitize it. This step alone might not revive them nor give them the spotlight. Therefore, this project aims to create a medium for the historical photo collections so that anyone, from anywhere, can easily access them for research purposes or explore a different time.

Our work focuses on Tiziano Terzani, an Italian journalist, and writer. During the second half of the 20th century, he has extensively traveled in East Asia[1] and has witnessed many important events. He and his team captured pictures of immense historical value. The Cini Foundation digitized some of his work. However, it lacks organization, rendering the navigation through the collections tedious. Our web application alienates this by facilitating access to Terzani's photo archive in multiple ways.

Description of the realization

The Terzani Online Museum is a web application with multiple features allowing users to navigate through Terzani's photo collections. The different pages of the website described below are accessible on the top navigation bar of the website.

Terzani Online Museum Home

Home

The home page welcomes the users to the website. It invites them to read about Terzani or to learn about the project on the about page.

About

The about page describes the website's features to the visitors. It guides them through the usage of the gallery, landmarks, text queries, and image queries.

Terzani Online Museum Gallery

Gallery

The gallery allows visitors to quickly and easily explore the photo collections of specific countries. On the website's gallery, users can find a world map centered on Asia. On top of this map, a red overlay shows the countries for which photo collections are available. By clicking on any State, users can view the pictures associated with it. Clicking on an image opens a modal window with the full-size photo - unlike the gallery where cropped versions are shown - alongside its IIIF annotation. An option to colorize the images is also available on this modal window.

The next feature on the gallery is to be able to see at a glance the famous landmarks that are present in the photographs. For that, the users can click on the Show landmarks button above the map to display markers on the map at the locations of the landmarks. Clicking on the highlighted marker opens a pop-up with the location name and a button to show the photos of that landmark.

Search

On the Terzani Online Museum's search page, users can explore the photographs depending on their content. The requests can either be made by text, or by image. The search results are displayed similarly to the gallery page.

Text queries

Users are invited to input in a text field the content they are looking for in the Terzani collection photos. This content can correspond to multiple things like general labels associated with the photographs or specific localized objects in the image, or it can be recognized text from the photos.

Below the text field, users can select two additional parameters to tune their queries. Only show bounding boxes restrains the results to the localized objects and crops their display around them. Search with exact word constraints the search domain to match precisely the input and thereby not displaying partial matches.

Side by side an original image from the Terzani archive and its automatically colorized version

Image queries

Users can also upload an image from their device and obtain the 20 most similar pictures from all collections.


Photo colorization

To breathe life into the photo collections, we implemented a colorization feature. When users click on a photo and the Colorize button, a new window displays the automatically colorized picture.

Note: this feature is currently disabled on the website because of the lack of GPU

Methods

Data Processing

Acquiring IIIF annotations

As the IIIF annotations of photographs form the basis of the project, the first step is to collect them. The Terzani archive is available on the Cini Foundation server. However, it does not provide an API to download the IIIF manifests of the collections. Therefore, we use Python's Beautiful Soup module to read the root page of the archive[2] and to extract the collection IDs. Using the collected IDs, we obtain the corresponding IIIF manifest of the collection using urllib. We can then read these manifests and only keep the annotations of photographs whose label explicitly states that it represents its front side.

As we want to display the photos in a gallery sorted by country, we need to associate each IIIF annotation with the photo's origin. This information is available on the root page of the Terzani archive, as the collections' names take after their origin. As these names are written in Italian and are not all formatted the same, we manually map each photo collection to its country. In this process, we ignored the collections that have multiple country names.

Annotating the Photographs

Once in possession of all the photographs' IIIF, we annotate them using Google Cloud Vision. This tool provides a Python API with a myriad of annotation features. For the scope of this project, we decided to use the following :

  • Object localization: Detects the objects in the images along with their bounding box.
  • Text detection: Recognizes text in the image alongside its bounding box.
  • Label detection: Provides general labels for the whole image.
  • Landmark detection: Detects and returns the name of the place and its coordinates if the image contains a famous landmark.
  • Web detection: Searches if the same photo is on the web and returns its references alongside a description. We make use of this description as an additional label for the whole image.
  • Logo detection: Detects any (famous) product logos within an image along with a bounding box.

For each IIIF annotation, we first read the image data into byte format and then use Google Vision's API to get the additional annotations. However, some of the information returned by the API cannot be used as it is. We processed bounding boxes and all texts the following way :

  • Bounding boxes: To be able to crop the photos around bounding boxes with the IIIF format, we need its top-left corner coordinates as well as its width and height. For the OCR text, logo, and landmark detection, the coordinates of the bounding box are relative to the image, and thus we can use them directly.
    • As for object localization, the API normalizes the bounding box coordinates between 0 and 1. The width and height of the photo are present in its IIIF annotation, which allows us to de-normalize the coordinates.
  • Texts: Google API returns text in English for various detections and in other identified languages for text OCR detection. As to improve the search result, along with the original annotation returned by the API, we also add tags after performing some cleansing steps.
    • Lower Case: Converts all the characters in the text to lowercase.
    • Tokens: Converts the strings into words using nltk word tokenizer.
    • Punctuation: Removes all word punctuation.
    • Stem: Converts the words into their stem form using the porter stemmer from nltk.

We then store the annotations and bounding box information together in JSON format.

Photo feature vector

A general CNN [1]

The feature vector of a photograph finds its use in the search for similar images. For each photo in the collection, we generate a 512-dimensional vector using Resnet to represent the image. The feature vector, which is the output of the Convolutional Neural Network, is a numeric representation of the photo. Recently, there has been a plethora of success in training deep neural networks that perform tasks such as classification and localization with near-human cognition. The hidden layers in these networks learn the intermediate representations of the image, and thus they can serve as a representation of the image itself. Hence for this project, we used a pre-trained Resnet 18 to generate the feature vectors of the photo collections. We chose Resnet because of the relatively small feature size. We take the feature vector as the output of the average pooling layer, where the feature learning part ends. Similar to the annotations, a JSON document stores the vectors.

Database

Schema of the MongoDB collections

As the data is primarily unstructured owing to the non-definitive number of tags, annotations, and bounding boxes an image can have, we use a NoSQL database and choose MongoDB due to its representation of data as documents. Using PyMongo, we created three different collections in this database.

  • Image Annotations: This is the main collection. Each object has a unique ID and contains an IIIF annotation alongside Google Vision's additional annotations that have a bounding box (object localization, landmark, and OCR).
  • Image Feature Vectors: This collection contains the mapping between the object ID and its corresponding feature vector.
  • Image Tags: This is a meta collection on top of the Image Annotations to help process the text search queries faster by searching the labels and returning the related image labels. It contains one object for each annotation, bounded object, landmark, and text detected by Google Vision, and they store a list of IDs of photos corresponding to them.


Website

Back-end technologies

On account of handling data similarly to the way of its creation and not having to manage complex features like authentication, we choose Flask, a Python web framework that provides the essential tools to build a Web server.

The server primarily processes the users' queries. Along with making a bridge between the client and the database, It also takes care of colorizing photos.

Front-end technologies

To build our webpages, we make use of conventional HTML5 and CSS3. To make the website responsive on all kinds of devices and of screen sizes, we use Twitter's CSS framework Bootstrap. The client-side programming uses JavaScript with the help of the JQuery library. Finally, for easy usage of data coming from the server, we use the Jinja2 templating language.

Gallery by country

To create the interactive map, we used the open-source JavaScript library Leaftlet. To put in evidence the countries that Terzani visited, we used the feature that allows us to display GeoJSON on top of the map. We used GeoJSON maps to construct such a document that contains the countries we mapped manually.

When the user clicks on a country, the client requests the server. In turn, the server queries the database to get the IIIF annotations of pictures matching the requested country. When the client gets this information back, it uses the image links from the IIIF annotations to display them to the user. The total number of results for a given country serves to compute the number of pages required to display all of them, while each page contains 21 images. To create the pagination, we use HTML <a> tags, which, on click, make a request to the server asking for the relevant IIIF annotations.

Map of landmarks

When the user clicks on the Show landmarks button, a request is made to the server asking for the name and geolocalisation of all landmarks in the database. With this information, we can create with the Leaflet library a marker for unique landmarks. Additionally, Leaflet also allows creating a customized pop-up when clicking on the position. These pop-ups contain simple HTML with a button which, on click, queries for the IIIF annotations of the corresponding landmark.

Text queries pipeline

Search by Text

Querying photographs by text happens in multiple steps described below. The numbers correspond to the numbers on the schema on the right.

  1. Users enter their query in the search bar and the client makes a request containing the input to the server.
  2. Upon receiving the user text query, the server tokenizes it into lower case words and removes any punctuation. The words also undergo stemming if the user did not indicate to search with an exact match. Then the server queries the Image Tag collection to retrieve the image IDs corresponding to each word.
  3. The MongoDB database responds with the requested object IDs
  4. Upon receiving the object IDs, the server orders the images in the sequence of text matching score. It then queries the Image Annotation collection to retrieve the IIIF annotation of these objects. If the user checked the Only show bounding boxes checkbox, the server also asks for the bounding boxes information.
  5. The MongoDB database responds with the requested IIIF annotations and the bounding boxes if needed.
  6. When the server gets the IIIF annotations, it constructs the IIIF image URLs of all results so that the resulting image has the shape of a square. However, if the user requested to show the bounding boxes only, then the server creates the IIIF image URLs so that they will crop each photo around its bounding box. The client then receives this information from the server.
  7. Using Jinja2, the client creates an HTML <img> tag for each Image URL and queries the data hosted on the Cini Foundation server.
  8. The Cini Foundation server answers with the image data and they can be displayed to the users.
Image queries pipeline

Search by Image

The process to query similar photographs is similar to the text queries.

  1. Users upload an image from their device. The client makes a request containing the data of this image to the server.
  2. Upon receiving this request, the server computes the feature vector of the user's image similarly using a ResNet 18 as described earlier. It then queries the database for all feature vectors.
  3. The database answers with all feature vectors.
  4. When the server has all the feature vectors, it creates a similarity vector between the user uploaded image and all of the images returned by the database. The server obtains the similarity between the feature vectors using Cosine similarity. Then, it selects the top 20 images having the highest similarity and queries the Image Annotation database for the corresponding IIIF of the photos.
  5. 6. 7. 8. The remaining steps fall in place similarly to the text search case without the bounding box requirement.

Image colorization

The tool for image colorization is called DeOldify. DeOldfiy uses a deep generative model called NoGAN to transform a black & white image into a colored one. The details about this tool are on its GitHub page. When the user clicks on the Colorise the photo button, the client makes a POST request to the server with the selected image URL. In turn, the server initializes a DeOldify instance. This instance applies its precomputed model to the selected black & white image and returns a colorized version. Before returning this image to the client, it caches it to avoid colorizing the same image again.

Quality assessment

Assessing the quality of this project is not straight forward. While the project makes use of many technologies (Google Vision, DeOldify, etc.), we did not train any models or modify them. Thus, assessing the resulting quality based on metrics is not relevant. At the same time, the Terzani Online Museum is a user-centered project, where we enable the users to go through thousands of photographs at their convenience. Hence the assessment is carried out based on the experience of the user. We, therefore, gathered feedback in the form of guided and non-guided user testing. Nevertheless, we provide our critical views regarding what the users cannot see, namely the data processing part.

Data Processing

In an ideal scenario, we would have liked data processing to be generic and automated. The scraping of IIIF annotation from the Cini Foundation server, however, requires some manual work. Indeed, the lack of an API to easily access the IIIF manifests coerced us to parse the structure of the Terzani archive webpage. Therefore if the page structure were to change, the program to scrape the IIIF annotations would be invalid. Moreover, as the country of a collection is not available in the IIIF annotations, we manually set them using the collection names. The rest of the pipeline is fully automated and generic. For these reasons, we assess that we have developed a semi-generic method, where some manual work - scraping IIIF annotations and assigning a country to each of them - has to be performed before running the automated script.

Concerning the creation of tags and annotations for the photographs using Google Vision's API, we can generally assess that the results are sufficiently reliable and coherent. However, in terms of text, the API itself has constraints on the languages that it can detect automatically. Thus, most of the detected text is the one that contains English alphabets. Nevertheless, for that text that is visible, we have results that are many times not evident to the human eyes.

The annotation using the API is a time-consuming step. As we do not store any of the images on google cloud storage, the process itself cannot happen asynchronously and amounts to large amounts of lead time. A further improvement - with some additional cost - to this project would be to make the program asynchronous by storing the photos on cloud storage and accelerate the process. Moreover, we could also parallelize the computation of feature vectors to optimize data processing even further.

Website user feedback

Text queries

The first feedback about text queries was that they were sometimes counter-intuitive. It is always strenuous for the user to search for exact words and find a match. Thus we resorted to using regular expressions for the search queries. Nevertheless, this comes with the problem of returning many search results that might not be relevant always. For instance, the search for a car or a cat can show images of carving or a cathedral. We resolved this by adding the Search with exact words checkbox to disable partial matches.

Otherwise, users were mostly happy with this feature and had fun making queries. The failing cases such as the bounding boxes for dog additionally returning photos of pig or monkey were seen more as amusing than annoying. We asked the users to rate on a scale of 1 to 7 (1 being irrelevant and 7 being entirely relevant) the results of their queries. The average relevancy score of overall testers is approximately 6.2, which allows us to assume that this feature works well.

Image queries

Users were mostly pleased with the results, though not as impressed as for the text queries. This feature received an average relevancy score of 5.8. On a side note, this test website only has a subset of 1000 images from the 8500 available ones. Augmenting the search space might also augment the chances of finding similar images. The feature of showing the image uploaded by the user was under development during the testing phase for some users and they have pointed to add it.

While users did not complain about query time, the image queries take about 1-2 seconds to execute. This time requirement is because the feature vector of the uploaded image has to be sequentially compared with all feature vectors from the database. As a further optimization, we could parallelize this computation to make it scale better and faster overall.


Gallery

Many testers noticed the same unusual behavior with the Show landmarks button. It would be more intuitive if, once clicked, this button became Hide landmarks. The current behavior leads users to repeatedly click on the button, adding the same markers again, and resulting in the shadows on the map growing each time. We should seek to correct this bug in a future version.

Code Realisation and Github Repository

The GitHub repository of the project is at Terzani Online Museum. There are two principal components of the project. The first one corresponds to the creation of a database of the images with their corresponding tags, bounding boxes of objects, landmarks, identified text, and their feature vectors. The functions related to these operations are inside the folder (package) terzani and the corresponding script in the scripts folder. The second component is the website that is in the website directory. The details of installation and usage are available on the GitHub repository.

Limitations/Scope for Improvement

Including the outcome of feedback from users, we provide a list of potential improvements to the project in a future version.

  • Due to lack of time, we are unable to change the behavior of the Show landmarks button on the gallery page that keeps on adding markers.
  • Enhancing partial text matching on text queries to avoid returning semantically different results
  • We can extend the option to search for similar images, for photos within the collection.
  • Using Google Vision's annotations confidence scores to sort the results.
  • The pagination currently shows all the page numbers. It could be replaced by a number picker instead.
  • Parallelization of the program to compare feature vectors.
  • The creation of the database could be made asynchronous and/or parallel.

Extension idea

An idea to bring this project further would be to couple it with Terzani's writings. As Terzani has written many books and articles about Asia, it would be interesting to try to match his photographs with his texts. This way, the users' experience would be enhanced by having visual support while reading.

Schedule

We spent the first week setting up the scope of our project. The original idea was only to colorize the photographs from the Terzani archive, but we quickly realized that there is already state of the art software capable of doing it. Therefore, we moved the goalposts during week 5 and made the following schedule for the Terzani online museum.

☑: Completed ☒: Partially completed ☐: Did not undertake

Timeframe Tasks
Week 5-6
  • Investigate methods to scrape images from the Cini IIIF manifest. ☑
  • Study methods and models to colorize images. ☑
Week 6-7
  • Exploring Google Vision API. ☑
  • Prototype Image colorization. ☑
  • Investigation web technologies to create a website. ☑
  • Preliminary prototyping of the website. ☑
Week 7-8
  • Designing the database. ☑
  • Script to run google vision API on the images and store them in the database. ☑
  • Develop a basic text-matching based search engine. ☑
Week 8-9
  • Prepare a midterm presentation. ☑
  • Use Google Vision's tags in the text search queries. ☑
  • Enhance website UI. ☑
Week 9-10
  • Fill database with photographies' feature vectors. ☑
  • Pre-process Google Vision annotations (lemmatization, tokenization, ...) to enhance search queries. ☑
  • Manually attribute a country to each photo collection. ☑
  • Create a photo gallery to explore the photo by country. ☑
Week 10-11
  • Improving the Website UI. ☑
  • Create an inverted file to process the search queries faster. ☑
  • Start searching users for feedback. ☑
Week 11-12
  • Create an Image-based search engine. ☑
  • Allow image colorization option to the website. ☑
  • Hosting the website. ☑
  • User feedback. ☑
Week 12-13
  • Website modifications based on user feedback. ☒
  • Code refactoring. ☑
  • Report writing. ☑
  • Exploratory Data Analysis on the Image annotation data (optional). ☐
  • Add feedback option on the website. ☑
Week 13-14
  • Report writing. ☑
  • Code refactoring. ☑

Authors & Developers

Maxime François Jan

Ravinithesh Reddy Annapureddy

References