Rolandi Librettos: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 239: Line 239:


====Location====
====Location====
In order to extract the location where the opera was played, regular expressions were used once again to identify the chunk of text containing the location. Specifically, the text was split at the beginning by some expressions about the location (such as theater, church, house, ...) and at the end by expressions about the time (winter, a specific year, at noon, ...). The selected chunk of text, however, was not as precise as the title.  
In order to extract the location where the opera was played, regular expressions were used once again to identify the chunk of text containing the location. Specifically, the text was split at the beginning by some expressions about the location (such as theater, church, house, ...) and at the end by expressions about the time (winter, a specific year, at noon, ...). The selected chunk of text, however, was not as precise as the title. Some manual preprocessing was applied (such as mapping S. to Saint).


To improve the selection, [https://spacy.io/models/it spacy] was used to extract named entities from the chunk of text selected. Only the LOC (location) entities were searched for and the first appearing was selected.  
To further improve the selection, [https://spacy.io/models/it spacy] was used to extract named entities from the chunk of text. Only the LOC (location) entities were searched for and the first appearing was selected.  


At this point [https://geopy.readthedocs.io/en/stable/ geopy] was used to retrieve the latitude and longitude of the locations
At this point [https://geopy.readthedocs.io/en/stable/ geopy] was used to retrieve the latitude and longitude of the locations. In the search, the theater name + city name was used on the Nominatim geocoder.
-- use geopy to find latitude and longitude of the extracted location


-- get location vectors as average of spacy vectors (excluded location type word)
Since the number of matched locations was pretty low, and geopy is not very resistant to small changes in spelling, the locations were clustered. Just like for title, the locations were first mapped to spacy word vectors. This time only the location name was used, as it is more discriminative (i.e. for Theater Saint Benedict, the vector was produced as an average of the vector Saint and the vector Benedict but not including Theater). K-means was once again used for clustering and K was set to 150. This number was again manually calibrated until all clusters appeared correct.


-- use kmeans to cluster the extracted locations (make invariant to small changes in name) into 150 clusters
Finally, the most recurring location, longitude, and latitude of each cluster was given to all elements in the cluster.
 
-- infer latitude, longitude and location of a place if another element in the cluster has it


====Genre====
====Genre====

Revision as of 16:46, 10 December 2020

Introduction

The Fondo Ulderico Rolandi is one of the greatest collections of librettos (text booklet of an opera) in the world. This collection of librettos which is in the possession of the Fondazione Cini consist of around 32’000 thousand librettos, spanning a time period from the 16th to the 20th century. This collection is being digitized and made accessible to the public in the online archives of the Fondazione Cini, where currently 1'110 librettos are available.

Project Abstract

The Rolandi Librettos can be considered as a collection of many unstructured documents, where each document describes an opera performance. Each document contains structured entity information about place, time and people (e.g.: composer, actors) who were involved in this opera. In our project we want to extract as much entity information about the operas as possible. This includes information as the title of the opera, when and in which city it was performed, who was the composer, etc. By extracting the entity information and linking it to internal and external entities, it is possible to construct one comprehensive data set which describes the Rolandi Collection. The linking of information to external entities, would allow us to connect our data set to the real world. This would for example include linking every city name to a real place and assigning geographical coordinates (longitude and latitude) to it. Constructing links in the data set as such, would allow us for example to trace popular operas which were played several times in different places or famous directors which directed many operas in different places. In a last step we want to construct one comprehensive end product which represents Rolandi Collection as a whole. Thus we want to visualize the distribution of operas Librettos in space and time and potentially construct indications of linking.

Planning

The draft of the project and the tasks for each week are assigned below:

Weekly working plan
Timeframe Task Completion
Week 4
07.10 Evaluating which APIs to use (IIIF)
Write a scraper to scrape IIIF manifests from the Libretto website
Week 5
14.10 Processing of images: apply Tessaract OCR
Extraction of dates and cleaned the dataset to create initial DataFrame
Week 6
21.10 Design and develop initial structure for the visualization (using dates data)
Running a sanity check on the initial DataFrame by hand
Matching list of cities extracted from OCR using search techniques
Week 7
28.10 Remove irrelevant backgrounds of images
Extract age and gender from images
Design data model
Extract tags, names, birth and death years out of metadata
Week 8
04.11 Get coordinates for each city and translation of city names
Extracted additional metadata (opera title, maestro) from the title of Libretto
Setting up map and slider in the visualization and order by year
Week 9
11.11 Adding metadata information in visualization by having information pane
Checking in with the Cini Foundation
Preparing the Wiki outline and the midterm presentation
Week 10
18.11 Compiling a list of musical theatres
Getting better recall and precision on the city information
Identifying composers and getting a performer's information
Extracting corresponding information for the MediaWiki API for entities (theatres etc.)
Week 11
25.11 Integrate visualization's zoom functionality with the data pipeline to see intra-level info
Linking similar entities together (which directors performed the same play in different cities?)
Week 12
02.12 Serving the website and do performance metrics for our data analysis
Communicate and get feedback from the Cini Foundation
Continuously working on the report and the presentation
Week 13
09.12 Finishing off the project website and work, do a presentation on our results ⬜️


Just to show how to add images
Just to show how to add images

Motivation

- speed up metadata extraction

- extend existing metadata

- reduce to atomic entity existing metadata

- visualize metadata in an interactive and understandable way

Realisation

Most common City
Nr. Cities Number of Librettos
1 Venice 411
2 Rome 98
3 Reggio nell'Emilia 37
4 Bologna 29
5 Florence 26
Most common Theaters
Nr. Theater City Number of Librettos
1 Teatro La Fenice Venice 92
2 Teatro di Sant'Angelo Venice 53
3 Teatro Giustiniani Venice 30
4 Teatro di S. Benedetto Venice 28
5 Teatro Vendramino Venice 21
Most Common Libretto Title
Nr. Cities Number of Librettos
1 La vera costanza 7
2 Il geloso in cimento 7
3 Antigona 7
4 Artaserse 7
5 La moglie capricciosa 6
Most Common Composer
Nr. Cities Number of Librettos
1 Giuseppe Foppa 14
2 Giovanni Bertati 13
3 Aurelio Aureli 9
4 Saverio Mercadante 8
5 Pietro Metastasio 6

Methodology

Our data processing pipeline consists conceptually of four steps: 1) Data collection 2) Data extraction 3) Data linking 4) Visualization. In practice those steps can run in parallel or the data extraction of different entities runs independently and sequentially. Furthermore, the data source, which is in our case either the coperta or an extensive title for the libretto, from which we try to extract a given entity also strongly influences the chosen methodology. Following we will describe our data processing pipeline given those different circumstances and goals.


Data Collection

First we use a scraper to obtain the metadata and the images for each libretto from the online librettos archive of the Cini Foundation. In the IIIF framework, every object or manifest, in our case librettos, is one .json document, which stores metadata and a link to a digitized version of the libretto. With the python libraries BeautifulSoup and request we could download those manifests and save them locally. Those manifests contained already entity information such as publishing year and a long extensive title description, which was extracted for each libretto.

Data Extraction

After obtaining the year, an extensive title description, and a link to the digitization, we were able to extract further entity information for each libretto from two main sources: The title description and the coperta's of the librettos.

Our pipeline

Data Extraction from Copertas

The coperta of a libretto corresponds to the book cover, which contains various information about the content and circumstances. A crucial feature for us was that only the coperte contained information about where the librettos were printed and distributed. Furthermore, the coperte sometimes also mentioned the names of the composers. To localize the librettos we, therefore, had to extract the city information from the coperte.
First, we downloaded the coperte from the Cini online archives. The coperte are specially tagged in the IIIF manifests, thus they could be downloaded separably.
Subsequently, we made the coperte machine-readable with an optical character recognition (OCR) algorithm. For this task we chose Tesseract, which has the advantages of being easily usable as python plugin and furthermore not having any rate limit or costs associated. This was as such advantageous, because we often reran our code or experimentally also OCRed additional pages, therefore a rate limit would have been cumbersome. On the other hand, the OCR quality of Tesseract was from time to time very low, and because of lacking OCR quality, we were not able to extract some entities.
(--> ludo: maybe add what we would have done to improve this if we had more time)

To extract city information, we used a dictionary approach. We used the python library geonamescache, which contains lists of cities around the world with information about city population, longitude latitude, and city name variations. With geonamescache, we compiled a list of Italian cities which we would then search for in the coperte. At this step, we already considered name variations and filtered for cities that have a greater modern population than 20'000 inhabitants. With this procedure, we obtained a first city extraction, which yielded a city name for 63% of all our coperte.
Based on the first extraction, we enhanced our city extraction by implementing following concepts: 1) Given the sub-optimal OCR quality of Tesseract, many city names were written incorrectly. To account for this we selected the 10 most common cities from our first extraction and searched for very similar variations in the coperte. We implemented this with a similarity matching on the city names.
2) We extended our city list to central European cities with a greater modern population of 150'000 inhabitants
3) We did a sanity check on our extracted cities and excluded cities which were unlikely to be correct. For instant we had several librettos which were supposedly performed in the modern city of 'Casale', the word 'Casale' how evere was rather referencing to the italian word 'house' (house in which the opera was peformed or a house (family) of lords which was mentioned)
With this measure, we could improve the quality of our city extraction and we could increase the city extraction rate to 73%.

Data Extraction from Titles

When looking at the metadata already available on the Cini website, we noticed that the title information was, in reality, a rather comprehensive sentence describing different attributes of the librettos. An example of the already available title information is the following: 'Adelaide di Borgogna, melodramma serio in due atti di Luigi Romanelli, musica di Pietro Generali. Da rappresentarsi in Venezia nel Teatro di San Benedetto la primavera 1829' (in English: 'Adelaide di Borgogna, serious melodrama in two acts by Luigi Romanelli, music by Pietro Generali. To be performed in Venice in the Teatro di San Benedetto in the spring of 1829'). In this sentence, we can identify the first few words until the first comma as the actual title of the opera, in this case, Adelaide di Borgogna. Right after the title, one or two words are used to describe the genre of the opera at hand, a serious melodramma. Other information in this sentence includes the composer/director of the opera (Pietro Generali), the theater where it was represented (Teatro di San Benedetto), and the occasion (spring of 1829). In this specific case, the information about the occasion is uninteresting, but it sometimes specifies whether it was played at a Carnival or at a city fair.

In this section, we focused our efforts on extracting this information as single entities. Separating this sentence into atomic metadata that can be inserted into a table is, in fact, useful for better retrieval, clustering, interpretation, and others. In the following subsections, we explain how the extraction of atomic entities from title was carried out.

Title

In order to extract the actual title information, we made use of the fact that the latter is, almost always, at the beginning of the sentence and followed by genre information. In the extraction, we used a simple regular expression that matches different expressions about the genre of the opera and selected the part of text preceding the matched expression. A little cleaning was done to the extracted words as trailing whitespace and full stops were removed.

In order to group together the same plays (invariantly from a change in spelling and use of different words), we used spacy to obtain a vector representation of the titles, which we then clustered. Each word in the title was mapped to a vector in the Italian spacy framework which is trained using FastText CBOW on Wikipedia and OSCAR (Common Crawl) obtaining a list of vectors for each title. The elements in the list have been averaged and the result fed to K-means clustering. The sklearn implementation of K-means was used, with default parameters and K=830. This parameter was manually adjusted until each cluster contained only one title entity and the overhead was lowest. For each cluster, the most recurring title was selected and given to all elements of the cluster.

Finally, this title was used to get a link of the Italian MediaWiki page of the opera. To do so, the requests library was used with session.get. At a second stage, the search was adjusted to title + ' opera' to refine the search.

Location

In order to extract the location where the opera was played, regular expressions were used once again to identify the chunk of text containing the location. Specifically, the text was split at the beginning by some expressions about the location (such as theater, church, house, ...) and at the end by expressions about the time (winter, a specific year, at noon, ...). The selected chunk of text, however, was not as precise as the title. Some manual preprocessing was applied (such as mapping S. to Saint).

To further improve the selection, spacy was used to extract named entities from the chunk of text. Only the LOC (location) entities were searched for and the first appearing was selected.

At this point geopy was used to retrieve the latitude and longitude of the locations. In the search, the theater name + city name was used on the Nominatim geocoder.

Since the number of matched locations was pretty low, and geopy is not very resistant to small changes in spelling, the locations were clustered. Just like for title, the locations were first mapped to spacy word vectors. This time only the location name was used, as it is more discriminative (i.e. for Theater Saint Benedict, the vector was produced as an average of the vector Saint and the vector Benedict but not including Theater). K-means was once again used for clustering and K was set to 150. This number was again manually calibrated until all clusters appeared correct.

Finally, the most recurring location, longitude, and latitude of each cluster was given to all elements in the cluster.

Genre

-- match with regex

-- manual mapping

-- get spacy vectors

-- clustering into 15 genres

Composer

aurel write this

-- linked the composer name + 'maestro' to an Italian Wikimedia link using request session.get.

Occasion

-- searched spacy entities MISC and ORG

-- selected carnival and fair ones

-- searched regex for carnival and fair (get chunk of 17 characters)

-- returned spacy match if any, otherwise, regex match split on dot and stripped of trailing spaces

Visualization

- place

- time

- linking entities

- others

Quality assessment

Evaluation

In order to evaluate our results, we first calculate the count and percentage of entities we could extract for a given class with respect to the number of librettos that were available. In the information retrieval context, this corresponds to the recall plus the error rate !!!! of our extraction, i.e.: how many . The relative and absolute number of this retrieval rate are denoted in the table below:

Feature Extraction/Linking
Feature Cities Theaters Composer Genre Occasion Theater Localization Title Wiki Data Linking Composer Wiki Data Linking
Relative 72.97% 76.21% 27.74% 95.04% 01.62% 51.44% 90.90% 23.96%
Absolute 810 846 308 1055 18 571 1009 266

This, however, does not tell us what percentage of the retrieved entities is actually correct. Therefore, in a second step, we analyze the percentage of our entities that is correctly identified. To compute this, we randomly selected a subset of 20 librettos and extracted the ground truth by hand. By comparing now our extracted entities with this ground we could compute confusion matrices and metrics such as precision and recall.

City Extraction
True Class
False Positive
Predicted Class
False 0 7
Positive 1 12
Theater Extraction
True Class
False Positive
Predicted Class
False 5 3
Positive 2 10
Composer
True Class
False Positive
Predicted Class
False 16 3
Positive 0 1
Theater Localisation
True Class
False Positive
Predicted Class
False 10 6
Positive 0 4

Given the ground truth and our predicted labels, we now could calculate precision and recall metrics for the positivity rate of our feature extractions. Recall in this sense refers to the rate of true positives our extraction finds and precision to the rate of true positives to predicted positives. Below a table shows the recall and precision rates for each feature extraction. The number on which the calculated metrics are based on is indicated in brackets.
Please note that for linking of entities, only correctly extracted entities were considered, thus the number of observations between a specific extracted entity and the linking of the same entity might vary.


Recall/Precision Feature Extraction and Linking
Cities Composer Theaters Title Genre City Localization Theater Localization Title Wiki Data Linking
Precision 92.3% (N=13) 100% (N=1) 83.3% (N=12) 80% (N=20) 100% (N=20) 100% (N=12) 100% (N=12) 66.6% (N=20)
Recall 63.2% (N=19) 25% (N=4) 77% (N=13) 100% (N=16) 100% (N=20) 100% (N=12) 0.44% (N=12) 77% (N=20)

Reliability

- how reliable the extraction is

- what are the limitations (i.e. theaters changing names etc)

Efficiency of algorithms

- both computational and qualitative

- how well they can generalize to new data

Limitations and possible improvements

Links