Revision as of 21:28, 6 December 2020

Introduction

The Fondo Ulderico Rolandi is one of the greatest collections of librettos (text booklet of an opera) in the world. This collection of librettos which is in the possession of the Fondazione Cini consist of around 32’000 thousand librettos, spanning a time period from the 16th to the 20th century. This collection is being digitized and made accessible to the public in the online archives of the Fondazione Cini, where currently 1'110 librettos are available.

Project Abstract

The Rolandi Librettos can be considered as a collection of many unstructured documents, where each document describes an opera performance. Each document contains structured entity information about place, time and people (e.g.: composer, actors) who were involved in this opera. In our project we want to extract as much entity information about the operas as possible. This includes information as the title of the opera, when and in which city it was performed, who was the composer, etc. By extracting the entity information and linking it to internal and external entities, it is possible to construct one comprehensive data set which describes the Rolandi Collection. The linking of information to external entities, would allow us to connect our data set to the real world. This would for example include linking every city name to a real place and assigning geographical coordinates (longitude and latitude) to it. Constructing links in the data set as such, would allow us for example to trace popular operas which were played several times in different places or famous directors which directed many operas in different places. In a last step we want to construct one comprehensive end product which represents Rolandi Collection as a whole. Thus we want to visualize the distribution of operas Librettos in space and time and potentially construct indications of linking.

Planning

The draft of the project and the tasks for each week are assigned below:

Weekly working plan
Timeframe	Task	Completion
Week 4
07.10	Evaluating which APIs to use (IIIF)	✅
07.10	Write a scraper to scrape IIIF manifests from the Libretto website	✅
Week 5
14.10	Processing of images: apply Tessaract OCR	✅
14.10	Extraction of dates and cleaned the dataset to create initial DataFrame	✅
Week 6
21.10	Design and develop initial structure for the visualization (using dates data)	✅
	Running a sanity check on the initial DataFrame by hand
	Matching list of cities extracted from OCR using search techniques
Week 7
28.10	Remove irrelevant backgrounds of images	✅
	Extract age and gender from images
	Design data model
	Extract tags, names, birth and death years out of metadata
Week 8
04.11	Get coordinates for each city and translation of city names	✅
	Extracted additional metadata (opera title, maestro) from the title of Libretto
	Setting up map and slider in the visualization and order by year
Week 9
11.11	Adding metadata information in visualization by having information pane	✅
	Checking in with the Cini Foundation
	Preparing the Wiki outline and the midterm presentation
Week 10
18.11	Compiling a list of musical theatres	✅
	Getting better recall and precision on the city information
	Identifying composers and getting a performer's information
	Extracting corresponding information for the MediaWiki API for entities (theatres etc.)
Week 11
25.11	Integrate visualization's zoom functionality with the data pipeline to see intra-level info	✅
25.11	Linking similar entities together (which directors performed the same play in different cities?)	✅
Week 12
02.12	Serving the website and do performance metrics for our data analysis	✅
	Communicate and get feedback from the Cini Foundation
	Continuously working on the report and the presentation
Week 13
09.12	Finishing off the project website and work, do a presentation on our results	⬜️

Just to show how to add images

Methodology

Our data processing pipeline consists of three steps:

Collecting data

- scraper

- putting metadata into dataframe

Metadata extraction from titles

- composer

- location

-- from title information

-- using regex to find chunk of text that contains the location

-- search for location using spacy on the extracted chunk of text

-- use geopy to find latitude and longitude of the extracted location

-- use kmeans to cluster the extracted locations (make invariant to small changes in name)

-- infer latitude, longitude and location of a place if another element in the cluster has it

- title

- genre

- occasion

Metadata extraction from copertas

- place

- composer

Visualization

- place

- time

- linking entities

- others

Quality assessment

Evaluation

First, we evaluate how many and which percentage of entities we could extract for a given class compared to the number of librettos which were available. Relative and absolute number of this retrieval rate are denoted in the table below:

Feature Extraction
Feature Extraction	Cities	Theaters	Composer	Genre	Occasion
Relative	72.97%	76.21%	27.74%	95.04%	01.62%
Absolute	810	846	308	1055	18

This however does not tell us what percentage of the retrieved entities is actually correct. Therefore, in a second step we analysis which percentage of our entities is correct. To compute this we randomly selected a subset of 20 librettos and extract the ground truth by hand. By comparing now our extracted entities with this ground we can compute confusion matrices and metrics as precision and recall.

Confusion Matrix City Extraction
True Class
		False	Positive
Predicted Class
	False	0	12
	Positive	1	7

Confusion Matrix Theater Extraction
True Class
		False	Positive
Predicted Class
	False	5	3
	Positive	2	10

Confusion Matrix Theater Localisation
True Class
		False	Positive
Predicted Class
	False	10	6
	Positive	0	4

Given the ground truth and our predicted labels we now can calculate precision and recall metric

Recall/Precision Feature Extraction and Linking
	Cities	Composer	Theaters	Title	Genre	Occasion
Precision	87.5%	100%	83.3%	80%	100%	0%
Recall	37%	25%	77%	100%	100%	0%

Reliability

- how reliable the extraction is

- what are the limitations (i.e. theaters changing names etc)

Efficiency of algorithms

- both computational and qualitative

- how well they can generalize to new data

Results

- small analysis of results

Motivation

- speed up metadata extraction

- extend existing metadata

- reduce to atomic entity existing metadata

- visualize metadata in an interactive and understandable way

Realisation

- description of the realization (?)

@@ Line 238: / Line 238: @@
 |}
-- precision and recall of the extraction
+Given the ground truth and our predicted labels we now can calculate precision and recall metric
@@ Line 258: / Line 258: @@
   |  37% || 25%  || 77%  ||  100% || 100% || 0%
 |}
 ==Reliability==

Rolandi Librettos: Difference between revisions

Revision as of 21:28, 6 December 2020

Contents

Introduction

Project Abstract

Planning

Methodology

Collecting data

Metadata extraction from titles

Metadata extraction from copertas

Visualization

Quality assessment

Evaluation

Reliability

Efficiency of algorithms

Results

Motivation

Realisation

Links

Navigation menu

Rolandi Librettos: Difference between revisions

Revision as of 21:28, 6 December 2020

Introduction

Project Abstract

Planning

Methodology

Collecting data

Metadata extraction from titles

Metadata extraction from copertas

Visualization

Quality assessment

Evaluation

Reliability

Efficiency of algorithms

Results

Motivation

Realisation

Links

Navigation menu

Search