Opera Rolandi archive: Difference between revisions

Revision as of 09:41, 12 December 2020

Abstract

The Fondazione Giorgio Cini has digitized 36000 pages from Ulderico Rolandi's opera libretti collection. This collection contains contemporary works of 17th- and 18th-century composers. These opera libretti have a diverse content, which offered us a large amount of possibilities for analysis.

This project chose to concentrate on a way to illustrate the characters’ interactions in the Rolandi's libretti collection through network visualization. We also highlighted the importance of each character in the libretto they figure in. To achieve this, we retrieved important information using Deep Learning models and OCR. We started from a subset of Rolandi’s libretti collection and generalized this algorithm for all Rolandi’s libretti collection.

Planning

Week	To do
12.11. (week 9)	Step 1: Segmentation model training, fine tuning & testing
19.11. (week 10)	Step 2: Information extraction & cleaning
26.11. (week 11)	Finishing Step 2
03.12. (week 12)	Step 3: Information storing & network visualization
10.12. (week 13)	Finishing Step 3 and Finalize Report and Wikipage (Step 4: Generalization)
17.12. (week 14)	Final Presentation

Step 1

Train model on diverse random images of Rolandi’s libretti collection (better for generalization aspects)
Test model on diverse random images of Rolandi’s libretti collection
Test on a single chosen libretto:
- If bad results, train the model on more images coming from the libretto
- If still bad results (and ok with the planning), try to help the model by pre-processing the images beforehand : i.e. black and white images (filter that accentuates shades of black)
Choose a well-formatted and not too damaged libretto (to be able to do step 2)

Step 2

Extract essential information from the libretto with OCR
- names, scenes and descriptions
- if bad results, apply pre-processing to make the handwriting sharper for reading

If OCR extracts variants of same name:
- perform clustering to give it a common name
- apply distance measuring techniques
- find the real name (not just the abbreviation) of the character using the introduction at the start of the scene

Step 3

For one libretto:

Extract the information below and store it in a tree format (json file):
Assessment of extraction and cleaning results
Create the relationship network:
- nodes = characters’ name
- links = interactions
- weight of the links = importance of a relationship
- weight of nodes = speech weight of a character + normalization so that all the scenes have the same weight/importance

Step 4: Optionnal

See how our network algorithm generalizes (according to the success of step 1) to five Rolandi’s libretti.
If this does not generalize well, strengthen our deep learning model with more images coming from these libretti.

Description of the Methods

File:Lol.png

Dataprocessing Pipeline

Motivation

Analyse des Rolandi’s Libretti -> scraping hardcore
Recherches sur les opéras de mouvements, une chronologie, les thèmes récurrents
OCR (with Google Cloud Vision)
Traduction (with DeepL) ->
NLP (extraire topics par libretto) ->
Création d’une interface (JS) pour visualiser les différents cluster
Proposer les topics comme nouveau metadata par libretto

2ème idée:

Pk c'est mieux
pk tout les points précédents sont résolus avec l'idée
et ce que ça amène en plus (pk notre projet est intéressant)

Dataset

Choice of datasets (Format du texte, lisibilité des noms, séparation des scène/names propre)
Choix de tout créer en se basant sur Antigone

Segmentation

Choix de segmenter que les scènes et les abréviations des personnages
Choix de ne pas se focaliser sur les dialogues (car impact de l'orientation du scan qui perturbe l'extraction des coordonnées des box et ainsi la remise en ordre des box dans une suite logique)
Exemple de segmentations effectuées (background vert, noms rouge, ...) + nombre d'exemples pour training et validations
Using the dh_segment model, we return for each pixel of the testing images a probability to be in the green, red, yellow or blue classes. This list of class probabilities has the same length as the image from which the words are going to be extracted.

OCR

Use Google Vision API to extract all the words of our scans
We store each word and its box coordinated into a csv file

Combination of OCR and Segmentation to Find Words of Interest

Using the prediction classes probabilities for each pixel of the image, which were computed using dh_segment, we can now create a mask to extract the words belonging to these classes. As the probability is distributed among all classes, one has to define an ocr_threshold which specifies at which point one pixel of the image belongs to one class. This 1 and 0 values will be stored as a mask for each attribute.

We define a range of the image from which to extract the words for each attribute (i.e. for the extraction of the names, we want to extract in the most left part of the left page, and the most left part of the right page.). We use this trick for two reason : first, this helps us order our words, finding the ones from the first page and then the ones form the second page, and second to optimize our algorithm so that it doesn't goes through all the image, and saves some time. We however choose a quite large range so that we are sure we do not miss any data. The picture below illustrates this. TODO insert schema.

Then, when we focus on a word and its particular box. We compute the mean of the class probability of all the pixels figuring in a smaller ratio of the box, which is centered in the true box. We don't read all the pixels of the box as the pixels further away from the center of the box have less chance of being in the class, and because at the end only the center ones matter. This allows us to have a better precision on our results. TODO create and insert schema.

We then compute the mean value of the corresponding pixels in the OCR result - that, recall, were either 0 or 1. We define a mean_threshold for each attribute which specifies at which point the word in the box is kept as being part of the class. We will only keep these words for our analysis. TODO create and insert schema.

After that, we return for each extracted word that matters to us their top left box x and y coordinate, that we got from the OCR results. We first order the words based on their x coordinates, to begin with the left page and then the right page. This is done using the defined ranges of the image from which to extract the words for each attribute. Then we order the words based on the y coordinates from top to bottom.

In the end, we have for each page the extracted words and their respective attribute (name, scene, description) in the order of appearance.

Creating the Network of a Libretto

For Names:

Extract essential information from the libretto with OCR (names, scenes and descriptions)
Clean variants of same name by applying distance measuring technique of Levenstein
Explain why clustering wasn't working: k-means works if we already know the number of clusters, which is not our case. K-metroids can be a possible solution but complicated to implement for strings. Because of time constraints, we decided to focus on common similarity distances, and in our case an edit distance being the Levenstein.
Link real name to the abbreviation by using the description field (pattern matching)

For Scenes:

Extract occurence of Scene and its variance
Find mention of "Scena PRIMA" to delimit where the Acts start
Increment each Scene mention by 1 to give its value
Problem encountered: As the word SCENA PRIMA takes a lot of space, this word is counted as being two scenes. Therefore we need to remove "SCENA" in the "SCENA PRIMA" to not count it twice and increment wrongly.

Return a dictionnary in a tree strcture format the Acts values, Scenes values and names of characters appearing in the scene attached with their number of occurences in the scene.

Graph Representation of the Network

Using the tree structure dictionnary, create an interractive graph using D3.js. Most of the code comes from https://bl.ocks.org/steveharoz/8c3e2524079a8c440df60c1ab72b5d03
Need to create a new json data which is formatted in the following way:
- A key "nodes" which contains:
  - key id, the name of the character
  - key act_N: the weight of the node, being the number of times the character appeared in the act N. There are N key "act", depending of the number of acts figuring in the libretto.
- A key "links" which contains:
  - key source, a character name
  - key target, another character name
  - key act_N: the weight of the link, being the number of scenes where the source and target characters appear together. There are N key "act", depending of the number of acts figuring in the libretto.
We create a roll-down box to let the user decide which libretto network to visualize
We create checkboxes for the Acts to let the user decide for which acts to visualize the relationship network.

Generalization

Tried implementing all of the above for a new libretto, Gli
Problems encountered:
- Not same format of printing (i.e. text focused in 2/5 of the pages)
- Text is really close to one another
- Loads of unnecessary words are being extracted by OCR, so hyperparameters threshold are too low.
- We try to extract to many character names. The top_N most common extracted abbreviations names hyperparameter is too high.

Quality Assesment

Accuracy du testing de Segmentation ?
zss which measures a tree structure edit distance
We computed the total edit distance necessary between our extracted tree structure libretto with a ground truth one done manually.
We computed per act edit distances necessary between our extracted tree structure libretto with the ground truth one. Indeed, we assume that we get a high edit distance whenever our model forgot/added a wrong scene, which will then shift the names (number) of the nodes in the tree structure by 1 and thus count wrongly the children of the tree (being the characters). Therefore this plot will point at the extra added/removed scene.

What Still Needs To Be Done

TODO Elisa

Links

Opera Rolandi archive: Difference between revisions

Revision as of 09:41, 12 December 2020

Contents

Abstract

Planning

Step 1

Step 2

Step 3

Step 4: Optionnal

Description of the Methods

Motivation

Dataset

Segmentation

OCR

Combination of OCR and Segmentation to Find Words of Interest

Creating the Network of a Libretto

Graph Representation of the Network

Generalization

Quality Assesment

What Still Needs To Be Done

Links

Navigation menu

Opera Rolandi archive: Difference between revisions

Revision as of 09:41, 12 December 2020

Abstract

Planning

Step 1

Step 2

Step 3

Step 4: Optionnal

Description of the Methods

Motivation

Dataset

Segmentation

OCR

Combination of OCR and Segmentation to Find Words of Interest

Creating the Network of a Libretto

Graph Representation of the Network

Generalization

Quality Assesment

What Still Needs To Be Done

Links

Navigation menu

Search