Opera Rolandi archive: Difference between revisions
Line 82: | Line 82: | ||
=== Combination of OCR and Segmentation to Find Words of Interest === | === Combination of OCR and Segmentation to Find Words of Interest === | ||
For Names: | |||
* Extract essential information from the libretto with OCR (names, scenes and descriptions) | |||
* Clean variants of same name by applying distance measuring technique of Levenstein | |||
* Explain why clustering wasn't working: k-means works if we already know the number of clusters, which is not our case. K-metroids can be a possible solution but complicated to implement for strings. Because of time constraints, we decided to focus on common similarity distances, and in our case an edit distance being the Levenstein. | |||
* Link real name to the abbreviation by using the description field (pattern matching) | |||
For Scenes: | |||
* Extract occurence of Scene and its variance | |||
* Find mention of "Scena PRIMA" to delimit where the Acts start | |||
* Increment each Scene mention by 1 to give its value | |||
* Problem encountered: As the word SCENA PRIMA takes a lot of space, this word is counted as being two scenes. Therefore we need to remove "SCENA" in the "SCENA PRIMA" to not count it twice and increment wrongly. | |||
=== Creating the Network of a Libretto === | === Creating the Network of a Libretto === |
Revision as of 21:14, 11 December 2020
Abstract
The Fondazione Giorgio Cini has digitized 36000 pages from Ulderico Rolandi's opera libretti collection. This collection contains contemporary works of 17th- and 18th-century composers. These opera libretti have a diverse content, which offered us a large amount of possibilities for analysis.
This project chose to concentrate on a way to illustrate the characters’ interactions in the Rolandi's libretti collection through network visualization. We also highlighted the importance of each character in the libretto they figure in. To achieve this, we retrieved important information using Deep Learning models and OCR. We started from a subset of Rolandi’s libretti collection and generalized this algorithm for all Rolandi’s libretti collection.
Planning
Week | To do |
---|---|
12.11. (week 9) | Step 1: Segmentation model training, fine tuning & testing |
19.11. (week 10) | Step 2: Information extraction & cleaning |
26.11. (week 11) | Finishing Step 2 |
03.12. (week 12) | Step 3: Information storing & network visualization |
10.12. (week 13) | Finishing Step 3 and Finalize Report and Wikipage (Step 4: Generalization) |
17.12. (week 14) | Final Presentation |
Step 1
- Train model on diverse random images of Rolandi’s libretti collection (better for generalization aspects)
- Test model on diverse random images of Rolandi’s libretti collection
- Test on a single chosen libretto:
- If bad results, train the model on more images coming from the libretto
- If still bad results (and ok with the planning), try to help the model by pre-processing the images beforehand : i.e. black and white images (filter that accentuates shades of black)
- Choose a well-formatted and not too damaged libretto (to be able to do step 2)
Step 2
- Extract essential information from the libretto with OCR
- names, scenes and descriptions
- if bad results, apply pre-processing to make the handwriting sharper for reading
- If OCR extracts variants of same name:
- perform clustering to give it a common name
- apply distance measuring techniques
- find the real name (not just the abbreviation) of the character using the introduction at the start of the scene
Step 3
For one libretto:
- Extract the information below and store it in a tree format (json file):
- Assessment of extraction and cleaning results
- Create the relationship network:
- nodes = characters’ name
- links = interactions
- weight of the links = importance of a relationship
- weight of nodes = speech weight of a character + normalization so that all the scenes have the same weight/importance
Step 4: Optionnal
- See how our network algorithm generalizes (according to the success of step 1) to five Rolandi’s libretti.
- If this does not generalize well, strengthen our deep learning model with more images coming from these libretti.
Description of the Methods
Dataset
- Choice of datasets (Format du texte, lisibilité des noms, séparation des scène/names propre)
- Choix de tout créer en se basant sur Antigone
Segmentation
- Choix de segmenter que les scènes et les abréviations des personnages
- Choix de ne pas se focaliser sur les dialogues (car impact de l'orientation du scan qui perturbe l'extraction des coordonnées des box et ainsi la remise en ordre des box dans une suite logique)
- Exemple de segmentations effectuées (background vert, noms rouge, ...) + nombre d'exemples pour training et validations
- Accuracy du testing ?
OCR
Combination of OCR and Segmentation to Find Words of Interest
For Names:
- Extract essential information from the libretto with OCR (names, scenes and descriptions)
- Clean variants of same name by applying distance measuring technique of Levenstein
- Explain why clustering wasn't working: k-means works if we already know the number of clusters, which is not our case. K-metroids can be a possible solution but complicated to implement for strings. Because of time constraints, we decided to focus on common similarity distances, and in our case an edit distance being the Levenstein.
- Link real name to the abbreviation by using the description field (pattern matching)
For Scenes:
- Extract occurence of Scene and its variance
- Find mention of "Scena PRIMA" to delimit where the Acts start
- Increment each Scene mention by 1 to give its value
- Problem encountered: As the word SCENA PRIMA takes a lot of space, this word is counted as being two scenes. Therefore we need to remove "SCENA" in the "SCENA PRIMA" to not count it twice and increment wrongly.