Opera Regeolocation in Venice (1660-1760): Difference between revisions
Eliott.bell (talk | contribs) |
Eliott.bell (talk | contribs) |
||
| (28 intermediate revisions by the same user not shown) | |||
| Line 27: | Line 27: | ||
== Project plan == | == Project plan == | ||
Here is the plan for our project. The core idea is to represent on an interface the information about operas given in Venise between 1660 and 1760, according to the book of Eleanor Selfridge-Field (Standford 2007). We collect data about composers, | Here is the plan for our project. The core idea is to represent on an interface the information about operas given in Venise between 1660 and 1760, according to the book of Eleanor Selfridge-Field (Standford 2007). We collect data about composers, writers, dates, opera houses. | ||
After the midterm presentation of the project, we divide the work in two. Christophe works on the interface implementation while Eliott works on the last elements of the database. | After the midterm presentation of the project, we divide the work in two. Christophe works on the interface implementation while Eliott works on the last elements of the database. | ||
| Line 66: | Line 66: | ||
=== OCR === | === OCR === | ||
The first concrete step of the project was to extract as much relevant information from A History of Venetian Opera and Related Genres as possible in order to create an extensive historical dataset. To that effect, after a digitised copy of the book was found, an optical character recognition (OCR) scan was performed on the PDF file to retrieve the text content. This was done using the python-tesseract library. | The first concrete step of the project was to extract as much relevant information from ''A History of Venetian Opera and Related Genres'' as possible in order to create an extensive historical dataset. To that effect, after a digitised copy of the book was found, an optical character recognition (OCR) scan was performed on the PDF file to retrieve the text content. This was done using the python-tesseract library. | ||
Once this was done, some superficial data cleanup was performed in order to remove any non-standard characters that wouldn't be handled correctly by the next few functions. | Once this was done, some superficial data cleanup was performed in order to remove any non-standard characters that wouldn't be handled correctly by the next few functions. | ||
| Line 101: | Line 101: | ||
= Results = | = Results = | ||
== Interface Usability == | == Interface Usability == | ||
| Line 116: | Line 110: | ||
In order to allow comparative data analysis, a "Comparison Mode" was implemented into the interface. It enables users to apply two search filters in parallel. | In order to allow comparative data analysis, a "Comparison Mode" was implemented into the interface. It enables users to apply two search filters in parallel. | ||
=== | === Entity search === | ||
Alongside the standard method of filtering operas based on theater, composer or librettist, an option to quickly browse through entities mentioned in the entries was implemented. By selecting a certain entity, the map filters operas that mention it and the sidebar shows a list of said operas as well as the sentence the entity appears in, allowing many opportunities for deeper research. | Alongside the standard method of filtering operas based on theater, composer or librettist, an option to quickly browse through entities mentioned in the entries was implemented. By selecting a certain entity, the map filters operas that mention it and the sidebar shows a list of said operas as well as the sentence the entity appears in, allowing many opportunities for deeper research. Some of them are discussed in the [[#Historical results|Historical results]] section. | ||
=== Data exportation === | === Data exportation === | ||
| Line 142: | Line 136: | ||
= Discussion and limitations = | = Discussion and limitations = | ||
== Technical | == Technical issues and potential improvements == | ||
=== OCR inaccuracies === | |||
[[File:Niccolò Jommelli.jpg|200 px|thumb|Niccolò Jommelli, whose name was consistently inaccurately scanned.]] | |||
A frequent issue with the OCR scan occurred due to the language model used: since the pytesseract library uses a very basic English character set for its detection, some characters weren't recognised properly. This is most prominent with the "ò" character, which often became "d" or "e"; thus, for example, the composer Niccolò Jommelli often became Niccold or Niccole. Such errors had to be corrected by hand upon being noticed, though some may remain in the dataset due to lack of a thorough human examination. | |||
Another problem that might have arisen due to OCR errors regards entry separation. The different opera entries were split using an instance of pattern matching: as all entries end with a line that starts with "Listed as", this was used as the separation rule. However, there might be cases where OCR wrongfully detected different characters in this line; were that the case, the following entry would have been entirely omitted from the extraction. This could also be detected and corrected with an entire comparison of the original source against the resulting dataset. | |||
=== NER limitations=== | |||
The NER scan presented two main issues. The first one is due to the fact that in its base form, spaCy is unable to detect the layout in a document. This especially becomes an issue when it comes to footnotes, as the entities they mention often end up linked to the wrong entry. This is also in part due to the way entries were split after the OCR scan. A potential fix would be to use a document layout analysis tool to associate footnotes to the right paragraphs before performing NER; for example, this could be done with the spaCyLayout extension for spaCy. | |||
Another | Another issue stemming from the lack of layout recognition is the fact that no distinction is made between entities mentioned in the historical context and the ones appearing in the opera's synopsis. This becomes an issue in some entries in the database, such as Armenia, which is mentioned in 14 operas, but always as part of the plot. This in turn doesn't provide much relevant historical insight. | ||
The second issue with the NER scan revolves around the context sentences extracted alongside the entities: the delimitations often aren't correct, creating a lot of inelegant entries saturated with irrelevant information. For example, the entry for the opera ''Arianna e Teseo'' (1750) mentions the city of Vienna; the context sentence that was extracted reads: <pre>Winter (St. Stephen's) With balli SORTING DATE: 1750-12-26 Without dedicatee Pariati's text had been set by Porpora in 1714 for Vienna and was produced in Venice as 1727/11.</pre> The issue might again come from a lack of layout awareness, as everything prior to the relevant sentence is in fact part of different paragraphs. | |||
=== Interface bugs === | |||
Though the interface mostly functions as intended, a few minor bugs remain, the most prominent of which being an issue with the way the map displays Comparison Mode under certain unclear conditions. This could be fixed with more time spent debugging the website. | |||
=== Data extraction === | |||
In ''A History of Venetian Opera and Related Genres'', each entry additionally features the genre of the opera, which wasn't incorporated into the dataset due to a lack of time. The genre of each work could be a valuable piece of information to add, as analysing the evolution of different musical trends can be a crucial tool in order to get a grasp of an era's zeitgeist. | |||
== Historical limitations of the method == | == Historical limitations of the method == | ||
| Line 173: | Line 164: | ||
Domino effect, not sufficient to grasp the daily life | Domino effect, not sufficient to grasp the daily life | ||
== What we learnt == | == What we learnt == | ||
Latest revision as of 13:56, 15 December 2025
This project is defended by Christophe Bitar and Eliott Bell, master students at the EPFL, in the frame of the course DH-405 Foundations of Digital Humanities, given by Prof. Kaplan, Collège des Humanités, EPFL.
Introduction
[Eliott & Christophe]
GitHub repository (data extraction)
Motivation
[Christophe]
Mind map
Research questions
Biography via data
State of the art and litterature
CORAGO + books
Project plan
Here is the plan for our project. The core idea is to represent on an interface the information about operas given in Venise between 1660 and 1760, according to the book of Eleanor Selfridge-Field (Standford 2007). We collect data about composers, writers, dates, opera houses.
After the midterm presentation of the project, we divide the work in two. Christophe works on the interface implementation while Eliott works on the last elements of the database.
The planning is made by modules : if we manage to achieve one part, we can continue and enrich the data. This model guarantees to achieve the Minimal Viable Project, consisting in showing the opera through time and space. The NER extraction, possibly more difficult, would arise only if we have time. We ensure also to have sufficient time to debug the interface and analyze the results.
| Week | Eliott Bell | Christophe Bitar |
|---|---|---|
| Week 7 | Scan & OCR | Litterature finding |
| Week 8 | Pattern matching | State of the art |
| Week 9 | Midterm presentation | |
| Week 10 | Matching and cleaning data | Working on the interface |
| Week 11 | NER of entities | Working on the interface |
| Week 12 | Implement full database - Cleaning data | |
| Week 13 | Debugging, feedback, analysis | |
| Week 14 | Final report | |
Methodology
From a book to a database
OCR
The first concrete step of the project was to extract as much relevant information from A History of Venetian Opera and Related Genres as possible in order to create an extensive historical dataset. To that effect, after a digitised copy of the book was found, an optical character recognition (OCR) scan was performed on the PDF file to retrieve the text content. This was done using the python-tesseract library.
Once this was done, some superficial data cleanup was performed in order to remove any non-standard characters that wouldn't be handled correctly by the next few functions.
Pattern matching
Two different methods were used to extract the desired data. For the basic, systematic information about each opera (i.e. its title, writer(s), librettist(s), venue and sorting date), the decision was taken to use pattern matching, as said data was well-structured and lacked context. The procedure was to identify the pattern used by the author of the book, write the regular expression corresponding to that pattern, then match the entries and patterns to retrieve the desired data in an automated procedure.
To make data extraction more consistent, a few adjustments had to be made to the regexes in order to make them more flexible, as certain entries contained some inconsistencies and additional information. After the adjustments, the few errors that remained were corrected by hand in the final dataset in order to keep regexes reasonably short.
Once the dataset was consistent, it was exported into a JSON file. In addition to the data, a unique ID was added to each opera entry in order to easily cross-reference the database with the following elements.
NER
In addition to the systematic data on each opera, most entries in the book feature a text paragraph disclosing information about the context the opera was released in. To retrieve some of that information, a named entity recognition (NER) scan was done with the Python spaCy library. For each opera production, a list of named entities, including locations and people, was extracted. The decision was then taken to select the 100 most frequently mentioned entities, along with the entries and specific sentences in which they appear. Some entities were discarded from the set for different reasons, such as lack of interesting information (the most frequent entity was “Venice”), being too vague (e.g. surnames that could refer to multiple people), being inaccurate, etc.
The resulting JSON file features the same UID system as the original database to link them in a consistent manner. Alongside the named entities, the spaCy library allows for extraction of the sentence an entity appears in, which is stored as the ent.sent.text object. In order to provide more information on the context in which these entities appear, these sentences were included in the dataset as well.
Enriching the data
Finding coordinates
[Christophe]
Sources, Wikidata, etc.
Interface design
[Christophe]
Slider, filters, maps, modes, credits, etc. Google AI : trial and errors
Results
Interface Usability
Comparison mode
In order to allow comparative data analysis, a "Comparison Mode" was implemented into the interface. It enables users to apply two search filters in parallel.
Entity search
Alongside the standard method of filtering operas based on theater, composer or librettist, an option to quickly browse through entities mentioned in the entries was implemented. By selecting a certain entity, the map filters operas that mention it and the sidebar shows a list of said operas as well as the sentence the entity appears in, allowing many opportunities for deeper research. Some of them are discussed in the Historical results section.
Data exportation
When filtering entries by any criteria, users can export the list of all corresponding operas in a CSV file.
On top of that, an option to easily export and download visual data from the map of Venice was added next to the histogram. This allows users to seamlessly take screenshots of the desired data, or even create GIFs showing its evolution through time and space.
Historical results
[Christophe]
Opera house as an incubator
Composers journeys
Pallarolo
Links between cities
Trends in Music in Venise
Between opera houses
Discussion and limitations
Technical issues and potential improvements
OCR inaccuracies
A frequent issue with the OCR scan occurred due to the language model used: since the pytesseract library uses a very basic English character set for its detection, some characters weren't recognised properly. This is most prominent with the "ò" character, which often became "d" or "e"; thus, for example, the composer Niccolò Jommelli often became Niccold or Niccole. Such errors had to be corrected by hand upon being noticed, though some may remain in the dataset due to lack of a thorough human examination.
Another problem that might have arisen due to OCR errors regards entry separation. The different opera entries were split using an instance of pattern matching: as all entries end with a line that starts with "Listed as", this was used as the separation rule. However, there might be cases where OCR wrongfully detected different characters in this line; were that the case, the following entry would have been entirely omitted from the extraction. This could also be detected and corrected with an entire comparison of the original source against the resulting dataset.
NER limitations
The NER scan presented two main issues. The first one is due to the fact that in its base form, spaCy is unable to detect the layout in a document. This especially becomes an issue when it comes to footnotes, as the entities they mention often end up linked to the wrong entry. This is also in part due to the way entries were split after the OCR scan. A potential fix would be to use a document layout analysis tool to associate footnotes to the right paragraphs before performing NER; for example, this could be done with the spaCyLayout extension for spaCy.
Another issue stemming from the lack of layout recognition is the fact that no distinction is made between entities mentioned in the historical context and the ones appearing in the opera's synopsis. This becomes an issue in some entries in the database, such as Armenia, which is mentioned in 14 operas, but always as part of the plot. This in turn doesn't provide much relevant historical insight.
The second issue with the NER scan revolves around the context sentences extracted alongside the entities: the delimitations often aren't correct, creating a lot of inelegant entries saturated with irrelevant information. For example, the entry for the opera Arianna e Teseo (1750) mentions the city of Vienna; the context sentence that was extracted reads:
Winter (St. Stephen's) With balli SORTING DATE: 1750-12-26 Without dedicatee Pariati's text had been set by Porpora in 1714 for Vienna and was produced in Venice as 1727/11.
The issue might again come from a lack of layout awareness, as everything prior to the relevant sentence is in fact part of different paragraphs.
Interface bugs
Though the interface mostly functions as intended, a few minor bugs remain, the most prominent of which being an issue with the way the map displays Comparison Mode under certain unclear conditions. This could be fixed with more time spent debugging the website.
Data extraction
In A History of Venetian Opera and Related Genres, each entry additionally features the genre of the opera, which wasn't incorporated into the dataset due to a lack of time. The genre of each work could be a valuable piece of information to add, as analysing the evolution of different musical trends can be a crucial tool in order to get a grasp of an era's zeitgeist.
Historical limitations of the method
[Christophe]
Domino effect, not sufficient to grasp the daily life