Revision as of 15:04, 12 December 2025

This project is defended by Christophe Bitar and Eliott Bell, master students at the EPFL, in the frame of the course DH-405 Foundations of Digital Humanities, given by Prof. Kaplan, Collège des Humanités, EPFL.

Introduction

[Eliott & Christophe]

GitHub repository (data extraction)

GitHub repository (Website)

Website

Motivation

[Christophe]

Mind map

Research questions

Biography via data

State of the art and litterature

CORAGO + books

Project plan

Here is the plan for our project. The core idea is to represent on an interface the information about operas given in Venise between 1660 and 1760, according to the book of Eleanor Selfridge-Field (Standford 2007). We collect data about composers, writters, dates, opera houses.

After the midterm presentation of the project, we divide the work in two. Christophe works on the interface implementation while Eliott works on the last elements of the database.

The planning is made by modules : if we manage to achieve one part, we can continue and enrich the data. This model guarantees to achieve the Minimal Viable Project, consisting in showing the opera through time and space. The NER extraction, possibly more difficult, would arise only if we have time. We ensure also to have sufficient time to debug the interface and analyze the results.

**Planning**
Week	Eliott Bell	Christophe Bitar
Week 7	Scan & OCR	Litterature finding
Week 8	Pattern matching	State of the art
Week 9	Midterm presentation
Week 10	Matching and cleaning data	Working on the interface
Week 11	NER of entities	Working on the interface
Week 12	Implement full database - Cleaning data
Week 13	Debugging, feedback, analysis
Week 14	Final report

Methodology

From a book to a database

OCR

The first concrete step of the project was to extract as much relevant information from A History of Venetian Opera and Related Genres as possible in order to create an extensive historical dataset. To that effect, after a digitised copy of the book was found, an optical character recognition (OCR) scan was performed on the PDF file to retrieve the text content. This was done using the python-tesseract library.

Once this was done, some superficial data cleanup was performed in order to remove any non-standard characters that wouldn't be handled correctly by the next few functions.

Pattern matching

Two different methods were used to extract the desired data. For the basic, systematic information about each opera (i.e. its title, writer(s), librettist(s), venue and sorting date), the decision was taken to use pattern matching, as said data was well-structured and lacked context. The procedure was to identify the pattern used by the author of the book, write the regular expression corresponding to that pattern, then match the entries and patterns to retrieve the desired data in an automated procedure.

ADD REGEX ILLUSTRATION

To make data extraction more consistent, a few adjustments had to be made to the regexes in order to make them more flexible, as certain entries contained some inconsistencies and additional information. After the adjustments, the few errors that remained were corrected by hand in the final dataset in order to keep regexes reasonably short.

Once the dataset was consistent, it was exported into a JSON file. In addition to the data, a unique ID was added to each opera entry in order to easily cross-reference the database with the following elements.

NER

In addition to the systematic data on each opera, most entries in the book feature a text paragraph disclosing information about the context the opera was released in. To retrieve some of that information, a named entity recognition (NER) scan was done with the Python spaCy library. For each opera production, a list of named entities, including locations and people, was extracted. The decision was then taken to select the 100 most frequently mentioned entities, along with the entries and specific sentences in which they appear. Some entities were discarded from the set for different reasons, such as lack of interesting information (the most frequent entity was “Venice”), being too vague (e.g. surnames that could refer to multiple people), being inaccurate, etc.

The resulting JSON file features the same UID system as the original database to link them in a consistent manner. Alongside the named entities, the spaCy library allo

Enriching the data

Finding coordinates

[Christophe]

Sources, Wikidata, etc.

Interface design

[Christophe]

Slider, filters, maps, modes, credits, etc. Google AI : trial and errors

Results

Technical Assesment

[Eliott]

What did we/the machine manage to do?

Interface Usability

[Eliott]

Comparison mode

The Comparison Mode, here displaying works by Antonio Vivaldi and Marc'Antonio Ziani. The histogram at the bottom allows us to quickly see that Ziani precedes Vivaldi.

In order to allow comparative data analysis, a "Comparison Mode" was implemented into the interface. It enables users to apply two search filters in parallel.

Entities research

Capturing image and GIF

Historical results

[Christophe]

Opera house as an incubator

Composers journeys

Pallarolo

Links between cities

Trends in Music in Venise

Between opera houses

Discussion and limitations

Technical bugs

[Eliott]

OCR detection:
- Niccold or Niccole instead of Niccolò (Jommelli) because "ò" isn't present in the English character set used by pytesseract
- Entry separation done with the "Listed as" line, some entries might have been omitted if that line was scanned wrong
Pattern matching
NER:
- Footnote entities often associated with the wrong entry because of entry separation
- Context sentences often aren't delimited correctly
Interface:
- Bugs with comparison mode
- Search bar filters the operas on the map, not the theater/composer/librettist lists

Historical limitations of the method

[Christophe]

Domino effect, not sufficient to grasp the daily life

Future possible improvements

[Eliott]

Linking footnotes with correct entries (document/logical layout analysis?)
Cleaning data
Fixing interface bugs
Adding opera genres for more thorough findings

@@ Line 113: / Line 113: @@
 === Comparison mode ===
-[[File:nothing.png|thumb|nothing]]
+[[File:Vivaldi VS Ziani.png|thumb|The Comparison Mode, here displaying works by Antonio Vivaldi and Marc'Antonio Ziani. The histogram at the bottom allows us to quickly see that Ziani precedes Vivaldi.]]
 In order to allow comparative data analysis, a "Comparison Mode" was implemented into the interface. It enables users to apply two search filters in parallel.

Opera Regeolocation in Venice (1660-1760): Difference between revisions

Revision as of 15:04, 12 December 2025

Contents

Introduction

Motivation

Research questions

State of the art and litterature

Project plan

Methodology

From a book to a database

OCR

Pattern matching

NER

Enriching the data

Finding coordinates

Interface design

Results

Technical Assesment

Interface Usability

Comparison mode

Entities research

Capturing image and GIF

Historical results

Opera house as an incubator

Composers journeys

Links between cities

Trends in Music in Venise

Between opera houses

Discussion and limitations

Technical bugs

Historical limitations of the method

Future possible improvements

What we learnt

Navigation menu

Opera Regeolocation in Venice (1660-1760): Difference between revisions

Revision as of 15:04, 12 December 2025

Introduction

Motivation

Research questions

State of the art and litterature

Project plan

Methodology

From a book to a database

OCR

Pattern matching

NER

Enriching the data

Finding coordinates

Interface design

Results

Technical Assesment

Interface Usability

Comparison mode

Entities research

Capturing image and GIF

Historical results

Opera house as an incubator

Composers journeys

Links between cities

Trends in Music in Venise

Between opera houses

Discussion and limitations

Technical bugs

Historical limitations of the method

Future possible improvements

What we learnt

Navigation menu

Search