Opera Regeolocation in Venice (1660-1760)

From FDHwiki
Jump to navigation Jump to search

This project is defended by Christophe Bitar and Eliott Bell, master students at the EPFL, in the frame of the course DH-405 Foundations of Digital Humanities, given by Prof. Kaplan, Collège des Humanités, EPFL.

Introduction

[Eliott & Christophe]

GitHub repository (data extraction)

GitHub repository (Website)

Website

Motivation

[Christophe]

Mind map

Research questions

Biography via data

State of the art and litterature

CORAGO + books

Project plan

Here is the plan for our project. The core idea is to represent on an interface the information about operas given in Venise between 1660 and 1760, according to the book of Eleanor Selfridge-Field (Standford 2007). We collect data about composers, writters, dates, opera houses.

After the midterm presentation of the project, we divide the work in two. Christophe works on the interface implementation while Eliott works on the last elements of the database.

The planning is made by modules : if we manage to achieve one part, we can continue and enrich the data. This model guarantees to achieve the Minimal Viable Project, consisting in showing the opera through time and space. The NER extraction, possibly more difficult, would arise only if we have time. We ensure also to have sufficient time to debug the interface and analyze the results.

Planning
Week Eliott Bell Christophe Bitar
Week 7 Scan & OCR Litterature finding
Week 8 Pattern matching State of the art
Week 9 Midterm presentation
Week 10 Matching and cleaning data Working on the interface
Week 11 NER of entities Working on the interface
Week 12 Implement full database - Cleaning data
Week 13 Debugging, feedback, analysis
Week 14 Final report

Methodology

Pipeline

From a book to a database

[Eliott]

OCR

  • pytesseract
  • Cleanup of non-standard characters

Pattern matching

  • Division into entries
  • Bits of data extracted
  • Regular expressions + example
  • json dataset + UIDs

NER

  • Justification
  • spaCy scan + context sentences
  • Link with entries using UIDs

Enriching the data

Finding coordinates

[Christophe]

Sources, Wikidata, etc.

Interface design

[Christophe]

Slider, filters, maps, modes, credits, etc. Google AI : trial and errors

Results

Technical Assesment

[Eliott]

What did we/the machine manage to do?

Interface Usability

[Eliott]

Comparison mode

Entities research

Capturing image and GIF

Historical results

[Christophe]

Opera house as an incubator

Composers journeys

Pallarolo

Links between cities

Trends in Music in Venise

Between opera houses

Discussion and limitations

Technical bugs

[Eliott]

  • OCR detection:
    • Niccold or Niccole instead of Niccolò (Jommelli) because "ò" isn't present in the English character set used by pytesseract
    • Entry separation done with the "Listed as" line, some entries might have been omitted if that line was scanned wrong
  • Pattern matching
  • NER:
    • Footnote entities often associated with the wrong entry because of entry separation
    • Context sentences often aren't delimited correctly
  • Interface:
    • Bugs with comparison mode
    • Search bar filters the operas on the map, not the theater/composer/librettist lists

Historical limitations of the method

[Christophe]

Domino effect, not sufficient to grasp the daily life

Future possible improvements

[Eliott]

  • Linking footnotes with correct entries (document/logical layout analysis?)
  • Cleaning data
  • Fixing interface bugs
  • Adding opera genres for more thorough findings

What we learnt