Data Ingestion of Guide Commericiale

From FDHwiki
Jump to navigation Jump to search

Introduction

Project Timeline & Milestones

Timeframe Task Completion
Week 4
  • Explore and define possible approaches to the problem
  • Explore the existing Guida Commerciale documents
Week 5
  • Establish a pipeline for the project
  • Explore pricing plans for OpenAI GPT models and Geocoding platforms
  • Set up the GitHub repository
Week 6
  • Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries
  • Attempt to cluster pages based on different page types using an unsupervised learning approach
Week 7
  • Autumn Break
Week 8
  • Create a new approach/pipeline that relies on basic domain knowledge to obtain better results
  • Set up the semantic annotation platform, INCEpTION, for the new approach
Week 9
  • Complete midterm presentation
  • Run OCR with Pytesseract
  • Extract text using PDFPlumber
  • Compare the two text extraction approaches, choose the most suitable approach, and pre-process the text output
Week 10
  • Set up existing model of annotations completed on the 1864 Guida Commericiale
  • Use INCEpTION to complete semantic annotation and named-entity recognition of 1853 Guida Commericiale
  • Convert this into a CSV file for further processing
Week 11
  • Complete processing and cleaning of the CSV file
    • Occupation Broadcasting
    • Logical Segmentation
    • Parish/Church Reconciliation
  • Complete mapping of parishes to districts (sestiere)
Week 12
  • Plot the district and number on a map for all the entries in the 1853 Guida Commericiale
  • Obtain accuracy metrics including accuracy of PDFPlumber and the entire pipeline
  • Finalize potential approaches to analysis
Week 13
  • Complete analysis based on the map entries plotted
  • Derive conclusions based on the analysis and report findings
Week 14
  • Complete the written deliverables and the GitHub repository
  • Prepare the final presentation discussing the results

Methodology

During our work, we approached our problem in two ways: in a more generalizable way that could process any guide commercial, and a more streamlined way that would require more manual annotation to improve results. While we started with our first appr

Results

Limitations & Future Work

Limitations

OCR Results

Time and Money

Github Repository

Data Ingestion of Guide Commericiale

Acknowledgements

References