Data Ingestion of Guide Commericiale: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 54: Line 54:
| align="center" |Week 10
| align="center" |Week 10
|
|
*  
* Set up existing model of annotations completed on the 1864 Guida Commericiale
*  
* Use INCEpTION to complete semantic annotation and named-entity recognition of 1853 Guida Commericiale
* Convert this into a CSV file for further processing
| align="center" |✓  
| align="center" |✓  
|-
|-
Line 61: Line 62:
| align="center" |Week 11
| align="center" |Week 11
|
|
*  
* Complete processing and cleaning of the CSV file
*  
** Occupation Broadcasting
** Logical Segmentation
** Parish/Church Reconciliation
* Complete mapping of parishes to districts (sestiere)
| align="center" |✓  
| align="center" |✓  
|-
|-
Line 68: Line 72:
| align="center" |Week 12
| align="center" |Week 12
|
|
*  
* Plot the district and number on a map for all the entries in the 1853 Guida Commericiale
*  
* Finalize potential approaches to analysis
*  
*  
| align="center" |  
| align="center" |  
Line 76: Line 80:
| align="center" |Week 13
| align="center" |Week 13
|
|
*  
* Complete analysis based on the map entries plotted
*  
* Derive conclusions based on the analysis and report findings
*


| align="center" |  
| align="center" |  

Revision as of 12:00, 28 November 2024

Introduction

Project Timeline & Milestones

Timeframe Task Completion
Week 4
  • Explore and define possible approaches to the problem
  • Explore the existing Guida Commerciale documents
Week 5
  • Establish a pipeline for the project
  • Explore pricing plans for OpenAI GPT models and Geocoding platforms
  • Set up the GitHub repository
Week 6
  • Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries
  • Attempt to cluster pages based on different page types using an unsupervised learning approach
Week 7
  • Autumn Break
Week 8
  • Create a new approach/pipeline that relies on basic domain knowledge to obtain better results
  • Set up the semantic annotation platform, INCEpTION, for the new approach
Week 9
  • Complete midterm presentation
  • Run OCR with Pytesseract
  • Extract text using PDFPlumber
  • Compare the two text extraction approaches, choose the most suitable approach, and pre-process the text output
Week 10
  • Set up existing model of annotations completed on the 1864 Guida Commericiale
  • Use INCEpTION to complete semantic annotation and named-entity recognition of 1853 Guida Commericiale
  • Convert this into a CSV file for further processing
Week 11
  • Complete processing and cleaning of the CSV file
    • Occupation Broadcasting
    • Logical Segmentation
    • Parish/Church Reconciliation
  • Complete mapping of parishes to districts (sestiere)
Week 12
  • Plot the district and number on a map for all the entries in the 1853 Guida Commericiale
  • Finalize potential approaches to analysis
Week 13
  • Complete analysis based on the map entries plotted
  • Derive conclusions based on the analysis and report findings
Week 14
  • Complete the written deliverables and the GitHub repository
  • Prepare the final presentation discussing the results

Methodology

Results

Limitations & Future Work

Github Repository

Data Ingestion of Guide Commericiale

Acknowledgements

References