Introduction
Project Timeline & Milestones
Timeframe
|
Task
|
Completion
|
Week 4
|
- Explore and define possible approaches to the problem
- Explore the existing Guida Commerciale documents
|
✓
|
Week 5
|
- Establish a pipeline for the project
- Explore pricing plans for OpenAI GPT models and Geocoding platforms
- Set up the GitHub repository
|
✓
|
Week 6
|
- Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries
- Attempt to cluster pages based on different page types using an unsupervised learning approach
|
✓
|
Week 7
|
|
✓
|
Week 8
|
- Create a new approach/pipeline that relies on basic domain knowledge to obtain better results
- Set up the semantic annotation platform, INCEpTION, for the new approach
|
✓
|
Week 9
|
- Complete midterm presentation
- Run OCR with Pytesseract
- Extract text using PDFPlumber
- Compare the two text extraction approaches, choose the most suitable approach, and pre-process the text output
|
✓
|
Week 10
|
- Set up existing model of annotations completed on the 1864 Guida Commericiale
- Use INCEpTION to complete semantic annotation and named-entity recognition of 1853 Guida Commericiale
- Convert this into a CSV file for further processing
|
✓
|
Week 11
|
- Complete processing and cleaning of the CSV file
- Occupation Broadcasting
- Logical Segmentation
- Parish/Church Reconciliation
- Complete mapping of parishes to districts (sestiere)
|
✓
|
Week 12
|
- Plot the district and number on a map for all the entries in the 1853 Guida Commericiale
- Obtain accuracy metrics including accuracy of PDFPlumber and the entire pipeline
- Finalize potential approaches to analysis
|
|
Week 13
|
- Complete analysis based on the map entries plotted
- Derive conclusions based on the analysis and report findings
|
|
Week 14
|
- Complete the written deliverables and the GitHub repository
- Prepare the final presentation discussing the results
|
|
Methodology
During our work, we approached our problem in two ways: in a more generalizable way that could process any guide commercial, and a more streamlined way that would require more manual annotation to improve results. While we started with our first appr
Results
Limitations & Future Work
Limitations
OCR Results
Time and Money
Github Repository
Data Ingestion of Guide Commericiale
Acknowledgements
References