Introduction

Project Timeline & Milestones

Timeframe	Task	Completion
Week 4	Explore and define possible approaches to the problem Explore the existing Guida Commerciale documents	✓
Week 5	Establish a pipeline for the project Explore pricing plans for OpenAI GPT models and Geocoding platforms Set up the GitHub repository	✓
Week 6	Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries Attempt to cluster pages based on different page types using an unsupervised learning approach	✓
Week 7	Autumn Break	✓
Week 8	Create a new approach/pipeline that relies on basic domain knowledge to obtain better results Set up the semantic annotation platform, INCEpTION, for the new approach	✓
Week 9	Complete midterm presentation Run OCR with Pytesseract Extract text using PDFPlumber Compare the two text extraction approaches, choose the most suitable approach, and pre-process the text output	✓
Week 10	Set up existing model of annotations completed on the 1864 Guida Commericiale Use INCEpTION to complete semantic annotation and named-entity recognition of 1853 Guida Commericiale Convert this into a CSV file for further processing	✓
Week 11	Complete processing and cleaning of the CSV file Occupation Broadcasting Logical Segmentation Parish/Church Reconciliation Complete mapping of parishes to districts (sestiere)	✓
Week 12	Plot the district and number on a map for all the entries in the 1853 Guida Commericiale Obtain accuracy metrics including accuracy of PDFPlumber and the entire pipeline Finalize potential approaches to analysis
Week 13	Complete analysis based on the map entries plotted Derive conclusions based on the analysis and report findings
Week 14	Complete the written deliverables and the GitHub repository Prepare the final presentation discussing the results

Methodology

During our work, we approached our problem in two ways: in a more generalizable way that could process any guide commercial, and a more streamlined way that would require more manual annotation to improve results. While we started with our first appr

Data Ingestion of Guide Commericiale

Contents

Introduction

Project Timeline & Milestones

Methodology

Results

Limitations & Future Work

Limitations

OCR Results

Time and Money

Github Repository

Acknowledgements

References

Navigation menu

Data Ingestion of Guide Commericiale

Introduction

Project Timeline & Milestones

Methodology

Results

Limitations & Future Work

Limitations

OCR Results

Time and Money

Github Repository

Acknowledgements

References

Navigation menu

Search