Data Ingestion of Guide Commericiale: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 107: Line 107:


=== Pipeline ===
=== Pipeline ===
# Divide pages into groups with usable data and ones without (never got done)
# Process pages by:
# Process pages with clean data by:
## Convert batch of pages into images
## Convert batch of pages into images
## Pre-process images for OCR using CV2 and Pillow
## Pre-process images for OCR using CV2 and Pillow

Revision as of 13:35, 28 November 2024

Introduction

Project Timeline & Milestones

Timeframe Task Completion
Week 4
  • Explore and define possible approaches to the problem
  • Explore the existing Guida Commerciale documents
Week 5
  • Establish a pipeline for the project
  • Explore pricing plans for OpenAI GPT models and Geocoding platforms
  • Set up the GitHub repository
Week 6
  • Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries
  • Attempt to cluster pages based on different page types using an unsupervised learning approach
Week 7
  • Autumn Break
Week 8
  • Create a new approach/pipeline that relies on basic domain knowledge to obtain better results
  • Set up the semantic annotation platform, INCEpTION, for the new approach
Week 9
  • Complete midterm presentation
  • Run OCR with Pytesseract
  • Extract text using PDFPlumber
  • Compare the two text extraction approaches, choose the most suitable approach, and pre-process the text output
Week 10
  • Set up existing model of annotations completed on the 1864 Guida Commericiale
  • Use INCEpTION to complete semantic annotation and named-entity recognition of 1853 Guida Commericiale
  • Convert this into a CSV file for further processing
Week 11
  • Complete processing and cleaning of the CSV file
    • Occupation Broadcasting
    • Logical Segmentation
    • Parish/Church Reconciliation
  • Complete mapping of parishes to districts (sestiere)
Week 12
  • Plot the district and number on a map for all the entries in the 1853 Guida Commericiale
  • Obtain accuracy metrics including accuracy of PDFPlumber and the entire pipeline
  • Finalize potential approaches to analysis
Week 13
  • Complete analysis based on the map entries plotted
  • Derive conclusions based on the analysis and report findings
Week 14
  • Complete the written deliverables and the GitHub repository
  • Prepare the final presentation discussing the results

Methodology

During our work, we approached our problem in two ways: in a more generalizable way that could process any guide commercial, and a more streamlined way that would require more manual annotation to improve results. We started with our first approach, [and worked on this for the first couple months], but came to realize its margin for error was for too high, making us pivot toward our second approach. The points below highlight each step of the way for both these approaches.

Approach 1: General

Pipeline Overview

Pipeline Overview This process aims to extract, annotate, standardize, and map data from historical documents. By leveraging OCR, named entity recognition (NER), and geocoding, the pipeline converts unstructured text into geographically and semantically meaningful data.

Pipeline

  1. Process pages by:
    1. Convert batch of pages into images
    2. Pre-process images for OCR using CV2 and Pillow
    3. Use Pytesseract for OCR to convert image to string
  2. Perform Named Entity Recognition by:
    1. Prompt GPT-4o to identify names, professions, and addresses, and to turn this into entries in a table format.
    2. Standardize addresses that include abbreviations, shorthand, etc.
    3. Append “Venice, Italy” to the end of addresses to ensure a feasible location
  3. Geocode addresses by:
    1. Use a geocoding API like LocationHQ to convert the address to coordinates
    2. Take top result of search, and append to the map

Perform OCR

To start, we used PDF Plumber for extracting pages from our document in order to turn them into images. With the goal of extracting text from these pages, we then used libraries such as CV2 and Pillow in order to pre-process each image with filtering and thresholding to improve clarity, and lastly performed OCR using Pytesseract, tweaking certain settings to accommodate the old scripture better.

Named Entity Recognition

After having extracted all text from a page, we can then prompt GPT-4o to identify names, professions, and addresses, and turn our text into a table format with an entry for each person. Furthermore, since these documents use abbreviations and shorthand for simplifying their writing process, we need to undo these by replacing them with their full meaning. Finally, to make geocoding in the future possible, we need to specify that each entry belongs to Venice by adding "Venice, Italy" to the end of each address.

Geocoding

For geocoding, we found LocationHQ to be a feasible solution for our cause.

Approach 2: Streamlined

Pipeline Overview

This approach outlines a systematic process for extracting, annotating, and analyzing data from historical Venetian guide commercials. The aim is to convert unstructured document data into a clean, structured dataset that supports geographic mapping and data analysis.

Pipeline

  1. Manually inspect entire document for page ranges with digestible data
  2. Perform text extraction using PDF Plumber
  3. Semantic annotation with INCEpTION, separating our pages into entries with first names, last names, occupations, addresses, etc.
  4. Clean and format data?
    1. What were the steps?
  5. Map parish to provinces using dictionary
  6. Plot on a map using parish and number from entry
  7. Perform data analysis.

Explanation

Results

Limitations & Future Work

Limitations

OCR Results

Time and Money

Github Repository

Data Ingestion of Guide Commericiale

Acknowledgements

References