Data Ingestion of Guide Commericiale: Difference between revisions
Line 120: | Line 120: | ||
== Approach 2: Streamlined == | == Approach 2: Streamlined == | ||
=== Pipeline Overview === | |||
This approach outlines a systematic process for extracting, annotating, and analyzing data from historical Venetian guide commercials. The aim is to convert unstructured document data into a clean, structured dataset that supports geographic mapping and data analysis. | |||
=== Pipeline === | |||
# Manually inspect entire document for page ranges with digestible data | # Manually inspect entire document for page ranges with digestible data | ||
# Perform text extraction using PDF Plumber | # Perform text extraction using PDF Plumber | ||
Line 128: | Line 132: | ||
# Plot on a map using parish and number from entry | # Plot on a map using parish and number from entry | ||
# Perform data analysis. | # Perform data analysis. | ||
=== Explanation === | |||
= Results = | = Results = |
Revision as of 13:09, 28 November 2024
Introduction
Project Timeline & Milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Week 8 |
|
✓ |
Week 9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
✓ |
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|
Methodology
During our work, we approached our problem in two ways: in a more generalizable way that could process any guide commercial, and a more streamlined way that would require more manual annotation to improve results. We started with our first approach, [and worked on this for the first couple months], but came to realize its margin for error was for too high, making us pivot toward our second approach. The points below highlight each step of the way for both these approaches.
Approach 1: General
Pipeline
- Divide pages into groups with usable data and ones without (never got done)
- Process pages with clean data by:
- Convert batch of pages into images
- Pre-process images for OCR using CV2 and Pillow
- Use Pytesseract for OCR to convert image to string
- Perform Named Entity Recognition by:
- Prompt GPT-4o to identify names, professions, and addresses, and to turn this into entries in a table format.
- Standardize addresses that include abbreviations, shorthand, etc.
- Append “Venice, Italy” to the end of addresses to ensure a feasible location
- Geocode addresses by:
- Use a geocoding API like LocationHQ to convert the address to coordinates
- Take top result of search, and append to the map
Explanation
Testing how does this looke mwdasjkldasljdasl jlkdajskl dajkdj akjd ajkd jaskdj sakjd lka
Approach 2: Streamlined
Pipeline Overview
This approach outlines a systematic process for extracting, annotating, and analyzing data from historical Venetian guide commercials. The aim is to convert unstructured document data into a clean, structured dataset that supports geographic mapping and data analysis.
Pipeline
- Manually inspect entire document for page ranges with digestible data
- Perform text extraction using PDF Plumber
- Semantic annotation with INCEpTION, separating our pages into entries with first names, last names, occupations, addresses, etc.
- Clean and format data?
- What were the steps?
- Map parish to provinces using dictionary
- Plot on a map using parish and number from entry
- Perform data analysis.
Explanation
Results
Limitations & Future Work
Limitations
OCR Results
Time and Money
Github Repository
Data Ingestion of Guide Commericiale