Data Ingestion of Guide Commericiale: Difference between revisions
(12 intermediate revisions by the same user not shown) | |||
Line 95: | Line 95: | ||
|} | |} | ||
= Methodology = | |||
== Results | During our work, we approached our problem in two ways: in a more generalizable way that could process any guide commercial, and a more streamlined way that would require more manual annotation to improve results. We started with our first approach, [and worked on this for the first couple months], but came to realize its margin for error was for too high, making us pivot toward our second approach. The points below highlight each step of the way for both these approaches. | ||
== Approach 1: General == | |||
=== Pipeline Overview === | |||
Pipeline Overview | |||
This process aims to extract, annotate, standardize, and map data from historical documents. By leveraging OCR, named entity recognition (NER), and geocoding, the pipeline converts unstructured text into geographically and semantically meaningful data. | |||
=== Pipeline === | |||
# Process pages by: | |||
## Convert batch of pages into images | |||
## Pre-process images for OCR using CV2 and Pillow | |||
## Use Pytesseract for OCR to convert image to string | |||
# Perform Named Entity Recognition by: | |||
## Prompt GPT-4o to identify names, professions, and addresses, and to turn this into entries in a table format. | |||
## Standardize addresses that include abbreviations, shorthand, etc. | |||
## Append “Venice, Italy” to the end of addresses to ensure a feasible location | |||
# Geocode addresses by: | |||
## Use a geocoding API like LocationHQ to convert the address to coordinates | |||
## Take top result of search, and append to the map | |||
=== Divide Up Pages === | |||
Since we wanted this solution to work for any guide commercial, not just the ones provided to us, we intended to develop a way to parse through a document, identifying pages of interest that had usable data. While we got a working solution that was able to weed out pages with absolutely no information, we were never able to find a reliable solution for finding pages with the specific type of data we wanted. A possible idea we had was to train a model for identifying patterns within each page and grouping them off of that, but due to timeline restrictions we decided against it. Because of this, we never ended up finalizing this step for approach 1, and proceeded with the rest of our pipeline. | |||
=== Perform OCR === | |||
To start, we used PDF Plumber for extracting pages from our document in order to turn them into images. With the goal of extracting text from these pages, we then used libraries such as CV2 and Pillow in order to pre-process each image with filtering and thresholding to improve clarity, and lastly performed OCR using Pytesseract, tweaking certain settings to accommodate the old scripture better. | |||
=== Named Entity Recognition === | |||
After having extracted all text from a page, we can then prompt GPT-4o to identify names, professions, and addresses, and turn our text into a table format with an entry for each person. Furthermore, since these documents use abbreviations and shorthand for simplifying their writing process, we need to undo these by replacing them with their full meaning. Finally, to make geocoding in the future possible, we need to specify that each entry belongs to Venice by adding "Venice, Italy" to the end of each address. | |||
=== Geocoding === | |||
For geocoding, we found LocationHQ to be a feasible solution for our cause, successfully mapping addresses to close by locations within Venice. In order to implement this into our project, we attempted to query each address in our table. While some queries narrowed down the request to a single address, some would result in multiple possible options. Our heuristic for these scenarios was to simply pick the top option, as it seemed to be the closest match each time. Finally, after gathering the coordinates for our addresses, we would map them onto a visual map that we ended up not proceeding with in the end. | |||
== Approach 2: Streamlined == | |||
=== Pipeline Overview === | |||
This approach outlines a systematic process for extracting, annotating, and analyzing data from historical Venetian guide commercials. The aim is to convert unstructured document data into a clean, structured dataset that supports geographic mapping and data analysis. | |||
=== Pipeline === | |||
# Manually inspect entire document for page ranges with digestible data | |||
# Perform text extraction using PDF Plumber | |||
# Semantic annotation with INCEpTION, separating our pages into entries with first names, last names, occupations, addresses, etc. | |||
# Clean and format data? | |||
## What were the steps? | |||
# Map parish to provinces using dictionary | |||
# Plot on a map using parish and number from entry | |||
# Perform data analysis. | |||
=== Find page ranges of interest === | |||
The main difference with this approach compared to approach 1 was that instead of relying on a program to group our pages, we manually went through our document and found a range of pages that were of interest to us. This allowed us to go into the next steps of our pipeline with confidence, knowing we don't need to deal with the uncertainty of whether our page had irrelevant data or not. | |||
=== Perform Text Extraction === | |||
Furthermore, after testing both OCR and text extraction, we found that using PDF Plumber's function for extracting text from the PDF directly worked much more reliably. Because of this, we ended up using this for all of our pages, resulting in a similar output to the OCR with a much higher accuracy. This also freed us up from requiring to perform pre-processing for images, making the pipeline much simpler. | |||
=== Semantic Annotation === | |||
For semantic analysis, we steered toward using INCEpTION in place of GPT-4o, largely due to previous projects already having trained a model for extracting names, addresses, and professions from similar text documents. Another large benefit from this was that we no longer needed to rely on an external API for this, allowing us to bypass the limitations with financing our requests. [ADD MORE DETAIL HERE AND REWRITE MAYBE???] | |||
=== Data Clean Up === | |||
[SAHIL WRITE THIS PART PLS] | |||
=== Plot on a Map === | |||
WE HAVNET DONE THIS YET | |||
=== Data Analysis === | |||
WE HAVEN'T DONE THIS YET | |||
= Results = | |||
= Limitations & Future Work = | = Limitations & Future Work = |
Latest revision as of 14:06, 28 November 2024
Introduction
Project Timeline & Milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Week 8 |
|
✓ |
Week 9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
✓ |
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|
Methodology
During our work, we approached our problem in two ways: in a more generalizable way that could process any guide commercial, and a more streamlined way that would require more manual annotation to improve results. We started with our first approach, [and worked on this for the first couple months], but came to realize its margin for error was for too high, making us pivot toward our second approach. The points below highlight each step of the way for both these approaches.
Approach 1: General
Pipeline Overview
Pipeline Overview This process aims to extract, annotate, standardize, and map data from historical documents. By leveraging OCR, named entity recognition (NER), and geocoding, the pipeline converts unstructured text into geographically and semantically meaningful data.
Pipeline
- Process pages by:
- Convert batch of pages into images
- Pre-process images for OCR using CV2 and Pillow
- Use Pytesseract for OCR to convert image to string
- Perform Named Entity Recognition by:
- Prompt GPT-4o to identify names, professions, and addresses, and to turn this into entries in a table format.
- Standardize addresses that include abbreviations, shorthand, etc.
- Append “Venice, Italy” to the end of addresses to ensure a feasible location
- Geocode addresses by:
- Use a geocoding API like LocationHQ to convert the address to coordinates
- Take top result of search, and append to the map
Divide Up Pages
Since we wanted this solution to work for any guide commercial, not just the ones provided to us, we intended to develop a way to parse through a document, identifying pages of interest that had usable data. While we got a working solution that was able to weed out pages with absolutely no information, we were never able to find a reliable solution for finding pages with the specific type of data we wanted. A possible idea we had was to train a model for identifying patterns within each page and grouping them off of that, but due to timeline restrictions we decided against it. Because of this, we never ended up finalizing this step for approach 1, and proceeded with the rest of our pipeline.
Perform OCR
To start, we used PDF Plumber for extracting pages from our document in order to turn them into images. With the goal of extracting text from these pages, we then used libraries such as CV2 and Pillow in order to pre-process each image with filtering and thresholding to improve clarity, and lastly performed OCR using Pytesseract, tweaking certain settings to accommodate the old scripture better.
Named Entity Recognition
After having extracted all text from a page, we can then prompt GPT-4o to identify names, professions, and addresses, and turn our text into a table format with an entry for each person. Furthermore, since these documents use abbreviations and shorthand for simplifying their writing process, we need to undo these by replacing them with their full meaning. Finally, to make geocoding in the future possible, we need to specify that each entry belongs to Venice by adding "Venice, Italy" to the end of each address.
Geocoding
For geocoding, we found LocationHQ to be a feasible solution for our cause, successfully mapping addresses to close by locations within Venice. In order to implement this into our project, we attempted to query each address in our table. While some queries narrowed down the request to a single address, some would result in multiple possible options. Our heuristic for these scenarios was to simply pick the top option, as it seemed to be the closest match each time. Finally, after gathering the coordinates for our addresses, we would map them onto a visual map that we ended up not proceeding with in the end.
Approach 2: Streamlined
Pipeline Overview
This approach outlines a systematic process for extracting, annotating, and analyzing data from historical Venetian guide commercials. The aim is to convert unstructured document data into a clean, structured dataset that supports geographic mapping and data analysis.
Pipeline
- Manually inspect entire document for page ranges with digestible data
- Perform text extraction using PDF Plumber
- Semantic annotation with INCEpTION, separating our pages into entries with first names, last names, occupations, addresses, etc.
- Clean and format data?
- What were the steps?
- Map parish to provinces using dictionary
- Plot on a map using parish and number from entry
- Perform data analysis.
Find page ranges of interest
The main difference with this approach compared to approach 1 was that instead of relying on a program to group our pages, we manually went through our document and found a range of pages that were of interest to us. This allowed us to go into the next steps of our pipeline with confidence, knowing we don't need to deal with the uncertainty of whether our page had irrelevant data or not.
Perform Text Extraction
Furthermore, after testing both OCR and text extraction, we found that using PDF Plumber's function for extracting text from the PDF directly worked much more reliably. Because of this, we ended up using this for all of our pages, resulting in a similar output to the OCR with a much higher accuracy. This also freed us up from requiring to perform pre-processing for images, making the pipeline much simpler.
Semantic Annotation
For semantic analysis, we steered toward using INCEpTION in place of GPT-4o, largely due to previous projects already having trained a model for extracting names, addresses, and professions from similar text documents. Another large benefit from this was that we no longer needed to rely on an external API for this, allowing us to bypass the limitations with financing our requests. [ADD MORE DETAIL HERE AND REWRITE MAYBE???]
Data Clean Up
[SAHIL WRITE THIS PART PLS]
Plot on a Map
WE HAVNET DONE THIS YET
Data Analysis
WE HAVEN'T DONE THIS YET
Results
Limitations & Future Work
Limitations
OCR Results
Time and Money
Github Repository
Data Ingestion of Guide Commericiale