Revision as of 15:54, 18 December 2023

Introduction & Motivation

Deliverables

39,587 records related to postcards with image copyrights, along with their metadata, from the Europeana website.
OCR results of a sample set of 350 images containing text.
GPT-3.5 prediction results for a sample set of 350 images containing text, based on OCR results.
A high-quality, manually annotated Ground Truth for a sample set of 309 images.
GPT-3.5 prediction results for Ground Truth.
GPT-4 prediction results for Ground Truth.
An interactive webpage displaying the mapping of the postcards.
The GitHub repository contains all the codes for the whole project.

Methodologies

Data collection

Using the APIs provided by Europeana, we employed a web scraper to collect relevant data. Initially, we utilized the search API to obtain all records related to postcards with an open copyright status, resulting in 39,587 records. Subsequently, we filtered these records using the record API, retaining only those records whose metadata allowed for direct image retrieval via a web scraper, amounting to 20,000 records in total. We then organized this metadata, preserving only the attributes relevant to our research, such as the providing country, the providing institution, and potential coordinates. Employing a method of random sampling with this metadata, we downloaded some image samples locally for analysis.

OCR

Prediction using GPT

The Build of Ground Truth

Web Application

Result Assessment

Limitations & Future work

Projet plan & milestones

Timeframe	Task	Completion
Week 4	Explore postcard search results on Europeana's website Study the Europeana API documentation and get an access key. Extract data of postcards using the Europeana API	✅
Week 5	Clean data using metadata. Analyze the data of Europeana postcards Prepare sample image sets and explore prediction methods	✅
Week 6	Decide to focus on postcards with text Test and evaluate the effectiveness of multiple OCR models	✅
Week 7	Use OCR and NER for prediction Test and evaluate the effectiveness of multiple NER tools Explore alternative forecasting methods	✅
Week 8	Introduce ChatGPT for the prediction(OCR+GPT-3.5+NER) Try to make predictions directly using GPT-4	✅
Week 9	Optimize GPT-3.5 prompt for better results Compare the results of OCR + GPT-3.5 (optimized prompts) to those of GPT-4.	✅
Week 10	Complete the pipeline for the entire prediction process Prepare a sample set to evaluate the effect	✅
Week 11	Explore the visualization methods Refine the test set and analyze it	✅
Week 12	Use the TA's annotation tool for building a ground truth Build the visualization platform	✅
Week 13	Testing and refinement of the Web application Analyze the results of the test set evaluation	✅
Week 14	Prepare the final report and presentation	✅

Github Repository

Europeana-mapping-postcards

@@ Line 12: / Line 12: @@
 = Methodologies =
 == Data collection ==
+Using the APIs provided by Europeana, we employed a web scraper to collect relevant data. Initially, we utilized the search API to obtain all records related to postcards with an open copyright status, resulting in 39,587 records. Subsequently, we filtered these records using the record API, retaining only those records whose metadata allowed for direct image retrieval via a web scraper, amounting to 20,000 records in total. We then organized this metadata, preserving only the attributes relevant to our research, such as the providing country, the providing institution, and potential coordinates. Employing a method of random sampling with this metadata, we downloaded some image samples locally for analysis.
 == OCR ==
 == Prediction using GPT ==

Europeana: mapping postcards: Difference between revisions

Revision as of 15:54, 18 December 2023

Contents

Introduction & Motivation

Deliverables

Methodologies

Data collection

OCR

Prediction using GPT

The Build of Ground Truth

Web Application

Result Assessment

Limitations & Future work

Projet plan & milestones

Github Repository

References

Navigation menu

Europeana: mapping postcards: Difference between revisions

Revision as of 15:54, 18 December 2023

Introduction & Motivation

Deliverables

Methodologies

Data collection

OCR

Prediction using GPT

The Build of Ground Truth

Web Application

Result Assessment

Limitations & Future work

Projet plan & milestones

Github Repository

References

Navigation menu

Search