Europeana: mapping postcards: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 37: Line 37:
* Study the Europeana API documentation and get an access key.
* Study the Europeana API documentation and get an access key.
* Extract data of postcards using the Europeana API  
* Extract data of postcards using the Europeana API  
| align="center" |
| align="center" |
|-
|-


Line 45: Line 45:
* Analyze the data of Europeana postcards  
* Analyze the data of Europeana postcards  
* Prepare sample image sets and explore prediction methods
* Prepare sample image sets and explore prediction methods
| align="center" |
| align="center" |
|-
|-


Line 52: Line 52:
* Decide to focus on postcards with text
* Decide to focus on postcards with text
* Test and evaluate the effectiveness of multiple OCR models
* Test and evaluate the effectiveness of multiple OCR models
| align="center" |
| align="center" |
|-
|-


Line 60: Line 60:
* Test and evaluate the effectiveness of multiple NER tools
* Test and evaluate the effectiveness of multiple NER tools
* Explore alternative forecasting methods
* Explore alternative forecasting methods
| align="center" |
| align="center" |
|-
|-


Line 67: Line 67:
* Introduce ChatGPT for the prediction(OCR+GPT-3.5+NER)
* Introduce ChatGPT for the prediction(OCR+GPT-3.5+NER)
* Try to make predictions directly using GPT-4
* Try to make predictions directly using GPT-4
| align="center" |
| align="center" |
|-
|-


Line 75: Line 75:
* Compare the results of OCR + GPT-3.5 (optimized prompts) to those of GPT-4.
* Compare the results of OCR + GPT-3.5 (optimized prompts) to those of GPT-4.


| align="center" |
| align="center" |
|-
|-


Line 82: Line 82:
* Complete the pipeline for the entire prediction process
* Complete the pipeline for the entire prediction process
* Prepare a sample set to evaluate the effect
* Prepare a sample set to evaluate the effect
| align="center" |
| align="center" |
|-
|-


Line 89: Line 89:
* Explore the visualization methods
* Explore the visualization methods
* Refine the test set and analyze it
* Refine the test set and analyze it
| align="center" |
| align="center" |
|-
|-


Line 96: Line 96:
* Use the TA's annotation tool for building a ground truth
* Use the TA's annotation tool for building a ground truth
* Build the visualization platform
* Build the visualization platform
| align="center" |
| align="center" |
|-
|-


Line 104: Line 104:
* Analyze the results of the test set evaluation
* Analyze the results of the test set evaluation


| align="center" |
| align="center" |
|-
|-


Line 110: Line 110:
|
|
* Prepare the final report and presentation
* Prepare the final report and presentation
| align="center" |
| align="center" |
|-
|-
|}
|}
== Milestones ==
== Milestones ==



Revision as of 10:36, 19 December 2023

Introduction & Motivation

Deliverables

  • 39,587 records related to postcards with image copyrights, along with their metadata, from the Europeana website.
  • OCR results of a sample set of 350 images containing text.
  • GPT-3.5 prediction results for a sample set of 350 images containing text, based on OCR results.
  • A high-quality, manually annotated Ground Truth for a sample set of 309 images.
  • GPT-3.5 prediction results for Ground Truth.
  • GPT-4 prediction results for Ground Truth.
  • An interactive webpage displaying the mapping of the postcards.
  • The GitHub repository contains all the codes for the whole project.

Methodologies

Data collection

Using the APIs provided by Europeana, we used web scrapers to collect relevant data. Initially, we utilized the search API to obtain all records related to postcards with an open copyright status, resulting in 39,587 records. Subsequently, we filtered these records using the record API, retaining only those records whose metadata allowed for direct image retrieval via a web scraper, amounting to 20,000 records in total. We then organized this metadata, preserving only the attributes relevant to our research, such as the providing country, the providing institution, and potential coordinates. Employing a method of random sampling with this metadata, we downloaded some image samples locally for analysis.

OCR

Prediction using ChatGPT

Due to the suboptimal performance of applying NER directly on OCR results, as OCR may contain grammatical errors in recognition or the text on the postcard itself may lack the names of locations, we decided to introduce an LLM, like ChatGPT, to attempt this task. Using OpenAI APIs, we mainly explored two approaches: one was to use GPT-3.5 for location prediction based on OCR results, and the other was to directly use GPT-4 for predictions based on images. Additionally, we required ChatGPT to return a fixed JSON format object, including the predicted country and city, eliminating the need for NER. We found that both methods significantly improved upon previous efforts. Although GPT-4 showed better performance, as it also had the image itself as additional information, we discovered that after multiple optimizations of the GPT-3.5 prompt, its results were not much inferior to GPT-4. Moreover, considering the cost, using OCR results with an optimized prompt for GPT-3.5 is an excellent method. Therefore, we use this as our main pipeline.

The Build of Ground Truth

Web Application

Result Assessment

Limitations & Future work

For postcards without text, we currently rely on predictions made by GPT-4, which can be expensive. Subsequently, we may need to explore more methods.

Projet plan & Milestones

Projet plan

Timeframe Task Completion
Week 4
  • Explore postcard search results on Europeana's website
  • Study the Europeana API documentation and get an access key.
  • Extract data of postcards using the Europeana API
Week 5
  • Clean data using metadata.
  • Analyze the data of Europeana postcards
  • Prepare sample image sets and explore prediction methods
Week 6
  • Decide to focus on postcards with text
  • Test and evaluate the effectiveness of multiple OCR models
Week 7
  • Use OCR and NER for prediction
  • Test and evaluate the effectiveness of multiple NER tools
  • Explore alternative forecasting methods
Week 8
  • Introduce ChatGPT for the prediction(OCR+GPT-3.5+NER)
  • Try to make predictions directly using GPT-4
Week 9
  • Optimize GPT-3.5 prompt for better results
  • Compare the results of OCR + GPT-3.5 (optimized prompts) to those of GPT-4.
Week 10
  • Complete the pipeline for the entire prediction process
  • Prepare a sample set to evaluate the effect
Week 11
  • Explore the visualization methods
  • Refine the test set and analyze it
Week 12
  • Use the TA's annotation tool for building a ground truth
  • Build the visualization platform
Week 13
  • Testing and refinement of the Web application
  • Analyze the results of the test set evaluation
Week 14
  • Prepare the final report and presentation

Milestones

Github Repository

Europeana-mapping-postcards

References