Europeana: mapping postcards
Introduction & Motivation
Deliverables
- 39,587 records related to postcards with image copyrights, along with their metadata, from the Europeana website.
- OCR results of a sample set of 350 images containing text.
- GPT-3.5 prediction results for a sample set of 350 images containing text, based on OCR results.
- A high-quality, manually annotated Ground Truth for a sample set of 309 images.
- GPT-3.5 prediction results for Ground Truth.
- GPT-4 prediction results for Ground Truth.
- An interactive webpage displaying the mapping of the postcards.
- The GitHub repository contains all the codes for the whole project.
Methodologies
Data collection
Using the APIs provided by Europeana, we used web scrapers to collect relevant data. Initially, we utilized the search API to obtain all records related to postcards with an open copyright status, resulting in 39,587 records. Subsequently, we filtered these records using the record API, retaining only those records whose metadata allowed for direct image retrieval via a web scraper, amounting to 20,000 records in total. We then organized this metadata, preserving only the attributes relevant to our research, such as the providing country, the providing institution, and potential coordinates. Employing a method of random sampling with this metadata, we downloaded some image samples locally for analysis.
Optical character recognition(OCR)
This project aims to accurately extract textual information from various types of postcards in the European region and further utilize this information for geographic location recognition. To address the diversity of languages and scripts across the European region, the project adopts a multilingual model to ensure coverage of multiple languages, thereby enhancing the comprehensiveness and accuracy of recognition.
PaddleOCR
PaddleOCR offers specialized models encompassing 80 minority languages, such as Italian and Bulgarian, which are particularly beneficial for this project.
In the project, postcards obtained from Europeana serve as the input for the original images (Fig. 1), and segmentation (Fig. 2) is conducted using these original images.
Based on the OCR results, we remove images that do not contain any textual information from the dataset.
Prediction using ChatGPT
Due to the suboptimal performance of applying NER directly on OCR results, as OCR may contain grammatical errors in recognition or the text on the postcard itself may lack the names of locations, we decided to introduce an LLM, like ChatGPT, to attempt this task. Using OpenAI APIs, we mainly explored two approaches: one was to use GPT-3.5 for location prediction based on OCR results, and the other was to directly use GPT-4 for predictions based on images. Additionally, we required ChatGPT to return a fixed JSON format object, including the predicted country and city, eliminating the need for NER. We found that both methods significantly improved upon previous efforts. Although GPT-4 showed better performance, as it also had the image itself as additional information, we discovered that after multiple optimizations of the GPT-3.5 prompt, its results were not much inferior to GPT-4. Moreover, considering the cost, using OCR results with an optimized prompt for GPT-3.5 is an economical method. Therefore, we use this as our main pipeline.
Construction of Ground Truth
To scientifically evaluate the effectiveness of our prediction pipeline, it is necessary to create a ground truth for testing. To minimize the occurrence of postcard backs, we selected IDs that contain only one image for testing. Due to the highly uneven distribution of postcard providers on Europeana, we stipulated that no more than 30 IDs from the same provider were included in our random sampling. After sampling randomly from 35,000 IDs, we obtained 535 IDs from 24 different providers. Through the OCR process, we identified 350 IDs with recognizable text on the image, and after manual screening, we found 309 IDs to be meaningful postcards. We used GPT-3.5 to predict the OCR results of these 309 IDs and obtained a preliminary set of predictions, which we refer to as a noisy test set, as it is likely that there are still errors from the OCR model.
With the help of VGG Image Annotator (VIA), we decided to manually annotate this sample set of 309 IDs.During the annotation process, we only marked the text printed on the postcards, adopting a uniform standard and not annotating any handwritten script added later to the postcards. Additionally, we designated the origin (country and city) of each postcard, combining the postcard itself and its metadata. For postcards that mention a place name but cannot be located, we marked the country or city of origin as undefined. For other postcards whose origin could not be determined from the available information, we marked the country or city of origin as null.
After completing the Ground Truth, we then used our GPT-3.5 pipeline to directly predict and assess the manually annotated correct text results of the Ground Truth. Simultaneously, we used GPT-4 to perform predictive assessments on the Ground Truth as a comparison, to better evaluate the effectiveness of our prediction pipeline.
Web Application
Result Assessment
Limitations & Future work
For postcards without text, we currently rely on predictions made by GPT-4, which can be expensive. Subsequently, we may need to explore more methods.
Project plan & Milestones
Project plan
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Week 8 |
|
✓ |
Week 9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
✓ |
Week 12 |
|
✓ |
Week 13 |
|
✓ |
Week 14 |
|
✓ |