Europeana: mapping postcards: Difference between revisions
Jump to navigation
Jump to search
Jingbang.liu (talk | contribs) |
Jingbang.liu (talk | contribs) |
||
Line 12: | Line 12: | ||
= Methodologies = | = Methodologies = | ||
== Data collection == | == Data collection == | ||
Using the APIs provided by Europeana, we employed a web scraper to collect relevant data. Initially, we utilized the search API to obtain all records related to postcards with an open copyright status, resulting in 39,587 records. Subsequently, we filtered these records using the record API, retaining only those records whose metadata allowed for direct image retrieval via a web scraper, amounting to 20,000 records in total. We then organized this metadata, preserving only the attributes relevant to our research, such as the providing country, the providing institution, and potential coordinates. Employing a method of random sampling with this metadata, we downloaded some image samples locally for analysis. | |||
== OCR == | == OCR == | ||
== Prediction using GPT == | == Prediction using GPT == |
Revision as of 15:54, 18 December 2023
Introduction & Motivation
Deliverables
- 39,587 records related to postcards with image copyrights, along with their metadata, from the Europeana website.
- OCR results of a sample set of 350 images containing text.
- GPT-3.5 prediction results for a sample set of 350 images containing text, based on OCR results.
- A high-quality, manually annotated Ground Truth for a sample set of 309 images.
- GPT-3.5 prediction results for Ground Truth.
- GPT-4 prediction results for Ground Truth.
- An interactive webpage displaying the mapping of the postcards.
- The GitHub repository contains all the codes for the whole project.
Methodologies
Data collection
Using the APIs provided by Europeana, we employed a web scraper to collect relevant data. Initially, we utilized the search API to obtain all records related to postcards with an open copyright status, resulting in 39,587 records. Subsequently, we filtered these records using the record API, retaining only those records whose metadata allowed for direct image retrieval via a web scraper, amounting to 20,000 records in total. We then organized this metadata, preserving only the attributes relevant to our research, such as the providing country, the providing institution, and potential coordinates. Employing a method of random sampling with this metadata, we downloaded some image samples locally for analysis.
OCR
Prediction using GPT
The Build of Ground Truth
Web Application
Result Assessment
Limitations & Future work
Projet plan & milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✅ |
Week 5 |
|
✅ |
Week 6 |
|
✅ |
Week 7 |
|
✅ |
Week 8 |
|
✅ |
Week 9 |
|
✅ |
Week 10 |
|
✅ |
Week 11 |
|
✅ |
Week 12 |
|
✅ |
Week 13 |
|
✅ |
Week 14 |
|
✅ |