Europeana: A New Spatiotemporal Search Engine

From FDHwiki
Revision as of 19:45, 20 December 2022 by Xingchen.li (talk | contribs)
Jump to navigation Jump to search

Introduction

Motivation

Project Plan and Milestones

Date Task Completion
By Week 3
  • Brainstorm projects ideas.
  • Prepare slides for initial project idea presentation.
By Week 5
  • Discuss the differences between image analysis and text analysis in terms of related algorithms, processing toolkits, implementation difficulties and display methods.
  • Decide to focus on text processing.
  • Select a subset collection from the "Newspaper collection" of Europeana for our project.
  • Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.
By Week 6
  • Each of us read some pages of the journal to get an overall understanding of it.
  • We find that the accuracy of the OCR results isn't very satisfying and decide to somehow improve the OCR results before text analyzing.
  • Request for data.
By Week 7
  • Research in OCR methods and find some OCR methods for Italian italics
  • Get text by web analysis
  • Use DeepL to translate FR to ENG, and then translate ENG to FR, finally check results
  • Reproduce the OCR method from the literature and find that recognition has improved.
By Week 8
  • Apply OCRopus to a small set of images.
  • Use a grammar checker to analyze the result of OCRopus.
By Week 9
  • Prototype design.
  • Database design.
By Week 10
  • Get Europeana's API
  • Use the API to extract the URL for each page of our specific newspaper.
  • Download each page of our specific newspaper as images using the URL we got.
By Week 11
  • OCR using the better model and Kraken engine,
  • Store the text we get in the database.
  • Share for a grammar checker to optimize the text we get.
By Week 12
  • Use new selected grammar checker API to optimize the text.
  • Use entropy to analyze the result of the final text.
By Week 13
  • Build the web from our prototype.
  • Use different text analysis methods: LDA, n-gram, and name entity, to analyze the text
By Week 14
  • Final report and presentation.

Github Repository

https://github.com/XinyiDyee/Europeana-Search-Engine

Reference