Europeana: A New Spatiotemporal Search Engine: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
No edit summary
Line 21: Line 21:
* Decide to focus on text processing.
* Decide to focus on text processing.
* Select a subset collection from the "Newspaper collection" of Europeana for our project.
* Select a subset collection from the "Newspaper collection" of Europeana for our project.
* Check the content of "La clef du cabinet des princes de l'Europe" and roughly select 3 topics we may focus on.
* Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.
| align="center" | ✓
| align="center" | ✓
|-
|-
Line 44: Line 44:
|By Week 8
|By Week 8
|
|
* Apply ocropus to a small set of images.
* Apply OCRopus to a small set of images.
* Use a grammar checker to analyze the result of OCRopus.
| align="center" | ✓
| align="center" | ✓
|-
|-
Line 50: Line 51:
|By Week 9
|By Week 9
|
|
* Preprocess the data. (ReOCR the images)
* Prototype design.
* Prototype design.
* Database design.
* Database design.
Line 58: Line 58:
|By Week 10
|By Week 10
|
|
* To be filled
* Get Europeana's API
 
* Use the API to extract the URL for each page of our specific newspaper.
| align="center" |  
* Download each page of our specific newspaper as images using the URL we got.
| align="center" |
|-
|-


|By Week 11
|By Week 11
|
|
* Content analysis.
* OCR using the better model and Kraken engine,
| align="center" |
* Store the text we get in the database.
* Share for a grammar checker to optimize the text we get.
| align="center" |
|-
|-


|By Week 12
|By Week 12
|
|
* To be filled
* Use new selected grammar checker API to optimize the text.
 
* Use entropy to analyze the result of the final text.
| align="center" |
| align="center" |
|-
|-


|By Week 13
|By Week 13
|
|
* Build the web.
* Build the web from our prototype.
| align="center" |
* Use different text analysis methods: LDA, n-gram, and name entity, to analyze the text
| align="center" |
|-
|-


|By Week 14
|By Week 14
|
|
* Final report.
* Final report and presentation.


| align="center" |
| align="center" |
|-
|-



Revision as of 19:45, 20 December 2022

Introduction

Motivation

Project Plan and Milestones

Date Task Completion
By Week 3
  • Brainstorm projects ideas.
  • Prepare slides for initial project idea presentation.
By Week 5
  • Discuss the differences between image analysis and text analysis in terms of related algorithms, processing toolkits, implementation difficulties and display methods.
  • Decide to focus on text processing.
  • Select a subset collection from the "Newspaper collection" of Europeana for our project.
  • Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.
By Week 6
  • Each of us read some pages of the journal to get an overall understanding of it.
  • We find that the accuracy of the OCR results isn't very satisfying and decide to somehow improve the OCR results before text analyzing.
  • Request for data.
By Week 7
  • Research in OCR methods and find some OCR methods for Italian italics
  • Get text by web analysis
  • Use DeepL to translate FR to ENG, and then translate ENG to FR, finally check results
  • Reproduce the OCR method from the literature and find that recognition has improved.
By Week 8
  • Apply OCRopus to a small set of images.
  • Use a grammar checker to analyze the result of OCRopus.
By Week 9
  • Prototype design.
  • Database design.
By Week 10
  • Get Europeana's API
  • Use the API to extract the URL for each page of our specific newspaper.
  • Download each page of our specific newspaper as images using the URL we got.
By Week 11
  • OCR using the better model and Kraken engine,
  • Store the text we get in the database.
  • Share for a grammar checker to optimize the text we get.
By Week 12
  • Use new selected grammar checker API to optimize the text.
  • Use entropy to analyze the result of the final text.
By Week 13
  • Build the web from our prototype.
  • Use different text analysis methods: LDA, n-gram, and name entity, to analyze the text
By Week 14
  • Final report and presentation.

Github Repository

https://github.com/XinyiDyee/Europeana-Search-Engine

Reference