Introduction
Motivation
Project Plan and Milestones
Date
|
Task
|
Completion
|
By Week 3
|
- Brainstorm projects ideas.
- Prepare slides for initial project idea presentation.
|
✓
|
By Week 5
|
- Discuss the differences between image analysis and text analysis in terms of related algorithms, processing toolkits, implementation difficulties and display methods.
- Decide to focus on text processing.
- Select a subset collection from the "Newspaper collection" of Europeana for our project.
- Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.
|
✓
|
By Week 6
|
- Each of us read some pages of the journal to get an overall understanding of it.
- We find that the accuracy of the OCR results isn't very satisfying and decide to somehow improve the OCR results before text analyzing.
- Request for data.
|
✓
|
By Week 7
|
- Research in OCR methods and find some OCR methods for Italian italics
- Get text by web analysis
- Use DeepL to translate FR to ENG, and then translate ENG to FR, finally check results
- Reproduce the OCR method from the literature and find that recognition has improved.
|
✓
|
By Week 8
|
- Apply OCRopus to a small set of images.
- Use a grammar checker to analyze the result of OCRopus.
|
✓
|
By Week 9
|
- Prototype design.
- Database design.
|
✓
|
By Week 10
|
- Get Europeana's API
- Use the API to extract the URL for each page of our specific newspaper.
- Download each page of our specific newspaper as images using the URL we got.
|
✓
|
By Week 11
|
- OCR using the better model and Kraken engine,
- Store the text we get in the database.
- Share for a grammar checker to optimize the text we get.
|
✓
|
By Week 12
|
- Use new selected grammar checker API to optimize the text.
- Use entropy to analyze the result of the final text.
|
✓
|
By Week 13
|
- Build the web from our prototype.
- Use different text analysis methods: LDA, n-gram, and name entity, to analyze the text
|
✓
|
By Week 14
|
- Final report and presentation.
|
✓
|
Github Repository
https://github.com/XinyiDyee/Europeana-Search-Engine
Reference