Revision as of 19:45, 20 December 2022

Introduction

Motivation

Project Plan and Milestones

Date	Task	Completion
By Week 3	Brainstorm projects ideas. Prepare slides for initial project idea presentation.	✓
By Week 5	Discuss the differences between image analysis and text analysis in terms of related algorithms, processing toolkits, implementation difficulties and display methods. Decide to focus on text processing. Select a subset collection from the "Newspaper collection" of Europeana for our project. Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.	✓
By Week 6	Each of us read some pages of the journal to get an overall understanding of it. We find that the accuracy of the OCR results isn't very satisfying and decide to somehow improve the OCR results before text analyzing. Request for data.	✓
By Week 7	Research in OCR methods and find some OCR methods for Italian italics Get text by web analysis Use DeepL to translate FR to ENG, and then translate ENG to FR, finally check results Reproduce the OCR method from the literature and find that recognition has improved.	✓
By Week 8	Apply OCRopus to a small set of images. Use a grammar checker to analyze the result of OCRopus.	✓
By Week 9	Prototype design. Database design.	✓
By Week 10	Get Europeana's API Use the API to extract the URL for each page of our specific newspaper. Download each page of our specific newspaper as images using the URL we got.	✓
By Week 11	OCR using the better model and Kraken engine, Store the text we get in the database. Share for a grammar checker to optimize the text we get.	✓
By Week 12	Use new selected grammar checker API to optimize the text. Use entropy to analyze the result of the final text.	✓
By Week 13	Build the web from our prototype. Use different text analysis methods: LDA, n-gram, and name entity, to analyze the text	✓
By Week 14	Final report and presentation.	✓

Github Repository

https://github.com/XinyiDyee/Europeana-Search-Engine

@@ Line 21: / Line 21: @@
 * Decide to focus on text processing.
 * Select a subset collection from the "Newspaper collection" of Europeana for our project.
-* Check the content of "La clef du cabinet des princes de l'Europe" and roughly select 3 topics we may focus on.
+* Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.
 | align="center" | ✓
 |-
@@ Line 44: / Line 44: @@
 |By Week 8
 |
-* Apply ocropus to a small set of images.
+* Apply OCRopus to a small set of images.
+* Use a grammar checker to analyze the result of OCRopus.
 | align="center" | ✓
 |-
@@ Line 50: / Line 51: @@
 |By Week 9
 |
-* Preprocess the data. (ReOCR the images)
 * Prototype design.
 * Database design.
@@ Line 58: / Line 58: @@
 |By Week 10
 |
-* To be filled
+* Get Europeana's API
+* Use the API to extract the URL for each page of our specific newspaper.
-| align="center" |
+* Download each page of our specific newspaper as images using the URL we got.
+| align="center" | ✓
 |-
 |By Week 11
 |
-* Content analysis.
+* OCR using the better model and Kraken engine,
-| align="center" |
+* Store the text we get in the database.
+* Share for a grammar checker to optimize the text we get.
+| align="center" | ✓
 |-
 |By Week 12
 |
-* To be filled
+* Use new selected grammar checker API to optimize the text.
+* Use entropy to analyze the result of the final text.
-| align="center" |
+| align="center" | ✓
 |-
 |By Week 13
 |
-* Build the web.
+* Build the web from our prototype.
-| align="center" |
+* Use different text analysis methods: LDA, n-gram, and name entity, to analyze the text
+| align="center" | ✓
 |-
 |By Week 14
 |
-* Final report.
+* Final report and presentation.
-| align="center" |
+| align="center" | ✓
 |-

Europeana: A New Spatiotemporal Search Engine: Difference between revisions

Revision as of 19:45, 20 December 2022

Contents

Introduction

Motivation

Project Plan and Milestones

Github Repository

Reference

Navigation menu