Introduction and Motivation

1. search engine kraken 2. model ocr17

Europeana as a container of Europe’s digital cultural heritage covers different themes like art, photography, and newspaper. As Europeana has covered diverse topics, it's difficult to balance the ways to present digital materials according to their content. The search for some specific topics needs to go through different steps, and the result of the search might also dissatisfy the user's intention. After having a deep knowledge of the structure of Europeana, we decided to create a new search engine to better present the resources according to their contents. Taking the time and scale of our group into account, we selected the theme Newspaper as the content for our engine. In order to narrow down the task further, we selected the newspaper La clef du cabinet des princes de l'Europe as our target.

La clef du cabinet des princes de l'Europe was the first magazine in Luxembourg. It appeared monthly from July 1704 to July 1794. There are 1,317 La clef du cabinet des princes de l'Europe magazines in Europeana. The page number for each magazine is between 75 to 85. In order to reduce the amount of data to a scale that can be dealt with on our laptops, we randomly selected 8,000 pages from the whole time span of the magazine.

In order to have a better presentation of the specific magazine on our engine, we mainly implement OCR, text analysis, database design, and webpage design.

OCR is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text. This conversion from in-kind to digital format can not only be used for historical and cultural protection but also provide us access to a deep analysis of them based on the computer. In our work, we used OCR to convert the image format magazine to text and store the text in the database, which provides us with more convenience and chances to better deal with them.

For the text analysis part of our work, we used 3 methods: name entity, LDA, and n-gram to deal with the text we got.

For the presentation of the magazine, we developed a webpage to realize the search and analysis functions. The webpage aim at realizing interactivity between users, and let users have an efficient way to reach the content they'd like to get.

Deliverables

The 8000 pages of La clef du cabinet des princes de l'Europe from July 1704 to July 1794 in image format from Europeana's website.
The OCR results for 8000 pages in text format.
The dataset for the text and results of text analysis based on LDA, name entity, and n-gram.
The webpage to present the contents and analysis results for La clef du cabinet des princes de l'Europe.
The GitHub repository contains all the codes for the whole project.

Methodologies

This project includes three main parts which are text processing, database development and web applications. At the same time, the project is conducted with a synergetic process of improving those three parts. Toolkits of this project contain Python for text processing and web applications, MySQL for database development, and FLASK for the webpage framework. In the end, the dataset is composed of four versions for 100 newspaper issues including 7950 pages, that is images, text from Europeana, text after OCR and text after OCR and grammar-checker.

The synergetic process

Text processing

Data acquisition

Using the api given by Europeana's staff, the relevant data is acquired by web crawler. We first get the unique identifier for each issue, then use it to get the image url and ocr text provided by Europeana. We also get the publication date and the page number of every images, which is helpful for us to locate every page and retrieve them in the future. The data is stored in <Title, Year, Month, Page, Identifier, Image_url, Text> format. The crawling result is shown below.

1317 Issues
- Number of pages per issue: roughly 80 pages
- Number of words per page: roughly 100 words

Optical character recognition(OCR)

Ground truth and the model

The reliability of OCR models depends on both the quantity and the quality of training data. Quantity needs to be produced and made freely available to other scholars. On the other hand, quality needs to be properly defined, since philological traditions vary from one place to another, but also from one period to another. The essentials of successful recognition for this type of newspaper are to target the old French during 18 centuries while meeting both quality and quantity of dataset. Therefore, the model for recognition used in this project is trained by OCR17.

1. Corpus building: The training data is selected according to two main categories bibliographical (printing date and place, literary genre, author) and computational (size and resolution of the images) information. Regarding dates, prints are diachronically distributed over the century, with a special attention for books printed between 1620 and 1700. Regarding genre, the result can be seen as a two-tier corpus with a primary one consisting of literary texts (drama, poetry, novels. . . ) and a secondary one made of scientific works (medicine, mechanics, physics. . . ).

Figure 3: Distribution of the prints in the training corpus per decade	Figure 4: Distribution of the prints in the training corpus per genre

The inbalanced corpus are made for two main reasons. On the one hand, dramatic texts tend to be printed in italics at the beginning of the 17th century. On the other hand, they traditionally use capital letters to indicate the name of the speaker, which is an easy way to increase the amount of such rarer glyphs and is also helpful to deal with highly complex layouts.

At the same time, low resolution of images would wrong recognition, the model is able handle low resolution images properly.

2. Transaction rules:

3. Model

OCR Procedure

Images for OCR is crawled from Europeana API. The resolution is 72 dpi and the color mode is in grayscale with white background. A low resolution introduces significant changes in the shape of letters. However, a few lines in images are blurry and glared, so noise is not a big problem during OCR. Besides, due to a simple and clear layout of the newspaper, results of segmentation are pretty good.

Based on the above, the first step is binarization to convert greyscale images into black-and-white(BW) images. By comparing BW images with original ones, it finds that characters on binarized images is milder and vaguer. In this case, the original images are used to segment.

Figure 3: Original Image	Figure 4: Binarized Image

The next step is to segment pages into lines and regions. Since the whole procedure of OCR is carried out on Kraken engine, page segmentation is implemented by the default trainable baseline segmenter that is capable of detecting both lines of different types and regions.

At last, recognition requires grey-scale images, page segmentation for images, and the model file. The recognized records are output as a text file after serialization.

The OCR procedure

Grammar checker

To optimize results obtained from OCR, this project used the grammar checker API to refine the text. After sending the requests to the server of the grammar checker, it will return a JSON file that contains all the modifications for the specific text. By using the offset and length information in the JSON file, we can locate the position of the word that should be modified in the original text. For every modification, we used the first possible value to replace the original word.

Text Analysis

Named entity
N-gram

Database development

Local MySQL database setup

MySQL database design

MySQL interaction

Webpage applications

Tools

Flask is a Python framework for building web apps. It's famous for being small, light and simple. And MySQL is a database system used for developing web-based software applications. We use Flask to build the front-end content and MySQL to connect with local database. By doing so we are able to retrieve data from the local server and present it on the webpage.

Feature Design

Retrieval Method

Interface

Quality Assessment

Subjective Assessment

Objective Assessment

origin 0.56
reocr 0.57
grammar 0.61

Limitations

Project Plan and Milestones

Date	Task	Completion
By Week 3	Brainstorm projects ideas. Prepare slides for initial project idea presentation.	✓
By Week 5	Discuss the differences between image analysis and text analysis in terms of related algorithms, processing toolkits, implementation difficulties and display methods. Decide to focus on text processing. Select a subset collection from the "Newspaper collection" of Europeana for our project. Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.	✓
By Week 6	Each of us read some pages of the journal to get an overall understanding of it. We find that the accuracy of the OCR results isn't very satisfying and decide to somehow improve the OCR results before text analyzing. Request for data.	✓
By Week 7	Research in OCR methods and find some OCR methods for Italian italics Get text by web analysis Use DeepL to translate FR to ENG, and then translate ENG to FR, finally check results Reproduce the OCR method from the literature and find that recognition has improved.	✓
By Week 8	Apply OCRopus to a small set of images. Use a grammar checker to analyze the result of OCRopus.	✓
By Week 9	Prototype design. Database design.	✓
By Week 10	Get Europeana's API Use the API to extract the URL for each page of our specific newspaper. Download each page of our specific newspaper as images using the URL we got.	✓
By Week 11	OCR using the better model and Kraken engine, Store the text we get in the database. Share for a grammar checker to optimize the text we get.	✓
By Week 12	Use new selected grammar checker API to optimize the text. Use entropy to analyze the result of the final text.	✓
By Week 13	Build the web from our prototype. Use different text analysis methods: N-gram, and name entity, to analyze the text	✓
By Week 14	Final report and presentation.	✓

Github Repository

https://github.com/XinyiDyee/Europeana-Search-Engine

Europeana: A New Spatiotemporal Search Engine

Contents