Europeana: A New Spatiotemporal Search Engine: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Tag: Manual revert
 
(11 intermediate revisions by 3 users not shown)
Line 2: Line 2:
[[File:topic.jpg|400px|right|]]
[[File:topic.jpg|400px|right|]]


This project aims at build up a webpage for more accurate and flexible search for archives and improve the search function of Europeana. Due to time and hardware limitation, the project focus on one part of dataset from Europeana--the French newspaper ''La clef du cabinet des princes de l'Europe''. ''La clef du cabinet des princes de l'Europe'' is the first newspaper in Luxembourg. It appeared monthly from July 1704 to July 1794. There are 1,317 issues in Europeana. The page number for most issues is around 80. In order to reduce the amount of data to a scale that can be dealt with on our laptops, we randomly selected 7950 pages from the whole time span of the magazine. In the project, we mainly implement OCR, grammar-checker, text analsis, database design and webpage development on our dataset.
This project aims to build up a webpage for a more accurate and flexible search for archives and improve the search function of Europeana. Due to time and hardware limitations, the project focuses on one part of the dataset from Europeana--the French newspaper ''La clef du cabinet des princes de l'Europe''. ''La clef du cabinet des princes de l'Europe'' is the first newspaper in Luxembourg. It appeared monthly from July 1704 to July 1794. There are 1,317 issues in Europeana. The page number for most issues is around 80. In order to reduce the amount of data to a scale that can be dealt with on our laptops, we randomly selected 7950 pages from the whole time span of the magazine. In the project, we mainly implement OCR, grammar-checker, text analysis, database design, and webpage development on our dataset.


Due to the unsatisfied results of OCR provided by Europeana, we tried OCR again to convert the image format newspaper to text and store the text in the database, which increases the accuracy. The OCR process is assisted with Kraken OCR engine and a trained model from OCR17. For the text analysis part of our work, we used 2 methods--name entity and n-gram--to deal with the text we obtained. For the presentation of the magazine, we developed a webpage to realize the search and analysis functions. The webpage aim at realizing interactivity between users, and let users have an efficient way to reach the content they'd like to get.
Due to the unsatisfied results of OCR provided by Europeana, we tried OCR again to convert the image format newspaper to text and store the text in the database, which increases the accuracy. The OCR process is assisted by the Kraken OCR engine and a trained model from OCR17. For the text analysis part of our work, we used 2 methods--name entity and n-gram--to deal with the text we obtained. For the presentation of the magazine, we developed a webpage to realize the search and analysis functions. The webpage aim at realizing interactivity between users, and let users have an efficient way to reach the content they'd like to get.


= Motivation =
= Motivation =
Line 19: Line 19:


= Milestones =
= Milestones =
1. A great improvement of text recognition accuracy with a new OCR method and grammar-checker.
1. A great improvement in text recognition accuracy with a new OCR method and grammar-checker.


2. A structured and functionable Database with relatively a large scale of dataset.
2. A structured and functional Database with relatively a large-scale dataset.


3. A light and interactive webpage for accurate search in time and space.
3. A light and interactive webpage for accurate search in time and space.
Line 29: Line 29:
= Methodologies =
= Methodologies =
[[File:process.jpg|400px|right|thumb|The synergetic process]]
[[File:process.jpg|400px|right|thumb|The synergetic process]]
This project includes three main parts which are text processing, database development and web applications. At the same time, the project is conducted with a synergetic process of improving those three parts. Toolkits of this project contain Python for text processing and web applications, MySQL for database development, and FLASK for the webpage framework. In the end, the dataset is composed of four versions for 100 newspaper issues including 7950 pages, that is images, text from Europeana, text after OCR and text after OCR and grammar-checker.   
This project includes three main parts which are text processing, database development, and web applications. At the same time, the project is conducted with a synergetic process of improving those three parts. Toolkits of this project contain Python for text processing and web applications, MySQL for database development, and FLASK for the webpage framework. In the end, the dataset is composed of four versions for 100 newspaper issues including 7950 pages, that is images, text from Europeana, text after OCR, and text after OCR and grammar-checker.   


==Text processing==
==Text processing==
===Data acquisition===
===Data acquisition===
Using the API given by Europeana's staff, the relevant data is acquired by web crawler. We first get the unique identifier for each issue, then use it to get the image url and ocr text provided by Europeana. We also get the publication date and the page number of every images, which is helpful for us to locate every page and retrieve them in the future. The data is stored in ''<Title, Year, Month, Page, Identifier, Image_url, Text>'' format.
Using the API given by Europeana's staff, the relevant data is acquired by the web crawler. We first get the unique identifier for each issue, then use it to get the image URL and OCR text provided by Europeana. We also get the publication date and the page number of every image, which is helpful for us to locate every page and retrieve them in the future. The data is stored in ''<Title, Year, Month, Page, Identifier, Image_url, Text>'' format.
The crawling result is shown below.
The crawling result is shown below.


Line 47: Line 47:
|[[File:Labelled genre_dist.png|100px|center|thumb|Figure 2: Distribution of the prints in the training corpus per genre]]
|[[File:Labelled genre_dist.png|100px|center|thumb|Figure 2: Distribution of the prints in the training corpus per genre]]
|}
|}
The reliability of OCR models depends on both the quantity and the quality of training data. Quantity needs to be produced and made freely available to other scholars. On the other hand, quality needs to be properly defined, since philological traditions vary from one place to another, but also from one period to another. The essentials of successful recognition for this type of newspaper are to target the old French during 18 centuries while meeting both quality and quantity of dataset. Therefore, the model for recognition used in this project is trained by OCR17. The corpus of Ground Truth(GT) is made of 30,000 lines taken from 37 French prints of the 17th century, following strict philological guidelines.  
The reliability of OCR models depends on both the quantity and the quality of training data. Quantity needs to be produced and made freely available to other scholars. On the other hand, quality needs to be properly defined, since philological traditions vary from one place to another, but also from one period to another. The essentials of successful recognition for this type of newspaper are to target the old French during 18 Century while meeting both the quality and quantity of the dataset. Therefore, the model for recognition used in this project is trained by OCR17. The corpus of Ground Truth(GT) is made of 30,000 lines taken from 37 French prints of the 17th century, following strict philological guidelines.  




'''1. Corpus building: ''' The training data is selected according to two main categories bibliographical (printing date and place, literary genre, author) and computational (size and resolution of the images) information. Regarding dates, prints are diachronically distributed over the century, with a special attention for books printed between 1620 and 1700. Regarding genre, the result can be seen as a two-tier corpus with a primary one consisting of literary texts (drama, poetry, novels. . . ) and a secondary one made of scientific works (medicine, mechanics, physics. . . ). A summary for corpus is showed in figure 1 and figure 2. The inbalanced corpus are made for two main reasons. On the one hand, dramatic texts tend to be printed in italics at the beginning of the 17th century. On the other hand, they traditionally use capital letters to indicate the name of the speaker, which is an easy way to increase the amount of such rarer glyphs and is also helpful to deal with highly complex layouts.
'''1. Corpus building: ''' The training data is selected according to two main categories bibliographical (printing date and place, literary genre, author) and computational (size and resolution of the images) information. Regarding dates, prints are diachronically distributed over the century, with special attention to books printed between 1620 and 1700. Regarding genre, the result can be seen as a two-tier corpus with a primary one consisting of literary texts (drama, poetry, novels. . . ) and a second one made of scientific works (medicine, mechanics, physics. . . ). A summary of the corpus is shown in figure 1 and figure 2. The imbalanced corpus is made for two main reasons. On the one hand, dramatic texts tend to be printed in italics at the beginning of the 17th century. On the other hand, they traditionally use capital letters to indicate the name of the speaker, which is an easy way to increase the amount of such rarer glyphs and is also helpful to deal with highly complex layouts.
At the same time, low resolution of images would wrong recognition(fig. 3), the model is able handle low resolution images properly.  
At the same time, low-resolution of images would wrong recognition(fig. 3), and the model is able to handle low-resolution images properly.  
[[File:res_e.png|235px|right|thumb|Figure 3: Impact of resolution on letter e]]
[[File:res_e.png|235px|right|thumb|Figure 3: Impact of resolution on letter e]]




'''2. Transaction rules''':
'''2. Transaction rules''':
The transaction guideline in this model is to encode as much information as possible, as long as it is available in unicode. The result is therefore a mix between graphetic and graphemic transcription.  
The transaction guideline in this model is to encode as much information as possible, as long as it is available in Unicode. The result is therefore a mix of graphetic and graphemic transcription.  
[[File:sample1.png|800px|center|thumb|Figure 4: Excerpt of Marie de Gournay, Egalité, 1622]]
[[File:sample1.png|800px|center|thumb|Figure 4: Excerpt of Marie de Gournay, Egalité, 1622]]




In practice(Fig. 4), it means that it do not dissimilate‹u›/‹v› (diuin) or ‹i›/‹j›, we do not normalise accents (interprete and not interprète), we keep historical, diactritical (Eſcripts and not Ecrits) or calligraphic letters (celuy and not celui). We keep the long s (meſme and not mesme), but most of the other allographetic variations are not encoded(Fig. 5).
In practice(Fig. 4), it means that it do not dissimilate‹u›/‹v› (diuin) or ‹i›/‹j›, we do not normalise accents (interprete and not interprète), we keep historical, diactritical (Eſcripts and not Ecrits) or calligraphic letters (celuy and not celui). We keep the long s (meſme and not mesme), but most of the other allographic variations are not encoded(Fig. 5).


One exception has been made to our unicode rule: aesthetic ligatures that still exist in French (‹œ› vs ‹oe›) have been encoded, but not those that have disappeared despite their existence in unicode.(Fig. 6)
One exception has been made to our Unicode rule: aesthetic ligatures that still exist in French (‹œ› vs ‹oe›) have been encoded, but not those that have disappeared despite their existence in Unicode.(Fig. 6)
{|class="wikitable" style="margin: 1em auto 1em auto;"
{|class="wikitable" style="margin: 1em auto 1em auto;"
|-
|-
Line 70: Line 70:


'''3. Model''':
'''3. Model''':
The model has been trained on Kraken OCR engine and tested with small samples of 18th century out-of-domain prints to test the generality of our model – only with roman or italic typefaces. On top of training a model using the default setup regarding the network structure, training parameters. . . , several modifications, have been tested to maximize the final scores. The training process is completed on Kraken which is an optical character recognition package that can be trained fairly easily for a large number of scripts. Some training details can be seen [https://kraken.re/master/training.html#training/ here].
The model has been trained on the Kraken OCR engine and tested with small samples of 18th-century out-of-domain prints to test the generality of our model – only with roman or italic typefaces. On top of training a model using the default setup regarding the network structure, training parameters. . . , several modifications, have been tested to maximize the final scores. The training process is completed on Kraken which is an optical character recognition package that can be trained fairly easily for a large number of scripts. Some training details can be seen [https://kraken.re/master/training.html#training/ here].
{| class="wikitable" style="margin:auto"
{| class="wikitable" style="margin:auto"
|+ Table 1: Scores
|+ Table 1: Scores
Line 86: Line 86:


====OCR Procedure====
====OCR Procedure====
Images for OCR is crawled from Europeana API. The resolution is 72 dpi and the color mode is in grayscale with white background. A low resolution introduces significant changes in the shape of letters. However, a few lines in images are blurry and glared, so noise is not a big problem during OCR. Besides, due to the simple and clear layout of the newspaper, the results of segmentation are pretty good.  
Images for OCR are crawled from Europeana API. The resolution is 72 dpi and the color mode is in grayscale with white background. A low resolution introduces significant changes in the shape of letters. However, a few lines in images are blurry and glared, so noise is not a big problem during OCR. Besides, due to the simple and clear layout of the newspaper, the results of segmentation are pretty good.  


Based on the above, the first step is '''binarization''' to convert greyscale images into black-and-white(BW) images. By comparing the BW image(Fig. 7) with the original one(Fig. 8), it finds that characters on binarized images are milder and vaguer although it can get rid of some noise which is not a big issue for this type of newspaper. In this case, the original images are used to segment. e
Based on the above, the first step is '''binarization''' to convert greyscale images into black-and-white(BW) images. By comparing the BW image(Fig. 7) with the original one(Fig. 8), it finds that characters on binarized images are milder and vaguer although it can get rid of some noise which is not a big issue for this type of newspaper. In this case, the original images are used to segment. e
Line 96: Line 96:
|}
|}


The next step is to segment pages into lines and regions. Since the whole procedure of OCR is carried out on Kraken engine, page '''segmentation''' is implemented by the default trainable baseline segmenter that is capable of detecting both lines of different types and regions.(Fig. 9)  
The next step is to segment pages into lines and regions. Since the whole procedure of OCR is carried out on the Kraken engine, page '''segmentation''' is implemented by the default trainable baseline segmenter that is capable of detecting both lines of different types and regions.(Fig. 9)  


At last, '''recognition''' requires grey-scale images, page segmentation for images, and the model file. The recognized records are output as a text file after serialization.
At last, '''recognition''' requires grey-scale images, page segmentation for images, and the model file. The recognized records are output as a text file after serialization.
Line 110: Line 110:
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.  
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.  


Named entity is realized by using Spacy API with entityfishing pipeline. Spacy API provides a French named-entity models and four categories of entity which are ''person'', ''location'', ''organization'' and ''others'' can be recognized. The model we selected is ''fr_core_news_lg'' with 500k unique vectors(300 dimensions). The evaluation score of this model is as following:
The named entity is realized by using Spacy API with entityfishing pipeline. Spacy API provides a French named-entity models and four categories of entity which are ''person'', ''location'', ''organization'' and ''others'' can be recognized. The model we selected is ''fr_core_news_lg'' with 500k unique vectors(300 dimensions). The evaluation score of this model is as following:
{| class="wikitable" style="margin:auto"
{| class="wikitable" style="margin:auto"
|+ Table 2: Scores
|+ Table 2: Scores
Line 131: Line 131:
The version of MySQL server in this project is 8.0.28. The data interacts with MySQL database via Python and its package, pymysql and sqlalchemy. The database contains six main schemas ''newspaper_grammar_check_text'', ''newspaper_info_content_all'', ''newspaper_ner_freq_label_url'', ''newspaper_page_ner'', ''newspaper_reocr_text'', and ''newspaper_n_gram''. The details information of this database can be seen in the ER relationships figure.  
The version of MySQL server in this project is 8.0.28. The data interacts with MySQL database via Python and its package, pymysql and sqlalchemy. The database contains six main schemas ''newspaper_grammar_check_text'', ''newspaper_info_content_all'', ''newspaper_ner_freq_label_url'', ''newspaper_page_ner'', ''newspaper_reocr_text'', and ''newspaper_n_gram''. The details information of this database can be seen in the ER relationships figure.  


Schema ''newspaper_info_content_all'' includes all original text, image URL and date information. Schema ''newspaper_reocr_text'' includes text after OCR in two formats--one is line by line as in pages, and another is in one line getting rid of line break <\n>. Schema ''newspaper_grammar_check_text'' includes OCR text after grammar checking in on line. Schema ''newspaper_page_ner'' includes all entity information like entities, frequencies, QID of wiki knowledge, and URL of wiki knowledge in dictionary format. Those four schemas' primary key is <year, month, page>, which means each row presents one page. The schema ''newspaper_ner_freq_label_url'' stores entities in a sparse way. Each row presents one entity in one page. The last schema ''newspaper_n_gram'' contains all words in the text and their frequency. Each row presents one word in page.
Schema ''newspaper_info_content_all'' includes all original text, image URL, and date information. Schema ''newspaper_reocr_text'' includes text after OCR in two formats--one is line by line as in pages, and another is in one line getting rid of line break <\n>. Schema ''newspaper_grammar_check_text'' includes OCR text after implementing grammar checking. Schema ''newspaper_page_ner'' includes all entity information like entities, frequencies, QID of wiki knowledge, and URL of wiki knowledge in dictionary format. Those four schemas' primary key is <year, month, page>, which means each row presents one page. The schema ''newspaper_ner_freq_label_url'' stores entities in a sparse way. Each row presents one entity on one page. The last schema ''newspaper_n_gram'' contains all words in the text and their frequency. Each row presents one word on the page.


Such design is all for the web application such as search and visualization.
Such design is all for web applications such as search and visualization.


==Webpage applications==
==Webpage applications==
===Tools===
===Tools===
Flask is a Python framework for building web apps. It's famous for being small, light and simple. And MySQL is a database system used for developing web-based software applications. We use Flask to build the front-end content and MySQL to connect with local database. By doing so we are able to retrieve data from the local server and present it on the webpage.
Flask is a Python framework for building web apps. It's famous for being small, light, and simple. And MySQL is a database system used for developing web-based software applications. We use Flask to build the front-end content and MySQL to connect with a local database. By doing so we are able to retrieve data from the local server and present it on the webpage.


===Feature Design===
===Feature Design===
Our main goal is to design an efficient system for indexing the content of the newspapers and making it searchable. We also want to provide useful insights for the users. This may involve implementing techniques such as full-text indexing, metadata indexing, and natural language processing. So we design our web features as follows.
Our main goal is to design an efficient system for indexing the content of newspapers and making them searchable. We also want to provide useful insights for the users. This may involve implementing techniques such as full-text indexing, metadata indexing, and natural language processing. So we design our web features as follows.
* Search Page
* Search Page
** Full Text Search: Allow users to search by the full text of the newspaper collection.
** Full-Text Search: Allow users to search by the full text of the newspaper collection.
** Named Entities Search: Create facets using different types of named entities and allow users search by them.
** Named Entities Search: Create facets using different types of named entities and allow users to search by them.
** N-gram Search: Show the changes of the word frequency in a long time period and give some inspiration.
** N-gram Search: Show the changes in the word frequency over a long time period and give some inspiration.
* Detailed Page
* Detailed Page
** Display the full text.
** Display the full text.
** Show the named entities contained in this page and provide external links to help users understand better.
** Show the named entities contained in this page and provide external links to help users understand better.
** Able to go to previous or next page.
** Able to go to the previous or next page.


===Retrieval Method===
===Retrieval Method===
We use full-text search and facet search in the web.
We use full-text search and facet search on the web.
*Full-text Search
*Full-text Search
**For full-text search, users can search for documents based on the full text of the newspaper. And we uses fuzzy match when executing the query statement. This allows us to get more retrieval results even when the search keyword isn't very precise.
**For a full-text search, users can search for documents based on the full text of the newspaper. And we use fuzzy match when executing the query statement. This allows us to get more retrieval results even when the search keyword isn't very precise.
*Facet Search
*Facet Search
**For facet research, we create 3 different categories for retrieval -- People, Location, and Organization. When doing facet search, we still use fuzzy match to get as much results as possible. Besides, we also calculate the occurrences of the search keyword in each page, and output the results by descending order. In this way, we can put the most relevant data to the front and improve the search results.
**For facet research, we create 3 different categories for retrieval -- People, Location, and Organization. When doing facet searches, we still use the fuzzy match to get as many results as possible. Besides, we also calculate the occurrences of the search keyword on each page and output the results in descending order. In this way, we can put the most relevant data to the front and improve the search results.


===Interface===
===Interface===
We wish the web to have coherent themes, and to be easy to use and navigate with a clear hierarchy and logical organization of content. Following these rules we design the interface of the search engine, which is shown below.
We wish the web to have coherent themes, and to be easy to use and navigate with a clear hierarchy and logical organization of content. Following these rules, we design the interface of the search engine, which is shown below.


{|class="wikitable" style="margin: 1em auto 1em auto;"
{|class="wikitable" style="margin: 1em auto 1em auto;"
Line 180: Line 180:
{|class="wikitable" style="margin: 1em auto 1em auto;"
{|class="wikitable" style="margin: 1em auto 1em auto;"
|[[File:oori.png|300px|right|thumb|Figure 18: The selected page: 1748-06-07]]
|[[File:oori.png|300px|right|thumb|Figure 18: The selected page: 1748-06-07]]
|[[File:ori.png|400px|right|thumb|Figure 19: Text obtained from Europeana directly: : 1748-06-07]]
|[[File:ori.png|500px|right|thumb|Figure 19: Text obtained from Europeana directly: : 1748-06-07]]
|}
|}
{|class="wikitable" style="margin: 1em auto 1em auto;"
{|class="wikitable" style="margin: 1em auto 1em auto;"
|[[File:ocr.png|300px|center|thumb|Figure 20: The result after implementing OCR: 1748-06-07]]
|[[File:ocr.png|450px|center|thumb|Figure 20: The result after implementing OCR: 1748-06-07]]
|[[File:gc.png|400px|center|thumb|Figure 21: The optimized text after implementing the grammar checker: 1748-06-07]]
|[[File:gc.png|500px|center|thumb|Figure 21: The optimized text after implementing the grammar checker: 1748-06-07]]
|}
|}


==Objective Assessment==
==Objective Assessment==
To quantitatively compare the improvement brought by ReOCR and Grammar checker, we used two objective methods to measure it -- Dictionary method and Entropy method.
To quantitatively compare the improvement brought by ReOCR and Grammar checker, we used two objective methods to measure it -- the Dictionary method and the Entropy method.


For dictionary method, we simply calculate the ratio of the "correct" words in the corpus. "Correct" means the word can be retrieved in a french dictionary we download from the Internet. For the original text, the correct ratio is 55.5%. For ReOCR text, the ratio is 57.3%. And for text refined by grammar checker, the ratio goes up to 60.9%.
For the dictionary method, we simply calculate the ratio of the "correct" words in the corpus. "Correct" means the word can be retrieved in a french dictionary we download from the Internet. For the original text, the correct ratio is 55.5%. For ReOCR text, the ratio is 57.3%. And for text refined by a grammar checker, the ratio goes up to 60.9%.


For entropy, it is a measure of the amount of uncertainty or randomness in a system. In the case of language, this could refer to the probability of different words or combinations of words occurring in a given text or speech. If a language has a high entropy, it would have a large number of words and a wide variety of word combinations, making it more difficult to predict which words or combinations are likely to occur next. On the other hand, if a language has a low entropy, it would have fewer words and a more predictable structure, making it easier to predict which words or combinations are likely to occur next.
For entropy, it is a measure of the amount of uncertainty or randomness in a system. In the case of language, this could refer to the probability of different words or combinations of words occurring in a given text or speech. If a language has a high entropy, it would have a large number of words and a wide variety of word combinations, making it more difficult to predict which words or combinations are likely to occur next. On the other hand, if a language has low entropy, it would have fewer words and a more predictable structure, making it easier to predict which words or combinations are likely to occur next.


[[File:entropy.png|800px|center|thumb|Formular 1: Entropy of a distribution '''P''']]
[[File:entropy.png|800px|center|thumb|Formular 1: Entropy of a distribution '''P''']]


So our hypothesis is, the more random the occurrence of characters, the greater the information entropy. We crawl part of the old french books from the Gutenberg project to generate a corpus as our benchmark. We also generate a completely random charset as a comparison. Then we calculate the 1-gram and 2-gram entropy for each corpus we mentioned above. The results are listed below. As we can see, the entropy reduces as we improve the OCR result in different ways. And the text refined by grammar checker has the best performance compared with the original and ReOCR results.
So our hypothesis is, the more random the occurrence of characters, the greater the information entropy. We crawl part of the old french books from the Gutenberg project to generate a corpus as our benchmark. We also generate a completely random charset as a comparison. Then we calculate the 1-gram and 2-gram entropy for each corpus we mentioned above. The results are listed below. As we can see, the entropy reduces as we improve the OCR result in different ways. And the text refined by the grammar checker has the best performance compared with the original and ReOCR results.
{| class="wikitable" style="margin:auto"
{| class="wikitable" style="margin:auto"
|+ Entropy for different text
|+ Entropy for different text
Line 215: Line 215:
= Limitations =
= Limitations =
==OCR results==
==OCR results==
Although the model of OCR is verified to be strong and we have improved the OCR results compared to the original ones, the results are still not perfect. For example, in segmentation process, some annotations are still recognized as main body text and the first special capital letters are segmented wrongly, which can be seen in figure 9. Those wrong segmentation decreases the accuracy of OCR. Secondly, the resolution of images is low which would cause some significant changes in characters. The transcription rules remains some problems. For example, it cannot figure out 'u' and 'v' so the word divin becomes diuin which is unreadable.
Although the model of OCR is verified to be strong and we have improved the OCR results compared to the original ones, the results are still not perfect. For example, in the segmentation process, some annotations are still recognized as main body text and the first special capital letters are segmented wrongly, which can be seen in figure 9. Those wrong segmentations decrease the accuracy of OCR. Secondly, the resolution of images is low which would cause some significant changes in characters. The transcription rules remain some problems. For example, it cannot figure out 'u' and 'v' so the word divin becomes diuin which is unreadable.


== Limited Dataset ==
== Limited Dataset ==

Latest revision as of 13:45, 20 December 2023

Introduction

Topic.jpg

This project aims to build up a webpage for a more accurate and flexible search for archives and improve the search function of Europeana. Due to time and hardware limitations, the project focuses on one part of the dataset from Europeana--the French newspaper La clef du cabinet des princes de l'Europe. La clef du cabinet des princes de l'Europe is the first newspaper in Luxembourg. It appeared monthly from July 1704 to July 1794. There are 1,317 issues in Europeana. The page number for most issues is around 80. In order to reduce the amount of data to a scale that can be dealt with on our laptops, we randomly selected 7950 pages from the whole time span of the magazine. In the project, we mainly implement OCR, grammar-checker, text analysis, database design, and webpage development on our dataset.

Due to the unsatisfied results of OCR provided by Europeana, we tried OCR again to convert the image format newspaper to text and store the text in the database, which increases the accuracy. The OCR process is assisted by the Kraken OCR engine and a trained model from OCR17. For the text analysis part of our work, we used 2 methods--name entity and n-gram--to deal with the text we obtained. For the presentation of the magazine, we developed a webpage to realize the search and analysis functions. The webpage aim at realizing interactivity between users, and let users have an efficient way to reach the content they'd like to get.

Motivation

Europeana as a container of Europe's digital cultural heritage covers different themes like art, photography, and newspaper. As Europeana has covered diverse topics, it is difficult to balance the ways to present digital materials according to their content. The search for some specific topics needs to go through different steps, and the result of the search might also dissatisfy the user's intention. After having a deep knowledge of the structure of Europeana, we decided to create a new search engine to better present the resources according to their contents. Taking the time and scale of our group into account, we selected the theme Newspaper as the content for our engine. In order to narrow down the task further, we selected the newspaper La clef du cabinet des princes de l'Europe as our target. Since the OCR results provided by Europeana are not ideal, we implement a new OCR method and grammar checker to increase the accuracy of text recognition.

Deliverables

  • 7950 pages of La clef du cabinet des princes de l'Europe from July 1704 to July 1794 in image format from Europeana's website.
  • OCR results for 7950 pages in text format.
  • OCR results after grammar checking for 7950 pages in text format.
  • dataset for the text and results of text analysis based on name entity and n-gram.
  • A database contains all image urls, text data and its metadata.
  • A webpage to present the contents and analysis results for La clef du cabinet des princes de l'Europe.
  • The GitHub repository contains all the codes for the whole project.

Milestones

1. A great improvement in text recognition accuracy with a new OCR method and grammar-checker.

2. A structured and functional Database with relatively a large-scale dataset.

3. A light and interactive webpage for accurate search in time and space.

4. The final result is complete and systematic.

Methodologies

The synergetic process

This project includes three main parts which are text processing, database development, and web applications. At the same time, the project is conducted with a synergetic process of improving those three parts. Toolkits of this project contain Python for text processing and web applications, MySQL for database development, and FLASK for the webpage framework. In the end, the dataset is composed of four versions for 100 newspaper issues including 7950 pages, that is images, text from Europeana, text after OCR, and text after OCR and grammar-checker.

Text processing

Data acquisition

Using the API given by Europeana's staff, the relevant data is acquired by the web crawler. We first get the unique identifier for each issue, then use it to get the image URL and OCR text provided by Europeana. We also get the publication date and the page number of every image, which is helpful for us to locate every page and retrieve them in the future. The data is stored in <Title, Year, Month, Page, Identifier, Image_url, Text> format. The crawling result is shown below.

  • 1317 Issues
    • Number of pages per issue: roughly 80 pages
    • Number of words per page: roughly 200-300 words

Optical character recognition(OCR)

Ground truth and the model

Figure 1: Distribution of the prints in the training corpus per decade
Figure 2: Distribution of the prints in the training corpus per genre

The reliability of OCR models depends on both the quantity and the quality of training data. Quantity needs to be produced and made freely available to other scholars. On the other hand, quality needs to be properly defined, since philological traditions vary from one place to another, but also from one period to another. The essentials of successful recognition for this type of newspaper are to target the old French during 18 Century while meeting both the quality and quantity of the dataset. Therefore, the model for recognition used in this project is trained by OCR17. The corpus of Ground Truth(GT) is made of 30,000 lines taken from 37 French prints of the 17th century, following strict philological guidelines.


1. Corpus building: The training data is selected according to two main categories bibliographical (printing date and place, literary genre, author) and computational (size and resolution of the images) information. Regarding dates, prints are diachronically distributed over the century, with special attention to books printed between 1620 and 1700. Regarding genre, the result can be seen as a two-tier corpus with a primary one consisting of literary texts (drama, poetry, novels. . . ) and a second one made of scientific works (medicine, mechanics, physics. . . ). A summary of the corpus is shown in figure 1 and figure 2. The imbalanced corpus is made for two main reasons. On the one hand, dramatic texts tend to be printed in italics at the beginning of the 17th century. On the other hand, they traditionally use capital letters to indicate the name of the speaker, which is an easy way to increase the amount of such rarer glyphs and is also helpful to deal with highly complex layouts. At the same time, low-resolution of images would wrong recognition(fig. 3), and the model is able to handle low-resolution images properly.

Figure 3: Impact of resolution on letter e


2. Transaction rules: The transaction guideline in this model is to encode as much information as possible, as long as it is available in Unicode. The result is therefore a mix of graphetic and graphemic transcription.

Figure 4: Excerpt of Marie de Gournay, Egalité, 1622


In practice(Fig. 4), it means that it do not dissimilate‹u›/‹v› (diuin) or ‹i›/‹j›, we do not normalise accents (interprete and not interprète), we keep historical, diactritical (Eſcripts and not Ecrits) or calligraphic letters (celuy and not celui). We keep the long s (meſme and not mesme), but most of the other allographic variations are not encoded(Fig. 5).

One exception has been made to our Unicode rule: aesthetic ligatures that still exist in French (‹œ› vs ‹oe›) have been encoded, but not those that have disappeared despite their existence in Unicode.(Fig. 6)

Figure 5: Examples of ignored allographetic variants, Rotrou, Alphrede, 1639
Figure 6: Examples of ignored ligatures, Rotrou, Alphrède, 1639

3. Model: The model has been trained on the Kraken OCR engine and tested with small samples of 18th-century out-of-domain prints to test the generality of our model – only with roman or italic typefaces. On top of training a model using the default setup regarding the network structure, training parameters. . . , several modifications, have been tested to maximize the final scores. The training process is completed on Kraken which is an optical character recognition package that can be trained fairly easily for a large number of scripts. Some training details can be seen here.

Table 1: Scores
Model Test 16th c. prints 18th c. prints 19th c. prints
Basic model 97.47% 97.74% 97.78% 94.50%
with enlarged network 97.92% 98.06% 97.78% 94.23%
+ artificial data 96.65% 97.26% 97.74% 95.50%
with enlarged network 97.26% 97.68% 97.84% 94.84%

OCR Procedure

Images for OCR are crawled from Europeana API. The resolution is 72 dpi and the color mode is in grayscale with white background. A low resolution introduces significant changes in the shape of letters. However, a few lines in images are blurry and glared, so noise is not a big problem during OCR. Besides, due to the simple and clear layout of the newspaper, the results of segmentation are pretty good.

Based on the above, the first step is binarization to convert greyscale images into black-and-white(BW) images. By comparing the BW image(Fig. 7) with the original one(Fig. 8), it finds that characters on binarized images are milder and vaguer although it can get rid of some noise which is not a big issue for this type of newspaper. In this case, the original images are used to segment. e

Figure 7: Original Image
Figure 8: Binarized Image
Figure 9: Segmented Image

The next step is to segment pages into lines and regions. Since the whole procedure of OCR is carried out on the Kraken engine, page segmentation is implemented by the default trainable baseline segmenter that is capable of detecting both lines of different types and regions.(Fig. 9)

At last, recognition requires grey-scale images, page segmentation for images, and the model file. The recognized records are output as a text file after serialization.

Figure 10: The OCR procedure

Grammar checker

As the results obtained from OCR kept the old French font, which will influence the performance of the grammar checker, we need to shift these ancient fonts to modern ones. To optimize the results obtained from OCR, we used the grammar checker API (WebSpellChecker)[1] to refine the text. After sending the requests to the server of the grammar checker, it will return a JSON file that contains all the modifications for the specific text. By using the offset and length information in the JSON file, we can locate the position of the word that should be modified in the original text. For every modification, we used the first possible value to replace the original word.

Figure 11: The text obtained from OCR and the text optimized by the grammar checker

Text Analysis

Named entity

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

The named entity is realized by using Spacy API with entityfishing pipeline. Spacy API provides a French named-entity models and four categories of entity which are person, location, organization and others can be recognized. The model we selected is fr_core_news_lg with 500k unique vectors(300 dimensions). The evaluation score of this model is as following:

Table 2: Scores
score (max=1)
Named entities (precision) 0.84
Named entities (recall) 0.84
Named entities (F-score) 0.84

The results of named entities is used for search by categories and keywords in webpage application.

N-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. By defining the size of n, this method can infer the structure of a sentence from the probability of the occurrence of n words. This window of size n will pass through each word in the sentence in turn, and make this and the next n-1 words a statistical object. This project uses the N-gram method to accomplish the goal to calculate the word frequency, which may reflect some features of the specific era and the customary usage of specific words during a specific time span.

Database development

Figure 12: The ER-relationship of Europeana Database

The version of MySQL server in this project is 8.0.28. The data interacts with MySQL database via Python and its package, pymysql and sqlalchemy. The database contains six main schemas newspaper_grammar_check_text, newspaper_info_content_all, newspaper_ner_freq_label_url, newspaper_page_ner, newspaper_reocr_text, and newspaper_n_gram. The details information of this database can be seen in the ER relationships figure.

Schema newspaper_info_content_all includes all original text, image URL, and date information. Schema newspaper_reocr_text includes text after OCR in two formats--one is line by line as in pages, and another is in one line getting rid of line break <\n>. Schema newspaper_grammar_check_text includes OCR text after implementing grammar checking. Schema newspaper_page_ner includes all entity information like entities, frequencies, QID of wiki knowledge, and URL of wiki knowledge in dictionary format. Those four schemas' primary key is <year, month, page>, which means each row presents one page. The schema newspaper_ner_freq_label_url stores entities in a sparse way. Each row presents one entity on one page. The last schema newspaper_n_gram contains all words in the text and their frequency. Each row presents one word on the page.

Such design is all for web applications such as search and visualization.

Webpage applications

Tools

Flask is a Python framework for building web apps. It's famous for being small, light, and simple. And MySQL is a database system used for developing web-based software applications. We use Flask to build the front-end content and MySQL to connect with a local database. By doing so we are able to retrieve data from the local server and present it on the webpage.

Feature Design

Our main goal is to design an efficient system for indexing the content of newspapers and making them searchable. We also want to provide useful insights for the users. This may involve implementing techniques such as full-text indexing, metadata indexing, and natural language processing. So we design our web features as follows.

  • Search Page
    • Full-Text Search: Allow users to search by the full text of the newspaper collection.
    • Named Entities Search: Create facets using different types of named entities and allow users to search by them.
    • N-gram Search: Show the changes in the word frequency over a long time period and give some inspiration.
  • Detailed Page
    • Display the full text.
    • Show the named entities contained in this page and provide external links to help users understand better.
    • Able to go to the previous or next page.

Retrieval Method

We use full-text search and facet search on the web.

  • Full-text Search
    • For a full-text search, users can search for documents based on the full text of the newspaper. And we use fuzzy match when executing the query statement. This allows us to get more retrieval results even when the search keyword isn't very precise.
  • Facet Search
    • For facet research, we create 3 different categories for retrieval -- People, Location, and Organization. When doing facet searches, we still use the fuzzy match to get as many results as possible. Besides, we also calculate the occurrences of the search keyword on each page and output the results in descending order. In this way, we can put the most relevant data to the front and improve the search results.

Interface

We wish the web to have coherent themes, and to be easy to use and navigate with a clear hierarchy and logical organization of content. Following these rules, we design the interface of the search engine, which is shown below.

Figure 13: Homepage
Figure 14: N-gram search
Figure 15: Search Results
Figure 16: Details
Figure 17: N-gram of Roi

Quality Assessment

As text recognition is the focal point of this project and our text analysis is heavily based on this part,this session will use two assessment methods to test the effect of text recognition.

Subjective Assessment

Subjective assessment is a method to measure the quality of different samples directly through human eye comparison. This part will use a randomly selected page as the sample and show its text obtained from Europeana directly (Europeana provides its own OCR which can return the text of designated pages), the result after implementing OCR, and the optimized text after implementing the grammar checker. By comparing the results with the original page, it's obvious that although some specific characters are still misrecognized, the result after implementing ReOCR and the grammar checker has been significantly improved.

Figure 18: The selected page: 1748-06-07
Figure 19: Text obtained from Europeana directly: : 1748-06-07
Figure 20: The result after implementing OCR: 1748-06-07
Figure 21: The optimized text after implementing the grammar checker: 1748-06-07

Objective Assessment

To quantitatively compare the improvement brought by ReOCR and Grammar checker, we used two objective methods to measure it -- the Dictionary method and the Entropy method.

For the dictionary method, we simply calculate the ratio of the "correct" words in the corpus. "Correct" means the word can be retrieved in a french dictionary we download from the Internet. For the original text, the correct ratio is 55.5%. For ReOCR text, the ratio is 57.3%. And for text refined by a grammar checker, the ratio goes up to 60.9%.

For entropy, it is a measure of the amount of uncertainty or randomness in a system. In the case of language, this could refer to the probability of different words or combinations of words occurring in a given text or speech. If a language has a high entropy, it would have a large number of words and a wide variety of word combinations, making it more difficult to predict which words or combinations are likely to occur next. On the other hand, if a language has low entropy, it would have fewer words and a more predictable structure, making it easier to predict which words or combinations are likely to occur next.

Formular 1: Entropy of a distribution P

So our hypothesis is, the more random the occurrence of characters, the greater the information entropy. We crawl part of the old french books from the Gutenberg project to generate a corpus as our benchmark. We also generate a completely random charset as a comparison. Then we calculate the 1-gram and 2-gram entropy for each corpus we mentioned above. The results are listed below. As we can see, the entropy reduces as we improve the OCR result in different ways. And the text refined by the grammar checker has the best performance compared with the original and ReOCR results.

Entropy for different text
1-gram Entropy 2-gram Entropy
Old french books 4.24 7.88
Original text 4.27 7.99
ReOCR text 4.13 7.74
Grammar checker refined text 4.17 7.76
Random text 5.41 10.83

Limitations

OCR results

Although the model of OCR is verified to be strong and we have improved the OCR results compared to the original ones, the results are still not perfect. For example, in the segmentation process, some annotations are still recognized as main body text and the first special capital letters are segmented wrongly, which can be seen in figure 9. Those wrong segmentations decrease the accuracy of OCR. Secondly, the resolution of images is low which would cause some significant changes in characters. The transcription rules remain some problems. For example, it cannot figure out 'u' and 'v' so the word divin becomes diuin which is unreadable.

Limited Dataset

Due to the time and hardware limitation, we can process text for only 7950 pages(100 issues) of one type of newspaper with one language. Therefore, we don't have enough data in terms of amount, genre and language. Our project lack dataset diversity causing the limitation of our webpage functions, like search stories in history.

Project Plan

Date Task Completion
By Week 3
  • Brainstorm projects ideas.
  • Prepare slides for initial project idea presentation.
By Week 5
  • Discuss the differences between image analysis and text analysis in terms of related algorithms, processing toolkits, implementation difficulties and display methods.
  • Decide to focus on text processing.
  • Select a subset collection from the "Newspaper collection" of Europeana for our project.
  • Check the content of "La clef du cabinet des princes de l'Europe" and learn its structure and time span.
By Week 6
  • Each of us read some pages of the journal to get an overall understanding of it.
  • We find that the accuracy of the OCR results isn't very satisfying and decide to somehow improve the OCR results before text analyzing.
  • Request for data.
By Week 7
  • Research in OCR methods and find some OCR methods for Italian italics
  • Get text by web analysis
  • Use DeepL to translate FR to ENG, and then translate ENG to FR, finally check results
  • Reproduce the OCR method from the literature and find that recognition has improved.
By Week 8
  • Apply OCRopus to a small set of images.
  • Use a grammar checker to analyze the result of OCRopus.
By Week 9
  • Prototype design.
  • Database design.
By Week 10
  • Get Europeana's API
  • Use the API to extract the URL for each page of our specific newspaper.
  • Download each page of our specific newspaper as images using the URL we got.
By Week 11
  • OCR using the better model and Kraken engine,
  • Store the text we get in the database.
  • Share for a grammar checker to optimize the text we get.
By Week 12
  • Use new selected grammar checker API to optimize the text.
  • Use entropy to analyze the result of the final text.
By Week 13
  • Build the web from our prototype.
  • Use different text analysis methods: N-gram, and name entity, to analyze the text
By Week 14
  • Final report and presentation.

Github Repository

https://github.com/XinyiDyee/Europeana-Search-Engine

Reference

G. (2013, August 4). Entropy for N-Grams. Normal-extensions. Retrieved December 21, 2022, from https:////kiko.gfjaru.com/2013/08/04/entropy-for-n-grams/

Lo, J. (2018, September 8). The Entropy of Language. Medium. Retrieved December 21, 2022, from https://medium.com/language-insights/the-entropy-of-system-life-and-language-43d89c0d185b

Simon Gabay, Thibault Clérice, Christian Reul. OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more). date. hal-02577236, from https://hal.archives-ouvertes.fr/hal-02577236/document