Deciphering Venetian handwriting: Difference between revisions

Revision as of 22:30, 1 December 2020

Introduction

The goal of this create a pipeline that allows to reestablish the mapping between the original Venetian castrate "Sommarioni" and the digital version of it as an excel spread sheet. Will producing the transcription of the the spreadsheet the link to the original pages was lost. The purpose of this project is to take the spreadsheet and reestablish the link to the source document. The technologies used to achieve this goal are 2 deep neural nets, one to identify areas of the image that contain text, the other to process the handwriting and produce the text.

Planning

Week	Task
09	Segment patch of text in Sommarioni : (page id, patch)
10	Mapping transcription (excel file) -> page id (proof of concept)
11	Mapping transcription (excel file) -> page id (on the whole dataset)
12	Depending of the quality of the results : improve the mapping of page id, more precise matching, viewer web
13	Final results, final evaluation & final report writing
14	Final project presentation

Week 09

Input : Sommarioni images
Output : Patch of pixels containing text with coordinate of the patch in the Sommarioni
Step 1 : Segment hand written text regions in Sommarioni images
Step 2 : Extraction of the patches

Week 10

Input : transcription (Excel File), tuples (page id, patch) extracted in week 9
Output : line in the transcription -> page id
Step 1 : HTR recognition in the patch and cleaning : (patch, text)
Step 2 : Find matching pair between recognized text and transcription
Step 3 : New excel file with the new page id column

Week 11

Step 1 : Apply the pipeline validated on week 10 on the whole dataset
Step 2 : Evaluate the quality and based on that decide of the tasks for the next weeks

Week 12

Depending of the quality of the matching
- Improve image segmentation
- More precise matching (excel cell) -> (page id, patch) in order to have the precise box of each written text
- Use a IIF image viewer to show the results of the project in a more fancy way

Methodology

1. The first part of the project consisted in extracting the areas on the image of the page that contain text. This is a required step since our hand writing recognition model requires as an input a single line of text. The extract the patches the first step is to identify the location of the text on the page and to produce a baseline under it. We did this with the P2PaLA. Since there is no ground truth for baseline identification in the Sommarioni dataset we used the pre-trained model provided with the P2PaLA repository. Since there is no groundtruth we do not have a metric to measure the quality of this step of our pipeline. We conducted qualitative visual inspection of results. The output is remarkably good given that it was not trained on data from this dataset. A few false positives were found, but no false negatives. Having false positives at this point in the pipeline is fine since there are many steps in the pipeline that are aimed at detecting and removing them.

@@ Line 1: / Line 1: @@
 ==Introduction==
-The goal of this create a pipeline that allows to reestablish the mapping between the original Venetian castrate and the digital version of it as an excel spread sheet.
+The goal of this create a pipeline that allows to reestablish the mapping between the original Venetian castrate "Sommarioni" and the digital version of it as an excel spread sheet.
-Will producing the transcription of the the ,link to the original pages was lost. The purpose of this project is to take the spreadsheet and reestablish the link to the
+Will producing the transcription of the the spreadsheet the link to the original pages was lost. The purpose of this project is to take the spreadsheet and reestablish the link to the
-source document. To do this we use high quality scans of the Venetian cadastre called "Sommarioni".
+source document. The technologies used to achieve this goal are 2 deep neural nets, one to identify areas of the image that contain text, the other to process the handwriting and produce the text.
 ==Planning==
@@ Line 54: / Line 54: @@
 ** Use a IIF image viewer to show the results of the project in a more fancy way
-==Historical introduction to the source==
 ==Methodology ==
+. The first part of the project consisted in extracting the areas on the image of the page that contain text. This is a required step since our hand writing recognition model requires as an input a single line of text. The extract the patches the first step is to identify the location of the text on the page and to produce a baseline under it. We did this with the [https://github.com/lquirosd/P2PaLA P2PaLA]. Since there is no ground truth for baseline identification in the Sommarioni dataset we used the pre-trained model provided with the P2PaLA repository. Since there is no groundtruth we do not have a metric to measure the quality of this step of our pipeline. We conducted qualitative visual inspection of results. The output is remarkably good given that it was not trained on data from this dataset. A few false positives were found, but no false negatives.
+Having false positives at this point in the pipeline is fine since there are many steps in the pipeline that are aimed at detecting and removing them.
 ==Quality assessment==
 ==Links==

Deciphering Venetian handwriting: Difference between revisions

Revision as of 22:30, 1 December 2020

Contents

Introduction

Planning

Week 09

Week 10

Week 11

Week 12

Methodology

Quality assessment

Links

Navigation menu

Deciphering Venetian handwriting: Difference between revisions

Revision as of 22:30, 1 December 2020

Introduction

Planning

Week 09

Week 10

Week 11

Week 12

Methodology

Quality assessment

Links

Navigation menu

Search