Primary sources

From FDHwiki
Revision as of 14:26, 15 December 2017 by Mathieu.clavel (talk | contribs)
Jump to navigation Jump to search

A method for retrieving mentions of named entities in Primary Sources using Word Spotting in the context of ClioWire.

Motivations

Technology is not yet able to fully retranscribe text from old handwritten documents, yet these documents constitute the vast majority of primary sources of interest in a historian perspective. Therefore, the opportunity to retrieve even mentions of named entities becomes very valuable. For instance, a historian looking for records on a certain person could know quickly in which pages of which documents this person is mentioned, with a certain level of confidence. This would save him a lot of time scrutinizing the old writing and would make the information, even though not fully accessible since not digitized, much faster and convenient to retrieve. It could also foster a more focused, perhaps on-demand finer grained transcription (using other methods, costlier algorithms or taking advantage of human expertise) of particularly relevant documents.

Bibliography and state of the art

When working with digitized documents, Optical Character Recognition (OCR) is traditionally used to recognize words in a character-by-character fashion. However, in the case of offline handwritten text recognition, it does perform rather poorly. A more suited approach to this problem is the Word Spotting technology. First proposed in S. Madhvanath et al., (1996) [1], this approach does not attempt to recognize characters or wors per say, but tries instead to retrieve all instances matching a user query (either a word or an image) in a set of document images. It is noted that Word Spotting is of primary interest in the case of historical documents. [2] [3]

There are three main characteristics that constrain Word Spotting methods[3]:

  1. The need for a segmentation, which can be done by word, line or arbitrarily. Bad segmentation might results in errors if poorly chosen, thus there have been attempts at coming up with segmentation-free methods cf. Leydier et Al. (2007)[4].
  2. The way of searching for a word, either by string or by example. Search by example is more restrictive, provided an example as first to be identified prior to querying, while other methods have attempted to allow for a search by string cf. Edwards et Al. (2005) [5]
  3. The use of a training set, i.e. human annotated data. Most often, methods based on a training set appear to perform far better [3].

Currently, the best performing methods rely on segmentation by word, querying by example and using a training set. Moreover, they also make use of Neural Networks.[3] However, a preferable method would not need segmentation, and allow to query by string. New methods with such features have emerged and show promising results, c.f. Wilkinson et al. (2017)[6].

  1. Manmatha, R., Han, C., & Riseman, E. M. (1996, June). Word spotting: A new approach to indexing handwriting. In Computer Vision and Pattern Recognition, 1996. Proceedings CVPR'96, 1996 IEEE Computer Society Conference on (pp. 631-637). IEEE.
  2. Rath, T. M., & Manmatha, R. (2007). Word spotting for historical documents. International Journal on Document Analysis and Recognition, 9(2), 139-152.
  3. 3.0 3.1 3.2 3.3 Giotis, A. P., Sfikas, G., Gatos, B., & Nikou, C. (2017). A survey of document image word spotting techniques. Pattern Recognition, 68, 310-332.
  4. Leydier, Y., Lebourgeois, F., & Emptoz, H. (2007). Text search for medieval manuscript images. Pattern Recognition, 40(12), 3552-3567.
  5. Edwards, J., Teh, Y. W., Bock, R., Maire, M., Vesom, G., & Forsyth, D. A. (2005). Making latin manuscripts searchable using gHMM's. In Advances in Neural Information Processing Systems (pp. 385-392).
  6. Wilkinson, T., Lindström, J., & Brun, A. (2017). Neural Ctrl-F: Segmentation-free Query-by-String Word Spotting in Handwritten Manuscript Collections. arXiv preprint arXiv:1703.07645.

Detailed description of the method

Argus & Word Spotting

For Word Spotting itself, our method relies on a commercial algorithm called ArgusSearch, edited by the German company Planet, which we use as a black box. The Word Spotting is done on the company servers side, where an undisclosed trained model is used to perform the indexing of pages. For a subscription priced on the basis of a number of pages to index, one can send high resolution pictures to their servers and receive in a short while a database with the results of the indexing, which we can then query for names.

However, the usage of this black-box at the core of our bot is not very problematic. It could be swapped easily for any other better performing algorithm, without changing the core functionalities of the method - for instance, the algorithm being developed at the DHLab at the time. Eventually, using several algorithms and comparing their results would allow cross-checking the results.

Our implementation

To use this database, we created a simple server that connects to a postgresSQL database that contains the index. Then we can simply write http queries to search the indexed pages, the results returned is a json file containing the confidence of the match, the reference of the page with the match as well as the coordinates of the bounding box (which delimits the area of the match) We also created a simple front-end interface to allow for manual assessment in a visual way of the quality of results.

The pipeline of our method

To retrieve the named entities (for the scope of this project, persons names), we were given a list of 3749 Venetian named entities, which we parsed down to a meaningful list of 405 unique last names. The use of Venetian names was relevant to our project, since the actual documents we had at our disposition came from the State Archives of Venice. Looking up full name (first + last names) gave too little results, while looking up first names only, although it got many matches from the database, did not appear relevant to us compared to last names only. As such, there would be no point in feeding a massive number of random mentions of first names - unless other bots are integrated, which could for instance infer and retrieve subsequent last names from the coordinates of the bounding boxes. We also tried queries based on publicly available list of Italian names, which wasn’t quite as successful.

Pulses

It is not desirable to to query the server whenever and each time one is looking for information, hence ideally as much information as possible should be processed from the corpus at once, and published on ClioWire as formatted pulses. This way, knowledge is actually created by extracting key dormant informations from the documents. For this project, we created pulses based on the results of queries made with list of Venetian last names previously mentioned, filtered after a confidence threshold of 60%.

As was agreed between all the actors building ClioWire (which novel information architecture rests on a mutual negotiation of the tags between the contributors), our pulses have the format “#Mention #Name_entity #In #document_uid #Source url #Prob_score score”

Example : Mention #Francesco_554eacdd_6734_4c91_9b9a_5e40869403e7_000490_942_5513_363_139 #In #554eacdd_6734_4c91_9b9a_5e40869403e7 #Source https://images.center/iiif/554eacdd-6734-4c91-9b9a-5e40869403e7-000490/942,5513,363,139/full/0/default.jpg #Prob_Score 0.95

The pulse is organised in three key tags : first, it is clearly specify that the pulse is essentially the #mention of a #named entity, which is the name itself attached to the UUID of the volume. The purpose of this long unique identifier is to anticipate and resolve the expected ambiguities, whence the important need to have every single mention receive an unique id. It is always possible to define equivalence between mentioned entities in successive pulses. Then we specify again #in which #volume UUID is the mention present, to be able to search directly for volumes UUID in ClioWire (it is expected that future bots will be able to create links from UUIDs to volumes’ titles). Then we have the #source tag to provide a link to the bounded picture of the spotted named entity in the page, hosted on the DHlab server, and resized with IIIF. At last, there is an optional parameter #Prob_Score that shows the confidence score returned by the algorithm.

Meet Francesco_raspi_bot

The Python script *...* accessible in the GitHub repository perform the query on our corpus with the provided list of names. The pulses created are then posted on ClioWire by Francesco_raspi_bot (named after the glorious Venice Time Machine legendary character) bot on ClioWire. While creating the pulses, we set a json file to log the mention that have been posted as pulses already, so that they would not be added twice as two identical pulses. This file can be given as an argument to our script to control the pulse submission process.

Finally, our whole method is essentially designed to become a bot of its own. With further work, it should be made automatic, so that new pulses with named entities published on ClioWire would eventually trigger further queries on the available corpora, and feed ClioWire back with more pulses.


Quantitive analysis of the performances

We were first given a sample of 791 pages. Then at the end of the project, the indexing of a collection of 30’000 pages was started, yet some problems were encountered on the Argus indexing server side, and by the time of performing the analysis, we had but another subset of 8’201 pages. A prior limitation of our analysis lies in the fact that many of the sample pages were indexes, meaning they could have been potentially written in a way that is easier for the model to classify (more clearly than less formally formatted handwritten documents).

We firsted tested our preliminary sample of 791 pages to get an idea of the results we would get. Using a list of 980 italian firstnames : 904 occurences > 80% confidence Using the partially parsed list of venetian first + last names (677) : 173 occurences > 60% confidence not much Using the partially parsed list of venetian last names (405) : 3097 occurences > 60% confidence Using the partially parsed list of venetian last names (405) : 1192 occurences > 80% confidence much better but obviously our results are more vague. However, it appeared that a strategy focusing on full names and last names only was the most interesting. Full names are ideal, as they allow us to identify a person up to individuals who share the same names. Last names discriminate a family, whereas first names are too loosely defining entities to be considered useful. However, searching first name can still be relevant, since another bot could come and link the first name and last name. The next step of our properly quantitative analysis is to compute the distribution of number of occurences per confidence values. For this, we will use the 8’201 indexed pages database that we have up to this day (the final submission of pulses will be made out of the 30’000 indexed pages database). The results are as follows : [plot]


Precision/Recall

Because we mostly lack reference, estimating the quantitative performance by mean of the usual precision/recall with our method is not straightforward. As such, we cannot estimate the proper recall of the method, as we have no reasonable way to count the total number of occurrences of named entities documents, other than going through the 791 pages of the sample subset manually looking for them. We can estimate the precision of the method (estimate false positives and false negatives). To do so, we run a query with the 405 parsed Venetian names that we have and we set a score for each name, the number of hits that it gets from the database above a threshold confidence of 60%. In the ArgusSearch SDK settings, the threshold is set at 26,4%, but we found that under 60% the precision was mostly too low to bother counting occurences. Then, we can rank the names by number of occurrence, pick say the 15 highest and manually query them through our visual interface. Finally, we can count the number of results that appear correct to our eyes s by intervals of confidence. Results are as follows

[plot]

There’s a problem when the name is also an italian noun (eg. Porta is retrieved as porto) Precision really depends on the name

Regarding the recall, we could eventually filter our results by the page where the match is found, and compute recall on an even smaller sample. We chose to compute a less rigorous recall based on the results of the indexing. It happens that correct matches are given with a low index of confidence. We can consider them as false negatives and count their number to estimate this lesser recall. Results are as follows [plot]

As we cn see, the Argus system has mostly a high precision above …% confidence