Primary sources: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
== Bibliography and state of the art == | == Bibliography and state of the art == | ||
When working with digitized documents, Optical Character Recognition (OCR) is | When working with digitized documents, Optical Character Recognition (OCR) is | ||
traditionally used to | traditionally used to recognize words in a character-by-character fashion. However, in the case of offline handwritten text recognition, it does perform rather poorly. A more suited approach to this kind of document is the Word Spotting technology,first proposed in S. Madhvanath et al., (1996) <ref name="Manmatha1997"> | ||
case of offline handwritten text recognition, it does perform poorly. A more | |||
first proposed in S. Madhvanath et al., (1996) <ref name="Manmatha1997"> | |||
Manmatha, R., Han, C., & Riseman, E. M. (1996, June). Word spotting: A new | Manmatha, R., Han, C., & Riseman, E. M. (1996, June). Word spotting: A new | ||
approach to indexing handwriting. In Computer Vision and Pattern Recognition, | approach to indexing handwriting. In Computer Vision and Pattern Recognition, | ||
Line 18: | Line 15: | ||
310-332.</ref> | 310-332.</ref> | ||
There | There are three main characteristics that constrain Word Spotting methods<ref name="Giotis2017"/>: | ||
# The need for segmentation, which can be by word, by line or free. This can lead to additional errors if poorly | # The need for segmentation, which can be by word, by line or free. This can lead to additional errors if poorly chosen, thus some have tried to achieve a segmentation-free method cf. Leydier et Al. (2007)<ref> Leydier, Y., Lebourgeois, F., & Emptoz, H. (2007). Text search for medieval manuscript images. Pattern Recognition, 40(12), 3552-3567.</ref>. | ||
# The way of searching a word, either by string either by example. Search by example is more limited, since you first need to find it. Some methods have tried to allow for a search by string cf. Edwards et Al. (2005) <ref> Edwards, J., Teh, Y. W., Bock, R., Maire, M., Vesom, G., & Forsyth, D. A. (2005). Making latin manuscripts searchable using gHMM's. In Advances in Neural Information Processing Systems (pp. 385-392).</ref> | # The way of searching a word, either by string either by example. Search by example is more limited, since you first need to find it. Some methods have tried to allow for a search by string cf. Edwards et Al. (2005) <ref> Edwards, J., Teh, Y. W., Bock, R., Maire, M., Vesom, G., & Forsyth, D. A. (2005). Making latin manuscripts searchable using gHMM's. In Advances in Neural Information Processing Systems (pp. 385-392).</ref> | ||
# Finally the need or not of a training set. I.e. the need for human annotated data. In most of the case, methods with a training set perform a lot better <ref name="Giotis2017"/>. | # Finally the need or not of a training set. I.e. the need for human annotated data. In most of the case, methods with a training set perform a lot better <ref name="Giotis2017"/>. |
Revision as of 01:02, 10 November 2017
Bibliography and state of the art
When working with digitized documents, Optical Character Recognition (OCR) is traditionally used to recognize words in a character-by-character fashion. However, in the case of offline handwritten text recognition, it does perform rather poorly. A more suited approach to this kind of document is the Word Spotting technology,first proposed in S. Madhvanath et al., (1996) [1], this approach does not try to recognize a character/word, but try to retrieve all instances of a user query (either a word either an image) in a set of document images.
One particularly important use case of Word Spotting is in historical documents. [2] [3]
There are three main characteristics that constrain Word Spotting methods[3]:
- The need for segmentation, which can be by word, by line or free. This can lead to additional errors if poorly chosen, thus some have tried to achieve a segmentation-free method cf. Leydier et Al. (2007)[4].
- The way of searching a word, either by string either by example. Search by example is more limited, since you first need to find it. Some methods have tried to allow for a search by string cf. Edwards et Al. (2005) [5]
- Finally the need or not of a training set. I.e. the need for human annotated data. In most of the case, methods with a training set perform a lot better [3].
Currently, the best performing methods are using a segmentation by word, a query by example and need a training set, they also make use of Neural Networks.[3] However, the most desirable characteristics would be to at least not need segmentation and being able to query by string, some new methods having these features have emerged and show promising results, c.f. Wilkinson et al. (2017)[6].
- ↑ Manmatha, R., Han, C., & Riseman, E. M. (1996, June). Word spotting: A new approach to indexing handwriting. In Computer Vision and Pattern Recognition, 1996. Proceedings CVPR'96, 1996 IEEE Computer Society Conference on (pp. 631-637). IEEE.
- ↑ Rath, T. M., & Manmatha, R. (2007). Word spotting for historical documents. International Journal on Document Analysis and Recognition, 9(2), 139-152.
- ↑ 3.0 3.1 3.2 3.3 Giotis, A. P., Sfikas, G., Gatos, B., & Nikou, C. (2017). A survey of document image word spotting techniques. Pattern Recognition, 68, 310-332.
- ↑ Leydier, Y., Lebourgeois, F., & Emptoz, H. (2007). Text search for medieval manuscript images. Pattern Recognition, 40(12), 3552-3567.
- ↑ Edwards, J., Teh, Y. W., Bock, R., Maire, M., Vesom, G., & Forsyth, D. A. (2005). Making latin manuscripts searchable using gHMM's. In Advances in Neural Information Processing Systems (pp. 385-392).
- ↑ Wilkinson, T., Lindström, J., & Brun, A. (2017). Neural Ctrl-F: Segmentation-free Query-by-String Word Spotting in Handwritten Manuscript Collections. arXiv preprint arXiv:1703.07645.