Extracting Toponyms from Maps of Jerusalem: Difference between revisions
No edit summary |
No edit summary |
||
| Line 99: | Line 99: | ||
|} | |} | ||
==Introduction | ==Introduction== | ||
We aim to programmatically and accurately extract toponyms (place names) from historical maps of Jerusalem. With the help of a scene text recognition tool built by the Machines Reading Maps (MRM) team specifically for toponym extraction, we develop a novel label extraction and processing pipeline capable of significant accuracy improvements relative to MRM's mapKurator spotter module alone. We then explore the success of our pipeline during generalization, describe the limitations of our approach, and suggest possibilities for future progress in accuracy or end-user interactivity. | We aim to programmatically and accurately extract toponyms (place names) from historical maps of Jerusalem. With the help of a scene text recognition tool built by the Machines Reading Maps (MRM) team specifically for toponym extraction, we develop a novel label extraction and processing pipeline capable of significant accuracy improvements relative to MRM's mapKurator spotter module alone. We then explore the success of our pipeline during generalization, describe the limitations of our approach, and suggest possibilities for future progress in accuracy or end-user interactivity. | ||
Motivation | ==Motivation== | ||
[[File:etmj_intro_sample_raw.png|thumb| A sample of the linguistic, geometrical, and typographical diversity of toponyms in 19th-century maps of Jerusalem.]] | [[File:etmj_intro_sample_raw.png|thumb| A sample of the linguistic, geometrical, and typographical diversity of toponyms in 19th-century maps of Jerusalem.]] | ||
Revision as of 20:38, 13 December 2023
rubric
Written deliverables (Wiki writing) (40%)
Projet plan and milestones (10%) (>300 words)
Motivation and description of the deliverables (10%) (>300 words)
Detailed description of the methods (10%) (>500 words)
Quality assessment and discussion of limitations (10%) (>300 words)
The indicated number of words is a minimal bound. Detailed description can in particular be extended if needed.
Project Timeline
| Timeframe | Task | Completion |
|---|---|---|
| Week 4 |
|
✓ |
| Week 5 |
|
✓ |
| Week 6 |
|
✓ |
| Week 7 |
|
✓ |
| Week 8 |
|
✓ |
| Week 9 |
|
✓ |
| Week 10 |
|
✓ |
| Week 11 |
|
✓ |
| Week 12 |
|
|
| Week 13 |
|
|
| Week 14 |
|
Introduction
We aim to programmatically and accurately extract toponyms (place names) from historical maps of Jerusalem. With the help of a scene text recognition tool built by the Machines Reading Maps (MRM) team specifically for toponym extraction, we develop a novel label extraction and processing pipeline capable of significant accuracy improvements relative to MRM's mapKurator spotter module alone. We then explore the success of our pipeline during generalization, describe the limitations of our approach, and suggest possibilities for future progress in accuracy or end-user interactivity.
Motivation
Text extraction from maps The study of the geography and chronology of neighborhoods in Jerusalem can provide valuable insights into the city's past and present. The location of a neighborhood can often reflect the social, economic, and political forces that shaped it, as well as the cultural traditions and values of its residents.
Examining the founding year of a neighborhood can also provide insight into the city's history and development. Visualizing the location and founded year of neighborhoods in Jerusalem can be a powerful tool for understanding the city's past and present. By mapping and analyzing these data, it is possible to gain a deeper understanding of the cultural, social, and economic dynamics of different neighborhoods and the forces that have shaped them.
A city with such a rich and varied history as Jerusalem has many different accounts of it. These accounts from various sources are an important basis when studying it. How to integrate the information from these sources is also one of the focuses of our research.
Existing interactive maps about Jerusalem neighborhoods usually do not include information about the neighborhoods that once existed, nor do they contain information about when the neighborhoods were built. Therefore, our work is of great importance in the study of the history of the Jerusalem community.
Deliverables OCR results of Development of Jerusalem neighborhoods information from Jerusalem and its Environs. Crawler results from Wikipedia category Neighbourhoods of Jerusalem, Wikidata list of places in Jerusalem and Wikidata entity Neighborhood of Jerusalem. Integrated database with multiple information sources after perfect matching and fuzzy matching. An interactive and user-friendly website showing the changes in neighborhoods of Jerusalem, which contains: A timeline page that illustrates the evolution of the construction of neighborhoods in Jerusalem over time. An inhabitant page and an initiative that respectively show the inhabitants and initiative information about each neighborhood. A dedicated page that contains relevant information for each neighborhood. A search function that enables users to search for neighborhoods by name.
Methodology
MapKurator
Pyramid
Text Recitification
Let $G_{j,k}$ represent subset $k$ of ground truth label $G_j$. Note that because we do not define $G_{j,k}$ to be a proper subset, it is possible $G_{j,k}$ = $G_j$. Now let the set $S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}$ refer to the $p_{j,k}$ extracted labels that correspond to $G_{j,k}$ in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label $L_i$ in a given $S_{j,k}$ and exclude the rest - to filter $S_{j,k}$ such that there remains just one `representative' for $G_{j,k}$, in other words.
To organize our extracted labels into the sets $S_{j,k}$, we vectorize each extracted bounding box $P_i$ according to their bottom left and top right cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \textcolor{red}{Tack on DBSCAN hyperparameters. Also, is this still true? ->} Outliers are slotted into their own individual $S_{j,k}$.
To filter our $S_{j,k}$ collections down to their most appropriate representatives, we first attempt to retain the label inside $S_{j,k}$ with the highest $C_i$. \textcolor{red}{Go into RANSACK.}
Let $\sigma_{j,k}^{*}$ be the single label from set $S_{j,k}$ after Text Rectification has occurred.
<math>G_{j,k}</math>
Text Amalgamation
Let $A_{j} = \{\sigma_{j,1}^{*}, \sigma_{j,2}^{*}, ...,\sigma_{j,r_j}^{*}\}$ refer to the $r_j$ extracted and Text-Rectified labels corresponding to subsets of $G_j$. The goal of the Text Amalgamation stage is to retain a single label from $A_{j}$: the label $\alpha_{j}^{*} = G_{j}$.
This process is performed iteratively. The first step in the amalgamation sequence consists of computing pairwise geometric and textual intersection over minimum (IoM) values between all $\sum_{j=1}^{M}r_j$ labels in the set $R = \{\sigma_{1,1}^*, \sigma_{1,2}^*, ..., \sigma_{1, r_1-1}^*, \sigma_{1, r_1}^*, \sigma_{2, 1}^*, ..., \sigma_{M, r_M-1}^*, \sigma_{M, r_M}^*\}$. For example, suppose we are comparing $\sigma_{a,b}^*$ and $\sigma_{c,d}^*$. In this case, geometric IoM equals $P_{a,b} \cap P_{c,d}$ divided by the area of the smaller polygon. Textual IoM, meanwhile, equals the number of non-unique shared characters in $T_{a,b}$ and $T_{c,d}$ divided by the length of the longer string. Those pairs with geometric IoM value $\gamma_{geom} > 0.75$ and textual IoM value $\gamma_{text} > 0.5$ are considered to exhibit a subset-parent relationship. They are therefore amalgamated, meaning (1) a new label $\sigma_{a,b,c,d}^*$ is added to $R$ with $P_{a,b,c,d} = P_{a,b} \cup P_{c,d}$ and $T_{a,b,c,d} = $ the longer string from $T_{a,b}$ and $T_{c,d}$, and (2) both $\sigma_{a,b}^*$ and $\sigma_{c,d}^*$ are dropped from $R$. When all possible amalgamations have been made based on the group of pairwise combinations satisfying our $\gamma_{geom}$ and $\gamma_{text}$ conditions, the sequence begins anew with updated $R$. The amalgamation stage terminates when $R$ ceases to yield possible amalgamations.
Once both rectification and amalgamation have occurred, the set of labels $L_1, L_2, ..., L_N$ has been condensed to $\alpha_{1}^*, \alpha_{2}^*, ..., \alpha_{M}^*$.
Word Combination
Evaluation
Results
Limitations
Future work
Github Repository
References
Literature
- Kim, Jina, et al. "The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps." arXiv preprint arXiv:2306.17059 (2023).
- Li, Zekun, et al. "An automatic approach for generating rich, linked geo-metadata from historical map images." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020
Webpages
Acknowledgements
We thank Professor Frédéric Kaplan, Sven Najem-Meyer, and Beatrice Vaienti of the DHLAB for their valuable guidance over the course of this project.