Extracting Toponyms from Maps of Jerusalem
Project Timeline
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Week 8 |
|
✓ |
Week 9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
✓ |
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|
Introduction & Motivation
Methodology
MapKurator
Pyramid
Text Recitification
- Let $G_{j,k}$ represent subset $k$ of ground truth label $G_j$. Note that because $G_{j,k}$ is not defined as a proper subset, it is possible that $G_{j,k} = G_j$. Now, let the set $S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}$ refer to the $p_{j,k}$ extracted labels corresponding to $G_{j,k}$ entirely. The goal of the Text Rectification stage is to retain the single most accurate extracted label $L_i$ in a given $S_{j,k}$ and exclude the rest—filtering $S_{j,k}$ to maintain just one 'representative' for $G_{j,k}$.
- To organize extracted labels into sets $S_{j,k}$:
- Vectorize each extracted bounding box $P_i$ based on their bottom-left and top-right Cartesian coordinates. - Implement DBSCAN on the resulting four-dimensional vectors. - \textcolor{red}{Include DBSCAN hyperparameters. Confirm if outliers are still allocated to individual $S_{j,k}$.}
- To filter $S_{j,k}$ collections to their most appropriate representatives:
- Attempt to retain the label within $S_{j,k}$ with the highest $C_i$. - \textcolor{red}{Further detail on the RANSACK process is needed.}
- Let $\sigma_{j,k}^{*}$ represent the single label from set $S_{j,k}$ after Text Rectification has been completed.
Word Amalgamation
Word Combination
Evaluation
Results
Limitations
Future work
Github Repository
References
Literature
- Kim, Jina, et al. "The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps." arXiv preprint arXiv:2306.17059 (2023).
- Li, Zekun, et al. "An automatic approach for generating rich, linked geo-metadata from historical map images." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020