Extracting Toponyms from Maps of Jerusalem: Difference between revisions
Line 95: | Line 95: | ||
==== Text Recitification ==== | ==== Text Recitification ==== | ||
- Let | - Let \(G_{j,k}\) represent subset \(k\) of ground truth label \(G_j\). Note that because we do not define \(G_{j,k}\) to be a proper subset, it is possible \(G_{j,k} = G_j\). Now let the set \(S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}\) refer to the \(p_{j,k}\) extracted labels that correspond to \(G_{j,k}\) in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label \(L_i\) in a given \(S_{j,k}\) and exclude the rest - to filter \(S_{j,k}\) such that there remains just one 'representative' for \(G_{j,k}\), in other words. | ||
- To organize extracted labels into sets | - To organize our extracted labels into the sets \(S_{j,k}\), we vectorize each extracted bounding box \(P_i\) according to their bottom left and top right Cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \(\textcolor{red}{\text{Tack on DBSCAN hyperparameters. Also, is this still true? ->}}\) Outliers are slotted into their own individual \(S_{j,k}\). | ||
- To filter | - To filter our \(S_{j,k}\) collections down to their most appropriate representatives, we first attempt to retain the label inside \(S_{j,k}\) with the highest \(C_i\). \(\textcolor{red}{\text{Go into RANSACK.}}\) | ||
- Let | - Let \(\sigma_{j,k}^{*}\) be the single label from set \(S_{j,k}\) after Text Rectification has occurred. | ||
==== Word Amalgamation ==== | ==== Word Amalgamation ==== |
Revision as of 22:09, 5 December 2023
Project Timeline
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Week 8 |
|
✓ |
Week 9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
✓ |
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|
Introduction & Motivation
Methodology
MapKurator
Pyramid
Text Recitification
- Let \(G_{j,k}\) represent subset \(k\) of ground truth label \(G_j\). Note that because we do not define \(G_{j,k}\) to be a proper subset, it is possible \(G_{j,k} = G_j\). Now let the set \(S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}\) refer to the \(p_{j,k}\) extracted labels that correspond to \(G_{j,k}\) in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label \(L_i\) in a given \(S_{j,k}\) and exclude the rest - to filter \(S_{j,k}\) such that there remains just one 'representative' for \(G_{j,k}\), in other words.
- To organize our extracted labels into the sets \(S_{j,k}\), we vectorize each extracted bounding box \(P_i\) according to their bottom left and top right Cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \(\textcolor{red}{\text{Tack on DBSCAN hyperparameters. Also, is this still true? ->}}\) Outliers are slotted into their own individual \(S_{j,k}\).
- To filter our \(S_{j,k}\) collections down to their most appropriate representatives, we first attempt to retain the label inside \(S_{j,k}\) with the highest \(C_i\). \(\textcolor{red}{\text{Go into RANSACK.}}\)
- Let \(\sigma_{j,k}^{*}\) be the single label from set \(S_{j,k}\) after Text Rectification has occurred.
Word Amalgamation
Word Combination
Evaluation
Results
Limitations
Future work
Github Repository
References
Literature
- Kim, Jina, et al. "The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps." arXiv preprint arXiv:2306.17059 (2023).
- Li, Zekun, et al. "An automatic approach for generating rich, linked geo-metadata from historical map images." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020