Extracting Toponyms from Maps of Jerusalem: Difference between revisions

Revision as of 22:11, 5 December 2023

Project Timeline

Timeframe	Task	Completion
Week 4	Finalize and present project proposals. Toponym extraction project selected.	✓
Week 5	Survey SOTA toponym extraction tools.	✓
Week 6	Port MapKurator's Spotter tool and model weights into Windows-based Python. Select two (later four) maps to use when implementing, evaluating, and fine-tuning MapKurator's model.	✓
Week 7	Create ground truth labels for first map with VIA's online interface.	✓
Week 8	Create ground truth labels for second map. Implement 1:1-matched precision and recall via IoU (geometry) and normalized Levenshtein (text). Calculate baseline accuracy statistics.	✓
Week 9	Implement multi-layer pyramid application of MapKurator's Spotter.	✓
Week 10	Create ground truth labels for third map. Implement toponym rectification and amalgamation on pyramid-derived toponyms.	✓
Week 11	Calculate pyramid accuracy statistics. Fine-tune toponym rectification and amalgamation. Deliver Midterm presentation.	✓
Week 12	Launch Wiki. Group words into toponyms via polygon size and location. Apply NLP tools to correct toponyms based on MapKurator strategy.
Week 13	Create ground truth labels for fourth map. Calculate final accuracy statistics. Hierarchize final toponyms and develop Voronoi map.
Week 14	Prototype toponym-disagreement visualizer. Finalize Wiki and deliver presentation.

Introduction & Motivation

A sample of the linguistic, geometrical, and typographical diversity in 19th-century maps of Jerusalem.

Methodology

MapKurator

Pyramid

Text Recitification

Let $G_{j,k}$ represent subset $k$ of ground truth label $G_j$. Note that because we do not define $G_{j,k}$ to be a proper subset, it is possible $G_{j,k}$ = $G_j$. Now let the set $S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}$ refer to the $p_{j,k}$ extracted labels that correspond to $G_{j,k}$ in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label $L_i$ in a given $S_{j,k}$ and exclude the rest - to filter $S_{j,k}$ such that there remains just one `representative' for $G_{j,k}$, in other words.

To organize our extracted labels into the sets $S_{j,k}$, we vectorize each extracted bounding box $P_i$ according to their bottom left and top right cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \textcolor{red}{Tack on DBSCAN hyperparameters. Also, is this still true? ->} Outliers are slotted into their own individual $S_{j,k}$.

To filter our $S_{j,k}$ collections down to their most appropriate representatives, we first attempt to retain the label inside $S_{j,k}$ with the highest $C_i$. \textcolor{red}{Go into RANSACK.}

Let $\sigma_{j,k}^{*}$ be the single label from set $S_{j,k}$ after Text Rectification has occurred.

Word Amalgamation

Word Combination

Single Line	Multiple Lines	Curved Line

Evaluation

Results

Limitations

Future work

Github Repository

Jerusalem Maps EPFL DH405

References

Literature

Kim, Jina, et al. "The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps." arXiv preprint arXiv:2306.17059 (2023).
Li, Zekun, et al. "An automatic approach for generating rich, linked geo-metadata from historical map images." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020

@@ Line 95: / Line 95: @@
 ==== Text Recitification ====
-- Let \(G_{j,k}\) represent subset \(k\) of ground truth label \(G_j\). Note that because we do not define \(G_{j,k}\) to be a proper subset, it is possible \(G_{j,k} = G_j\). Now let the set \(S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}\) refer to the \(p_{j,k}\) extracted labels that correspond to \(G_{j,k}\) in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label \(L_i\) in a given \(S_{j,k}\) and exclude the rest - to filter \(S_{j,k}\) such that there remains just one 'representative' for \(G_{j,k}\), in other words.
+Let $G_{j,k}$ represent subset $k$ of ground truth label $G_j$. Note that because we do not define $G_{j,k}$ to be a proper subset, it is possible $G_{j,k}$ = $G_j$. Now let the set $S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}$ refer to the $p_{j,k}$ extracted labels that correspond to $G_{j,k}$ in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label $L_i$ in a given $S_{j,k}$ and exclude the rest - to filter $S_{j,k}$ such that there remains just one `representative' for $G_{j,k}$, in other words.
-- To organize our extracted labels into the sets \(S_{j,k}\), we vectorize each extracted bounding box \(P_i\) according to their bottom left and top right Cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \(\textcolor{red}{\text{Tack on DBSCAN hyperparameters. Also, is this still true? ->}}\) Outliers are slotted into their own individual \(S_{j,k}\).
+To organize our extracted labels into the sets $S_{j,k}$, we vectorize each extracted bounding box $P_i$ according to their bottom left and top right cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \textcolor{red}{Tack on DBSCAN hyperparameters. Also, is this still true? ->} Outliers are slotted into their own individual $S_{j,k}$.
-- To filter our \(S_{j,k}\) collections down to their most appropriate representatives, we first attempt to retain the label inside \(S_{j,k}\) with the highest \(C_i\). \(\textcolor{red}{\text{Go into RANSACK.}}\)
+To filter our $S_{j,k}$ collections down to their most appropriate representatives, we first attempt to retain the label inside $S_{j,k}$ with the highest $C_i$. \textcolor{red}{Go into RANSACK.}
-- Let \(\sigma_{j,k}^{*}\) be the single label from set \(S_{j,k}\) after Text Rectification has occurred.
+Let $\sigma_{j,k}^{*}$ be the single label from set $S_{j,k}$ after Text Rectification has occurred.
 ==== Word Amalgamation ====

Extracting Toponyms from Maps of Jerusalem: Difference between revisions

Revision as of 22:11, 5 December 2023

Contents

Project Timeline

Introduction & Motivation

Methodology

MapKurator

Pyramid

Text Recitification

Word Amalgamation

Word Combination

Evaluation

Results

Limitations

Future work

Github Repository

References

Literature

Webpages

Navigation menu

Extracting Toponyms from Maps of Jerusalem: Difference between revisions

Revision as of 22:11, 5 December 2023

Project Timeline

Introduction & Motivation

Methodology

MapKurator

Pyramid

Text Recitification

Word Amalgamation

Word Combination

Evaluation

Results

Limitations

Future work

Github Repository

References

Literature

Webpages

Navigation menu

Search