Extracting Toponyms from Maps of Jerusalem: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 95: Line 95:
==== Text Recitification ====
==== Text Recitification ====


- Let \(G_{j,k}\) represent subset \(k\) of ground truth label \(G_j\). Note that because we do not define \(G_{j,k}\) to be a proper subset, it is possible \(G_{j,k} = G_j\). Now let the set \(S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}\) refer to the \(p_{j,k}\) extracted labels that correspond to \(G_{j,k}\) in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label \(L_i\) in a given \(S_{j,k}\) and exclude the rest - to filter \(S_{j,k}\) such that there remains just one 'representative' for \(G_{j,k}\), in other words.
Let $G_{j,k}$ represent subset $k$ of ground truth label $G_j$. Note that because we do not define $G_{j,k}$ to be a proper subset, it is possible $G_{j,k}$ = $G_j$. Now let the set $S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}$ refer to the $p_{j,k}$ extracted labels that correspond to $G_{j,k}$ in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label $L_i$ in a given $S_{j,k}$ and exclude the rest - to filter $S_{j,k}$ such that there remains just one `representative' for $G_{j,k}$, in other words.


- To organize our extracted labels into the sets \(S_{j,k}\), we vectorize each extracted bounding box \(P_i\) according to their bottom left and top right Cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \(\textcolor{red}{\text{Tack on DBSCAN hyperparameters. Also, is this still true? ->}}\) Outliers are slotted into their own individual \(S_{j,k}\).
To organize our extracted labels into the sets $S_{j,k}$, we vectorize each extracted bounding box $P_i$ according to their bottom left and top right cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \textcolor{red}{Tack on DBSCAN hyperparameters. Also, is this still true? ->} Outliers are slotted into their own individual $S_{j,k}$.  


- To filter our \(S_{j,k}\) collections down to their most appropriate representatives, we first attempt to retain the label inside \(S_{j,k}\) with the highest \(C_i\). \(\textcolor{red}{\text{Go into RANSACK.}}\)
To filter our $S_{j,k}$ collections down to their most appropriate representatives, we first attempt to retain the label inside $S_{j,k}$ with the highest $C_i$. \textcolor{red}{Go into RANSACK.}


- Let \(\sigma_{j,k}^{*}\) be the single label from set \(S_{j,k}\) after Text Rectification has occurred.
Let $\sigma_{j,k}^{*}$ be the single label from set $S_{j,k}$ after Text Rectification has occurred.


==== Word Amalgamation ====
==== Word Amalgamation ====

Revision as of 22:11, 5 December 2023

Project Timeline

Timeframe Task Completion
Week 4
  • Finalize and present project proposals.
    • Toponym extraction project selected.
Week 5
  • Survey SOTA toponym extraction tools.
Week 6
  • Port MapKurator's Spotter tool and model weights into Windows-based Python.
  • Select two (later four) maps to use when implementing, evaluating, and fine-tuning MapKurator's model.
Week 7
  • Create ground truth labels for first map with VIA's online interface.
Week 8
  • Create ground truth labels for second map.
  • Implement 1:1-matched precision and recall via IoU (geometry) and normalized Levenshtein (text).
  • Calculate baseline accuracy statistics.
Week 9
  • Implement multi-layer pyramid application of MapKurator's Spotter.
Week 10
  • Create ground truth labels for third map.
  • Implement toponym rectification and amalgamation on pyramid-derived toponyms.
Week 11
  • Calculate pyramid accuracy statistics.
  • Fine-tune toponym rectification and amalgamation.
  • Deliver Midterm presentation.
Week 12
  • Launch Wiki.
  • Group words into toponyms via polygon size and location.
  • Apply NLP tools to correct toponyms based on MapKurator strategy.
Week 13
  • Create ground truth labels for fourth map.
  • Calculate final accuracy statistics.
  • Hierarchize final toponyms and develop Voronoi map.
Week 14
  • Prototype toponym-disagreement visualizer.
  • Finalize Wiki and deliver presentation.

Introduction & Motivation

A sample of the linguistic, geometrical, and typographical diversity in 19th-century maps of Jerusalem.

Methodology

MapKurator

Pyramid

Text Recitification

Let $G_{j,k}$ represent subset $k$ of ground truth label $G_j$. Note that because we do not define $G_{j,k}$ to be a proper subset, it is possible $G_{j,k}$ = $G_j$. Now let the set $S_{j,k} = \{L_1, L_2, ..., L_{p_{j,k}}\}$ refer to the $p_{j,k}$ extracted labels that correspond to $G_{j,k}$ in its entirety. The goal of the Text Rectification stage is to retain the single most accurate extracted label $L_i$ in a given $S_{j,k}$ and exclude the rest - to filter $S_{j,k}$ such that there remains just one `representative' for $G_{j,k}$, in other words.

To organize our extracted labels into the sets $S_{j,k}$, we vectorize each extracted bounding box $P_i$ according to their bottom left and top right cartesian coordinates and implement DBSCAN on the four-dimensional vectors. \textcolor{red}{Tack on DBSCAN hyperparameters. Also, is this still true? ->} Outliers are slotted into their own individual $S_{j,k}$.

To filter our $S_{j,k}$ collections down to their most appropriate representatives, we first attempt to retain the label inside $S_{j,k}$ with the highest $C_i$. \textcolor{red}{Go into RANSACK.}

Let $\sigma_{j,k}^{*}$ be the single label from set $S_{j,k}$ after Text Rectification has occurred.

Word Amalgamation

Word Combination

Single Line
Multiple Lines
Curved Line

Evaluation

Results

Limitations

Future work

Github Repository

Jerusalem Maps EPFL DH405

References

Literature

  • Kim, Jina, et al. "The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps." arXiv preprint arXiv:2306.17059 (2023).
  • Li, Zekun, et al. "An automatic approach for generating rich, linked geo-metadata from historical map images." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020

Webpages