Abstract

Introduction

Caption

Delving into the pages of "Hadji in Syria: or, Three years in Jerusalem" by Sarah Barclay Johnson 1858, this project sets out to digitally map the toponyms embedded in Johnson's 19th-century exploration of Jerusalem with the wish to connect the past and the present. By visualizing Johnson's recorded toponyms, this project aims to offer a dynamic tool for scholars and enthusiasts, contributing to the ongoing dialogue on the city's historical evolution.

This spatialization, in its attempt, pays homage to Johnson's literary contribution, serving as a digital window into the cultural crossroads: Jerusalem. The project invites users to engage with the city's history, fostering a deeper understanding of its rich heritage and the interconnected narratives that have shaped the city. In this fusion of literature, history, and technology, we hope to embark on a digital odyssey, weaving a narrative tapestry that transcends time and enriches our collective understanding of Jerusalem's intricate past.

Motivation

Delivarables

Methodology

Assessing and Preparing OCR-Derived Text for Analysis

In our project, which involved analyzing a specific book, the initial step was to acquire the text version of the book. We found and downloaded the OCR text from Google Books and then assessed the quality of the textual data. A key metric in our assessment was the ratio of “words in the text that exist in a dictionary” to “total words” in the text, calculated using the NLTK library. Considering the book's inclusion of multiple languages, we set a threshold of 70% for this ratio. If met or exceeded, we regarded the text's quality satisfactory for our analysis purposes.

Following this quality assessment, we proceeded with text preprocessing, adapted to the specific needs of our study. Notably, we chose not to remove stop words or convert the text to lowercase, maintaining the original structure and form of the text.

Detecting Locations

NER with Spacy

//LIST or visual

Spacy API

So, we can see the problem is mislabeling: in theory we only need to retrieve the toponyms, i.e. “GPE” & “LOC”, but SpaCy labelled some of them as “PERSON” or “ORG”. In other words, if we only select “GPE” and “LOC”, we’ll lose some toponyms; if we also select “ORG” and “PERSON”, we’ll get some non-toponyms.

Difficulties when working with historical content

When applying Named Entity Recognition (NER) with Spacy to historical content, we encountered significant challenges. The main problem was the frequent misnaming of locations, which is a result of place names changing over time, especially in historical and biblical contexts. These names often have varied across multiple languages, adding to the complexity. Furthermore, even by reading, it was occasionally challenging to determine the current significance or identity of these names due to their changing nature over centuries. This complexity highlighted how essential it is to understand the relationships between locations and what they mean within the book's narrative along with to representing a linguistic and technical challenge. It is critical to comprehend these relationships because they have a significant impact on how the text is interpreted and understood overall.

GPT-4

Preliminary Analysis for Model Selection - Assessment Focusing on Chapter 3

Manual detection
In our preliminary analysis for model selection, we focused on Chapter 3 for a detailed assessment. The initial step involved manually detecting and labeling named entities, specifically locations mentioned in the text. This was achieved by highlighting relevant text segments and subsequently gathering these identified locations into a spreadsheet. To ensure the accuracy of our location identification, each location was cross-referenced with its corresponding Wikipedia page. This was an important step, especially in cases where the context of the book made it challenging to understand the exact nature of the places mentioned, even after analyzing the entire paragraph. This method provided a solid foundation for our named entity recognition approach.
Spacy Results
GPT-4 Results
Since the GPT-4 Results outperformed the results of Spacy NER, presented GPT-4 prompt has been used to retrieve the locations in the book.

Matching Wikipedia Pages

Using the Wikipedia API , locations identified by GPT were searched in Wikipedia, and the first relevant result was recorded. Additionally, the first image found on the page of the recorded link was retrieved. This approach was primarily used to verify the accuracy of manually determined locations. Subsequently, after all locations were obtained, it was used both for visualizing the author's path and for acquiring coordinates for locations without coordinates.

Tracking Author's Route on Maps

Finalizing the List of Coordinates

Fuzzy matching GPT-4 results with an existing location list

Retrieving coordinates by matched Wikipedia pages

Visualization by GeoPandas

QGIS
GeoPandas

Creating a Platform for Final Output

Results

Limitations and Further Work

- API for a smoother process - Improve with the full book instead of chapter by chapter

Conclusion

Project Timeline & Milestones

Timeframe	Task	Completion
Week 4	Exploring literature of NER Finding textual data of the book	✓
Week 5	Pre-processing text Quality assessment of the data	✓
Week 6	Applying NER using Spacy	✓
Week 7	Manually labelling chapter 3 GPT-4 Prompt Engineering	✓
Week 8	Working on mapping with QGIS	✓
Week 9	Finalizing GPT-4 Prompt Automating Wikipedia Page Search	✓
Week 10	Finalizing the list of manually detected locations Evaluation of GPT-4 and Spacy Results for chapter 3	✓
Week 11	Matching the coordinates of the locations from chapter 3 QGIS mapping of the locations from chapter 3	✓
Week 12	Visualizing the full chapter 3 journey Retrieving the locations from the entire book	✓
Week 13	Matching the coordinates of the locations from the entire book Retrieving coordinates from matched Wikipedia pages QGIS Mapping of the locations from the entire book Visualizing the full journey	✓
Week 14	Develop a platform to display outputs Complete GitHub repository Complete Wiki page Complete presentation

GitHub Repository

GitHub Link

Spatialising Sarah Barclay Johnson's travelogue around Jerusalem (1858)

Contents

Abstract

Introduction

Motivation

Delivarables

Methodology

Assessing and Preparing OCR-Derived Text for Analysis

Detecting Locations

NER with Spacy

Difficulties when working with historical content

GPT-4

Preliminary Analysis for Model Selection - Assessment Focusing on Chapter 3

Matching Wikipedia Pages

Tracking Author's Route on Maps

Finalizing the List of Coordinates

Visualization by GeoPandas

Creating a Platform for Final Output

Results

Limitations and Further Work

Conclusion

Project Timeline & Milestones

GitHub Repository

References

Navigation menu

Spatialising Sarah Barclay Johnson's travelogue around Jerusalem (1858)

Abstract

Introduction

Motivation

Delivarables

Methodology

Assessing and Preparing OCR-Derived Text for Analysis

Detecting Locations

NER with Spacy

Difficulties when working with historical content

GPT-4

Preliminary Analysis for Model Selection - Assessment Focusing on Chapter 3

Matching Wikipedia Pages

Tracking Author's Route on Maps

Finalizing the List of Coordinates

Visualization by GeoPandas

Creating a Platform for Final Output

Results

Limitations and Further Work

Conclusion

Project Timeline & Milestones

GitHub Repository

References

Navigation menu

Search