Spatialising Sarah Barclay Johnson's travelogue around Jerusalem (1858)
Abstract
Introduction
Delving into the pages of "Hadji in Syria: or, Three years in Jerusalem" by Sarah Barclay Johnson 1858, this project sets out to digitally map the toponyms embedded in Johnson's 19th-century exploration of Jerusalem with the wish to connect the past and the present. By visualizing Johnson's recorded toponyms, this project aims to offer a dynamic tool for scholars and enthusiasts, contributing to the ongoing dialogue on the city's historical evolution.
This spatialization, in its attempt, pays homage to Johnson's literary contribution, serving as a digital window into the cultural crossroads: Jerusalem. The project invites users to engage with the city's history, fostering a deeper understanding of its rich heritage and the interconnected narratives that have shaped the city. In this fusion of literature, history, and technology, we hope to embark on a digital odyssey, weaving a narrative tapestry that transcends time and enriches our collective understanding of Jerusalem's intricate past.
Motivation
Deliverables
- Pre-processed textual dataset of the book.
- Application results of NER using Spacy and GPT-4 on the book's text.
- Comparative analysis report of NER effectiveness between Spacy and GPT-4.
- Manually labelled dataset for a selected chapter for NER validation.
- QGIS mapping files visualizing named locations from selected chapters and the entire book.
- Visual representative maps of the narrative journey, highlighting key locations and paths.
- Developed scripts for automating Wikipedia page searches and extracting location coordinates.
- Dataset with matched coordinates for all identified locations.
- A developed platform to display project outputs.
Methodology
Assessing and Preparing OCR-Derived Text for Analysis
In our project, which involved analyzing a specific book, the initial step was to acquire the text version of the book. We found and downloaded the OCR text from Google Books and then assessed the quality of the textual data. A key metric in our assessment was the ratio of “words in the text that exist in a dictionary” to “total words” in the text, calculated using the NLTK library. Considering the book's inclusion of multiple languages, we set a threshold of 70% for this ratio. If met or exceeded, we regarded the text's quality satisfactory for our analysis purposes.
Following this quality assessment, we proceeded with text preprocessing, adapted to the specific needs of our study. Notably, we chose not to remove stop words or convert the text to lowercase, maintaining the original structure and form of the text.
Detecting Locations
NER with Spacy
In the initial stages of the project, Spacy was employed for Named Entity Recognition (NER) to analyze the text and automatically classify entities such as locations, organizations, and people. SpaCy's pre-trained models and linguistic features facilitated the identification of named entities within the text, allowing for the automatic tagging of toponyms. Specifically, the focus of our project was on extracting toponyms, which are place names or locations relevant to the geographic context of the travelogue, i.e. usually "GPE" (Geopolitical Entities) and "LOC" (Regular Locations) in SpaCy's labeling system.
However, as the project progressed, it became apparent that SpaCy's performance in accurately labeling toponyms was not entirely satisfactory, encountering mislabeling issues that could impact the precision of the spatial representation. Here is the SpaCy output from one sample paragraph:
Toponym Name | NER Label | Correct Label |
---|---|---|
Bethlehem | ORG | GPE |
Bethany | GPE | GPE |
Mary | PERSON | PERSON |
meek | PERSON | N.A. |
Lazarus | ORG | PERSON |
Christ | ORG | PERSON |
Calvary | ORG | LOC |
Olivet | PERSON | LOC |
So, there is a mislabeling problem. In theory we only need to retrieve the toponyms, i.e. “GPE” & “LOC”, but SpaCy labelled some of them as “PERSON” or “ORG”. In other words, if we only select “GPE” and “LOC”, we’ll lose some toponyms; if we also select “ORG” and “PERSON”, we’ll get some non-toponyms.
Difficulties when working with historical content
When applying Named Entity Recognition (NER) with Spacy to historical content, we encountered significant challenges. The main problem was the frequent misnaming of locations, which is a result of place names changing over time, especially in historical and biblical contexts. These names often have varied across multiple languages, adding to the complexity. Furthermore, even by reading, it was occasionally challenging to determine the current significance or identity of these names due to their changing nature over centuries. This complexity highlighted how essential it is to understand the relationships between locations and what they mean within the book's narrative along with to representing a linguistic and technical challenge. It is critical to comprehend these relationships because they have a significant impact on how the text is interpreted and understood overall.
GPT-4
Given the issues we encountered with SpaCy, we decided to give GPT-4 a try. We got access to its functionalities through the ChatGPT interface and the Plus subscription plan, including the possibility to attach documents and a model accurately trained for data analysis.
Our idea was, once again, to collect all the locations in chronological order, and to incapsulate them in a standardized format like JSON. In addition, we decided to benefit from the LLM by retrieving the interactions that the authors had with the place.
However, our prompt engineering went through many iterations, before getting the results we were aiming for. The main issues we encountered were:
- the returned JSON file would not include all the locations in the text
- the labels and the format of the entry would change with every request
- the process would stop working whenever an error was found in the analysis
For these reasons, we ultimately defined our goal as maximising the number the number of the locations the author had a real interaction with, not all the locations cited in the text. In other words, we wanted to give more specificity to the model, while preserving its generalisability.
The prompt
Context of the task
I'm analyzing a travelogue written in the 19th century around Ancient Israel and Jerusalem, with the future goal of mapping all the locations visited.
Instructions of the task
I'm gonna provide you the chapter. Can you provide me the specific named entities of the places the author had an interaction with?
Guidelines for the entries
I want the named entities in chronological list. I don't want all the places mentioned, just the places observed from the distance, passed by, visited or where the author stayed for the night. Generate a neutral description of the interaction the author had with the place, in one sentence.
Specificity and comprehensivity
Please provide a comprehensive list of the locations mentioned in the text, including any buildings, gates, streets, and significant landmarks. Do not provide only the main places, but a comprehensive set of places.
Technical details of the entries
Please provide them in a JSON format, where you specify the name, the brief description and one label from ['observed', 'mentioned', 'visited', 'stayed']
Example of an often ignored place
Example of an entry:
{
"place": "Church of Yacobeiah", "interaction": "visited", "description": "Visited the ruins of the Church of Yacobeiah, a relic of the crusaders." }
Iteration nudge to avoid involuntary crashes
If the first language model does not work, try with something else. Keep trying until you get a solution.
Additional details to get uniform results and avoid misunderstandings
Keep the answer short. Do not invent any additional 'interaction' label, assign only one 'interaction' from the list ['observed', 'mentioned', 'visited', 'stayed']. Return me the final json (not just a sample of it).
Preliminary Analysis for Model Selection - Assessment Focusing on Chapter 3
Manual detection
In our preliminary analysis for model selection, we focused on Chapter 3 for a detailed assessment. The initial step involved manually detecting and labeling named entities, specifically locations mentioned in the text. This was achieved by highlighting relevant text segments and subsequently gathering these identified locations into a spreadsheet. To ensure the accuracy of our location identification, each location was cross-referenced with its corresponding Wikipedia page. This was an important step, especially in cases where the context of the book made it challenging to understand the exact nature of the places mentioned, even after analyzing the entire paragraph. This method provided a solid foundation for our named entity recognition approach.
Spacy Results
With SpaCy, we included results labeled as "GPE", "LOC", "ORG", and "PERSON". To better illustrate the result from SpaCy, we made this Venn graph below.
A: False negative, those toponyms in the book that SpaCy didn't get; (16)
B: True positive, with correct labels (GPE & LOC); (13)
C: True positive, with wrong labels (ORG & PERSON); (17)
D: False positive, with correct labels (GPE & LOC); (68)
E: False positive, with wrong labels (ORG & PERSON); (13)
Based on these sets, we proposed several metrics to assess the SpaCy results:
Accuracy: (B+C)/(A+B+C) = 0.652
Precision: (B+C)/(B+C+D+E) = 0.270
Mislabelling rate: C/(B+C) = 0.567
Thus, if we exclude the ones labeled as "ORG" and "PEOPLE", the new Accuracy would be B/(A+B) = 0.448, and the new Precision would be B/(B+E) = 0.5.
In either case, the results are not satisfying enough.
GPT-4 ResultsWe opted for 2 metrics:
- Accuracy, to compare the places identified both by us and GPT with those identified by just us
- Precision, to compare the places identified both by us and GPT with those identified by just GPT
Sometimes, GPT would identify a couple of entities with a slight variation from the book, and they would automatically be considered different.
However, despite our goal is to reduce our intervention to the least possible, we double checked manually the entries without correspondence and considered them as correctly identified.
Since the GPT-4 Results outperformed the results of Spacy NER, presented GPT-4 prompt has been used to retrieve the locations in the book.
Matching Wikipedia Pages
Using the Wikipedia API , locations identified by GPT were searched in Wikipedia, and the first relevant result was recorded. Additionally, the first image found on the page of the recorded link was retrieved. This approach was primarily used to verify the accuracy of manually determined locations. Subsequently, after all locations were obtained, it was used both for visualizing the author's path and for acquiring coordinates for locations without coordinates.
Tracking Author's Route on Maps
Finalizing the List of Coordinates
1. Fuzzy matching GPT-4 results with an existing location list
We've got a list of existing toponyms in Jerusalem that were extracted previously from one map with coordinates. The list is multi-lingual, which means there might be multiple names for one place in different languages including English, Arabic, German, and Hebrew. Considering toponyms that we retrieved from the travelogue could be in languages other than English (perhaps quotes or references to historical/cultural stuff) and there might be slightly different ways to refer to the same place, a fuzzy matching algorithm is applied to match the toponyms GPT retrieved and the toponyms in multiple languages from the list. We used "fuzzywuzzy" package and set the similarity_score over 80 as the passing standard. After fuzzy matching, we got X unique toponyms with Y unique coordinates.
2. Retrieving coordinates by matched Wikipedia pages
Utilizing previously obtained Wikipedia links, coordinates were gathered using a web scraping method. To ensure the exclusion of broader locations such as the 'Mediterranean Sea,' the coordinates obtained were filtered to focus specifically on the Jerusalem region. For locations where coordinates could not be determined in the previous step, additional coordinate information was added, culminating in the final version of the dataset being recorded.
Chapter | Count of Geometry | Count of Location |
---|---|---|
1 | 1 | 7 |
2 | 2 | 16 |
3 | 21 | 32 |
4 | 9 | 16 |
5 | 4 | 6 |
6 | 1 | 4 |
7 | 4 | 11 |
8 | 7 | 10 |
9 | 2 | 7 |
10 | 4 | 8 |
11 | 3 | 8 |
12 | 2 | 4 |
13 | 4 | 6 |
14 | - | 4 |
16 | 4 | 12 |
18 | 4 | 7 |
19 | 2 | 7 |
20 | 3 | 5 |
Grand Total | 77 | 170 |
Visualization by GeoPandas
Daniele
- QGIS
- GeoPandas
Creating a Platform for Final Output
To display our final output, we built a website on which interactive maps about the travelogue are shown by chapters.
Results
Limitations and Further Work
- API for a smoother process - Improve with the full book instead of chapter by chapter
automated application for all historical context
augmented interaction on map
Conclusion
Project Timeline & Milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Week 8 |
|
✓ |
Week 9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
✓ |
Week 12 |
|
✓ |
Week 13 |
|
✓ |
Week 14 |
|
GitHub Repository
Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisors:
Authors: