Marino Sanudo's Diary: Difference between revisions
(97 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==Introductions== | ==Introductions== | ||
The project focused on analyzing the diaries of Marino Sanudo, a key historical source for understanding the Renaissance period. The primary goal was to create an index of people and places mentioned in the diaries, pair these entities, and analyze the potential relationships between them. | |||
[[File:SanudoImage.jpg|300px|thumb|right|Marino Sanudo]] | |||
The project focused on analyzing the diaries of Marino Sanudo [https://it.wikipedia.org/wiki/Marin_Sanudo_il_Giovane], a key historical source for understanding the Renaissance period. The primary goal was to create an index of people and places mentioned in the diaries, pair these entities, and analyze the potential relationships between them. | |||
<br><br> | <br><br> | ||
==Historical Context== | ==Historical Context== | ||
[[File:Libretto.jpg|190px|thumb|left|A page of one diary]] | |||
Who Was Marino Sanudo? | Who Was Marino Sanudo? | ||
Marino Sanudo (1466–1536) was a Venetian historian, diarist, and politician whose extensive diaries, Diarii, provide a meticulous chronicle of daily life, politics, and events in Renaissance Venice. Sanudo devoted much of his life to recording the intricacies of Venetian society, governance, and international relations, making him one of the most significant chroniclers of his era. | Marino Sanudo (1466–1536) was a Venetian historian, diarist, and politician whose extensive diaries, Diarii, provide a meticulous chronicle of daily life, politics, and events in Renaissance Venice. Sanudo devoted much of his life to recording the intricacies of Venetian society, governance, and international relations, making him one of the most significant chroniclers of his era. | ||
The Importance of His Diaries | The Importance of His Diaries | ||
Line 12: | Line 17: | ||
Relevance Today | Relevance Today | ||
Studying Marino Sanudo’s diaries remains highly relevant for modern historians, linguists, and data analysts. They provide a primary source for understanding Renaissance politics, diplomacy, and social hierarchies. Furthermore, the diaries’ exhaustive detail lends itself to contemporary methods of analysis, such as network mapping and data visualization, enabling new interpretations and uncovering hidden patterns in historical relationships. By examining the interconnectedness of individuals and places, Sanudo’s work sheds light on the broader dynamics of Renaissance Europe, offering lessons that resonate even in today’s globalized world. | Studying Marino Sanudo’s diaries remains highly relevant for modern historians, linguists, and data analysts. They provide a primary source for understanding Renaissance politics, diplomacy, and social hierarchies. Furthermore, the diaries’ exhaustive detail lends itself to contemporary methods of analysis, such as network mapping and data visualization, enabling new interpretations and uncovering hidden patterns in historical relationships. By examining the interconnectedness of individuals and places, Sanudo’s work sheds light on the broader dynamics of Renaissance Europe, offering lessons that resonate even in today’s globalized world. | ||
==Motivation and description of the deliverables == | |||
The decision to analyze Marino Sanudo’s diaries stemmed from their exceptional value as a primary source for understanding Renaissance Venice and its influence on European history. Sanudo’s meticulous documentation provides unparalleled insights into the sociopolitical and cultural dynamics of the time, capturing events ranging from significant political maneuvers to the nuances of daily life. This project aimed to leverage this rich historical resource to explore connections between individuals and places, offering a fresh perspective on the networks that shaped Renaissance society. | |||
Our motivation was twofold: to deepen historical understanding and to demonstrate the potential of digital humanities. By employing modern data analysis tools and visualization techniques, we aimed to uncover patterns and relationships that might remain hidden in traditional textual analysis. Sanudo’s diaries, with their wealth of names, places, and detailed events, provided the ideal foundation for such an interdisciplinary approach, bridging the gap between historical research and innovative technology. | |||
The project deliverables reflect this dual objective. First, we developed an indexed dataset of names and places mentioned in Sanudo’s diaries, categorized by the relationships and contexts in which they appeared. This dataset forms the basis for further exploration of Renaissance networks. Second, we analyzed these relationships to identify significant patterns, such as the prominence of certain individuals in specific locations or events, and created visualizations to illustrate these findings. These analyses contribute not only to a deeper understanding of Venetian society but also to broader discussions about the interconnectedness of Renaissance Europe. | |||
To ensure accessibility, we documented our research on a dedicated website and created a Wikipedia page summarizing the project and its findings. These platforms serve as public-facing resources, promoting engagement with the material and showcasing the value of combining historical research with digital tools. This project exemplifies how modern methodologies can enrich our appreciation of the past. | |||
== Project Plan and Milestones== | == Project Plan and Milestones== | ||
The project was organized on a weekly basis to ensure steady progress and a balanced workload. Each phase was carefully planned with clearly defined objectives and milestones, promoting effective collaboration and equitable division of tasks among team members. | The project was organized on a weekly basis to ensure steady progress and a balanced workload. Each phase was carefully planned with clearly defined objectives and milestones, promoting effective collaboration and equitable division of tasks among team members. | ||
The first milestone ( | The first milestone (13.10) involved deciding on the project's focus. After thorough discussions, we collectively chose to analyze Marino Sanudo’s diaries, given their historical significance and potential for data-driven exploration. This phase established a shared understanding of the project, laid the foundation for subsequent work, and clarified the scope of our research. | ||
The second milestone (14.10) focused on optimizing the extraction of indexes for names and places from the diaries. This required refining our methods for data extraction and ensuring accuracy in capturing and categorizing entities. Alongside this, we worked on identifying the geolocations of the places mentioned in the index, using historical and modern mapping tools to ensure precise identification. | The second milestone (14.10) focused on optimizing the extraction of indexes for names and places from the diaries. This required refining our methods for data extraction and ensuring accuracy in capturing and categorizing entities. Alongside this, we worked on identifying the geolocations of the places mentioned in the index, using historical and modern mapping tools to ensure precise identification. This step was critical to linking historical references with real-world locations, providing a solid basis for subsequent analysis. | ||
The final stages of the project (19.12)marked the transition to analyzing relationships between the extracted names and places. Using the indexed data, we explored potential connections, identifying patterns and trends that revealed insights into Renaissance Venice's social, political, and geographic networks | The final stages of the project (19.12) marked the transition to analyzing relationships between the extracted names and places. Using the indexed data, we explored potential connections, identifying patterns and trends that revealed insights into Renaissance Venice's social, political, and geographic networks. We collaboratively built a Wikipedia page to document our research and created a dedicated website to present our results in an accessible and visually engaging manner. This phase also included preparing for the final presentation, ensuring that every team member contributed to summarizing and showcasing the work. | ||
. We collaboratively built a Wikipedia page to document our research and created a dedicated website to present our results in an accessible and visually engaging manner. This phase also included preparing for the final presentation, ensuring that every team member contributed to summarizing and showcasing the work. | |||
By adhering to this structured approach and dividing tasks equitably, we achieved a comprehensive analysis of Sanudo’s diaries. Combining historical research with modern digital tools, we uncovered new insights into his world and its relevance today. | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ '''Workflow''' | |+ '''Workflow''' | ||
Line 59: | Line 75: | ||
'''Deliver GitHub + wiki on 18.12''' <br> '''Final presentation on 19.12''' | '''Deliver GitHub + wiki on 18.12''' <br> '''Final presentation on 19.12''' | ||
|} | |} | ||
= Methodology = | = Methodology = | ||
== Data preparation == | == Data preparation == | ||
In our project, which involved analyzing a specific book, the initial step was to obtain the text version of the book. After exploring several sources, including [https://onlinebooks.library.upenn.edu/webbin/metabook?id=sanudodiary], we identified three potential websites for downloading the text. Ultimately, we selected the version available on [https://archive.org/details/idiariidimarino00allegoog] because it offered a more comprehensive set of tools for our analysis. We downloaded the text from this source and compared it to versions from Google Books and HathiTrust, confirming that it best suited our needs. | In our project, which involved analyzing a specific book, the initial step was to obtain the text version of the book. After exploring several sources, including The Online Books Page [https://onlinebooks.library.upenn.edu/webbin/metabook?id=sanudodiary], we identified three potential websites for downloading the text. Ultimately, we selected the version available on Internet Archives [https://archive.org/details/idiariidimarino00allegoog] because it offered a more comprehensive set of tools for our analysis. We downloaded the text from this source and compared it to versions from Google Books and HathiTrust, confirming that it best suited our needs. | ||
We then decided to focus our analysis on the indices included in each volume, which listed the names of people and places alongside the corresponding column numbers. | We then decided to focus our analysis on the indices included in each volume, which listed the names of people and places alongside the corresponding column numbers. | ||
=== | === Place index === | ||
[[File:Screenshot 2024-12-15 alle 15.21.47.png|200px|thumb|left|Place Index in the diary]] | |||
Our primary focus was on the places mentioned in Venice. The index of places was significantly shorter, allowing us to analyze it manually. Each entry in the index included headings that often indicated a hierarchical relationship, suggesting that a location belonged to a broader area indicated by the preceding indentation. | Our primary focus was on the places mentioned in Venice. The index of places was significantly shorter, allowing us to analyze it manually. Each entry in the index included headings that often indicated a hierarchical relationship, suggesting that a location belonged to a broader area indicated by the preceding indentation. | ||
Line 85: | Line 103: | ||
This structured dataset captures both the explicit details from the index and the inferred hierarchical relationships, making it suitable for further analysis and exploration. | This structured dataset captures both the explicit details from the index and the inferred hierarchical relationships, making it suitable for further analysis and exploration. | ||
=== | === Name index === | ||
The index of names proved to be significantly more challenging to analyze than the index of places, as it spanned approximately 80 pages. To tackle this, we first provided examples of the desired output and then used an Italian-trained OCR model to process the text and generate a preliminary table of names. This approach differed from traditional OCR methods, allowing for a more accurate extraction tailored to our project. | The index of names proved to be significantly more challenging to analyze than the index of places, as it spanned approximately 80 pages. To tackle this, we first provided examples of the desired output and then used an Italian-trained OCR model to process the text and generate a preliminary table of names. This approach differed from traditional OCR methods, allowing for a more accurate extraction tailored to our project. | ||
==== Dataset Structure ==== | ==== Dataset Structure==== | ||
[[File:Screenshot 2024-12-17 alle 17.00.46.png|400px|thumb|left|Name Index structure]] | |||
The dataset generated for people consists of various features, such as a unique identifier (id) for each entry, the primary name of the person (name), any alternative names or variations (alias), the specific volume where the person is mentioned (volume), the column number within the index where the person is listed (column), the broader family or hierarchical category to which the person belongs (parents), and additional details or notes about the person (description). This structured dataset organizes the data in a way that preserves both explicit information and contextual relationships, enabling in-depth analysis and exploration. | |||
==== Cleaning and Error Correction ==== | ==== Cleaning and Error Correction ==== | ||
Line 97: | Line 117: | ||
==== Observations ==== | ==== Observations ==== | ||
[[File:FAMILY.png|220px|thumb|right|Family structure index]] | |||
Several observations emerged during the analysis. The first surname listed under a heading often applied to subsequent names following the same indentation, providing insight into familial groupings. In some cases, ellipses (...) were used in place of names. This practice was historically employed for names that were unknown at the time of writing, with the intention that they could be added later if discovered, or to deliberately anonymize individuals. These findings enriched our understanding of the index, offering valuable context for both its historical and structural significance. | Several observations emerged during the analysis. The first surname listed under a heading often applied to subsequent names following the same indentation, providing insight into familial groupings. In some cases, ellipses (...) were used in place of names. This practice was historically employed for names that were unknown at the time of writing, with the intention that they could be added later if discovered, or to deliberately anonymize individuals. These findings enriched our understanding of the index, offering valuable context for both its historical and structural significance. | ||
Line 113: | Line 133: | ||
=== Text Management === | === Text Management === | ||
[[File:text.png|400px|thumb|right|text dataset structure]] | |||
After completing the processing of the indexes, the final step in preparing the data was to work on the full text. Although we considered using an alternative OCR system, the OCR provided by Internet Archive proved particularly useful. This system included OCR pages in JSON format, which provided the start and end | After completing the processing of the indexes, the final step in preparing the data was to work on the full text. Although we considered using an alternative OCR system, the OCR provided by Internet Archive proved particularly useful. This system included OCR pages in JSON format, which provided the start and end character for each page. Thanks to this feature, it was possible to split the text into individual pages, a crucial step given the need to organize the data by columns. | ||
Once the text was divided into pages, it became necessary to identify and align the columns. A manual review of the OCR revealed that the text columns of interest only began after page 25. From that point onward, the columns were numbered starting at 5 and 6. This numbering was chosen to create a system consistent with the original text, ensuring the columns were properly aligned to the required format. | Once the text was divided into pages, it became necessary to identify and align the columns. A manual review of the OCR revealed that the text columns of interest only began after page 25. From that point onward, the columns were numbered starting at 5 and 6. This numbering was chosen to create a system consistent with the original text, ensuring the columns were properly aligned to the required format. | ||
Line 124: | Line 144: | ||
==== Pipeline ==== | ==== Pipeline ==== | ||
[[File:indexplace.png|600px|thumb|right|Output with latitude and longitude]] | |||
Load and preprocess data | Load and preprocess data | ||
Line 159: | Line 179: | ||
The API works best with simple and precise input, whereas long or complex phrases hinder matching. | The API works best with simple and precise input, whereas long or complex phrases hinder matching. | ||
==== Venetian Church Dictionary ==== | ==== Venetian Church Dictionary ==== | ||
[[File:chiesa.png|500px|thumb|right|Church dictionary]] | |||
This dictionary aids in geolocating churches by matching entries from the dataset with those in the dictionary using Italian and Venetian labels. | This dictionary aids in geolocating churches by matching entries from the dataset with those in the dictionary using Italian and Venetian labels. | ||
Line 165: | Line 186: | ||
Filter dataset entries containing "chiesa" (church) or synonyms in the 'name' field. | Filter dataset entries containing "chiesa" (church) or synonyms in the 'name' field. | ||
Use only church entries with geolocation and either 'venetianLabel' or 'italianLabel' not null. | Use only church entries with geolocation and either 'venetianLabel' or 'italianLabel' not null. | ||
Assess similarity between strings using Sentence-BERT (SBERT). | Assess similarity between strings using Sentence-BERT (SBERT)[https://sbert.net/]. | ||
Advantages: | Advantages: | ||
Restricting to church-related entries minimizes unrelated matches. | Restricting to church-related entries minimizes unrelated matches. | ||
SBERT captures semantic similarities, such as "Chiesa di San Marco" and "Basilica di San Marco." | SBERT captures semantic similarities, such as "Chiesa di San Marco" and "Basilica di San Marco." | ||
This step aimed to associate private houses ("casa") with coordinates. However, no results were obtained due to several challenges, including hardware limitations that prevented the generation of string encoding for the large dataset within a reasonable timeframe, the possibility that the specific volume analyzed (vol. 5) lacked relevant entries, and a temporal mismatch between the analyzed | ==== Catastici and Sommarioni ==== | ||
This step aimed to associate private houses ("casa") with coordinates. However, no results were obtained due to several challenges, including hardware limitations that prevented the generation of string encoding for the large dataset within a reasonable timeframe, the possibility that the specific volume analyzed (vol. 5) lacked relevant entries, and a temporal mismatch between the analyzed Marino Sanudo document and the catastics records. | |||
To address these challenges, a dictionary was created from the Catastici database to facilitate semantic similarity assessments with entries from the name index. | To address these challenges, a dictionary was created from the Catastici database to facilitate semantic similarity assessments with entries from the name index. | ||
[[File:cata.png|600px|thumb|right|Catastici dataframe]] | |||
From the JSON file | From the JSON file con Catastici, the property’s function, owner’s name, and coordinates were extracted. A structured dataframe was created by combining these tags. The geometry field was split into two columns for latitude and longitude. Duplicates were removed based on the function and owner’s name, keeping only the first occurrence for each combination. A new column, named "name," was generated by concatenating the function and owner’s name. Subsequently, only entries related to private houses ("casa") were retained for further analysis. | ||
Since the dataset used a different local coordinate system, it was necessary to convert any matched coordinates to the appropriate reference system. | Since the dataset used a different local coordinate system, it was necessary to convert any matched coordinates to the appropriate reference system. | ||
Line 194: | Line 216: | ||
== Finding the relationship== | == Finding the relationship== | ||
=== | === Basic Relationships === | ||
In this analysis, a simple method was used to establish relationships by considering places and people mentioned within the same column. This approach assumes that if a place and a person appear together in the same entry, they might be related. However, this assumption has limitations. | In this analysis, a simple method was used to establish relationships by considering places and people mentioned within the same column. This approach assumes that if a place and a person appear together in the same entry, they might be related. However, this assumption has limitations. | ||
Line 203: | Line 225: | ||
In summary, this naive approach relies on the proximity of entries within the index to suggest relationships, but it lacks the ability to verify the true nature of these connections. More advanced methods would be required to establish more accurate relationships. | In summary, this naive approach relies on the proximity of entries within the index to suggest relationships, but it lacks the ability to verify the true nature of these connections. More advanced methods would be required to establish more accurate relationships. | ||
=== | |||
=== Advanced Relationship === | |||
==== Goal of the Analysis ==== | ==== Goal of the Analysis ==== | ||
The goal of the improved relationship analysis was to determine whether a person and a place occurred together more frequently than just a few times in the dataset. This could provide useful insights, especially if these relationships were later weighted for further investigation. To achieve this, the first step was to select relevant text for each entry by evaluating the structure and content in the 'merged_data' dataframe. | The goal of the improved relationship analysis was to determine whether a person and a place occurred together more frequently than just a few times in the dataset. This could provide useful insights, especially if these relationships were later weighted for further investigation. To achieve this, the first step was to select relevant text for each entry by evaluating the structure and content in the 'merged_data' dataframe. | ||
==== Handling Names and Places ==== | ==== Handling Names and Places ==== | ||
One challenge we faced was that the names of people or places were not always directly referenced in the text but often appeared with additional attributes, aliases, or titles. Therefore, we had to extract all the data related to a place and a name and construct two lists: one for the words by which places were likely to be called in the text and another for the words associated with names. This strategy allowed us to leave the original text intact and only evaluate the presence of at least one token from the place list and one from the name list, enabling us to read the output in a way that retained both syntactic and semantic meaning. | One challenge we faced was that the names of people or places were not always directly referenced in the text but often appeared with additional attributes, aliases, or titles. Therefore, we had to extract all the data related to a place and a name and construct two lists: one for the words by which places were likely to be called in the text and another for the words associated with names. This strategy allowed us to leave the original text intact and only evaluate the presence of at least one token from the place list and one from the name list, enabling us to read the output in a way that retained both syntactic and semantic meaning. | ||
{| class="wikitable" style="margin:auto; text-align:center;" | |||
|+ Example | |||
|- | |||
! name in place index !! column !! actual name in the column | |||
|- | |||
| sala del consiglio dei x || 78 || sala | |||
|} | |||
==== Refining the Process ==== | ==== Refining the Process ==== | ||
[[File:tot.png|600px|thumb|right|All the features]] | |||
To ensure accuracy, we first processed the lists by removing short words, stop words, numbers, NaN characters, and special characters. Our initial trial involved checking if 'name' and 'place' appeared within the same sentence. However, this approach resulted in only a small number of meaningful results (209 out of 3,170), with many of them being insignificant. | To ensure accuracy, we first processed the lists by removing short words, stop words, numbers, NaN characters, and special characters. Our initial trial involved checking if 'name' and 'place' appeared within the same sentence. However, this approach resulted in only a small number of meaningful results (209 out of 3,170), with many of them being insignificant. | ||
Line 225: | Line 251: | ||
==== Improving Token Matching ==== | ==== Improving Token Matching ==== | ||
To further refine the process, | To further refine the process, we introduced a more advanced method: assessing the similarity of tokens using Levenshtein distance. This technique calculates the similarity between two strings, which helped overcome errors introduced by OCR. By applying Levenshtein’s ratio, a similarity value between 0 and 1 was generated for each pair of tokens. If the similarity exceeded a set threshold, the tokens were considered a match. This approach further increased the number of valid results, bringing the total to 1,727 out of 3,170. | ||
==== Categorizing Relationships ==== | ==== Categorizing Relationships ==== | ||
Once the matches were identified, the next step was to categorize the relationships between people and places. We used the minimum and maximum sentence boundaries identified earlier to pass these sentences to a large language model (LLM) system for relationship categorization. The LLM categorized the relationships into predefined types, such as a person nominated in that place but not physically present, a person belonging to the place, or a person working or living there. In cases where the model could not find a clear connection, such as in sentences like “John was appointed to the city council,” the system might return an empty string, indicating no clear relationship. For instances where the relationship was unclear but still relevant, such as “Jane recently visited her uncle’s house in Paris,” the model would categorize the relationship as "other," implying that the relationship could not be easily classified into the predefined labels. | Once the matches were identified, the next step was to categorize the relationships between people and places. We used the minimum and maximum sentence boundaries identified earlier to pass these sentences to a large language model (LLM) system for relationship categorization. The LLM categorized the relationships into predefined types, such as a person nominated in that place but not physically present, a person belonging to the place, or a person working or living there. In cases where the model could not find a clear connection, such as in sentences like “John was appointed to the city council,” the system might return an empty string, indicating no clear relationship. For instances where the relationship was unclear but still relevant, such as “Jane recently visited her uncle’s house in Paris,” the model would categorize the relationship as "other," implying that the relationship could not be easily classified into the predefined labels. | ||
{| class="wikitable" style="margin:auto; text-align:left;" | |||
|+ category | |||
|- | |||
! person nominated in that place, but not physically there | |||
|- | |||
! person belongs to there | |||
|- | |||
! person works there | |||
|- | |||
! person lives there | |||
|- | |||
! person meets someone there | |||
|- | |||
! person studies there | |||
|- | |||
! person participates in an event there | |||
|- | |||
! person owns the place | |||
|- | |||
! person is there | |||
|- | |||
! person crosses the place | |||
|- | |||
! a person was seen there | |||
|- | |||
! person visiting there | |||
|- | |||
! people exiled from the place | |||
|- | |||
! people escape from the place | |||
|- | |||
! other | |||
|} | |||
==== Outcome and Significance ==== | ==== Outcome and Significance ==== | ||
Line 235: | Line 294: | ||
This method significantly improved the relationship analysis by providing a more structured way to understand and categorize interactions between people and places within the dataset. The categorization made the results more readable and allowed for deeper investigations in subsequent stages. | This method significantly improved the relationship analysis by providing a more structured way to understand and categorize interactions between people and places within the dataset. The categorization made the results more readable and allowed for deeper investigations in subsequent stages. | ||
=Result= | ==Result== | ||
[[File:dat.png|600px|thumb|right|Output dataset ]] | |||
The dataset output includes detailed columns such as 'Min_Distance_Relationship' and 'Max_Distance_Relationship',in this two there is a string composed by 'Sentence', 'Person','Location', 'Category', and 'Description'. The Description column, in particular, provides valuable insights by elaborating on the relationships between individuals and places, making it a critical element for understanding historical context and the nature of these associations. | |||
[[File:grap.png|600px|thumb|right]] | |||
The comparative analysis of "Min Distance" and "Max Distance" contexts reveals notable patterns across various relationship categories. The most prominent category, “Person nominated in that place, but not physically there,” appears frequently in both contexts. In the Min Distance context, this relationship is recorded 796 times, compared to 764 in the Max Distance context, showing a decrease of 32. This trend indicates that indirect associations, where individuals are connected to places without physical presence, are prevalent but slightly less frequent when the spatial context expands. Similarly, the category “Other” remains significant, accounting for general or ambiguous relationships. Its slight decrease from 619 to 611 between the two contexts underscores the stability of undefined spatial associations. | |||
Categories such as “Person works there” and “Person participates in an event there” show notable increases. Specifically, the “Person works there” category rises from 150 occurrences in the Min Distance context to 192 in the Max Distance context, a positive difference of 42. This suggests that work-related spatial relationships are more frequently referenced in broader spatial contexts. Similarly, “Person participates in an event there” increases by 25, from 51 to 76, indicating that participation in events becomes more prominent in larger spatial frames, potentially reflecting the visibility or importance of events beyond local settings. | |||
In contrast, categories emphasizing physical presence and local connections, such as “Person belongs to there” and “Person visiting there,” show declines. “Person belongs to there” decreases by 10, from 50 in the Min Distance context to 40 in the Max Distance context, while “Person visiting there” decreases by 12, from 40 to 28. These shifts suggest that identity-based associations, such as belonging or visiting, hold more contextual significance in localized spatial relationships and become less prominent as the spatial frame widens. | |||
The category “People exiled from the place” also decreases, dropping from 10 occurrences in Min Distance to 5 in Max Distance, a reduction of 5. This suggests that references to exile are more common in localized contexts. Likewise, categories such as “Person owns the place” and “Person is there” show minor decreases, each by 1, reflecting relative stability but reduced prominence over larger distances. Meanwhile, categories like “Person lives there” and “Person studies there” remain unchanged across both contexts, highlighting the consistent documentation of residential and educational relationships regardless of spatial scope. | |||
Interestingly, there is a slight increase in the category “Person meets someone there,” which rises from 2 to 3 in the Max Distance context. Additionally, the category “A person was seen there” appears exclusively in the Max Distance context, suggesting that sightings are occasionally referenced in broader spatial frames. | |||
Overall, the trends reveal that relationships related to work and event participation gain prominence in broader spatial contexts, while categories tied to physical presence, belonging, and exile remain more localized. The consistency observed in ownership and residential relationships underscores their stable relevance regardless of distance. These findings emphasize how spatial framing shapes the documentation and interpretation of human interactions with places, with certain associations gaining broader recognition while others remain firmly tied to localized contexts. | |||
By analyzing both the Description column and the comparative data, the dataset provides a rich resource for exploring how historical relationships between people and places were recorded, offering significant insights into the sociocultural dynamics of the past. | |||
The comparative analysis of “Min Distance” and “Max Distance” contexts not only reveals patterns within individual categories but also highlights broader trends in how spatial relationships are documented across different frames of reference. Incorporating the counts of rows where categories are equal versus not equal, we observe that 929 rows have matching categories (True), while 798 rows differ (False), showcasing a significant number of instances where relationships are reinterpreted or shift in emphasis as the spatial frame expands. | |||
===Matching vs. Non-Matching Categories=== | |||
[[File:match.png|600px|thumb|right]] | |||
The near parity between rows with matching and non-matching categories suggests that while many relationships remain consistent between the Min and Max Distance contexts, a substantial portion undergoes reclassification. This underscores the nuanced nature of spatial relationships, where context plays a crucial role in shaping interpretations. | |||
A closer look at the rows with differing categories reveals notable patterns. The most frequent mismatch involves the transition between “person nominated in that place, but not physically there” and “other” in both directions, with 224 cases shifting from “person nominated” to “other” and 216 cases moving in the opposite direction. These shifts suggest ambiguity or broader interpretations when moving between localized and expanded spatial contexts. The “other” category often serves as a catch-all for relationships that are less explicitly defined or harder to categorize. | |||
Another notable difference involves “person nominated in that place, but not physically there” and “person works there,” with 62 cases transitioning from “person nominated” to “person works there” and 42 cases moving in the opposite direction. This highlights how work-related associations might emerge or gain prominence in specific contexts while being classified as nominations in others. | |||
Differences also appear with the category “person participates in an event there.” Specifically, there are 36 cases shifting from “person nominated in that place, but not physically there” to “person participates in an event there,” 25 cases transitioning from “other” to “person participates in an event there,” and 21 cases moving from “person participates in an event there” back to “person nominated in that place.” These differences emphasize how event participation can sometimes be interpreted as part of a nomination or as an independent relational context. | |||
Other mismatches include transitions between “person visiting there” and “person nominated in that place, but not physically there,” with 19 cases reflecting this change. Also, there are 18 cases where “other” transitions to “person works there,” indicating that some work-related relationships may initially be too vague or undefined in localized contexts. | |||
Overall Insights | |||
The discrepancies between categories reflect the fluidity of interpreting historical relationships based on spatial scope. Broader contexts tend to emphasize work, events, and indirect associations, while localized contexts retain a stronger focus on physical presence, belonging, and direct ties to place. This duality is further supported by the stability of certain categories (e.g., residential relationships and ownership) across spatial frames. | |||
The 798 rows with non-matching categories thus provide valuable insight into how spatial and social relationships adapt and are documented differently based on context, enriching our understanding of the interplay between people, places, and the nature of their associations in historical data. | |||
=Conclusions= | |||
===Achieved Results=== | |||
Our work led to the reorganization and in-depth analysis of a complex corpus, making the data more accessible and comprehensible for further investigation. The extraction and structuring of information from indices and texts, the identification of relationships between people and places, and the application of advanced techniques such as textual similarity calculations allowed us to create richer and more useful datasets. The comparative analysis of relationships in the “Min Distance” and “Max Distance” contexts has revealed notable insights into the documentation of spatial relationships and their evolution across varying frames of reference. This analysis, combined with the categorization of relationships, provides a valuable lens for understanding how individuals were connected to places, both directly and indirectly, in the historical records. | |||
===Quality Assessment=== | |||
The data extraction process, particularly through the use of OCR technology, was a significant starting point. However, as expected with OCR, the accuracy of the results depended heavily on the quality of the original scanned text and the effectiveness of the OCR model. Although an Italian-trained OCR model was used, errors were still present, especially in the case of handwritten or degraded text. The most common errors included the misinterpretation of page numbers as names, misreading of some letters or words, and the inclusion of incorrect dashes (”-”) in place of spaces or letters. These errors were manually reviewed and corrected to a degree, but the sheer volume of data made it impractical to address all issues. | |||
One critical aspect of the dataset involved the handling of numeric data. The OCR often misclassified page numbers, years, and quantifications in ways that required manual inspection and filtering. While we were able to transfer page references to the correct “pages” column and eliminate redundant numbers, the process of distinguishing between pagination, quantifications, and historical years was difficult, particularly with two-digit numbers. These errors were mitigated, but at the cost of excluding potentially important information. | |||
The hierarchical relationships in the indexes (such as the relationships between places and their broader geographical categories) were successfully inferred from the indentation structure of the index. However, this inference required a careful manual review to ensure that hierarchical relationships were accurately captured, and some ambiguity remained, particularly with places that could have multiple potential parent locations. | |||
Additionally, the analysis of mismatched categories between “Min Distance” and “Max Distance” relationships highlighted both the consistency and variability of spatial associations. While many categories aligned, a significant portion (798 rows) showed differences, revealing how certain relationships (e.g., “person nominated in that place, but not physically there”) might shift to more ambiguous categories (“other”) or more specific ones (“person works there”) as the spatial context expanded. These differences provided a deeper understanding of how relationships were interpreted in historical records and their dependence on contextual framing. | |||
===Limitations=== | |||
Despite our best efforts, several key limitations remained in the dataset. One of the most significant challenges was the manual validation and cleaning of data, especially for the name index. With approximately 80 pages of names, the process was time-consuming and prone to oversight. While the OCR helped automate much of the extraction, some entries required a more nuanced interpretation, which could have resulted in missing or incorrect relationships between people and places. | |||
Another limitation arose from the geolocation process. Although we used multiple sources, including the Nominatim API and the Venetian Church Dictionary, to extract coordinates, this step was not fully successful for all entries. The data available for private houses, especially in historical records like the Catastici/Sommarioni, was incomplete, leading to the absence of geolocations for many private addresses. This challenge was further compounded by the temporal mismatch between Sanudo’s work and the Catastici data, which reflected a different time period and may not have matched Sanudo’s references. | |||
The analysis of mismatched categories, while insightful, also revealed potential limitations in the dataset. The high frequency of transitions between “person nominated in that place, but not physically there” and “other” reflects a degree of ambiguity that could stem from the subjective interpretation of relationships or incomplete textual information. Similarly, work-related and event participation categories demonstrated shifts that were context-dependent, underscoring the need for more sophisticated algorithms or manual interventions to capture the full complexity of these relationships. | |||
The use of machine learning techniques, such as the Levenshtein distance for token matching, was an essential tool for improving the relationship analysis, but it was not flawless. The computational resources available for processing the full dataset were limited, which constrained our ability to run more advanced matching algorithms at scale. This limitation, coupled with the complexity of the historical language used in Sanudo’s diaries, meant that some relationships were missed or misclassified, especially those involving aliases or titles not captured by the matching algorithm. | |||
===Further Work=== | |||
Given the limitations mentioned, there is considerable room for further work. First, additional refinement of the OCR process could improve data accuracy. Leveraging more advanced OCR models, particularly those based on deep learning, or even training a custom OCR model specifically designed for historical Italian texts, could help reduce the number of errors in data extraction. | |||
In terms of relationship analysis, further research could include the use of natural language processing (NLP) models tailored to historical texts to better capture the nuances of relationships. These models could be trained specifically on Renaissance-era Italian to better understand syntactic and semantic patterns in the text. | |||
Additionally, a deeper exploration of mismatched categories could offer new insights into the contextual shifts in relationships. For instance, analyzing the reasons why specific transitions occur more frequently—such as those involving “person nominated” and “other”—could uncover patterns in how historical documentation categorized indirect or less explicit relationships. | |||
Finally, a broader exploration of other historical records related to Venice during the Renaissance could complement Sanudo’s diaries, allowing for a more comprehensive analysis of the relationships between people, places, and events. | |||
It is important to note that due to time constraints, our work has focused exclusively on volume 5 of Marino Sanudo’s diaries, while there are a total of 58 volumes. Expanding this approach to cover the full corpus would provide a more holistic view of the social and geographical networks of Renaissance Venice. | |||
In conclusion, while the dataset produced in this project offers valuable insights into the social and geographical networks of Renaissance Venice, the limitations outlined highlight the challenges of working with historical texts and the importance of refining methods for future work. The comparative analysis of differing categories and the exploration of contextual shifts in relationships represent a foundational step toward understanding the complexities of human interaction and spatial associations in historical records. | |||
=GitHub repositories= | =GitHub repositories= | ||
Link: https://github.com/dhlab-class/fdh-2024-student-projects-marcopolo | |||
=Website= | |||
Link: https://fht-epfl.github.io/marcopolo-sanudo-fdh/ | |||
=References= | =References= | ||
Line 252: | Line 386: | ||
https://zentry.com/ | https://zentry.com/ | ||
Sbert: https://sbert.net/ |
Revision as of 12:23, 18 December 2024
Introductions
The project focused on analyzing the diaries of Marino Sanudo [1], a key historical source for understanding the Renaissance period. The primary goal was to create an index of people and places mentioned in the diaries, pair these entities, and analyze the potential relationships between them.
Historical Context
Who Was Marino Sanudo? Marino Sanudo (1466–1536) was a Venetian historian, diarist, and politician whose extensive diaries, Diarii, provide a meticulous chronicle of daily life, politics, and events in Renaissance Venice. Sanudo devoted much of his life to recording the intricacies of Venetian society, governance, and international relations, making him one of the most significant chroniclers of his era.
The Importance of His Diaries
Sanudo’s Diarii span nearly four decades, comprising 58 volumes of detailed observations. These writings offer invaluable insights into the political maneuvers of the Venetian Republic, social customs, and the geographical scope of Renaissance trade and diplomacy. His work captures not only significant historical events but also the daily rhythms of Venetian life, painting a vivid picture of one of the most influential states of the time.
Relevance Today Studying Marino Sanudo’s diaries remains highly relevant for modern historians, linguists, and data analysts. They provide a primary source for understanding Renaissance politics, diplomacy, and social hierarchies. Furthermore, the diaries’ exhaustive detail lends itself to contemporary methods of analysis, such as network mapping and data visualization, enabling new interpretations and uncovering hidden patterns in historical relationships. By examining the interconnectedness of individuals and places, Sanudo’s work sheds light on the broader dynamics of Renaissance Europe, offering lessons that resonate even in today’s globalized world.
Motivation and description of the deliverables
The decision to analyze Marino Sanudo’s diaries stemmed from their exceptional value as a primary source for understanding Renaissance Venice and its influence on European history. Sanudo’s meticulous documentation provides unparalleled insights into the sociopolitical and cultural dynamics of the time, capturing events ranging from significant political maneuvers to the nuances of daily life. This project aimed to leverage this rich historical resource to explore connections between individuals and places, offering a fresh perspective on the networks that shaped Renaissance society.
Our motivation was twofold: to deepen historical understanding and to demonstrate the potential of digital humanities. By employing modern data analysis tools and visualization techniques, we aimed to uncover patterns and relationships that might remain hidden in traditional textual analysis. Sanudo’s diaries, with their wealth of names, places, and detailed events, provided the ideal foundation for such an interdisciplinary approach, bridging the gap between historical research and innovative technology.
The project deliverables reflect this dual objective. First, we developed an indexed dataset of names and places mentioned in Sanudo’s diaries, categorized by the relationships and contexts in which they appeared. This dataset forms the basis for further exploration of Renaissance networks. Second, we analyzed these relationships to identify significant patterns, such as the prominence of certain individuals in specific locations or events, and created visualizations to illustrate these findings. These analyses contribute not only to a deeper understanding of Venetian society but also to broader discussions about the interconnectedness of Renaissance Europe.
To ensure accessibility, we documented our research on a dedicated website and created a Wikipedia page summarizing the project and its findings. These platforms serve as public-facing resources, promoting engagement with the material and showcasing the value of combining historical research with digital tools. This project exemplifies how modern methodologies can enrich our appreciation of the past.
Project Plan and Milestones
The project was organized on a weekly basis to ensure steady progress and a balanced workload. Each phase was carefully planned with clearly defined objectives and milestones, promoting effective collaboration and equitable division of tasks among team members.
The first milestone (13.10) involved deciding on the project's focus. After thorough discussions, we collectively chose to analyze Marino Sanudo’s diaries, given their historical significance and potential for data-driven exploration. This phase established a shared understanding of the project, laid the foundation for subsequent work, and clarified the scope of our research.
The second milestone (14.10) focused on optimizing the extraction of indexes for names and places from the diaries. This required refining our methods for data extraction and ensuring accuracy in capturing and categorizing entities. Alongside this, we worked on identifying the geolocations of the places mentioned in the index, using historical and modern mapping tools to ensure precise identification. This step was critical to linking historical references with real-world locations, providing a solid basis for subsequent analysis.
The final stages of the project (19.12) marked the transition to analyzing relationships between the extracted names and places. Using the indexed data, we explored potential connections, identifying patterns and trends that revealed insights into Renaissance Venice's social, political, and geographic networks. We collaboratively built a Wikipedia page to document our research and created a dedicated website to present our results in an accessible and visually engaging manner. This phase also included preparing for the final presentation, ensuring that every team member contributed to summarizing and showcasing the work.
By adhering to this structured approach and dividing tasks equitably, we achieved a comprehensive analysis of Sanudo’s diaries. Combining historical research with modern digital tools, we uncovered new insights into his world and its relevance today.
Week | Task |
---|---|
07.10 - 13.10 | Define project and structure work |
14.10 - 20.10 |
Manually write a place index |
21.10 - 27.10 | Autumn vacation |
28.10 - 03.11 |
Work on the name dataset |
04.11 - 10.11 |
Finish the geolocation |
11.11 - 17.11 |
Midterm presentation on 14.11 |
18.11 - 24.11 |
Find naive relationship |
25.11 - 01.12 |
Standardization of the text |
02.12 - 08.12 |
Find relationship based on the distance |
09.12 - 15.12 |
Finish writing the wiki |
16.12 - 22.12 |
Deliver GitHub + wiki on 18.12 |
Methodology
Data preparation
In our project, which involved analyzing a specific book, the initial step was to obtain the text version of the book. After exploring several sources, including The Online Books Page [2], we identified three potential websites for downloading the text. Ultimately, we selected the version available on Internet Archives [3] because it offered a more comprehensive set of tools for our analysis. We downloaded the text from this source and compared it to versions from Google Books and HathiTrust, confirming that it best suited our needs.
We then decided to focus our analysis on the indices included in each volume, which listed the names of people and places alongside the corresponding column numbers.
Place index
Our primary focus was on the places mentioned in Venice. The index of places was significantly shorter, allowing us to analyze it manually. Each entry in the index included headings that often indicated a hierarchical relationship, suggesting that a location belonged to a broader area indicated by the preceding indentation.
Dataset structure
The generated dataset consists of the following features:
id: A unique identifier assigned to each entry.
place: The primary name of the location mentioned in the index.
alias: Any alternative names or variations associated with the place.
volume: The specific volume of the book in which the place is mentioned.
column: The column number within the index where the place is listed.
parents: The broader location or hierarchical category to which the place belongs, derived from the indentation structure in the index.
This structured dataset captures both the explicit details from the index and the inferred hierarchical relationships, making it suitable for further analysis and exploration.
Name index
The index of names proved to be significantly more challenging to analyze than the index of places, as it spanned approximately 80 pages. To tackle this, we first provided examples of the desired output and then used an Italian-trained OCR model to process the text and generate a preliminary table of names. This approach differed from traditional OCR methods, allowing for a more accurate extraction tailored to our project.
Dataset Structure
The dataset generated for people consists of various features, such as a unique identifier (id) for each entry, the primary name of the person (name), any alternative names or variations (alias), the specific volume where the person is mentioned (volume), the column number within the index where the person is listed (column), the broader family or hierarchical category to which the person belongs (parents), and additional details or notes about the person (description). This structured dataset organizes the data in a way that preserves both explicit information and contextual relationships, enabling in-depth analysis and exploration.
Cleaning and Error Correction
After the automated extraction, the output was manually reviewed and corrected. Given that OCR often makes mistakes when generating text from images, our goal was to improve the dataset's quality without conducting a complete manual overhaul. Instead, we focused on correcting the most frequent and disruptive errors. A common issue encountered was the misinterpretation of page numbers as names, which led to cascading errors that required cleaning. Another frequent error involved the inclusion of dashes ("-") generated by the OCR, which needed removal.
Observations
Several observations emerged during the analysis. The first surname listed under a heading often applied to subsequent names following the same indentation, providing insight into familial groupings. In some cases, ellipses (...) were used in place of names. This practice was historically employed for names that were unknown at the time of writing, with the intention that they could be added later if discovered, or to deliberately anonymize individuals. These findings enriched our understanding of the index, offering valuable context for both its historical and structural significance.
Handling Numeric Characters
During the analysis of the "description" column, numeric characters were frequently observed due to incorrect OCR processing. Many of these numbers corresponded to page references and needed to be transferred to the "pages" column. However, the column also contained numbers unrelated to pagination, such as quantifications (e.g., "he had 3 children") or references to historical years. Historical years were often recognizable by their formatting, typically enclosed in round brackets or preceded by Italian prepositions commonly used to denote time periods, such as "nel," "da," or "di." Additionally, the structure of the dataset limited the maximum number of valid columns for volume 5 to 1074, meaning any larger number was undoubtedly a reference to a historical year.
Extraction Methodology
To address these inconsistencies, a systematic extraction process was designed. All numbers with three or four digits were identified within the "description" column. Numbers exceeding the maximum column count of 1074 were flagged as historical years, as were those preceded by prepositions indicating temporal references. Valid page numbers were transferred to the "pages" column, and once extracted, these numbers were removed from the "description" column to eliminate redundancy. This process relied on a combination of regular expressions and filtering techniques to ensure precision.
Limitations
Handling two-digit numbers posed significant challenges. Distinguishing between numbers representing pagination and those used for quantifications (e.g., "2 children") required manual inspection, which was impractical at scale. Consequently, two-digit numbers were excluded from the automated process to minimize errors. Although this approach may have resulted in some missing pagination data, it effectively avoided inaccuracies caused by misclassification. This method underscored the need to balance accuracy and practicality in processing complex datasets while acknowledging the inherent limitations of such efforts.
Text Management
After completing the processing of the indexes, the final step in preparing the data was to work on the full text. Although we considered using an alternative OCR system, the OCR provided by Internet Archive proved particularly useful. This system included OCR pages in JSON format, which provided the start and end character for each page. Thanks to this feature, it was possible to split the text into individual pages, a crucial step given the need to organize the data by columns.
Once the text was divided into pages, it became necessary to identify and align the columns. A manual review of the OCR revealed that the text columns of interest only began after page 25. From that point onward, the columns were numbered starting at 5 and 6. This numbering was chosen to create a system consistent with the original text, ensuring the columns were properly aligned to the required format.
Through this methodology, each page was associated with a "start column" and an "end column," ensuring accurate structuring of the data in alignment with the original document format.
Finding the Geolocation
Extraction of Coordinates
Pipeline
Load and preprocess data
Nominatim API: Extract coordinates for "famous places."
Venetian Church Dictionary: Match coordinates for churches.
Catastici/Sommarioni: Match coordinates for private houses.
ChatGPT API: Fill in missing data.
Process: Each step enriches the dataset with newly found coordinates. Subsequent steps only process entries lacking geolocation, avoiding overwriting or misclassification of previous entries.
Considerations
Nominatim API: Returns few but highly accurate results.
Venetian Church Dictionary: Highly effective in associating churches, although some errors may arise depending on thresholds.
Catastici/Sommarioni: Currently yields no results, likely due to:limited testing capability on the full dataset and volume-specific content variations.
Temporal mismatch between Mario Sanudo's document and catastics data.
ChatGPT API: Associates most remaining instances with some errors, though not significantly high.
Nominatim API Trial
Nominatim is a geocoding API that processes either structured or free-form textual descriptions. For this study, the free-form query method was used.
Key Findings:
Using 'name' and 'city' fields yielded accurate but limited results (6/65). Including 'alias' and 'father' fields in the query reduced accuracy significantly. The API works best with simple and precise input, whereas long or complex phrases hinder matching.
Venetian Church Dictionary
This dictionary aids in geolocating churches by matching entries from the dataset with those in the dictionary using Italian and Venetian labels.
Steps:
Filter dataset entries containing "chiesa" (church) or synonyms in the 'name' field. Use only church entries with geolocation and either 'venetianLabel' or 'italianLabel' not null. Assess similarity between strings using Sentence-BERT (SBERT)[4]. Advantages:
Restricting to church-related entries minimizes unrelated matches. SBERT captures semantic similarities, such as "Chiesa di San Marco" and "Basilica di San Marco."
Catastici and Sommarioni
This step aimed to associate private houses ("casa") with coordinates. However, no results were obtained due to several challenges, including hardware limitations that prevented the generation of string encoding for the large dataset within a reasonable timeframe, the possibility that the specific volume analyzed (vol. 5) lacked relevant entries, and a temporal mismatch between the analyzed Marino Sanudo document and the catastics records.
To address these challenges, a dictionary was created from the Catastici database to facilitate semantic similarity assessments with entries from the name index.
From the JSON file con Catastici, the property’s function, owner’s name, and coordinates were extracted. A structured dataframe was created by combining these tags. The geometry field was split into two columns for latitude and longitude. Duplicates were removed based on the function and owner’s name, keeping only the first occurrence for each combination. A new column, named "name," was generated by concatenating the function and owner’s name. Subsequently, only entries related to private houses ("casa") were retained for further analysis.
Since the dataset used a different local coordinate system, it was necessary to convert any matched coordinates to the appropriate reference system.
Despite following this methodology, no valid results were obtained. The primary reasons were hardware constraints, the volume-specific nature of the analyzed data, and the temporal mismatch between the catastics records and the Mario Sanudo document.
This step underscored the limitations of the Catastici dataset for this project and highlighted the need for improved computational resources and greater dataset compatibility to achieve better outcomes in future research.
ChatGPT API
For the remaining entries, ChatGPT was combined with geocoding APIs like OpenCage or Nominatim to infer coordinates.
Advantages:
Combines ChatGPT's descriptive power with the precision of geocoding APIs. Limitations: Requires external geocoding APIs, as ChatGPT alone cannot retrieve real-time coordinates.
Finding the relationship
Basic Relationships
In this analysis, a simple method was used to establish relationships by considering places and people mentioned within the same column. This approach assumes that if a place and a person appear together in the same entry, they might be related. However, this assumption has limitations.
Co-occurrence in the Same Column: Places and people mentioned in the same column were linked, based on the assumption that their proximity suggests some form of relationship. However, we cannot be certain if this proximity indicates an actual relationship or if they are simply listed together without a true connection.
Limitations: Temporal Gaps: It is not possible to determine whether the relationship exists because the place and person appear in the same column or if the relationship pertains to places or people mentioned in previous or subsequent pages. Context: Without additional context or a deeper understanding of the index structure, we cannot definitively determine whether the place and person are directly related, or if they are merely listed next to each other by chance.
In summary, this naive approach relies on the proximity of entries within the index to suggest relationships, but it lacks the ability to verify the true nature of these connections. More advanced methods would be required to establish more accurate relationships.
Advanced Relationship
Goal of the Analysis
The goal of the improved relationship analysis was to determine whether a person and a place occurred together more frequently than just a few times in the dataset. This could provide useful insights, especially if these relationships were later weighted for further investigation. To achieve this, the first step was to select relevant text for each entry by evaluating the structure and content in the 'merged_data' dataframe.
Handling Names and Places
One challenge we faced was that the names of people or places were not always directly referenced in the text but often appeared with additional attributes, aliases, or titles. Therefore, we had to extract all the data related to a place and a name and construct two lists: one for the words by which places were likely to be called in the text and another for the words associated with names. This strategy allowed us to leave the original text intact and only evaluate the presence of at least one token from the place list and one from the name list, enabling us to read the output in a way that retained both syntactic and semantic meaning.
name in place index | column | actual name in the column |
---|---|---|
sala del consiglio dei x | 78 | sala |
Refining the Process
To ensure accuracy, we first processed the lists by removing short words, stop words, numbers, NaN characters, and special characters. Our initial trial involved checking if 'name' and 'place' appeared within the same sentence. However, this approach resulted in only a small number of meaningful results (209 out of 3,170), with many of them being insignificant.
The second trial expanded the search by looking for 'place' combined with 'alias_place' and 'name' with 'description'. This broader search significantly increased the number of results, reaching more than 1,000.
Improving Token Matching
To further refine the process, we introduced a more advanced method: assessing the similarity of tokens using Levenshtein distance. This technique calculates the similarity between two strings, which helped overcome errors introduced by OCR. By applying Levenshtein’s ratio, a similarity value between 0 and 1 was generated for each pair of tokens. If the similarity exceeded a set threshold, the tokens were considered a match. This approach further increased the number of valid results, bringing the total to 1,727 out of 3,170.
Categorizing Relationships
Once the matches were identified, the next step was to categorize the relationships between people and places. We used the minimum and maximum sentence boundaries identified earlier to pass these sentences to a large language model (LLM) system for relationship categorization. The LLM categorized the relationships into predefined types, such as a person nominated in that place but not physically present, a person belonging to the place, or a person working or living there. In cases where the model could not find a clear connection, such as in sentences like “John was appointed to the city council,” the system might return an empty string, indicating no clear relationship. For instances where the relationship was unclear but still relevant, such as “Jane recently visited her uncle’s house in Paris,” the model would categorize the relationship as "other," implying that the relationship could not be easily classified into the predefined labels.
person nominated in that place, but not physically there |
---|
person belongs to there |
person works there |
person lives there |
person meets someone there |
person studies there |
person participates in an event there |
person owns the place |
person is there |
person crosses the place |
a person was seen there |
person visiting there |
people exiled from the place |
people escape from the place |
other |
Outcome and Significance
This method significantly improved the relationship analysis by providing a more structured way to understand and categorize interactions between people and places within the dataset. The categorization made the results more readable and allowed for deeper investigations in subsequent stages.
Result
The dataset output includes detailed columns such as 'Min_Distance_Relationship' and 'Max_Distance_Relationship',in this two there is a string composed by 'Sentence', 'Person','Location', 'Category', and 'Description'. The Description column, in particular, provides valuable insights by elaborating on the relationships between individuals and places, making it a critical element for understanding historical context and the nature of these associations.
The comparative analysis of "Min Distance" and "Max Distance" contexts reveals notable patterns across various relationship categories. The most prominent category, “Person nominated in that place, but not physically there,” appears frequently in both contexts. In the Min Distance context, this relationship is recorded 796 times, compared to 764 in the Max Distance context, showing a decrease of 32. This trend indicates that indirect associations, where individuals are connected to places without physical presence, are prevalent but slightly less frequent when the spatial context expands. Similarly, the category “Other” remains significant, accounting for general or ambiguous relationships. Its slight decrease from 619 to 611 between the two contexts underscores the stability of undefined spatial associations.
Categories such as “Person works there” and “Person participates in an event there” show notable increases. Specifically, the “Person works there” category rises from 150 occurrences in the Min Distance context to 192 in the Max Distance context, a positive difference of 42. This suggests that work-related spatial relationships are more frequently referenced in broader spatial contexts. Similarly, “Person participates in an event there” increases by 25, from 51 to 76, indicating that participation in events becomes more prominent in larger spatial frames, potentially reflecting the visibility or importance of events beyond local settings.
In contrast, categories emphasizing physical presence and local connections, such as “Person belongs to there” and “Person visiting there,” show declines. “Person belongs to there” decreases by 10, from 50 in the Min Distance context to 40 in the Max Distance context, while “Person visiting there” decreases by 12, from 40 to 28. These shifts suggest that identity-based associations, such as belonging or visiting, hold more contextual significance in localized spatial relationships and become less prominent as the spatial frame widens.
The category “People exiled from the place” also decreases, dropping from 10 occurrences in Min Distance to 5 in Max Distance, a reduction of 5. This suggests that references to exile are more common in localized contexts. Likewise, categories such as “Person owns the place” and “Person is there” show minor decreases, each by 1, reflecting relative stability but reduced prominence over larger distances. Meanwhile, categories like “Person lives there” and “Person studies there” remain unchanged across both contexts, highlighting the consistent documentation of residential and educational relationships regardless of spatial scope.
Interestingly, there is a slight increase in the category “Person meets someone there,” which rises from 2 to 3 in the Max Distance context. Additionally, the category “A person was seen there” appears exclusively in the Max Distance context, suggesting that sightings are occasionally referenced in broader spatial frames.
Overall, the trends reveal that relationships related to work and event participation gain prominence in broader spatial contexts, while categories tied to physical presence, belonging, and exile remain more localized. The consistency observed in ownership and residential relationships underscores their stable relevance regardless of distance. These findings emphasize how spatial framing shapes the documentation and interpretation of human interactions with places, with certain associations gaining broader recognition while others remain firmly tied to localized contexts.
By analyzing both the Description column and the comparative data, the dataset provides a rich resource for exploring how historical relationships between people and places were recorded, offering significant insights into the sociocultural dynamics of the past.
The comparative analysis of “Min Distance” and “Max Distance” contexts not only reveals patterns within individual categories but also highlights broader trends in how spatial relationships are documented across different frames of reference. Incorporating the counts of rows where categories are equal versus not equal, we observe that 929 rows have matching categories (True), while 798 rows differ (False), showcasing a significant number of instances where relationships are reinterpreted or shift in emphasis as the spatial frame expands.
Matching vs. Non-Matching Categories
The near parity between rows with matching and non-matching categories suggests that while many relationships remain consistent between the Min and Max Distance contexts, a substantial portion undergoes reclassification. This underscores the nuanced nature of spatial relationships, where context plays a crucial role in shaping interpretations.
A closer look at the rows with differing categories reveals notable patterns. The most frequent mismatch involves the transition between “person nominated in that place, but not physically there” and “other” in both directions, with 224 cases shifting from “person nominated” to “other” and 216 cases moving in the opposite direction. These shifts suggest ambiguity or broader interpretations when moving between localized and expanded spatial contexts. The “other” category often serves as a catch-all for relationships that are less explicitly defined or harder to categorize.
Another notable difference involves “person nominated in that place, but not physically there” and “person works there,” with 62 cases transitioning from “person nominated” to “person works there” and 42 cases moving in the opposite direction. This highlights how work-related associations might emerge or gain prominence in specific contexts while being classified as nominations in others.
Differences also appear with the category “person participates in an event there.” Specifically, there are 36 cases shifting from “person nominated in that place, but not physically there” to “person participates in an event there,” 25 cases transitioning from “other” to “person participates in an event there,” and 21 cases moving from “person participates in an event there” back to “person nominated in that place.” These differences emphasize how event participation can sometimes be interpreted as part of a nomination or as an independent relational context.
Other mismatches include transitions between “person visiting there” and “person nominated in that place, but not physically there,” with 19 cases reflecting this change. Also, there are 18 cases where “other” transitions to “person works there,” indicating that some work-related relationships may initially be too vague or undefined in localized contexts.
Overall Insights The discrepancies between categories reflect the fluidity of interpreting historical relationships based on spatial scope. Broader contexts tend to emphasize work, events, and indirect associations, while localized contexts retain a stronger focus on physical presence, belonging, and direct ties to place. This duality is further supported by the stability of certain categories (e.g., residential relationships and ownership) across spatial frames.
The 798 rows with non-matching categories thus provide valuable insight into how spatial and social relationships adapt and are documented differently based on context, enriching our understanding of the interplay between people, places, and the nature of their associations in historical data.
Conclusions
Achieved Results
Our work led to the reorganization and in-depth analysis of a complex corpus, making the data more accessible and comprehensible for further investigation. The extraction and structuring of information from indices and texts, the identification of relationships between people and places, and the application of advanced techniques such as textual similarity calculations allowed us to create richer and more useful datasets. The comparative analysis of relationships in the “Min Distance” and “Max Distance” contexts has revealed notable insights into the documentation of spatial relationships and their evolution across varying frames of reference. This analysis, combined with the categorization of relationships, provides a valuable lens for understanding how individuals were connected to places, both directly and indirectly, in the historical records.
Quality Assessment
The data extraction process, particularly through the use of OCR technology, was a significant starting point. However, as expected with OCR, the accuracy of the results depended heavily on the quality of the original scanned text and the effectiveness of the OCR model. Although an Italian-trained OCR model was used, errors were still present, especially in the case of handwritten or degraded text. The most common errors included the misinterpretation of page numbers as names, misreading of some letters or words, and the inclusion of incorrect dashes (”-”) in place of spaces or letters. These errors were manually reviewed and corrected to a degree, but the sheer volume of data made it impractical to address all issues.
One critical aspect of the dataset involved the handling of numeric data. The OCR often misclassified page numbers, years, and quantifications in ways that required manual inspection and filtering. While we were able to transfer page references to the correct “pages” column and eliminate redundant numbers, the process of distinguishing between pagination, quantifications, and historical years was difficult, particularly with two-digit numbers. These errors were mitigated, but at the cost of excluding potentially important information.
The hierarchical relationships in the indexes (such as the relationships between places and their broader geographical categories) were successfully inferred from the indentation structure of the index. However, this inference required a careful manual review to ensure that hierarchical relationships were accurately captured, and some ambiguity remained, particularly with places that could have multiple potential parent locations.
Additionally, the analysis of mismatched categories between “Min Distance” and “Max Distance” relationships highlighted both the consistency and variability of spatial associations. While many categories aligned, a significant portion (798 rows) showed differences, revealing how certain relationships (e.g., “person nominated in that place, but not physically there”) might shift to more ambiguous categories (“other”) or more specific ones (“person works there”) as the spatial context expanded. These differences provided a deeper understanding of how relationships were interpreted in historical records and their dependence on contextual framing.
Limitations
Despite our best efforts, several key limitations remained in the dataset. One of the most significant challenges was the manual validation and cleaning of data, especially for the name index. With approximately 80 pages of names, the process was time-consuming and prone to oversight. While the OCR helped automate much of the extraction, some entries required a more nuanced interpretation, which could have resulted in missing or incorrect relationships between people and places.
Another limitation arose from the geolocation process. Although we used multiple sources, including the Nominatim API and the Venetian Church Dictionary, to extract coordinates, this step was not fully successful for all entries. The data available for private houses, especially in historical records like the Catastici/Sommarioni, was incomplete, leading to the absence of geolocations for many private addresses. This challenge was further compounded by the temporal mismatch between Sanudo’s work and the Catastici data, which reflected a different time period and may not have matched Sanudo’s references.
The analysis of mismatched categories, while insightful, also revealed potential limitations in the dataset. The high frequency of transitions between “person nominated in that place, but not physically there” and “other” reflects a degree of ambiguity that could stem from the subjective interpretation of relationships or incomplete textual information. Similarly, work-related and event participation categories demonstrated shifts that were context-dependent, underscoring the need for more sophisticated algorithms or manual interventions to capture the full complexity of these relationships.
The use of machine learning techniques, such as the Levenshtein distance for token matching, was an essential tool for improving the relationship analysis, but it was not flawless. The computational resources available for processing the full dataset were limited, which constrained our ability to run more advanced matching algorithms at scale. This limitation, coupled with the complexity of the historical language used in Sanudo’s diaries, meant that some relationships were missed or misclassified, especially those involving aliases or titles not captured by the matching algorithm.
Further Work
Given the limitations mentioned, there is considerable room for further work. First, additional refinement of the OCR process could improve data accuracy. Leveraging more advanced OCR models, particularly those based on deep learning, or even training a custom OCR model specifically designed for historical Italian texts, could help reduce the number of errors in data extraction.
In terms of relationship analysis, further research could include the use of natural language processing (NLP) models tailored to historical texts to better capture the nuances of relationships. These models could be trained specifically on Renaissance-era Italian to better understand syntactic and semantic patterns in the text.
Additionally, a deeper exploration of mismatched categories could offer new insights into the contextual shifts in relationships. For instance, analyzing the reasons why specific transitions occur more frequently—such as those involving “person nominated” and “other”—could uncover patterns in how historical documentation categorized indirect or less explicit relationships.
Finally, a broader exploration of other historical records related to Venice during the Renaissance could complement Sanudo’s diaries, allowing for a more comprehensive analysis of the relationships between people, places, and events.
It is important to note that due to time constraints, our work has focused exclusively on volume 5 of Marino Sanudo’s diaries, while there are a total of 58 volumes. Expanding this approach to cover the full corpus would provide a more holistic view of the social and geographical networks of Renaissance Venice.
In conclusion, while the dataset produced in this project offers valuable insights into the social and geographical networks of Renaissance Venice, the limitations outlined highlight the challenges of working with historical texts and the importance of refining methods for future work. The comparative analysis of differing categories and the exploration of contextual shifts in relationships represent a foundational step toward understanding the complexities of human interaction and spatial associations in historical records.
GitHub repositories
Link: https://github.com/dhlab-class/fdh-2024-student-projects-marcopolo
Website
Link: https://fht-epfl.github.io/marcopolo-sanudo-fdh/
References
Sanudo Diarii Volume 5 https://books.google.ch/books?id=cm6srb292ToC&redir_esc=y
Hero video 1 http://www.unabibliotecaunlibro.it/video?ID=227&PID=64
Hero video 2 https://www.youtube.com/watch?v=JphHw6iU4m8
Website design:
https://www.youtube.com/watch?v=b7a_Y1Ja6js
Sbert: https://sbert.net/