Sanudo's Diary: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
No edit summary
Line 84: Line 84:


== Methodology ==
== Methodology ==
<b>pre-processing OCR generated text file</b><br>
 
=== Generating Place Entities ===
 
The first task we had was extracting the relevant place entities from the OCR generated text file of Sanudo's index. This was essential because our eventual goal was to visualize name and place entities (i.e. important places and names documented by Sanudo) from the index onto a map of Venice. For this, we needed to filter out the place entities that existed only in Venice (not outside of it), and link them to co-ordinates to display on a map. Furthermore, each place corresponded to one or more "indices", i.e. columns in which the place name appeared in Sanudo's diaries.
 
<b>1. Pre-processing OCR generated text file</b><br>
The text file was sourced from the Internet Archive. We processed it to split the content into columns by identifying a specific text pattern. In summary, the columns are delineated by 2x3 line breaks and two page numbers, which served as key markers for the division.
The text file was sourced from the Internet Archive. We processed it to split the content into columns by identifying a specific text pattern. In summary, the columns are delineated by 2x3 line breaks and two page numbers, which served as key markers for the division.
<b>Place entity extraction from OCR generated text files</b><br>


The place entities exist in the diary in the form of an indexed list at the end of certain columns. We extracted a list of place entities by  
The place entities exist in the diary in the form of an indexed list at the end of certain columns. We extracted a list of place entities by  


<b>Geoname verification pipeline</b><br>
<b>2. Place name verification pipeline</b><br>
We identified three APIs which we could use to locate whether an entity exists in Venice or not. These were: Nominatim (OpenStreetMap), WikiData, and Geonames. We created a pipeline to extract place entities from the Sanudo's index and use the three APIs to search if the entity exists in Venice. If any of the three APIs found a match for the same place in Venice, we stored the id, alternative names, coordinates and place indices. Below is an example of the resulting JSON for "Marano":
We identified three APIs which we could use to locate whether an entity exists in Venice or not. These were: Nominatim (OpenStreetMap), WikiData, and Geonames. We created a pipeline to extract place entities from the Sanudo's index and use the three APIs to search if the entity exists in Venice. If any of the three APIs found a match for the same place in Venice, we stored the id, alternative names, coordinates and place indices. Below is an example of the resulting JSON for "Marano":



Revision as of 13:56, 5 December 2024

Click to go back to Project lists

Our Github Page Link to Google Doc Diaries Frontend



Introduction

The main work done in this project is to convert the geoname entities in the index of <Sanodo's diary> into digital form.

For the ultimate goal, it is to present the named entities in diary in an interactive way, e.g. map-based website. georeference the useful name entities with Venice places. Aiming to provide an immersive experience that allows users to explore the diary in a spatially interactive format, deepening their engagement with the historical narrative.



About Sanudo


Marin Sanudo
Sanudo's Background Information
Status Venetian historian, author and diarist. Aristocrat.
Occupation Historian
Life 22 May 1466 - † 4 April 1536


Sanudo's Diary

The Diaries of Marin Sanudo represent one of the most comprehensive daily records of events ever compiled by a single individual in early modern Europe. They offer insights into various aspects of Venetian life, from "diplomacy to public spectacles, politics to institutional practices, state councils to public opinion, mainland territories to overseas possessions, law enforcement to warfare, the city's landscape to the lives of its inhabitants, and from religious life to fashion, prices, weather, and entertainment".[2]

English selection & comment of Diary



Transforming Sanudo’s index (1496 - 1533)

Diary Background Information
Diary Duration 1496-1533 37
Quantity 58 Volumes around 40000 pages
Content Style deal with any matter regardless of its ‘importance’



"...the continuity of events and institutions collapses into the quotidian. ...Unreflecting, pedantic, and insatiable, he aimed "to seek out every occurrence, no matter how slight," for he believed that the truth of events could only be grasped through an abundance of facts. 18 He gathered those facts in the chancellery of the Ducal Palace and in the streets of the city, transcribing official legislation and ambassadorial dispatches, reporting popular opinion and Rialto gossip"[1]

   BirdsEyeView1528.jpg


Methodology

Generating Place Entities

The first task we had was extracting the relevant place entities from the OCR generated text file of Sanudo's index. This was essential because our eventual goal was to visualize name and place entities (i.e. important places and names documented by Sanudo) from the index onto a map of Venice. For this, we needed to filter out the place entities that existed only in Venice (not outside of it), and link them to co-ordinates to display on a map. Furthermore, each place corresponded to one or more "indices", i.e. columns in which the place name appeared in Sanudo's diaries.

1. Pre-processing OCR generated text file
The text file was sourced from the Internet Archive. We processed it to split the content into columns by identifying a specific text pattern. In summary, the columns are delineated by 2x3 line breaks and two page numbers, which served as key markers for the division.

The place entities exist in the diary in the form of an indexed list at the end of certain columns. We extracted a list of place entities by

2. Place name verification pipeline
We identified three APIs which we could use to locate whether an entity exists in Venice or not. These were: Nominatim (OpenStreetMap), WikiData, and Geonames. We created a pipeline to extract place entities from the Sanudo's index and use the three APIs to search if the entity exists in Venice. If any of the three APIs found a match for the same place in Venice, we stored the id, alternative names, coordinates and place indices. Below is an example of the resulting JSON for "Marano":

    {
        "id": 21,
        "place_name": "Marano",
        "place_alternative_name": [
            "FrìuliX"
        ],
        "place_index": [
            "496",
            "497",
            "546",
            "550",
            "554",
            "556",
            "557",
            "584"
        ],
        "nominatim_coords": [
            "45.46311235",
            "12.120517939849389"
        ],
        "geodata_coords": null,
        "wikidata_coords": null,
        "nominatim_match": true,
        "geodata_match": false,
        "wikidata_match": false,
        "latitude": "45.46311235",
        "longitude": "12.120517939849389",
        "agreement_count": 1
    }


We analyzed which of the three APIs could successfully determine whether the place entities extracted from Sanudo's diary were in Venice. One challenge we faced was the disambiguation of place names. Many place names could have changed over time or referred to multiple locations. To resolve this, we manually checked for ambiguities and then refined the place extraction pipeline to eliminate conflicts in the results. For instance, several places had an alternative name to be "veneziano", which had to be removed from the search query to the API. Furthermore, the Geodata API mistakenly included some locations in Greece in the final result of places in "Venice". We added special cases in the place_extraction pipeline to cover these errors.




Database Entity

We create a database with the following schema to define a place entity. The database consists of three tables, as below:

1. places

Column Name Data Type Not Null Default Value Primary Key Description
id INTEGER YES NULL YES A unique identifier for each place in the database.
place_name TEXT YES NULL NO The name of the place (e.g., city, town, landmark).
latitude REAL YES NULL NO The latitude of the place in decimal degrees (geographical coordinate).
longitude REAL YES NULL NO The longitude of the place in decimal degrees (geographical coordinate).

2. alternative_names

Column Name Data Type Not Null Default Value Primary Key Description
id INTEGER YES NULL YES A unique identifier for each alternative name entry.
place_id INTEGER YES NULL NO The ID of the place from the `places` table, linking alternative names to places.
alternative_name TEXT YES NULL NO An alternative or historical name for the place.


3. place_indexes

Column Name Data Type Not Null Default Value Primary Key Description
id INTEGER YES NULL YES A unique identifier for each place index entry.
place_id INTEGER YES NULL NO The ID of the place from the `places` table, linking indexes to places.
place_index INTEGER YES NULL NO An index or reference number associated with the place (e.g., historical, geographical, or catalog number).

Project Milestones

2024/10/10 Week 3

    Start work on project. Conducted background information research and clarified project goals.Reviewed related articles.
  • Goals:
    • Obtain diary text and extract place names.
    • Match person names with place names.
    • Link named entities to specific diary content.
  • Planned integration with the Venice interactive map frontend.

2024/11/14 Week 9

    Midterm presentation completed.
  • Summary:
    • Current Progress:
      • Data processing: Data obtaining, index extraction, column extraction
      • Filtering: Venice-name verification pipeline
    • Future Plan:
      • Handle data discrepancies
      • Extract people names associated with place entities
      • Embed information into a map
  • Next Steps:
    • Week 10: Extract people names associated with place entities
    • Week 11: Create GIS-compatible format for pipeline output
    • Week 12 (optional): Test integration with Venice front-end
    • Week 13 (optional): Add context for name-place pairing

2024/12/5 Week 12

    Till now, the main task is almost completed.
  • Current Progress:
    • Place Index Navigator: can now match and go back to the related paragraph with a given gname entity.
    • Data structure (geoname, column number, related paragraph)built in csv format
    • Italian name extraction: can now extract all Italian name entities from a text paragraph.
    • Refined the Geoname verification pipeline, fixed several bugs
  • Next steps:
    • Work visualization
    • Embed information into a map

Results

Conclusion

References

Frontend references:
https://pov-dev.up.railway.app/ (development version)
https://pov.up.railway.app/

Interactive(?) book:
https://valley.newamericanhistory.org/

Document references:
[1] Finlay, Robert. “Politics and History in the Diary of Marino Sanuto.” Renaissance Quarterly, vol. 33, no. 4, 1980, pp. 585–98. JSTOR, https://doi.org/10.2307/2860688. Accessed 10 Oct. 2024.
[2]Image source: https://evolution.veniceprojectcenter.org/evolution.html
[3]Ferguson, Ronnie. “The Tax Return (1515) of Marin Sanudo: Fiscality, Family, and Language in Renaissance Venice.” Italian Studies 79, no. 2 (2024): 137–54. doi:10.1080/00751634.2024.2348379.