Sanudo's Diary: Difference between revisions
No edit summary |
|||
Line 162: | Line 162: | ||
<b>pre-processing OCR generated text file</b><br> | <b>pre-processing OCR generated text file</b><br> | ||
The text file was sourced from the Internet Archive. We processed it to split the content into columns by identifying a specific text pattern. In summary, the columns are delineated by 2x3 line breaks and two page numbers, which served as key markers for the division. | The text file was sourced from the Internet Archive. We processed it to split the content into columns by identifying a specific text pattern. In summary, the columns are delineated by 2x3 line breaks and two page numbers, which served as key markers for the division. | ||
<b>Place entity extraction from OCR generated text files</b><br> | |||
<b>Geoname verification pipeline</b><br> | <b>Geoname verification pipeline</b><br> | ||
We identified three APIs which we could use to locate whether an entity exists in Venice or not. These were: Nominatim (OpenStreetMap), WikiData, and Geonames. We | We identified three APIs which we could use to locate whether an entity exists in Venice or not. These were: Nominatim (OpenStreetMap), WikiData, and Geonames. We created a pipeline to extract place entities from the Sanudo's index and use the three APIs to search if the entity exists in Venice. If any of the three APIs found a match for the same place in Venice, we stored the id, alternative names, coordinates and place indices. Below is an example of the resulting JSON for "Marano": | ||
<pre> | |||
{ | |||
"id": 21, | |||
"place_name": "Marano", | |||
"place_alternative_name": [ | |||
"FrìuliX" | |||
], | |||
"place_index": [ | |||
"496", | |||
"497", | |||
"546", | |||
"550", | |||
"554", | |||
"556", | |||
"557", | |||
"584" | |||
], | |||
"nominatim_coords": [ | |||
"45.46311235", | |||
"12.120517939849389" | |||
], | |||
"geodata_coords": null, | |||
"wikidata_coords": null, | |||
"nominatim_match": true, | |||
"geodata_match": false, | |||
"wikidata_match": false, | |||
"latitude": "45.46311235", | |||
"longitude": "12.120517939849389", | |||
"agreement_count": 1 | |||
} | |||
</pre> | |||
We analyzed which of the three APIs could successfully determine whether the place entities extracted from Sanudo's diary were in Venice. One challenge we faced was the disambiguation of place names. Many place names could have changed over time or referred to multiple locations. To resolve this, we manually checked for ambiguities and then refined the place extraction pipeline to eliminate conflicts in the results. For instance, several places had an alternative name to be "veneziano", which had to be removed from the search query to the API. Furthermore, the Geodata API mistakenly included some locations in Greece in the final result of places in "Venice". We added special cases in the place_extraction pipeline to cover these errors. | |||
<br><br><br> | <br><br><br> | ||
Revision as of 13:37, 5 December 2024
Click to go back to Project lists
Our Github Page
Link to Google Doc
Diaries
Frontend
Introduction
The main work done in this project is to convert the geoname entities in the index of <Sanodo's diary> into digital form.
For the ultimate goal, it is to present the named entities in diary in an interactive way, e.g. map-based website. georeference the useful name entities with Venice places. Aiming to provide an immersive experience that allows users to explore the diary in a spatially interactive format, deepening their engagement with the historical narrative.
About Sanudo
Status | Venetian historian, author and diarist. Aristocrat. |
Occupation | Historian |
Life | 22 May 1466 - † 4 April 1536 |
Sanudo's Diary
The Diaries of Marin Sanudo represent one of the most comprehensive daily records of events ever compiled by a single individual in early modern Europe. They offer insights into various aspects of Venetian life, from "diplomacy to public spectacles, politics to institutional practices, state councils to public opinion, mainland territories to overseas possessions, law enforcement to warfare, the city's landscape to the lives of its inhabitants, and from religious life to fashion, prices, weather, and entertainment".[2]
English selection & comment of Diary
Transforming Sanudo’s index (1496 - 1533)
Diary Duration | 1496-1533 | 37 |
Quantity | 58 Volumes | around 40000 pages |
Content Style | deal with any matter | regardless of its ‘importance’ |
"...the continuity of events and institutions collapses into the quotidian. ...Unreflecting, pedantic, and insatiable, he aimed "to seek out every occurrence, no matter how slight," for he believed that the truth of events could only be grasped through an abundance of facts. 18 He gathered those facts in the chancellery of the Ducal Palace and in the streets of the city, transcribing official legislation and ambassadorial dispatches, reporting popular opinion and Rialto gossip"[1]
Project Milestones
2024/10/10 Week 3
Start work on project. Conducted background information research and clarified project goals.Reviewed related articles.
- Goals:
- Obtain diary text and extract place names.
- Match person names with place names.
- Link named entities to specific diary content.
- Planned integration with the Venice interactive map frontend.
2024/11/14 Week 9
Midterm presentation completed.
- Summary:
- Current Progress:
- Data processing: Data obtaining, index extraction, column extraction
- Filtering: Venice-name verification pipeline
- Future Plan:
- Handle data discrepancies
- Extract people names associated with place entities
- Embed information into a map
- Current Progress:
- Next Steps:
- Week 10: Extract people names associated with place entities
- Week 11: Create GIS-compatible format for pipeline output
- Week 12 (optional): Test integration with Venice front-end
- Week 13 (optional): Add context for name-place pairing
2024/12/5 Week 12
Till now, the main task is almost completed.
- Current Progress:
- Place Index Navigator: can now match and go back to the related paragraph with a given gname entity.
- Data structure (geoname, column number, related paragraph)built in csv format
- Italian name extraction: can now extract all Italian name entities from a text paragraph.
- Refined the Geoname verification pipeline, fixed several bugs
- Next steps:
- Work visualization
- Embed information into a map
Technic Details
pre-processing OCR generated text file
The text file was sourced from the Internet Archive. We processed it to split the content into columns by identifying a specific text pattern. In summary, the columns are delineated by 2x3 line breaks and two page numbers, which served as key markers for the division.
Place entity extraction from OCR generated text files
Geoname verification pipeline
We identified three APIs which we could use to locate whether an entity exists in Venice or not. These were: Nominatim (OpenStreetMap), WikiData, and Geonames. We created a pipeline to extract place entities from the Sanudo's index and use the three APIs to search if the entity exists in Venice. If any of the three APIs found a match for the same place in Venice, we stored the id, alternative names, coordinates and place indices. Below is an example of the resulting JSON for "Marano":
{ "id": 21, "place_name": "Marano", "place_alternative_name": [ "FrìuliX" ], "place_index": [ "496", "497", "546", "550", "554", "556", "557", "584" ], "nominatim_coords": [ "45.46311235", "12.120517939849389" ], "geodata_coords": null, "wikidata_coords": null, "nominatim_match": true, "geodata_match": false, "wikidata_match": false, "latitude": "45.46311235", "longitude": "12.120517939849389", "agreement_count": 1 }
We analyzed which of the three APIs could successfully determine whether the place entities extracted from Sanudo's diary were in Venice. One challenge we faced was the disambiguation of place names. Many place names could have changed over time or referred to multiple locations. To resolve this, we manually checked for ambiguities and then refined the place extraction pipeline to eliminate conflicts in the results. For instance, several places had an alternative name to be "veneziano", which had to be removed from the search query to the API. Furthermore, the Geodata API mistakenly included some locations in Greece in the final result of places in "Venice". We added special cases in the place_extraction pipeline to cover these errors.
Database Entity
We create a database with the following schema to define a place entity. The database consists of three tables, as below:
1. places
Column Name | Data Type | Not Null | Default Value | Primary Key | Description |
---|---|---|---|---|---|
id | INTEGER | YES | NULL | YES | A unique identifier for each place in the database. |
place_name | TEXT | YES | NULL | NO | The name of the place (e.g., city, town, landmark). |
latitude | REAL | YES | NULL | NO | The latitude of the place in decimal degrees (geographical coordinate). |
longitude | REAL | YES | NULL | NO | The longitude of the place in decimal degrees (geographical coordinate). |
2. alternative_names
Column Name | Data Type | Not Null | Default Value | Primary Key | Description |
---|---|---|---|---|---|
id | INTEGER | YES | NULL | YES | A unique identifier for each alternative name entry. |
place_id | INTEGER | YES | NULL | NO | The ID of the place from the `places` table, linking alternative names to places. |
alternative_name | TEXT | YES | NULL | NO | An alternative or historical name for the place. |
3. place_indexes
Column Name | Data Type | Not Null | Default Value | Primary Key | Description |
---|---|---|---|---|---|
id | INTEGER | YES | NULL | YES | A unique identifier for each place index entry. |
place_id | INTEGER | YES | NULL | NO | The ID of the place from the `places` table, linking indexes to places. |
place_index | INTEGER | YES | NULL | NO | An index or reference number associated with the place (e.g., historical, geographical, or catalog number). |
Results
Conclusion
References
Frontend references:
https://pov-dev.up.railway.app/ (development version)
https://pov.up.railway.app/
Interactive(?) book:
https://valley.newamericanhistory.org/
Document references:
[1] Finlay, Robert. “Politics and History in the Diary of Marino Sanuto.” Renaissance Quarterly, vol. 33, no. 4, 1980, pp. 585–98. JSTOR, https://doi.org/10.2307/2860688. Accessed 10 Oct. 2024.
[2]Image source: https://evolution.veniceprojectcenter.org/evolution.html
[3]Ferguson, Ronnie. “The Tax Return (1515) of Marin Sanudo: Fiscality, Family, and Language in Renaissance Venice.” Italian Studies 79, no. 2 (2024): 137–54. doi:10.1080/00751634.2024.2348379.