Revision as of 18:01, 18 December 2024

Introduction

The historical records of land management, cadastre, and taxation provide invaluable insights into the socio-economic and administrative evolution of regions over time. Among the most significant resources for understanding such systems in ancient Venice, Catastici (1740) and Sommarioni (1808) are two books offer different perspectives on Venetian land parcels, their ownership, and their taxation structures, reflecting the shifts in governance and societal organization over time.

The Catastici, compiled in 1740, serves as a snapshot of Venetian land management under the Venetian Republic. It meticulously documents the ownership, division, and use of land parcels during a period of relative stability. This record provides a foundation for understanding how land was distributed and administered before the onset of major political upheavals.

The Sommarioni, written in 1808, emerged during a period of significant political and social disruption following the fall of the Venetian Republic and under Napoleonic rule. This document captures a transformed landscape, reflecting the influence of changing administrative structures, evolving property ownership patterns, and new taxation policies. It reveals a dynamic reorganization of Venetian society and the economic pressures that shaped the era.

Historical transactional data is often difficult to interpret and lacks a clear representation of relationships between records. This project aims to transform structured data from historical archives related to cadastral records and leases into vivid and intuitive natural language descriptions. In addition to creating accurate and comprehensive descriptions for Venetian parcels from each period, we will compare and connect data from the two eras. Through this approach, the project seeks to provide new insights into the evolution of Venice’s administrative and socioeconomic framework.

Project Milestones and Pipeline

Workflow
Week	Task	Status
07.10 - 13.10	Define research questions Review relevant literature	Done
14.10 - 20.10	Perform initial data checking and cleaning Address dataset-related questions	Done
21.10 - 27.10	Autumn vacation	Done
28.10 - 03.11	Align Catastici and Sommarioni dataset Continue data cleaning	Done
04.11 - 10.11	Develop description templates and prompts Prepare for the midterm presentation	Done
11.11 - 17.11	Midterm presentation (14.11) Refine the description template and prompts	Done
18.11 - 24.11	Translate Italian data into English	Done
25.11 - 01.12	Design an evaluation plan Evaluate the prompts	Done
02.12 - 08.12	Generate final results Evaluate the prompts and translation Begin writing the wikipage	Done
09.12 - 15.12	Write the wikipage Organize GitHub code Prepare for the final presentation	Done
16.12 - 22.12	Deliver GitHub repository and wikipage (18.12) Final presentation (19.12)

Pipeline

Methodology

The Catastici and Sommarioni datasets were derived from historical records through OCR conversion, which introduced significant noise. To address this, we first cleaned the datasets to remove noisy data, extracted relevant information, and standardized the data into English (excluding names of people and places). Next, we selected and ordered the key content required for generating descriptions, and designed templates to fine-tune GPT for generation.

The generated descriptions were evaluated manually to ensure their accuracy, conciseness, and plausibility. Additionally, we translated the descriptions into other languages and evaluated their adaptability in multilingual contexts.

Data Processing

Catastici

1. owner first name

2. title & mestiere

3. rent

4. quantity income & quality income

Sommarioni

1. Organize data by parcel

The Sommarioni dataset records the attribute information of different parcels, including their location, area, ownership, as well as details about the parcel's owner and the owner's family. However, some parcels also contain several subparcels with different parcel types. As a result, we group the information of the subparcels together to generate our description.

2. Clean owner name

The column 'owner_standardised' records the names of the owners. However, it also contains additional information about the owner, such as their identity. To avoid confusion, we have separated this information. For example, "CROTTA Lucrezia" is the widow of "CALVO" (where "Vedova" means widow in Italian). Therefore, we separated the two parts into the owner’s name and extra information. To maintain the coherence of the data, we will only translate the extra information in the next step.

Original
owner_standardised
CROTTA Lucrezia (Vedova CALVO)

Seperate
owner name	extra info
CROTTA Lucrezia	Vedova CALVO

Translation

Since all the datasets are in Italian, this may cause some issues when generating the English descriptions. As a result, we have translated the following properties into English.

Properties Selected for Translation
Catastici
Sommarioni	extra information (seperate above)	place	ownership_types	qualities	own_title	own_other_notes

Geographical Connection

To analyze the changes in ancient Venice between 1740 and 1808, we link the two datasets based on their geographical coordinates. The Catastici dataset comprises point data, while the Sommarioni dataset consists of polygon data. The connection between the two is established by identifying which Catastici points fall within specific Sommarioni parcels. Since not all Sommarioni parcels have corresponding Catastici data, and not all Catastici data matches a Sommarioni parcel, the summary description generation focuses solely on the overlapping portions—specifically, cases where Catastici data is contained within Sommarioni parcels.

Content Selection and Ordering

Catastici

Taking data quality into account, we selected specific properties from the Sommarioni dataset and categorized them into six groups.

Sommarioni

Taking data quality into account, we selected specific properties from the Sommarioni dataset and categorized them into five groups.

Selected Properties from Sommarioni
Location	Features	Owner & Ownership	Owners' Family	Othernotes
district_acronym	parcel type	owner_standardised	own_father	own_other_notes
place	qualities	ownership_types	own_father_is_q	parcel_ids
area		own_title	own_mother
			own_siblings
			own_husband
			own_husband_is_q

Template Design

Our project utilizes GPT-4o-mini to generate descriptions. To provide the language model with clear instructions and produce well-structured content, we divide the description into three paragraphs. The first paragraph introduces the information from Catastici (1740), the second paragraph details the data from Sommarioni (1808), and the third paragraph connects the two datasets, summarizing their relationship.

We aim for Paragraphs 1 and 2 to be accurate and concise, including all relevant information about the geometry parcels in the dataset while avoiding the inclusion of any fictional or irrelevant data. For Paragraph 3, our goal is to ensure plausibility by logically connecting the changes and similarities between the two descriptions, avoiding unreasonable or speculative elements, and adequately addressing implications or transitions inferred from the original data.

To ensure the language model effectively understands our instructions, we designed a specific prompt template for each paragraph. The final version of our templates are shown below and we will discuss the promotion of our prompt template in the next section.

Prompt Evaluation

The evaluation process consists of five main steps. First, the reviewers carefully read the input content to thoroughly understand the context, key details, and specific requirements. Second, the reviewers score each description based on predefined criteria. Third, we assess the inter-annotator agreement among the reviewers to ensure consistency. Fourth, we analyze the evaluation results to identify patterns and insights. Finally, we refine and improve our prompt template based on these findings. Given that the requirements and objectives of each paragraph vary, we have developed distinct evaluation metrics tailored to each paragraph. The evaluation criteria of each paragraph are explained below.

The Evaluation Steps

Evaluation Criteria for Catastici and Sommarioni Paragraph

Since the first two paragraphs primarily focus on the information in the dataset, we prioritize accuracy and conciseness as the main evaluation indicators. The detailed criteria are explained below.

Evaluate Description for Accuracy

For each description, assess whether:

All facts are correctly represented.
The description aligns precisely with the metadata, without misrepresentation or omission of key details.
There are no fabricated or incorrect elements.

Assign a score of:

0: If the description contains inaccuracies, misrepresentation, omissions, or fabricated elements.
1: If the description is fully accurate, complete, and faithfully represents the metadata.

Evaluate Description for Conciseness

For each description, assess whether:

The description is free from redundant or repetitive content.
The information is presented succinctly, without unnecessary elaboration or verbosity.
The description focuses on delivering the key details without including irrelevant information.

Assign a score of:

0: If the description contains redundancy, repetition, or excessive elaboration.
1: If the description is concise, focused, and avoids unnecessary content.

To evaluate the generated descriptions more thoroughly and comprehensively, we divide the content based on the original data into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.

Evaluation Metrics for Catastici
	Location	Features	Owners' Name	Owners' Title & Job	Tenant	Payment
Accurate	0 or 1	0 or 1	0 or 1	0 or 1	0 or 1	0 or 1
Concise	0 or 1	0 or 1	0 or 1	0 or 1	0 or 1	0 or 1

Evaluation Metrics for Sommarioni
	Location	Features	Owner & Ownership	Owners' Family	Othernotes
Accurate	0 or 1	0 or 1	0 or 1	0 or 1	0 or 1
Concise	0 or 1	0 or 1	0 or 1	0 or 1	0 or 1

Evaluation Criteria for Summary Paragraph

The summary paragraph is designed to provide a cohesive overview, integrating the information from the preceding two paragraphs. Its purpose is to bridge the two datasets by logically connecting the observed changes and similarities in the descriptions, while analyzing the underlying reasons for these changes. To achieve this, we have selected plausibility as our evaluation criterion.

For each generated description, evaluate whether it:

Logically connects the changes and similarities between the two descriptions.
Avoids introducing unreasonable, exaggerated, or speculative elements.
Maintains consistency with real-world knowledge and contextual logic.
Adequately addresses any implications or transitions inferred from the original descriptions.

Assign a score of:

0: If the description contains illogical conclusions, exaggerated assumptions, or inconsistencies with the provided information or context.
1: If the description is logical, reasonable, and aligns well with the context and common sense.

Considering the informational overlap between the two datasets, we divide the summary content into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.

Evaluation Metrics for Summary
	Location	Features	Owner Information	Tenant
Accurate	0 or 1	0 or 1	0 or 1	0 or 1
Concise	0 or 1	0 or 1	0 or 1	0 or 1

Evaluation Criteria for Translation

Our project aims to provide descriptions in multiple languages. After generating the English descriptions, we translated them into various languages. However, since we are only proficient in Chinese, we used the Chinese version as an example for evaluation. The evaluation criteria and metrics are outlined below.

Evaluation Results and Analysis

Results

Evaluation of Catastici Description

We first calculated the inter-annotator agreement between the two annotators to ensure the quality and consistency of the annotations. One of the most commonly used methods for assessing agreement is Cohen's Kappa coefficient. However, in scenarios with extreme class imbalance (e.g., when all data points are assigned the same label, such as '1', which can occur with perfect results), Cohen's Kappa becomes unreliable or even misleading.

In contrast, percentage agreement is a simpler measure that directly calculates the proportion of matching labels between two raters without accounting for chance agreement. This makes it more suitable for extreme cases, as it is not affected by class imbalance or skewed label distributions. Therefore, we ultimately chose percentage agreement to evaluate the validity of the annotation results. Its formula is as follows:

Percentage Agreement = Number of consistent labels / Total number of labels

Inter-Annotation Agreement for Catastici Description
	Location	Features	Owners' Name	Owners' Title & Job	Tenant	Payment
Prompt1-Accuracy	0.95	1.0	0.95	0.9	1.0	0.95
Prompt2-Accuracy	1.0	1.0	0.95	1.0	1.0	0.95
Prompt1-Conciseness	1.0	1.0	1.0	0.9	1.0	0.95
Prompt2-Conciseness	1.0	1.0	0.7	1.0	1.0	0.95

Evaluation Results for Catastici Description
	Location	Function	Owner Name	Owner title & job	Tenant	Payment
Prompt1-Accuracy	0.675	1.0	0.875	0.55	1.0	0.975
Prompt2-Accuracy	1.0	1.0	0.875	1.0	1.0	0.975
Prompt1-Concise	1.0	1.0	1.0	0.2	1.0	0.275
Prompt2-Concise	1.0	1.0	0.6	1.0	1.0	0.975

Evaluation of Sammarioni Description

Inter-Annotation Agreement for Sommarioni Description
	Location	Function	Ownership	Owner Family	Other Notes
Prompt1-Accuracy	0.9375	0.8125	0.8125	0.9375	0.875
Prompt2-Accuracy	0.95	1.0	0.85	0.9	0.95
Prompt1-Conciseness	0.875	0.9375	0.8125	0.9375	0.875
Prompt2-Conciseness	1.0	1.0	1.0	0.95	1.0

Evaluation Results for Sommarioni Description
	Location	Function	Ownership	Owner Family	Other Notes
Prompt1-Accuracy	0.96875	0.71875	0.90625	0.96875	0.75
Prompt2-Accuracy	0.925	1.0	0.825	0.95	0.975
Prompt1-Concise	0.75	0.96875	0.65625	0.09375	0.75
Prompt2-Concise	0.85	1.0	1.0	0.975	1.0

Evaluation of Summary

Inter-Annotation Agreement on Plausibility for Summary
	ParcelInfo	OwnerInfo	RentalInfo
Plausibility	0.8125	0.75	0.9375

Evaluation Results for Summary
	ParcelInfo	OwnerInfo	RentalInfo
Plausibility	0.91	0.75	0.97

Evaluation of Translation

Limitations and Future Works

Prompt Template Promotion

As discussed in the previous section, we only conducted two rounds of promotion for our prompt template. The accuracy and conciseness of the Catastici and Sommarioni paragraphs, as well as the plausiblity of the summary paragraph can still be improved. Additionally, all the descriptions are currently quite similar to each other. It would be beneficial to design different tones for the descriptions to make them more diverse and engaging.

Model Improvement

Due to budget limitations, we only used GPT-4o-mini to generate our descriptions and translations. This model has some limitations, particularly in translation. Therefore, our next step is to involve other large language models, such as GPT-4 and T5, to address these issues.

Reference

GitHub Repositories

GitHub Link

Credits

Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisors: Alexander Rusnak, Tristan Karch, Tommy Bruzzese
Authors: Ruyin Feng, Zhichen Fang

@@ Line 248: / Line 248: @@
 ==Evaluation Results and Analysis==
 ===Results===
+----
 ===Evaluation of Catastici Description===

Generation of Textual Description for Parcels: Difference between revisions

Revision as of 18:01, 18 December 2024

Contents

Introduction

Project Milestones and Pipeline