Generation of Textual Description for Parcels: Difference between revisions
Zhichen.fang (talk | contribs) |
Zhichen.fang (talk | contribs) |
||
Line 248: | Line 248: | ||
==Evaluation Results and Analysis== | ==Evaluation Results and Analysis== | ||
===Results=== | ===Results=== | ||
---- | |||
===Evaluation of Catastici Description=== | ===Evaluation of Catastici Description=== |
Revision as of 18:01, 18 December 2024
Introduction
The historical records of land management, cadastre, and taxation provide invaluable insights into the socio-economic and administrative evolution of regions over time. Among the most significant resources for understanding such systems in ancient Venice, Catastici (1740) and Sommarioni (1808) are two books offer different perspectives on Venetian land parcels, their ownership, and their taxation structures, reflecting the shifts in governance and societal organization over time.
The Catastici, compiled in 1740, serves as a snapshot of Venetian land management under the Venetian Republic. It meticulously documents the ownership, division, and use of land parcels during a period of relative stability. This record provides a foundation for understanding how land was distributed and administered before the onset of major political upheavals.
The Sommarioni, written in 1808, emerged during a period of significant political and social disruption following the fall of the Venetian Republic and under Napoleonic rule. This document captures a transformed landscape, reflecting the influence of changing administrative structures, evolving property ownership patterns, and new taxation policies. It reveals a dynamic reorganization of Venetian society and the economic pressures that shaped the era.
Historical transactional data is often difficult to interpret and lacks a clear representation of relationships between records. This project aims to transform structured data from historical archives related to cadastral records and leases into vivid and intuitive natural language descriptions. In addition to creating accurate and comprehensive descriptions for Venetian parcels from each period, we will compare and connect data from the two eras. Through this approach, the project seeks to provide new insights into the evolution of Venice’s administrative and socioeconomic framework.
Project Milestones and Pipeline
Week | Task | Status |
---|---|---|
07.10 - 13.10 | Define research questions Review relevant literature |
Done |
14.10 - 20.10 | Perform initial data checking and cleaning Address dataset-related questions |
Done |
21.10 - 27.10 | Autumn vacation | Done |
28.10 - 03.11 | Align Catastici and Sommarioni dataset Continue data cleaning |
Done |
04.11 - 10.11 | Develop description templates and prompts Prepare for the midterm presentation |
Done |
11.11 - 17.11 | Midterm presentation (14.11) Refine the description template and prompts |
Done |
18.11 - 24.11 | Translate Italian data into English |
Done |
25.11 - 01.12 | Design an evaluation plan Evaluate the prompts |
Done |
02.12 - 08.12 | Generate final results Evaluate the prompts and translation Begin writing the wikipage |
Done |
09.12 - 15.12 | Write the wikipage Organize GitHub code Prepare for the final presentation |
Done |
16.12 - 22.12 | Deliver GitHub repository and wikipage (18.12) Final presentation (19.12) |
Methodology
The Catastici and Sommarioni datasets were derived from historical records through OCR conversion, which introduced significant noise. To address this, we first cleaned the datasets to remove noisy data, extracted relevant information, and standardized the data into English (excluding names of people and places). Next, we selected and ordered the key content required for generating descriptions, and designed templates to fine-tune GPT for generation.
The generated descriptions were evaluated manually to ensure their accuracy, conciseness, and plausibility. Additionally, we translated the descriptions into other languages and evaluated their adaptability in multilingual contexts.
Data Processing
Catastici
1. owner first name
2. title & mestiere
3. rent
4. quantity income & quality income
Sommarioni
1. Organize data by parcel
The Sommarioni dataset records the attribute information of different parcels, including their location, area, ownership, as well as details about the parcel's owner and the owner's family. However, some parcels also contain several subparcels with different parcel types. As a result, we group the information of the subparcels together to generate our description.
2. Clean owner name
The column 'owner_standardised' records the names of the owners. However, it also contains additional information about the owner, such as their identity. To avoid confusion, we have separated this information. For example, "CROTTA Lucrezia" is the widow of "CALVO" (where "Vedova" means widow in Italian). Therefore, we separated the two parts into the owner’s name and extra information. To maintain the coherence of the data, we will only translate the extra information in the next step.
owner_standardised |
---|
CROTTA Lucrezia (Vedova CALVO) |
owner name | extra info |
---|---|
CROTTA Lucrezia | Vedova CALVO |
Translation
Since all the datasets are in Italian, this may cause some issues when generating the English descriptions. As a result, we have translated the following properties into English.
Catastici | ||||||
Sommarioni | extra information (seperate above) | place | ownership_types | qualities | own_title | own_other_notes |
Geographical Connection
To analyze the changes in ancient Venice between 1740 and 1808, we link the two datasets based on their geographical coordinates. The Catastici dataset comprises point data, while the Sommarioni dataset consists of polygon data. The connection between the two is established by identifying which Catastici points fall within specific Sommarioni parcels. Since not all Sommarioni parcels have corresponding Catastici data, and not all Catastici data matches a Sommarioni parcel, the summary description generation focuses solely on the overlapping portions—specifically, cases where Catastici data is contained within Sommarioni parcels.
Content Selection and Ordering
Catastici
Taking data quality into account, we selected specific properties from the Sommarioni dataset and categorized them into six groups.
Sommarioni
Taking data quality into account, we selected specific properties from the Sommarioni dataset and categorized them into five groups.
Location | Features | Owner & Ownership | Owners' Family | Othernotes |
---|---|---|---|---|
district_acronym | parcel type | owner_standardised | own_father | own_other_notes |
place | qualities | ownership_types | own_father_is_q | parcel_ids |
area | own_title | own_mother | ||
own_siblings | ||||
own_husband | ||||
own_husband_is_q |
Template Design
Our project utilizes GPT-4o-mini to generate descriptions. To provide the language model with clear instructions and produce well-structured content, we divide the description into three paragraphs. The first paragraph introduces the information from Catastici (1740), the second paragraph details the data from Sommarioni (1808), and the third paragraph connects the two datasets, summarizing their relationship.
We aim for Paragraphs 1 and 2 to be accurate and concise, including all relevant information about the geometry parcels in the dataset while avoiding the inclusion of any fictional or irrelevant data. For Paragraph 3, our goal is to ensure plausibility by logically connecting the changes and similarities between the two descriptions, avoiding unreasonable or speculative elements, and adequately addressing implications or transitions inferred from the original data.
To ensure the language model effectively understands our instructions, we designed a specific prompt template for each paragraph. The final version of our templates are shown below and we will discuss the promotion of our prompt template in the next section.
Prompt Evaluation
The evaluation process consists of five main steps. First, the reviewers carefully read the input content to thoroughly understand the context, key details, and specific requirements. Second, the reviewers score each description based on predefined criteria. Third, we assess the inter-annotator agreement among the reviewers to ensure consistency. Fourth, we analyze the evaluation results to identify patterns and insights. Finally, we refine and improve our prompt template based on these findings. Given that the requirements and objectives of each paragraph vary, we have developed distinct evaluation metrics tailored to each paragraph. The evaluation criteria of each paragraph are explained below.
Evaluation Criteria for Catastici and Sommarioni Paragraph
Since the first two paragraphs primarily focus on the information in the dataset, we prioritize accuracy and conciseness as the main evaluation indicators. The detailed criteria are explained below.
Evaluate Description for Accuracy
For each description, assess whether:
- All facts are correctly represented.
- The description aligns precisely with the metadata, without misrepresentation or omission of key details.
- There are no fabricated or incorrect elements.
Assign a score of:
- 0: If the description contains inaccuracies, misrepresentation, omissions, or fabricated elements.
- 1: If the description is fully accurate, complete, and faithfully represents the metadata.
Evaluate Description for Conciseness
For each description, assess whether:
- The description is free from redundant or repetitive content.
- The information is presented succinctly, without unnecessary elaboration or verbosity.
- The description focuses on delivering the key details without including irrelevant information.
Assign a score of:
- 0: If the description contains redundancy, repetition, or excessive elaboration.
- 1: If the description is concise, focused, and avoids unnecessary content.
To evaluate the generated descriptions more thoroughly and comprehensively, we divide the content based on the original data into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.
Location | Features | Owners' Name | Owners' Title & Job | Tenant | Payment | |
---|---|---|---|---|---|---|
Accurate | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Concise | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Location | Features | Owner & Ownership | Owners' Family | Othernotes | |
---|---|---|---|---|---|
Accurate | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Concise | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Evaluation Criteria for Summary Paragraph
The summary paragraph is designed to provide a cohesive overview, integrating the information from the preceding two paragraphs. Its purpose is to bridge the two datasets by logically connecting the observed changes and similarities in the descriptions, while analyzing the underlying reasons for these changes. To achieve this, we have selected plausibility as our evaluation criterion.
For each generated description, evaluate whether it:
- Logically connects the changes and similarities between the two descriptions.
- Avoids introducing unreasonable, exaggerated, or speculative elements.
- Maintains consistency with real-world knowledge and contextual logic.
- Adequately addresses any implications or transitions inferred from the original descriptions.
Assign a score of:
- 0: If the description contains illogical conclusions, exaggerated assumptions, or inconsistencies with the provided information or context.
- 1: If the description is logical, reasonable, and aligns well with the context and common sense.
Considering the informational overlap between the two datasets, we divide the summary content into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.
Location | Features | Owner Information | Tenant | |
---|---|---|---|---|
Accurate | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Concise | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Evaluation Criteria for Translation
Our project aims to provide descriptions in multiple languages. After generating the English descriptions, we translated them into various languages. However, since we are only proficient in Chinese, we used the Chinese version as an example for evaluation. The evaluation criteria and metrics are outlined below.
Evaluation Results and Analysis
Results
Evaluation of Catastici Description
We first calculated the inter-annotator agreement between the two annotators to ensure the quality and consistency of the annotations. One of the most commonly used methods for assessing agreement is Cohen's Kappa coefficient. However, in scenarios with extreme class imbalance (e.g., when all data points are assigned the same label, such as '1', which can occur with perfect results), Cohen's Kappa becomes unreliable or even misleading.
In contrast, percentage agreement is a simpler measure that directly calculates the proportion of matching labels between two raters without accounting for chance agreement. This makes it more suitable for extreme cases, as it is not affected by class imbalance or skewed label distributions. Therefore, we ultimately chose percentage agreement to evaluate the validity of the annotation results. Its formula is as follows:
Percentage Agreement = Number of consistent labels / Total number of labels
Location | Features | Owners' Name | Owners' Title & Job | Tenant | Payment | |
---|---|---|---|---|---|---|
Prompt1-Accuracy | 0.95 | 1.0 | 0.95 | 0.9 | 1.0 | 0.95 |
Prompt2-Accuracy | 1.0 | 1.0 | 0.95 | 1.0 | 1.0 | 0.95 |
Prompt1-Conciseness | 1.0 | 1.0 | 1.0 | 0.9 | 1.0 | 0.95 |
Prompt2-Conciseness | 1.0 | 1.0 | 0.7 | 1.0 | 1.0 | 0.95 |
Location | Function | Owner Name | Owner title & job | Tenant | Payment | |
---|---|---|---|---|---|---|
Prompt1-Accuracy | 0.675 | 1.0 | 0.875 | 0.55 | 1.0 | 0.975 |
Prompt2-Accuracy | 1.0 | 1.0 | 0.875 | 1.0 | 1.0 | 0.975 |
Prompt1-Concise | 1.0 | 1.0 | 1.0 | 0.2 | 1.0 | 0.275 |
Prompt2-Concise | 1.0 | 1.0 | 0.6 | 1.0 | 1.0 | 0.975 |
Evaluation of Sammarioni Description
Location | Function | Ownership | Owner Family | Other Notes | |
---|---|---|---|---|---|
Prompt1-Accuracy | 0.9375 | 0.8125 | 0.8125 | 0.9375 | 0.875 |
Prompt2-Accuracy | 0.95 | 1.0 | 0.85 | 0.9 | 0.95 |
Prompt1-Conciseness | 0.875 | 0.9375 | 0.8125 | 0.9375 | 0.875 |
Prompt2-Conciseness | 1.0 | 1.0 | 1.0 | 0.95 | 1.0 |
Location | Function | Ownership | Owner Family | Other Notes | |
---|---|---|---|---|---|
Prompt1-Accuracy | 0.96875 | 0.71875 | 0.90625 | 0.96875 | 0.75 |
Prompt2-Accuracy | 0.925 | 1.0 | 0.825 | 0.95 | 0.975 |
Prompt1-Concise | 0.75 | 0.96875 | 0.65625 | 0.09375 | 0.75 |
Prompt2-Concise | 0.85 | 1.0 | 1.0 | 0.975 | 1.0 |
Evaluation of Summary
ParcelInfo | OwnerInfo | RentalInfo | |
---|---|---|---|
Plausibility | 0.8125 | 0.75 | 0.9375 |
ParcelInfo | OwnerInfo | RentalInfo | |
---|---|---|---|
Plausibility | 0.91 | 0.75 | 0.97 |
Evaluation of Translation
Limitations and Future Works
- Prompt Template Promotion
As discussed in the previous section, we only conducted two rounds of promotion for our prompt template. The accuracy and conciseness of the Catastici and Sommarioni paragraphs, as well as the plausiblity of the summary paragraph can still be improved. Additionally, all the descriptions are currently quite similar to each other. It would be beneficial to design different tones for the descriptions to make them more diverse and engaging.
- Model Improvement
Due to budget limitations, we only used GPT-4o-mini to generate our descriptions and translations. This model has some limitations, particularly in translation. Therefore, our next step is to involve other large language models, such as GPT-4 and T5, to address these issues.
Reference
GitHub Repositories
Credits
Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisors: Alexander Rusnak, Tristan Karch, Tommy Bruzzese
Authors: Ruyin Feng, Zhichen Fang