Generation of Textual Description for Parcels: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 364: Line 364:
====Evaluation Criteria for Summary Paragraph====
====Evaluation Criteria for Summary Paragraph====


The summary paragraph is designed to provide a cohesive overview, integrating the information from the preceding two paragraphs. Its purpose is to bridge the two datasets by logically connecting the observed changes and similarities in the descriptions, while analyzing the underlying reasons for these changes. To achieve this, we have selected plausibility as our evaluation criterion.
Since the summary is a comprehensive description based on the Catastici and Sommarioni descriptions, it is challenging to evaluate each detailed property individually. Thus, we categorized all properties into three major properties. The first property includes information related to the parcel itself, such as geographic location, administrative region, function, and type. The second category pertains to the owner, encompassing details such as the owner's name, identity, and relationships. The third category covers rental-related information, including tenants and rental payments.
 
In evaluating the summaries, we focus more on whether the new descriptions logically and reasonably explain the changes, differences, and consistencies between the two time periods, while aligning with common sense and contextual logic. Therefore, we chose plausibility as the primary metric for this evaluation.


For each generated description, evaluate whether it:
For each generated description, evaluate whether it:
Line 375: Line 377:
*1: If the description is logical, reasonable, and aligns well with the context and common sense.
*1: If the description is logical, reasonable, and aligns well with the context and common sense.


Considering the informational overlap between the two datasets, we divide the summary content into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.
The evaluation metrics are shown below.


<div style="display: flex; justify-content: center;">
<div style="display: flex; justify-content: center;">
Line 381: Line 383:
|+ Evaluation Metrics for Summary
|+ Evaluation Metrics for Summary
|-
|-
!  !! Location !! Features !! Owner Information !! Tenant
!  !! ParcelInfo !! !! OwnerInfo !! RentalInfo
|-
|-
| Accurate || 0 or 1 || 0 or 1 || 0 or 1 || 0 or 1
| plausibility || 0 or 1 || 0 or 1 || 0 or 1
|-
 
| Concise  || 0 or 1 || 0 or 1 || 0 or 1 || 0 or 1 
|}
|}
</div>
</div>

Revision as of 21:25, 18 December 2024

Introduction

The historical records of land management, cadastre, and taxation provide invaluable insights into the socio-economic and administrative evolution of regions over time. Among the most significant resources for understanding such systems in ancient Venice, Catastici (1740) and Sommarioni (1808) are two books offer different perspectives on Venetian land parcels, their ownership, and their taxation structures, reflecting the shifts in governance and societal organization over time.

The Catastici, compiled in 1740, serves as a snapshot of Venetian land management under the Venetian Republic. It meticulously documents the ownership, division, and use of land parcels during a period of relative stability. This record provides a foundation for understanding how land was distributed and administered before the onset of major political upheavals.

The Sommarioni, written in 1808, emerged during a period of significant political and social disruption following the fall of the Venetian Republic and under Napoleonic rule. This document captures a transformed landscape, reflecting the influence of changing administrative structures, evolving property ownership patterns, and new taxation policies. It reveals a dynamic reorganization of Venetian society and the economic pressures that shaped the era.

Historical transactional data is often difficult to interpret and lacks a clear representation of relationships between records. This project aims to transform structured data from historical archives related to cadastral records and leases into vivid and intuitive natural language descriptions. In addition to creating accurate and comprehensive descriptions for Venetian parcels from each period, we will compare and connect data from the two eras. Through this approach, the project seeks to provide new insights into the evolution of Venice’s administrative and socioeconomic framework.

Project Milestones and Pipeline

Workflow
Week Task Status
07.10 - 13.10 Define research questions
Review relevant literature
Done
14.10 - 20.10 Perform initial data checking and cleaning
Address dataset-related questions
Done
21.10 - 27.10 Autumn vacation Done
28.10 - 03.11 Align Catastici and Sommarioni dataset
Continue data cleaning
Done
04.11 - 10.11 Develop description templates and prompts
Prepare for the midterm presentation
Done
11.11 - 17.11 Midterm presentation (14.11)
Refine the description template and prompts
Done
18.11 - 24.11 Translate Italian data into English
Done
25.11 - 01.12 Design an evaluation plan
Evaluate the prompts
Done
02.12 - 08.12 Generate final results
Evaluate the prompts and translation
Begin writing the wikipage
Done
09.12 - 15.12 Write the wikipage
Organize GitHub code
Prepare for the final presentation
Done
16.12 - 22.12 Deliver GitHub repository and wikipage (18.12)
Final presentation (19.12)
Pipeline

Methodology

The Catastici and Sommarioni datasets were derived from historical records through OCR conversion, which introduced significant noise. To address this, we first cleaned the datasets to remove noisy data, extracted relevant information, and standardized the data into English (excluding names of people and places). Next, we selected and ordered the key content required for generating descriptions, and designed templates to fine-tune GPT for generation.

The generated descriptions were evaluated manually to ensure their accuracy, conciseness, and plausibility. Additionally, we translated the descriptions into other languages and evaluated their adaptability in multilingual contexts.

Data Processing


Catastici

1. owner first name

The owner_first_name field in the Catastici dataset contains various forms of representation. Most entries record the owner's first name directly, while unknown names are marked with a placeholder, "-". However, in addition to these, we identified 3,599 entries where the "first name" includes the placeholder "_". These placeholders do not represent the owner's real name but symbolize a certain familial identity associated with the owner. We collected all values representing family relationships and identities, identifying a total of 42 unique entries. Below are some examples.

Family Relationships and Identities
Origin Values English Translation
_cugini cousins
_nepoti grandchildren
_herede heir
_nepote grandchild
_fratelli brothers

We focused on investigating this subset and proposed three potential hypotheses:

A. The familial identity serves as a supplementary label for an owner with a recorded name, meaning the familial identity and the actual name refer to the same person.

B. The familial identity is directly related to another owner, indicating a specific interpersonal relationship.

C. The familial identity is unrelated to any other owner and represents a separate individual.

Upon further investigation, the first two hypotheses were conclusively disproven. The supporting examples are as follows. In example A, the term figlioli translates to "children." It is implausible to assert that Veneranda and figlioli refer to the same individual, as the owner_first_name field lists figlioli (children), while the owner_title_std field includes titles such as nobil homo ser (noble gentleman) and nobil domina (noble lady). These titles clearly indicate that the owner represents multiple individuals of different genders. Consequently, the hypothesis A can be definitively dismissed. If hypothesis B were valid, the dataset would contain numerous contradictory relationships. For instance, Angela, Elena, and Cattarina in example B are traditional Italian female names, while _moglie means "wife." However, in 1740, same-sex marriage was not legally recognized in Italy. Therefore, the second owner could not possibly be the wife of Angela, Elena, and Cattarina. The second hypothesis is also disproven.Thus, we are left with hypothesis C as the most plausible explanation. there is no direct relationship between the names and familial roles appearing in the owner_first_name field.

Family Relationships and Identities
Example A Example B
Cousins Cousins

2. rent

The rent field exhibited inconsistencies in currency units and included non-rental information. To resolve this, we standardized all entries to ducats under the golden standard. Following this unification, 15 non-monetary noise entries were identified, which were manually reviewed and corrected.

Currency Conversion Rules
Currency Conversion Rules
1 ducato = 1488 denari
1 lira = 240 denari
1 grosso = 62 denari
1 soldo = 12 denari
Examples of Data Requiring Manual Processing
Original Values English Translation
porzione di casa Portion of a house
casa in soler House in Soler
libertà di traghetto Ferry liberty
7 lire, 15 soldi, ogni tre mesi 1860 denari for every three months


3. quantity income & quality income

The fields quantity income and quality income capture non-monetary forms of rent payment or supplementary information about the lease. When non-monetary items are recorded as payment, quantity income specifies the amount, while quality income describes the items provided. Additionally, other rental-related information may be logged in either of these fields. Therefore, an essential step is to combine these two fields into a single representation to consolidate all non-monetary rental details.

Combined Representation of Quantity and Quality Income
quantity_income quality_income combination method example
NaN NaN / quantity_income: NaN
quality_income: NaN
NaN 1 or more than 1 items quality_income quantity_income: NaN
quality_income: per carità
1 or more than 1 items NaN quantity_income quantity_income: 12 lire in contanti
quality_income: NaN
1 item 1 item quantity_income + quality_income quantity_income: 2
quality_income: legne di manzo
more than 1 items more than 1 items (but equal to quantity_income) split and add them separately quantity_income: 2 pera, 8 lire
quality_income: caponi, regalia
1 item 2 items split and use the quantity_income to add each quality_income item quantity_income: 8 lire de piccoli
quality_income: sapone, zuccaro

Sommarioni

1. Organize data by parcel

The Sommarioni dataset records the attribute information of different parcels, including their location, area, ownership, as well as details about the parcel's owner and the owner's family. However, some parcels also contain several subparcels with different parcel types. As a result, we group the information of the subparcels together to generate our description.

2. Clean owner name

The column 'owner_standardised' records the names of the owners. However, it also contains additional information about the owner, such as their identity. To avoid confusion, we have separated this information. For example, "CROTTA Lucrezia" is the widow of "CALVO" (where "Vedova" means widow in Italian). Therefore, we separated the two parts into the owner’s name and extra information. To maintain the coherence of the data, we will only translate the extra information in the next step.

Original
owner_standardised
CROTTA Lucrezia (Vedova CALVO)


Seperate
owner name extra info
CROTTA Lucrezia Vedova CALVO

Translation

Since all the values are in Italian, this poses challenges when generating English descriptions. So we utilized the GPT-4o API to translate the following properties into English.

Translated Properties in the Sommarioni
extra information (seperate above)
place
ownership_types
qualities
own_title
own_other_notes
Translated Properties in the Catastici
function
owner_entity_group_std
owner_mestiere_std
owner_title_std
owner identities in owner_first_name (with placeholder "_")
quality_income, quantity_income (together)

Geographical Connection

To analyze the changes in ancient Venice between 1740 and 1808, we link the two datasets based on their geographical coordinates. The Catastici dataset comprises point data, while the Sommarioni dataset consists of polygon data. The connection between the two is established by identifying which Catastici points fall within specific Sommarioni parcels. Since not all Sommarioni parcels have corresponding Catastici data, and not all Catastici data matches a Sommarioni parcel, the summary description generation focuses solely on the overlapping portions—specifically, cases where Catastici data is contained within Sommarioni parcels.

Content Selection and Ordering


Catastici

Taking data quality into account, we selected specific properties from the Catastici dataset and categorized them into 6 groups.

Selected Properties from Sommarioni
Location Function Owner & Ownership Owner Title & Job Tenant Payment
place function owner_first_name owner_title_std tenant_name rent
sestiere owner_family_name owner_mestiere_std quantity_income
owner_entity_group quality_income
owner_identity

Sommarioni

Taking data quality into account, we selected specific properties from the Sommarioni dataset and categorized them into 5 groups.

Selected Properties from Sommarioni
Location Features Owner & Ownership Owners' Family Othernotes
district_acronym parcel type owner_standardised own_father own_other_notes
place qualities ownership_types own_father_is_q parcel_ids
area own_title own_mother
own_siblings
own_husband
own_husband_is_q

Template Design


Our project leverages GPT-4o-mini to generate descriptions. To guide the language model effectively and produce well-structured content, we generate three types of descriptions. The first type introduces information from Catastici (1740), while the second type details data from Sommarioni (1808). These two descriptions focus on individual parcels of land, aiming to be both accurate and concise. They are designed to include all relevant information about the geometry parcels in the dataset, while avoiding any fictional or irrelevant data.

Building on these two types of descriptions, we generate a third type—a summary description—which seeks to enrich the information by uncovering internal connections and deeper implications within the data. For example, the same parcel of land might belong to the same family during the Catastici and Sommarioni periods. By linking the descriptions from these two eras, we can uncover hidden narratives of familial inheritance. Thus, for the summary description, our primary goal is to ensure plausibility by logically connecting the changes and similarities between the two descriptions, avoiding unreasonable or speculative elements, and addressing implications or transitions inferred from the original data. This approach not only enables us to compare societal characteristics across different eras, such as changes in rent and social hierarchies, but also allows us to explore administrative boundaries and property inheritance over time.

To ensure the language model interprets our instructions effectively, we designed specific prompt templates for each paragraph. The finalized templates are shown below, and the next section will discuss the iterative process of refining these prompt templates.

Prompt Evaluation


The evaluation process consists of five main steps. First, the reviewers carefully read the input content to thoroughly understand the context, key details, and specific requirements. Second, the reviewers score each description based on predefined criteria. Third, we assess the inter-annotator agreement among the reviewers to ensure consistency. Fourth, we analyze the evaluation results to identify patterns and insights. Finally, we refine and improve our prompt template based on these findings. Given that the requirements and objectives of each paragraph vary, we have developed distinct evaluation metrics tailored to each paragraph. The evaluation criteria of each paragraph are explained below.

The Evaluation Steps

Evaluation Criteria for Catastici and Sommarioni Paragraph

Since the first two paragraphs primarily focus on the information in the dataset, we prioritize accuracy and conciseness as the main evaluation indicators. The detailed criteria are explained below.

Evaluate Description for Accuracy

For each description, assess whether:

  • All facts are correctly represented.
  • The description aligns precisely with the metadata, without misrepresentation or omission of key details.
  • There are no fabricated or incorrect elements.

Assign a score of:

  • 0: If the description contains inaccuracies, misrepresentation, omissions, or fabricated elements.
  • 1: If the description is fully accurate, complete, and faithfully represents the metadata.

Evaluate Description for Conciseness

For each description, assess whether:

  • The description is free from redundant or repetitive content.
  • The information is presented succinctly, without unnecessary elaboration or verbosity.
  • The description focuses on delivering the key details without including irrelevant information.

Assign a score of:

  • 0: If the description contains redundancy, repetition, or excessive elaboration.
  • 1: If the description is concise, focused, and avoids unnecessary content.

To evaluate the generated descriptions more thoroughly and comprehensively, we divide the content based on the original data into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.

Evaluation Metrics for Catastici
Location Features Owners' Name Owners' Title & Job Tenant Payment
Accurate 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1
Concise 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1
Evaluation Metrics for Sommarioni
Location Features Owner & Ownership Owners' Family Othernotes
Accurate 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1
Concise 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1

Evaluation Criteria for Summary Paragraph

Since the summary is a comprehensive description based on the Catastici and Sommarioni descriptions, it is challenging to evaluate each detailed property individually. Thus, we categorized all properties into three major properties. The first property includes information related to the parcel itself, such as geographic location, administrative region, function, and type. The second category pertains to the owner, encompassing details such as the owner's name, identity, and relationships. The third category covers rental-related information, including tenants and rental payments.

In evaluating the summaries, we focus more on whether the new descriptions logically and reasonably explain the changes, differences, and consistencies between the two time periods, while aligning with common sense and contextual logic. Therefore, we chose plausibility as the primary metric for this evaluation.

For each generated description, evaluate whether it:

  • Logically connects the changes and similarities between the two descriptions.
  • Avoids introducing unreasonable, exaggerated, or speculative elements.
  • Maintains consistency with real-world knowledge and contextual logic.
  • Adequately addresses any implications or transitions inferred from the original descriptions.

Assign a score of:

  • 0: If the description contains illogical conclusions, exaggerated assumptions, or inconsistencies with the provided information or context.
  • 1: If the description is logical, reasonable, and aligns well with the context and common sense.

The evaluation metrics are shown below.

Evaluation Metrics for Summary
ParcelInfo OwnerInfo RentalInfo
plausibility 0 or 1 0 or 1 0 or 1

Evaluation Criteria for Translation

Our project aims to provide descriptions in multiple languages. After generating the English descriptions, we translated them into various languages. However, since we are only proficient in Chinese, we used the Chinese version as an example for evaluation. The evaluation criteria and metrics are outlined below.

For each generated description, evaluate whether it:

  • Accuracy: Ensure the translation faithfully conveys the meaning of the original description without omissions, distortions, or errors.
  • Readability: Assess how natural and fluent the translation is, ensuring it adheres to Chinese linguistic norms and avoids awkward or unnatural phrasing.
  • Clarity: Confirm that the translated description is easy to understand and does not cause confusion or ambiguity.

Assign a score of:

  • 1 (Very Poor): The translation contains significant errors or is extremely difficult to understand. It fails to convey the original meaning or has substantial inaccuracies.
  • 2 (Poor): The translation has noticeable inaccuracies or awkward phrasing that make it challenging to follow, though some meaning is conveyed.
  • 3 (Average): The translation is accurate but includes some phrases that are unnatural or difficult to comprehend in Chinese. It generally conveys the intended meaning but lacks fluency.
  • 4 (Good): The translation is mostly accurate, clear, and reasonably fluent, with only minor issues in adherence to Chinese language norms.
  • 5 (Excellent): The translation is both accurate and highly fluent, seamlessly adhering to Chinese linguistic norms and providing an easy-to-understand and natural reading experience.

Considering the informational overlap between the two datasets, we divide the summary content into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.

Test Inner-Annotator Agreement

We calculated the inter-annotator agreement between the two annotators to ensure the quality and consistency of the annotations. One of the most commonly used methods for assessing agreement is Cohen's Kappa coefficient. However, in scenarios with extreme class imbalance (e.g., when all data points are assigned the same label, such as '1', which can occur with perfect results), Cohen's Kappa becomes unreliable or even misleading.

In contrast, percentage agreement is a simpler measure that directly calculates the proportion of matching labels between two raters without accounting for chance agreement. This makes it more suitable for extreme cases, as it is not affected by class imbalance or skewed label distributions. Therefore, we ultimately chose percentage agreement to evaluate the validity of the annotation results.

Evaluation and Final Result

Evaluation of Catastici Description


During the process of generating descriptions for Catastici and Sommarioni, we conducted two rounds of evaluation in total. The first evaluation revealed several shortcomings in the prompt design and data formatting. Based on these findings, we made targeted improvements and corrections, which resulted in significant enhancements during the second evaluation.

Misleading Features in Missing Data: After the first round of evaluation, we observed that large language models (LLMs) often fabricate information for missing features, despite explicit instructions in the prompt prohibiting such behavior. To address this, we chose to remove missing features entirely from the data provided to the LLM. Additionally, we further clarified in the prompt that generating fabricated information is strictly prohibited, reducing the likelihood of hallucinated content.

Information Misunderstanding: In the Catastici dataset, the owner's identity can take various forms, such as names, institutions, or familial relationships, which can overwhelm and confuse GPT when generating descriptions. To mitigate this, we enhanced the template by providing a clearer explanation of the data structure and offering more detailed guidance in the second round of evaluation.

Tense and Grammar Requirements: While GPT generally produces high-quality text, occasional errors in tense or noun plurality were observed. To improve accuracy, we updated the prompt to include explicit instructions for checking and ensuring correct tense and grammatical usage.

caption=Catastici Original Template


caption=Catastici Refined Template

The evaluation scores and the results of inter-annotator are shown below. Prompt 1 was used in the initial evaluation round. During this round, the accuracy and conciseness scores for the owner's title and job, along with the accuracy of location and the conciseness of payment, were notably low. However, after refining both the dataset and Prompt 2, we observed significant improvements in the performance of these properties.

Catastici Evaluation

Percentage Agreement = Number of consistent labels / Total number of labels

Inter-Annotation Agreement for Catastici Description
Location Features Owners' Name Owners' Title & Job Tenant Payment
Prompt1-Accuracy 0.95 1.0 0.95 0.9 1.0 0.95
Prompt2-Accuracy 1.0 1.0 0.95 1.0 1.0 0.95
Prompt1-Conciseness 1.0 1.0 1.0 0.9 1.0 0.95
Prompt2-Conciseness 1.0 1.0 0.7 1.0 1.0 0.95
Average Evaluation Scores for Catastici Description
Location Function Owner Name Owner title & job Tenant Payment
Prompt1-Accuracy 0.675 1.0 0.875 0.55 1.0 0.975
Prompt2-Accuracy 1.0 1.0 0.875 1.0 1.0 0.975
Prompt1-Concise 1.0 1.0 1.0 0.2 1.0 0.275
Prompt2-Concise 1.0 1.0 0.6 1.0 1.0 0.975

Evaluation of Sammarioni Description

After our first round of description generation and evaluation, we discover several problems about the sommarioni paragraph.

Meaningless Information: When certain properties in the data record are missing, the description often explicitly mentions the absence of these data. Given that many records contain missing information, this results in excessive and unnecessary details, making the description less concise and harder to follow. As a result, we emphasized that the generated description should not include missing data in the second prompt.

Information Misunderstanding: We used parcel_ids to indicate the number of other properties owned by the owner. However, the language model occasionally misinterprets it as the identifier for a specific parcel. We renamed some abstract property names to more meaningful and detailed ones. For example, we have renamed parcel_ids to parcel_num.

caption=Sommarioni Original Template

caption=Sommarioni Refined Template

After evaluating the performance of Prompt 1 used in Round 1, we identified its primary weakness in terms of conciseness, particularly for the properties of owner family and ownership. Therefore, we focused on enhancing conciseness in Prompt 2.

Pipeline

Inter-Annotation Agreement for Sommarioni Description
Location Function Ownership Owner Family Other Notes
Prompt1-Accuracy 0.9375 0.8125 0.8125 0.9375 0.875
Prompt2-Accuracy 0.95 1.0 0.85 0.9 0.95
Prompt1-Conciseness 0.875 0.9375 0.8125 0.9375 0.875
Prompt2-Conciseness 1.0 1.0 1.0 0.95 1.0
Average Evaluation Scores for Sommarioni Description
Location Function Ownership Owner Family Other Notes
Prompt1-Accuracy 0.96875 0.71875 0.90625 0.96875 0.75
Prompt2-Accuracy 0.925 1.0 0.825 0.95 0.975
Prompt1-Concise 0.75 0.96875 0.65625 0.09375 0.75
Prompt2-Concise 0.85 1.0 1.0 0.975 1.0

Evaluation of Summary


Since the summary is a comprehensive description based on the Catastici and Sommarioni descriptions, it is challenging to evaluate each detailed property individually. Thus, we categorized all properties into three major properties. The first property includes information related to the parcel itself, such as geographic location, administrative region, function, and type. The second category pertains to the owner, encompassing details such as the owner's name, identity, and relationships. The third category covers rental-related information, including tenants and rental payments.

In evaluating the summaries, we focus more on whether the new descriptions logically and reasonably explain the changes, differences, and consistencies between the two time periods, while aligning with common sense and contextual logic. Therefore, we chose plausibility as the primary metric for this evaluation.

Inter-Annotation Agreement on Plausibility for Summary
ParcelInfo OwnerInfo RentalInfo
Plausibility 0.8125 0.75 0.9375
Evaluation Results for Summary
ParcelInfo OwnerInfo RentalInfo
Plausibility 0.91 0.75 0.97


Pipeline

Evaluation of Translation


Pipeline

Results


Limitations and Future Works

  • Prompt Template Promotion

As discussed in the previous section, we only conducted two rounds of promotion for our prompt template. The accuracy and conciseness of the Catastici and Sommarioni paragraphs, as well as the plausiblity of the summary paragraph can still be improved. Additionally, all the descriptions are currently quite similar to each other. It would be beneficial to design different tones for the descriptions to make them more diverse and engaging.

  • Model Improvement

Due to budget limitations, we only used GPT-4o-mini to generate our descriptions and translations. This model has some limitations, particularly in translation. Therefore, our next step is to involve other large language models, such as GPT-4 and T5, to address these issues.

Reference

GitHub Repositories

GitHub Link

Credits

Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisors: Alexander Rusnak, Tristan Karch, Tommy Bruzzese
Authors: Ruyin Feng, Zhichen Fang