Generation of Textual Description for Parcels
Introduction
The historical records of land management, cadastre, and taxation provide invaluable insights into the socio-economic and administrative evolution of regions over time. Among the most significant resources for understanding such systems in ancient Venice, Catastici (1740) and Sommarioni (1808) are two books offer different perspectives on Venetian land parcels, their ownership, and their taxation structures, reflecting the shifts in governance and societal organization over time.
The Catastici, compiled in 1740, serves as a snapshot of Venetian land management under the Venetian Republic. It meticulously documents the ownership, division, and use of land parcels during a period of relative stability. This record provides a foundation for understanding how land was distributed and administered before the onset of major political upheavals.
The Sommarioni, written in 1808, emerged during a period of significant political and social disruption following the fall of the Venetian Republic and under Napoleonic rule. This document captures a transformed landscape, reflecting the influence of changing administrative structures, evolving property ownership patterns, and new taxation policies. It reveals a dynamic reorganization of Venetian society and the economic pressures that shaped the era.
Historical transactional data is often difficult to interpret and lacks a clear representation of relationships between records. This project aims to transform structured data from historical archives related to cadastral records and leases into vivid and intuitive natural language descriptions. In addition to creating accurate and comprehensive descriptions for Venetian parcels from each period, we will compare and connect data from the two eras. Through this approach, the project seeks to provide new insights into the evolution of Venice’s administrative and socioeconomic framework.
Project Milestones and Pipeline
Week | Task | Status |
---|---|---|
07.10 - 13.10 | Define research questions Review relevant literature |
Done |
14.10 - 20.10 | Perform initial data checking and cleaning Address dataset-related questions |
Done |
21.10 - 27.10 | Autumn vacation | Done |
28.10 - 03.11 | Align Catastici and Sommarioni dataset Continue data cleaning |
Done |
04.11 - 10.11 | Develop description templates and prompts Prepare for the midterm presentation |
Done |
11.11 - 17.11 | Midterm presentation (14.11) Refine the description template and prompts |
Done |
18.11 - 24.11 | Translate Italian data into English |
Done |
25.11 - 01.12 | Design an evaluation plan Evaluate the prompts |
Done |
02.12 - 08.12 | Generate final results Evaluate the prompts and translation Begin writing the wikipage |
Done |
09.12 - 15.12 | Write the wikipage Organize GitHub code Prepare for the final presentation |
Done |
16.12 - 22.12 | Deliver GitHub repository and wikipage (18.12) Final presentation (19.12) |
Methodology
The Catastici and Sommarioni datasets were derived from historical records through OCR conversion, which introduced significant noise. To address this, we first cleaned the datasets to remove noisy data, extracted relevant information, and standardized the data into English (excluding names of people and places). Next, we selected and ordered the key content required for generating descriptions, and designed templates to fine-tune GPT for generation.
The generated descriptions were evaluated manually to ensure their accuracy, conciseness, and plausibility. Additionally, we translated the descriptions into other languages and evaluated their adaptability in multilingual contexts.
Data Processing
Catastici
1. owner first name
The owner_first_name field in the Catastici dataset contains various forms of representation. Most entries record the owner's first name directly, while unknown names are marked with a placeholder, "-". However, in addition to these, we identified 3,599 entries where the "first name" includes the placeholder "_". These placeholders do not represent the owner's real name but symbolize a certain familial identity associated with the owner. We collected all values representing family relationships and identities, identifying a total of 42 unique entries. Below are some examples.
Origin Values | English Translation |
---|---|
_cugini | cousins |
_nepoti | grandchildren |
_herede | heir |
_nepote | grandchild |
_fratelli | brothers |
We focused on investigating this subset and proposed three potential hypotheses:
A. The familial identity serves as a supplementary label for an owner with a recorded name, meaning the familial identity and the actual name refer to the same person.
B. The familial identity is directly related to another owner, indicating a specific interpersonal relationship.
C. The familial identity is unrelated to any other owner and represents a separate individual.
Upon further investigation, the first two hypotheses were conclusively disproven. The supporting examples are as follows. In example A, the term figlioli translates to "children." It is implausible to assert that Veneranda and figlioli refer to the same individual, as the owner_first_name field lists figlioli (children), while the owner_title_std field includes titles such as nobil homo ser (noble gentleman) and nobil domina (noble lady). These titles clearly indicate that the owner represents multiple individuals of different genders. Consequently, the hypothesis A can be definitively dismissed. If hypothesis B were valid, the dataset would contain numerous contradictory relationships. For instance, Angela, Elena, and Cattarina in example B are traditional Italian female names, while _moglie means "wife." However, in 1740, same-sex marriage was not legally recognized in Italy. Therefore, the second owner could not possibly be the wife of Angela, Elena, and Cattarina. The second hypothesis is also disproven.Thus, we are left with hypothesis C as the most plausible explanation. there is no direct relationship between the names and familial roles appearing in the owner_first_name field.
2. rent
The rent field exhibited inconsistencies in currency units and included non-rental information. To resolve this, we standardized all entries to ducats under the golden standard. Following this unification, 15 non-monetary noise entries were identified, which were manually reviewed and corrected.
Currency Conversion Rules |
---|
1 ducato = 1488 denari |
1 lira = 240 denari |
1 grosso = 62 denari |
1 soldo = 12 denari |
Original Values | English Translation |
---|---|
porzione di casa | Portion of a house |
casa in soler | House in Soler |
libertà di traghetto | Ferry liberty |
7 lire, 15 soldi, ogni tre mesi | 1860 denari for every three months |
3. quantity income & quality income
The fields quantity income and quality income capture non-monetary forms of rent payment or supplementary information about the lease. When non-monetary items are recorded as payment, quantity income specifies the amount, while quality income describes the items provided. Additionally, other rental-related information may be logged in either of these fields. Therefore, an essential step is to combine these two fields into a single representation to consolidate all non-monetary rental details.
quantity_income | quality_income | combination method | example |
---|---|---|---|
NaN | NaN | / | quantity_income: NaN quality_income: NaN |
NaN | 1 or more than 1 items | quality_income | quantity_income: NaN quality_income: per carità |
1 or more than 1 items | NaN | quantity_income | quantity_income: 12 lire in contanti quality_income: NaN |
1 item | 1 item | quantity_income + quality_income | quantity_income: 2 quality_income: legne di manzo |
more than 1 items | more than 1 items (but equal to quantity_income) | split and add them separately | quantity_income: 2 pera, 8 lire quality_income: caponi, regalia |
1 item | 2 items | split and use the quantity_income to add each quality_income item | quantity_income: 8 lire de piccoli quality_income: sapone, zuccaro |
Sommarioni
1. Organize data by parcel
The Sommarioni dataset records the attribute information of different parcels, including their location, area, ownership, as well as details about the parcel's owner and the owner's family. However, some parcels also contain several subparcels with different parcel types. As a result, we group the information of the subparcels together to generate our description.
2. Clean owner name
The column 'owner_standardised' records the names of the owners. However, it also contains additional information about the owner, such as their identity. To avoid confusion, we have separated this information. For example, "CROTTA Lucrezia" is the widow of "CALVO" (where "Vedova" means widow in Italian). Therefore, we separated the two parts into the owner’s name and extra information. To maintain the coherence of the data, we will only translate the extra information in the next step.
owner_standardised |
---|
CROTTA Lucrezia (Vedova CALVO) |
owner name | extra info |
---|---|
CROTTA Lucrezia | Vedova CALVO |
Translation
Since all the values are in Italian, this poses challenges when generating English descriptions. So we utilized the GPT-4o API to translate the following properties into English.
Translated Properties in the Sommarioni |
---|
extra information (seperate above) |
place |
ownership_types |
qualities |
own_title |
own_other_notes |
Translated Properties in the Catastici |
---|
function |
owner_entity_group_std |
owner_mestiere_std |
owner_title_std |
owner identities in owner_first_name (with placeholder "_") |
quality_income, quantity_income (together) |
Geographical Connection
To analyze the changes in ancient Venice between 1740 and 1808, we link the two datasets based on their geographical coordinates. The Catastici dataset comprises point data, while the Sommarioni dataset consists of polygon data. The connection between the two is established by identifying which Catastici points fall within specific Sommarioni parcels. Since not all Sommarioni parcels have corresponding Catastici data, and not all Catastici data matches a Sommarioni parcel, the summary description generation focuses solely on the overlapping portions—specifically, cases where Catastici data is contained within Sommarioni parcels.
Content Selection and Ordering
Catastici
Taking data quality into account, we selected specific properties from the Sommarioni dataset and categorized them into six groups.
Sommarioni
Taking data quality into account, we selected specific properties from the Sommarioni dataset and categorized them into five groups.
Location | Features | Owner & Ownership | Owners' Family | Othernotes |
---|---|---|---|---|
district_acronym | parcel type | owner_standardised | own_father | own_other_notes |
place | qualities | ownership_types | own_father_is_q | parcel_ids |
area | own_title | own_mother | ||
own_siblings | ||||
own_husband | ||||
own_husband_is_q |
Template Design
Our project utilizes GPT-4o-mini to generate descriptions. To provide the language model with clear instructions and produce well-structured content, we divide the description into three paragraphs. The first paragraph introduces the information from Catastici (1740), the second paragraph details the data from Sommarioni (1808), and the third paragraph connects the two datasets, summarizing their relationship.
We aim for Paragraphs 1 and 2 to be accurate and concise, including all relevant information about the geometry parcels in the dataset while avoiding the inclusion of any fictional or irrelevant data. For Paragraph 3, our goal is to ensure plausibility by logically connecting the changes and similarities between the two descriptions, avoiding unreasonable or speculative elements, and adequately addressing implications or transitions inferred from the original data.
To ensure the language model effectively understands our instructions, we designed a specific prompt template for each paragraph. The final version of our templates are shown below and we will discuss the promotion of our prompt template in the next section.
Prompt Evaluation
The evaluation process consists of five main steps. First, the reviewers carefully read the input content to thoroughly understand the context, key details, and specific requirements. Second, the reviewers score each description based on predefined criteria. Third, we assess the inter-annotator agreement among the reviewers to ensure consistency. Fourth, we analyze the evaluation results to identify patterns and insights. Finally, we refine and improve our prompt template based on these findings. Given that the requirements and objectives of each paragraph vary, we have developed distinct evaluation metrics tailored to each paragraph. The evaluation criteria of each paragraph are explained below.
Evaluation Criteria for Catastici and Sommarioni Paragraph
Since the first two paragraphs primarily focus on the information in the dataset, we prioritize accuracy and conciseness as the main evaluation indicators. The detailed criteria are explained below.
Evaluate Description for Accuracy
For each description, assess whether:
- All facts are correctly represented.
- The description aligns precisely with the metadata, without misrepresentation or omission of key details.
- There are no fabricated or incorrect elements.
Assign a score of:
- 0: If the description contains inaccuracies, misrepresentation, omissions, or fabricated elements.
- 1: If the description is fully accurate, complete, and faithfully represents the metadata.
Evaluate Description for Conciseness
For each description, assess whether:
- The description is free from redundant or repetitive content.
- The information is presented succinctly, without unnecessary elaboration or verbosity.
- The description focuses on delivering the key details without including irrelevant information.
Assign a score of:
- 0: If the description contains redundancy, repetition, or excessive elaboration.
- 1: If the description is concise, focused, and avoids unnecessary content.
To evaluate the generated descriptions more thoroughly and comprehensively, we divide the content based on the original data into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.
Location | Features | Owners' Name | Owners' Title & Job | Tenant | Payment | |
---|---|---|---|---|---|---|
Accurate | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Concise | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Location | Features | Owner & Ownership | Owners' Family | Othernotes | |
---|---|---|---|---|---|
Accurate | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Concise | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Evaluation Criteria for Summary Paragraph
The summary paragraph is designed to provide a cohesive overview, integrating the information from the preceding two paragraphs. Its purpose is to bridge the two datasets by logically connecting the observed changes and similarities in the descriptions, while analyzing the underlying reasons for these changes. To achieve this, we have selected plausibility as our evaluation criterion.
For each generated description, evaluate whether it:
- Logically connects the changes and similarities between the two descriptions.
- Avoids introducing unreasonable, exaggerated, or speculative elements.
- Maintains consistency with real-world knowledge and contextual logic.
- Adequately addresses any implications or transitions inferred from the original descriptions.
Assign a score of:
- 0: If the description contains illogical conclusions, exaggerated assumptions, or inconsistencies with the provided information or context.
- 1: If the description is logical, reasonable, and aligns well with the context and common sense.
Considering the informational overlap between the two datasets, we divide the summary content into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.
Location | Features | Owner Information | Tenant | |
---|---|---|---|---|
Accurate | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Concise | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Evaluation Criteria for Translation
Our project aims to provide descriptions in multiple languages. After generating the English descriptions, we translated them into various languages. However, since we are only proficient in Chinese, we used the Chinese version as an example for evaluation. The evaluation criteria and metrics are outlined below.
For each generated description, evaluate whether it:
- Accuracy: Ensure the translation faithfully conveys the meaning of the original description without omissions, distortions, or errors.
- Readability: Assess how natural and fluent the translation is, ensuring it adheres to Chinese linguistic norms and avoids awkward or unnatural phrasing.
- Clarity: Confirm that the translated description is easy to understand and does not cause confusion or ambiguity.
Assign a score of:
- 1 (Very Poor): The translation contains significant errors or is extremely difficult to understand. It fails to convey the original meaning or has substantial inaccuracies.
- 2 (Poor): The translation has noticeable inaccuracies or awkward phrasing that make it challenging to follow, though some meaning is conveyed.
- 3 (Average): The translation is accurate but includes some phrases that are unnatural or difficult to comprehend in Chinese. It generally conveys the intended meaning but lacks fluency.
- 4 (Good): The translation is mostly accurate, clear, and reasonably fluent, with only minor issues in adherence to Chinese language norms.
- 5 (Excellent): The translation is both accurate and highly fluent, seamlessly adhering to Chinese linguistic norms and providing an easy-to-understand and natural reading experience.
Considering the informational overlap between the two datasets, we divide the summary content into different aspects and assess the accuracy and conciseness of each aspect. The evaluation metrics are shown below.
Test Inner-Annotator Agreement
We calculated the inter-annotator agreement between the two annotators to ensure the quality and consistency of the annotations. One of the most commonly used methods for assessing agreement is Cohen's Kappa coefficient. However, in scenarios with extreme class imbalance (e.g., when all data points are assigned the same label, such as '1', which can occur with perfect results), Cohen's Kappa becomes unreliable or even misleading.
In contrast, percentage agreement is a simpler measure that directly calculates the proportion of matching labels between two raters without accounting for chance agreement. This makes it more suitable for extreme cases, as it is not affected by class imbalance or skewed label distributions. Therefore, we ultimately chose percentage agreement to evaluate the validity of the annotation results.
Evaluation Results and Analysis
Results
Evaluation of Catastici Description
When certain properties in the data record are missing, the description often explicitly mentions the absence of these data. Given that many records contain missing information, this results in excessive and unnecessary details, making the description less concise and harder to follow.
- Information Misunderstanding
We used parcel_ids to indicate the number of other properties owned by the owner. However, the language model occasionally misinterprets it as the identifier for a specific parcel.
The evaluation scores and the results of inter-annotator are shown below. Shown by the figure below
Percentage Agreement = Number of consistent labels / Total number of labels
Location | Features | Owners' Name | Owners' Title & Job | Tenant | Payment | |
---|---|---|---|---|---|---|
Prompt1-Accuracy | 0.95 | 1.0 | 0.95 | 0.9 | 1.0 | 0.95 |
Prompt2-Accuracy | 1.0 | 1.0 | 0.95 | 1.0 | 1.0 | 0.95 |
Prompt1-Conciseness | 1.0 | 1.0 | 1.0 | 0.9 | 1.0 | 0.95 |
Prompt2-Conciseness | 1.0 | 1.0 | 0.7 | 1.0 | 1.0 | 0.95 |
Location | Function | Owner Name | Owner title & job | Tenant | Payment | |
---|---|---|---|---|---|---|
Prompt1-Accuracy | 0.675 | 1.0 | 0.875 | 0.55 | 1.0 | 0.975 |
Prompt2-Accuracy | 1.0 | 1.0 | 0.875 | 1.0 | 1.0 | 0.975 |
Prompt1-Concise | 1.0 | 1.0 | 1.0 | 0.2 | 1.0 | 0.275 |
Prompt2-Concise | 1.0 | 1.0 | 0.6 | 1.0 | 1.0 | 0.975 |
Evaluation of Sammarioni Description
After our first attempt of description generation, we discover several problems about the sommarioni paragraph.
- Meaningless Information
When certain properties in the data record are missing, the description often explicitly mentions the absence of these data. Given that many records contain missing information, this results in excessive and unnecessary details, making the description less concise and harder to follow.
- Information Misunderstanding
We used parcel_ids to indicate the number of other properties owned by the owner. However, the language model occasionally misinterprets it as the identifier for a specific parcel.
As a result, we propose the following revised template. In our refined template, we emphasize that the generated description should not include missing data. Additionally, we have renamed parcel_ids to parcel_num.
Location | Function | Ownership | Owner Family | Other Notes | |
---|---|---|---|---|---|
Prompt1-Accuracy | 0.9375 | 0.8125 | 0.8125 | 0.9375 | 0.875 |
Prompt2-Accuracy | 0.95 | 1.0 | 0.85 | 0.9 | 0.95 |
Prompt1-Conciseness | 0.875 | 0.9375 | 0.8125 | 0.9375 | 0.875 |
Prompt2-Conciseness | 1.0 | 1.0 | 1.0 | 0.95 | 1.0 |
Location | Function | Ownership | Owner Family | Other Notes | |
---|---|---|---|---|---|
Prompt1-Accuracy | 0.96875 | 0.71875 | 0.90625 | 0.96875 | 0.75 |
Prompt2-Accuracy | 0.925 | 1.0 | 0.825 | 0.95 | 0.975 |
Prompt1-Concise | 0.75 | 0.96875 | 0.65625 | 0.09375 | 0.75 |
Prompt2-Concise | 0.85 | 1.0 | 1.0 | 0.975 | 1.0 |
Evaluation of Summary
ParcelInfo | OwnerInfo | RentalInfo | |
---|---|---|---|
Plausibility | 0.8125 | 0.75 | 0.9375 |
ParcelInfo | OwnerInfo | RentalInfo | |
---|---|---|---|
Plausibility | 0.91 | 0.75 | 0.97 |
Evaluation of Translation
Limitations and Future Works
- Prompt Template Promotion
As discussed in the previous section, we only conducted two rounds of promotion for our prompt template. The accuracy and conciseness of the Catastici and Sommarioni paragraphs, as well as the plausiblity of the summary paragraph can still be improved. Additionally, all the descriptions are currently quite similar to each other. It would be beneficial to design different tones for the descriptions to make them more diverse and engaging.
- Model Improvement
Due to budget limitations, we only used GPT-4o-mini to generate our descriptions and translations. This model has some limitations, particularly in translation. Therefore, our next step is to involve other large language models, such as GPT-4 and T5, to address these issues.
Reference
GitHub Repositories
Credits
Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisors: Alexander Rusnak, Tristan Karch, Tommy Bruzzese
Authors: Ruyin Feng, Zhichen Fang