Generation of Textual Description: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
 
(71 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=Introduction=
=Introduction=
Leveraging two invaluable historical datasets - the 1740 Catastici and the 1808 Sommarioni - our project aim to generate detailed textual descriptions of parcels in Venice. These datasets, rich in historical, spatial, and social information, provide a comprehensive view of land ownership, urban development, and social structure in Venice across two distinct periods. The 1740 Catastici offers early insights into parcel functions, rent payments, and tenant names, while the 1808 Sommarioni provides a more detailed and standardized survey, including owner data, and normalized ownership types and qualities. By integrating these datasets, we can create informative descriptions of each parcel, including their location, function, ownership details, and historical context, thereby enhancing our understanding of Venice's evolution and cultural significance.
Leveraging two invaluable historical datasets - the 1740 Catastici and the 1808 Sommarioni - our project aim to generate detailed textual descriptions of parcels in Venice. These datasets, rich in historical, spatial, and social information, provide a comprehensive view of land ownership, urban development, and social structure in Venice across two distinct periods. The 1740 Catastici offers early insights into parcel functions, rent payments, and tenant names, while the 1808 Sommarioni provides a more detailed and standardized survey, including owner data, and normalized ownership types and qualities. By integrating these datasets, we can create informative descriptions of each parcel, including their location, function, ownership details, and historical context, thereby enhancing our understanding of Venice's evolution and cultural significance.
[[File:data-points.png|thumb|upright=2.0| A visualization of the datapoints and an example of their parcel information on the Catastici 1740 Dataset. Source: [https://pov.up.railway.app/ Parcel of Venice]]]


=Motivation=
=Motivation=
Our motivation for utilizing the 1740 Catastici and 1808 Sommarioni datasets stems from the rich historical insights they offer, despite the challenges in comprehending and connecting the data. These datasets, like many other historical records, face significant challenges in terms of data consistency and coherence. The data, often unstructured and manually constructed, presents inconsistencies due to its varied and sometimes incomplete nature. Fields may be missing, mistranscribed, or contain varying levels of detail, making it difficult to connect the different pieces of information into a coherent narrative. Additionally, the use of old Italian and Venetian dialects can further complicate the interpretation and context of the data.
Given these challenges, especially for non-historians, our goal is to simplify and contextualize this complex information. By cleaning, organizing, and integrating the data from these datasets, we aim to create coherent and informative textual descriptions. Leveraging in-context learning and prompting with GPT-4, we seek to depict a clear description of the story behind each datapoint, making the historical narratives more accessible and understandable for a broader audience. This approach will not only facilitate a deeper understanding of Venice's history but also make the datasets more user-friendly and interpretable.


=Deliverables=
=Deliverables=
* Pre-processed, standardized dataset from the original source 
* A manually-crafted dictionary of old Venice house functions, matching terms used in the dataset with translations and detailed historical description 
* A text generation pipeline using in-context learning methods 
* A set of evaluation metrics to assess the text generation pipeline across different perspectives
=Methodology=
==Data Exploration==
===Data clusterings and patterns===
After a general review of the Catastici data, it was observed that each data point contains a series of empty fields. Many of these fields appeared to follow the same pattern of missing values. Since the generated text needs to handle various entries with different available data, the first step involved categorizing the data points based on their missing values and then addressing each category.


= Project Timeline & Milestones =
In the initial step, all possible patterns of missing values were extracted, and their frequency within the dataset was analyzed. Among the 36K data points, 28K samples were found to align with 8 major patterns. In the table below, the 8 frequent patterns are present in order of popularity with an 'X' indicating that the data is present in the given template.


{|class="wikitable"
{| class="wikitable"
! style="text-align:center;"|Timeframe
|+ Frequent Patterns in the Dataset
! Task
! pattern id !! owner entity !! owner entity group !! owner first name !! owner family name !! owner family group !! owner title !! an rendi !! ten name
! Completion
|-
|-
| align="center" |Week 4
| 2 || || || X || X || X || || X || X
|
* Exploring the dataset
* Exploring in-context learning models for text summarization
| align="center" |
|-
|-
| align="center" |Week 5
| 0 || || || X || X || X || X || X || X
|
* Identify patterns and edge cases from the dataset (e.g missing fields, "odd" values)
* Define different summarization formats accordingly to be used for in-context learning
* Explore the connection between the Catastici and Sommarioni dataset
| align="center" |
|-
|-
| align="center" |Week 6
| 12 || X || X || || || || || X || X
|
* Refine summarization formats
* Construct a pipeline connecting translation generation, summarization and validation
| align="center" |
|-
|-
| align="center" |Week 7
| 19 || || || X || X || X || || X || X
|
* Evaluate summarization results
| align="center" |  
|-
|-
| align="center" |Week 8
| 8 || || || X || X || X || X || X || X
|
* Prepare for mid-term presentation
| align="center" |  
|-
|-
| align="center" |Week 9
| 1 || X || X || || || || X || X || X
|
* Explore father-son relationship among Catastici and Sommarioni dataset
| align="center" |  
|-
|-
| align="center" |Week 10
| 23 || || || X || X || X || || ||  
|
* Standardization of monthly rent column
| align="center" |  
|-
| align="center" |Week 11
|
* Verified and refined standardized rent values
|-
|-
| 5 || || || || X || X || X || X || X
|}
The main text generation task was then divided into two subtasks. The first subtask addressed the most frequent patterns, where a suitable example was crafted for each pattern to be used as context for the large language model. The second subtask focused on optimizing the context and prompt to ensure high-quality descriptions could also be generated for the less frequent patterns in the dataset.
==== Frequent Patterns Categorization ====
In tackling the first subtask, the 8 major patterns were categorized based on the following features:
* Whether the property is owned by a person or a different entity
* Whether the owner's title is present or not
* Whether the tenant’s name is present or not
For each combination of these features, a sample paragraph was crafted for a corresponding sample data point, providing a precise explanation of the parcel. Below are two examples crafted for pattern id 0 and 1 respectively:
'''Pattern #0 Sample:'''
<pre>
The property with ID 1 is located in Campo vicino alla Chiesa and serves as a casa e bottega da barbier (house and workshop for a barber). Owned by Liberal Campi, who holds the title of Secondo Prete (Second Priest), the property is tenanted by Francesco Zeni. In 1700s-1800s Venice, such properties were often dual-purpose, combining living spaces with workplaces. Barbers at the time may have also performed minor medical tasks in addition to their grooming services.
</pre>
'''Pattern #1 Sample:'''
<pre>
The property with ID 3 is located in Campo vicino alla Chiesa and serves as a bottega da strazariol (a workshop for a dealer or repairer of old clothes). The owner is listed as Pievan di San Cancian (the parish priest of San Cancian), with the title of Pievano. The tenant is Bortolamio Piazza. In the 1700s and 1800s, such workshops were associated with tradespeople engaged in repairing or selling second-hand clothing.
</pre>
When a given data point matched one of the frequent patterns, the pre-crafted sample data and text were added to the prompt, enabling the language model to generate descriptions for new data points. The outcomes consistently resulted in well-written and reliable descriptions. To provide the sample data point to the language model, the following structure is used in the prompt:
<pre>
... As an example of descriptive text I generated: {incontext_learning_text}; for sample data: {incontext_learning_data}.
</pre>
==== Addressing Less Frequent Patterns ====
Addressing the second subtask, which involved the less frequent patterns, the dataset revealed over 160 different patterns, each representing fewer than 100 samples. Many of these patterns were present in fewer than five samples. Due to the variety, it was impractical to craft a tailored text for each individual pattern. Instead, an approach was devised where examples of two frequent patterns, along with their desired manually crafted paragraphs, were provided to the language model. The prompt was enhanced to acknowledge the possibility of missing values and instructed the model to adjust the paragraph slightly if necessary. Additionally, the prompt included a request for the model to highlight missing values, as this information could be useful to the end-user.
==Data Enrichment==
=== Functionality of the parcels and historical translations ===
In working towards the goal of creating descriptive text for each data point, it became clear that simply listing the names of owners and the functions of parcels was just a straightforward conversion of tabular data into text. While this approach might be informative, it adds little value for the end-user. On the other hand, including some historical context about how the parcel's function fit into 1700s-1800s Venetian society could make the descriptions far more engaging and useful.
The first step in adding historical context was to ask the language model to use the functionality data to provide insights grounded in the time and place where these functions existed. While the language model (specifically GPT-4o) could sometimes deliver helpful information, it often struggled. The model seemed to lack sufficient training on the historical and linguistic nuances needed for this project.
After a few trials, it became clear that the main issue was the model's limited understanding of 18th-century Italian and how it differs from modern Italian. Once the correct meanings of the words in the functionality field were provided, the model was able to produce much more accurate and relevant descriptions.
As mentioned above, when dealing with the historical context and asking the GPT-4o model to provide explanations based on historical facts and information, it was observed that the GPT-4o model lacks an understanding of certain words in 18th-century Italian. Some words and phrases related to the functionality of parcels are not accurately translated by the GPT-4o model, resulting in generated text that is sometimes inaccurate or, in certain cases, misleading.
Take the following sample data entry as an example:
* '''General Information'''
** '''ID''': 238
** '''Place''': Campiel del remer
* '''Ownership Details'''
** '''Owner First Name''': Gerolamo
** '''Owner Family Name''': GRADENIGO
** '''Owner Title''': NOBIL HOMO
* '''Functionality'''
** '''Catastici Function''': bastion inviamento (bastion with trade designation)
** '''Standardized Functions''': bastione, inviamento
** '''Sommarioni Functions''': BOTTEGA, CASA
Using the finalized prompt and without manually providing the translations, the model generated the following text regarding the context of the parcel's functionality:
<pre>
In the 1700s-1800s Venice, the term "bastion inviamento" suggests a space used for fortification activities and dispatch operations.
</pre>


| align="center" |Week 12
However, based on the available sources, it is known that `bastione` refers to ''"a large tavern or commercial space, often used for selling wine,"'' which is quite different from what the language model might infer from modern Italian. To address this discrepancy, a dictionary titled ''Dizionario del dialetto veneziano'' was used as the primary reference, supplemented by two additional resources: a website called [https://historywalksvenice.com/ History Walks Venice] and an index of Venetian toponymy titled ''Toponomastica Veneziana''.
|
* Added district mean rent column and integrated additional information into the data
* Refine text prompts to incorporate new information
* Implement evaluation methods for the final generated text
| align="center" |
|-


| align="center" |Week 13
For each standardized function used in the dataset, these three resources were combined with the team members' expertise in Italian to determine accurate translations. The translations derived from these efforts were then compiled into a concise dictionary to ensure the language model generated historically accurate and contextually appropriate descriptions.
|
* Refine evaluation methods
* Split evaluation tasks among members
| align="center" |
|-


| align="center" |Week 14
This dictionary is provided to the GPT-4 model to ensure it generates factually and historically accurate explanations about the functionality of parcels. In the system prompt, only the necessary definitions are included to avoid redundant context and minimize resource usage.
|
* Verify and combine evaluation tasks
* Finalize Wikipage and prepare for presentation
| align="center" |
|-
|}


=Methodology=
With the updated translation provided to the language model, a description for the above example was generated as follows:
==Generating Summarization Formats for In-context Learning==


<pre>
In the context of the 1700s-1800s Venice, a "bastion inviamento" may imply a versatile commercial space potentially used for selling wine by the glass and/or was designated for a specific trade, granted a particular right or possibly a license related to lucrative professions like baking or wine selling.
</pre>


==Functionality of Parcels==
Using the compiled dictionary, the model accurately describes the functionality of the parcel within the historical context of 1700-1800s Venice.
blablabla


==Standardization of Monthly Rent==
===Standardization of Monthly Rent===
===Exploring Value Formats===
====Exploring Value Formats====
Inferring accurate values from rent data can offer valuable insights into a parcel's history, such as its price in relation to its neighborhood, which may reflect its relative status at that time. Among 35,946 data rows, we identified that only 28,610 (~80%) contained numeric values, with the remaining entries consisting of null values or text data.
Inferring accurate values from rent data can offer valuable insights into a parcel's history, such as its price in relation to its neighborhood, which may reflect its relative status at that time. Among 35,946 data rows, we identified that only 28,610 (~80%) contained numeric values, with the remaining entries consisting of null values or text data.


Line 135: Line 168:
|}
|}


===Format Selection and Hypothesis===
====Format Selection and Hypothesis====
From the identified text formats, we decided to exclude entries related to function, ownership, or other non-monetary details due to uncertainty about their origins. These deviations may stem from transcription errors or the inclusion of special characters.
From the identified text formats, we decided to exclude entries related to function, ownership, or other non-monetary details due to uncertainty about their origins. These deviations may stem from transcription errors or the inclusion of special characters.


Line 196: Line 229:
|}
|}


===Adding Mean District Rent Value for Parcels===
====Adding Mean District Rent Value for Parcels====
With the standardized rent values for each parcel, we extended our analysis to compare individual parcels against the mean rent value of their respective districts (sestiere).
With the standardized rent values for each parcel, we extended our analysis to compare individual parcels against the mean rent value of their respective districts (sestiere).


Line 209: Line 242:
From this visualization, we can derive interesting insights about neighborhoods. For example, San Marco (SM) has the highest mean rent, likely due to its central location within the city. This information enables further enrichment for text generation, not only within individual districts but also for comparisons across districts, offering a richer understanding of the socio-economic dynamics of the city.
From this visualization, we can derive interesting insights about neighborhoods. For example, San Marco (SM) has the highest mean rent, likely due to its central location within the city. This information enables further enrichment for text generation, not only within individual districts but also for comparisons across districts, offering a richer understanding of the socio-economic dynamics of the city.


=Results=
==Prompt Engineering==
 
The prompt used to generate the description for each given input data is as follows. The steps involved in crafting this prompt and the rationale behind including each part are detailed below:
 
<pre>
Generate a concise and factual descriptive text for the given input data about historical Venetian properties. The text should include:
 
1. Property Summary: Start with the property's ID, location, district name, and function. Use the Italian terms (e.g., casa, bottega) with an English explanation in parentheses.
 
2. Ownership Details: Mention the owner. It can be either an entity or a person defined by first name, family name, and possibly title. If it is not an entity and a first name, family name, or title is missing, clearly state "no [specific detail] available" without overexplaining. Include the tenant's name (based on 1740 data) and additional details, if provided.
 
3. Historical Context: Offer a brief explanation of the property's function, referencing its role in 1700s-1800s Venice, phrased with appropriate caution (e.g., "may suggest," "indicates," "was often associated with") rather than definitive language. Only use the information provided by this dictionary for historical context: {translations_dict}
 
Avoid speculative storytelling or unnecessary elaboration or phrases like "reflecting a common arrangement".
Avoid extrapolating beyond the data provided unless well-established historical knowledge applies directly.
Avoid describing the Venetian society when providing Historical Context about the functionality. Focus on the function rather than society.
 
The field "som_function" (always refer to it as function in Sommarioni Data), if available, indicates the function in 1808 based on Sommarioni Data rather than field "function" which indicates the function in 1740. If the "som_function" is available, compare it to "function" and explain the difference.
 
Provide insight on rent amount (std_rent - unit is 'Soldi') and rent relative to the district (sestiere_rent_mean) if data is available.
 
Follow this structure to ensure clarity, professionalism, and historical accuracy, maintaining a formal tone suitable for a reference text. Use varied phrasing for common words like "typically" or "often."
 
As an example of descriptive text I generated: <{incontext_learning_text}> for sample data: <{incontext_learning_data}>.
 
Generate for input data: <{input_sample_data}>.
</pre>
 
=== Ask for Basic Information ===
 
To establish a foundation for generating the descriptions, the prompt begins by clearly stating the task: to create a descriptive text. Early in the prompt, it emphasizes that the text must be concise and factual. While further details on what is meant by ''concise'' and ''factual'' are provided later, these requirements are highlighted from the outset.
 
Additionally, the language model is given the context of the data, which focuses on historical Venetian properties, ensuring that the descriptions align with the subject matter.
 
<pre>
Generate a concise and factual descriptive text for the given input data about historical Venetian properties.
</pre>
 
The prompt then outlines the basic information to be included in the generated description. It starts with the parcel ID, which serves as an index, followed by the parcel's location, district name, and its function. For the function, an English translation is provided to give the user a clearer understanding of the parcel. The detailed process of creating these translations is described in the '''Data Enrichment''' section.
 
<pre>
The text should include:
1. Property Summary: Start with the property's ID, location, district name, and function. Use the Italian terms (e.g., casa, bottega) with an English explanation in parentheses.
</pre>
 
The second set of basic information to include in the descriptive text is ownership details, encompassing both the owner and tenant information. As explained in the '''Clustering''' section, ownership can be categorized into two types: an entity or a person. This distinction is provided to the language model, which is also instructed to note missing data where applicable.
 
It was clarified that if the owner is an entity, the absence of a first name, last name, or title should not be considered missing data. In contrast, for private owners, any missing details in these fields should be explicitly pointed out in the description.
 
In rare cases, the language model has referred to the tenant's name as the ''current tenant.'' To address this, it is explicitly stated that the tenancy information is based on the 1740 data.
 
<pre>
2. Ownership Details: Mention the owner. It can be either an entity or a person defined by first name, family name, and possibly title. If it is not an entity and a first name, family name, or title is missing, clearly state "no [specific detail] available" without overexplaining.
Include the tenant's name (based on 1740 data) and additional details, if provided.
</pre>
 
=== Use of Translations ===
 
As outlined in the '''Data Enrichment''' section, one key enhancement added to the descriptions is the historical context of the parcel's functionality. In many cases, the GPT-4o model struggled to accurately translate the parcel's functionality into English, especially when considering the historical time and place. To address this, a dictionary was compiled using various resources alongside the team’s knowledge of the Italian language. This dictionary ensures accurate context is provided to the language model, enabling it to generate detailed and relevant descriptions of the parcels' functionality.
 
For each parcel, the necessary translations from the crafted dictionary are included in the prompt to ensure the descriptions are historically accurate and contextually relevant.
 
<pre>
3. Historical Context: Offer a brief explanation of the property's function, referencing its role in 1700s-1800s Venice, phrased with appropriate caution (e.g., "may suggest," "indicates," "was often associated with") rather than definitive language.
Only use the information provided by this dictionary for historical context: {translations_dict}
</pre>
 
=== Ensuring Historical Accuracy Over Narrative Storytelling ===
 
Evaluating the descriptions generated by the GPT-4o model, it was discovered that the language model occasionally shifts toward storytelling rather than adhering strictly to historical facts. While the story provided by the model might be correct, redundant information without concrete factual evidence is avoided.
 
For example:
 
* '''General Information'''
** '''ID''': 3
** '''Place''': Campo vicino alla Chiesa
 
* '''Ownership Details'''
** '''Owner Title''': PIEVANO
 
* '''Functionality'''
** '''Catastici Function''': bottega da strazariol
 
Generated description:
 
<pre>
... Workshops like this were typically associated with tradespeople specializing in the repair or resale of second-hand clothing, which was a common trade in Venice during the 1700s-1800s.
</pre>
 
Although this may be plausible, it cannot always be verified and is considered out of scope. The model is instructed to avoid such speculative storytelling and focus on providing only data-supported insights.
 
To enforce this, the prompt includes guidelines:
 
<pre>
Avoid speculative storytelling or unnecessary elaboration or phrases like "reflecting a common arrangement."
Avoid extrapolating beyond the data provided unless well-established historical knowledge applies directly.
</pre>
 
=== Use of Sommarioni Data ===
 
To further enrich the generated text, ''Sommarioni'' data has also been incorporated. This allows comparisons of functionality between 1740 and 1808, providing insight into changes or consistencies over time.
 
<pre>
The field "som_function" (always refer to it as function in Sommarioni Data), if available, indicates the function in 1808 based on Sommarioni Data rather than field "function," which indicates the function in 1740. If the "som_function" is available, compare it to "function" and explain the difference.
</pre>
 
=== Use of Standard Rent and District Rent ===
 
As the final piece of information, the prompt highlights the calculated standard rent in units of "Soldi" and compares it with the average rent cost in the district. This ensures clarity on the economic context.
 
<pre>
Provide insight on rent amount (std_rent - unit is 'Soldi') and rent relative to the district (sestiere_rent_mean) if data is available.
</pre>
 
Example:
 
<pre>
The rent for this property is 744 Soldi, which is below the district (sestiere) rent mean of 2836 Soldi, suggesting its rental cost was relatively low in comparison.
</pre>
 
=== Prompt Ending and In-Context Learning ===
 
The system prompt concludes with an emphasis on adhering to the provided guidelines, focusing on historical accuracy and avoiding an overly confident tone. The final section provides a sample data point and its manually crafted description for in-context learning, helping the model align with the required format.
 
<pre>
Follow this structure to ensure clarity, professionalism, and historical accuracy, maintaining a formal tone suitable for a reference text. Use varied phrasing for common words like "typically" or "often."
As an example of descriptive text I generated: <{incontext_learning_text}> for sample data: <{incontext_learning_data}>.
</pre>
 
=== User Prompt ===
 
All the above-mentioned prompts are components of the system prompt. Once the model is aligned, the input can be provided, and the model is instructed to generate a description.
Having generated the description, the model is then asked to provide an Italian version of the text. It was necessary to clarify that parts already translated from Italian to English should not be retranslated into Italian in the final output.
 
<pre>
Generate for input data: <{input_sample_data}>
Then, provide an Italian translation of the text. Do not include in the translation the parts that are already translated from Italian to English.
</pre>
 
=== Handling Less Frequent Patterns ===
 
As previously mentioned, not all data entries follow the same pattern of data availability. As described in the '''Clustering''' section, it was observed that while 8 major patterns could represent most data points, there are numerous other patterns that represent smaller sets of data points, sometimes consisting of only a single entry.
 
For these data points, a slightly different approach was taken in prompt engineering. For the frequent patterns, the pre-crafted sample corresponding to the specific pattern of the input data was provided to the language model. In contrast, for less frequent patterns, no pre-crafted sample was available for each pattern. Instead, two samples from the frequent patterns were used (one representing a parcel owned by an entity and one by a person). The model was explicitly instructed once more to be cautious about missing data and to highlight it when applicable.
 
The prompt ending, including the in-context learning section, for data points from less frequent patterns is as follows:
 
<pre>
If there is missing or misformatted data, identify and address it. For instance, if a first name or place is missing, clearly indicate that the data is absent.
Follow this structure to ensure clarity, professionalism, and historical accuracy, maintaining a formal tone suitable for a reference text. Use varied phrasing for common words like "typically" or "often."
 
As an example of descriptive text, I generated: {incontext_learning_text_1} for sample data: {incontext_learning_data_1}. And: {incontext_learning_text_2} for sample data: {incontext_learning_data_2}"
</pre>
 
==Evaluation Metrics==
=== Generated Text in English ===
To evaluate the generated textual descriptions, we define three main metrics:
* '''Completeness''': Assessing whether all necessary fields (properties) are included, ensuring they are non-empty and relevant.
* '''Accuracy''': Verifying the correctness of the information in the included fields.
* '''Creativity''': Evaluating the variation in structure across generated texts for different samples
 
The metrics were evaluated using both manual and automatic methods:
* '''Completeness''':
**'''Manual Evaluation''': A scoring system of 0, 1, or 2 was applied to 100 samples by the research team. A score of 0 indicated missing properties, 1 indicated all required properties were included, and 2 indicated that additional irrelevant properties were added.
**'''Automatic Evaluation''': A "coverage score" was computed for 481 samples. This score, ranging from 0 to 1, reflects the ratio of properties included and the accuracy of the information presented.
 
* '''Accuracy''':
**'''Manual Evaluation''': Descriptions were scored as 0 or 1 based on the correctness of the included information. Any instance of incorrect data resulted in a score of 0. This evaluation was conducted on the same 100 samples used for manual completeness scoring.
**'''Automatic Evaluation''': The "coverage score" used for completeness also accounted for accuracy by verifying both the inclusion and correctness of the properties.
 
* '''Creativity''': Creativity was assessed using large language models, specifically ChatGPT 4o, to analyze the structural and stylistic variation in 200 generated texts. Clustering techniques were employed, and the results of these clusters were manually reviewed to verify the validity of the variation.
 
=== Generated Text in Italian ===
 
To test the multilingual capabilities of the generated textual descriptions, we translated the already-checked factual samples into Italian and had an expert fluent in Italian review them. The expert evaluation focused on:
 
*'''Soundness of Translation''': Ensuring the translations were linguistically accurate and preserved the meaning of the original descriptions.
 
*'''Clarity and Coherence''': Verifying that the descriptions were natural and contextually appropriate in Italian.
 
=Evaluation Results=
 
This is the final version of a generated text after refinement of prompts:
<pre>
The property with ID 220 is located in Fondamenta del Forner within the district of San Polo. Its function is listed as a stazio da erbarol (station of a herbalist). The owner is noted as Provveditori di Commun, with no available personal name or family name. The tenant in 1740 is identified as Antonio Scarpa. Historically, such stations were associated with places for stopping and conducting specific activities, including the trade of goods like herbs. The property's rent, set at 1736 Soldi, is below the average rent of 4124 Soldi for the San Polo district.
 
La proprietà con ID 220 si trova in Fondamenta del Forner nel distretto di San Polo. La sua funzione è indicata come stazio da erbarol. Il proprietario è registrato come Provveditori di Commun, senza nome o cognome personale disponibili. L'inquilino nel 1740 è identificato come Antonio Scarpa. Storicamente, tali stazioni erano associate a luoghi di sosta e conduzione di attività specifiche, incluso il commercio di beni. L'affitto della proprietà, fissato a 1736 Soldi, è inferiore alla media dell'affitto di 4124 Soldi per il distretto di San Polo.
</pre>
 
Next, we present the results of our generated textual descriptions, evaluated using the three proposed metrics outlined earlier. 
 
==Completeness and Accuracy==
Using the automatic evaluation method, approximately 73% of the generated texts achieved a coverage score above 80%. This result indicates that the majority of descriptions included most of the required properties and accurate information. However, this left 27% of the samples below the threshold, prompting further manual investigation.
 
One significant limitation of the automatic approach lies in its reliance on keyword matching. This approach, while efficient for large amounts of samples, fails to account for some cases. Consequently, it can only be used as a preliminary metric to identify patterns in the data and flag samples that require closer manual inspection. An example illustrating this limitation is the sample below; the translation of ''fratelli'' to ''brothers'' was linguistically correct, yet the automatic system did not recognize it as a match, leading to an inaccurately reduced score for the sample.
 
<pre>
 
input_data:  {...'owner_first_name': 'Paolo | _fratelli', 'owner_family_name': 'BEMBO | BEMBO'...}
generated_text:  ...It is owned by Paolo and his brothers, identified collectively with the family name Bembo...
 
</pre>
 
To address the gaps in the automatic evaluation, we conducted a manual review. This process involved all team members independently rating 100 randomly selected samples for completeness and accuracy. The manual evaluation revealed that 100% of the samples were rated 1 (fully complete) by all team members. This confirms that all necessary properties were included without omissions. The manual ratings also showed that 100% of the samples were accurate, meaning no errors or incorrect information.
 
==Creativity==
Creativity was assessed by evaluating structural variation in the generated texts. To do this, we leveraged ChatGPT 4o's capabilities to cluster samples based on their structural and stylistic differences, with a focus on paraphrasing and the order in which the properties of the entities are mentioned. Initial attempts involved clustering the full dataset of 481 samples. ChatGPT 4o, however, produced an overwhelming number of clusters, indicating challenges in distinguishing subtle variations in sentence structure. Recognizing this limitation, we reduced the dataset to 200 samples to refine the clustering process. We explicitly instructed ChatGPT 4o to focus on structural variations and avoid over-grouping by the ''function'' field. Despite these refinements, ChatGPT 4o continued to group samples according to it.
 
The clustering difficulties are explainable and align with our expectations. Samples that have the same function field often share similar fields and templates, leading to similar sentence structures that ChatGPT 4o could not effectively differentiate. With manual inspection, our analysis confirmed the presence of subtle variations in sentence order and phrasing, as illustrated in the example below. These variations, while not extensive, demonstrate that our generated texts do not rigidly follow a single template.
 
<pre>
Owned by Michelangelo and his brothers, surnamed Lini, who hold the title of Nobil Homo, there are more than three owners.
</pre>
 
<pre>
The owner is identified as Nobil Homo Capello, with no first name available.
</pre>
 
 
Below is a summary of the key results from both automatic and manual evaluations across the three metrics:
 
{| class="wikitable"
! Metric        !! Evaluation Method      !! Result (%) !! Notes
|-
| Completeness  || Automatic Evaluation    || 73        || Coverage score > 80% for most samples.
|-
| Completeness  || Manual Evaluation      || 100        || Rated 1 by all team members.
|-
| Accuracy      || Automatic Evaluation    || 73        || Same as completeness due to shared metric.
|-
| Accuracy      || Manual Evaluation      || 100        || Rated accurate by all team members.
|-
| Creativity    || ChatGPT 4o Clustering        || N/A        || Subtle variations detected manually, clustering struggled to differentiate nuances.
|}
 
 
Our decision to prioritize factuality and accuracy over creativity was rooted in the objectives of our project, and demonstrated in the achieved results. Our target audience, primarily concerned with the accuracy and completeness of information, places minimal emphasis on stylistic variations in the generated texts. By ensuring that all generated descriptions were factually correct and complete, we achieved our core goal, even at the expense of higher stylistic diversity. The results demonstrate the robustness of our approach in generating accurate and complete descriptions.
 
=Limitations and Future Work=
While this project has established a comprehensive pipeline for enriching data and generating textual descriptions, there are several areas where future work can enhance the quality and depth of the results: 
* '''Further data standardization''': Currently, we have only standardized the rent column due to its identified frequent patterns. However, there are additional unstandardized text fields that could provide valuable information, such as the quantity or quality of income. Further standardization of these fields could infer more detailed insights into owners' or tenants' social and economic status.
* '''Handling uncertain data''': At present, we ignore data attributes that do not satisfy certain predetermined formats. Future work could involve taking extra steps to handle these uncertain cases, such as manual checks to verify accuracy, or developing additional strategies for interpretation through hypotheses or further processing (for instance, we ignore very high rent values (5 or 6-digit) for them being the outliers of our data - but was that true or there were these exceptional parcels that are highly expensive due to certain economical situation?) 
* '''Deepening the connection between Catastici and Sommarioni''': We have demonstrated the changes in parcel functions between the Catastici and Sommarioni datasets, which offer historical and economic insights. However, the Sommarioni dataset contains more potential information that requires deeper exploration. For instance, the detailed owner information, including family relationships, titles, and parcel ownership, can be used to detect interesting inheritance patterns (e.g., whether a parcel was inherited from father to son or grandson). This could significantly enhance the summarization text by providing a more nuanced understanding of historical and economic shifts. 
* '''Extensive model testing and comparative analysis''': Due to monetary constraints, we only tested GPT-4 on a subset of samples. Future work should involve refining the pipeline and running the final version over the entire dataset. Additionally, testing with different models and conducting a comparative analysis using existing or new evaluation metrics would provide a more critical and comprehensive assessment of the results. This approach would help in identifying the most effective model and fine-tuning the pipeline for optimal performance.
 
By addressing these limitations, future work can further enrich the data, improve the accuracy of the generated texts, and provide a more comprehensive understanding of the historical and economic context of Venice during these periods.
 
= Project Timeline & Milestones =
 
{|class="wikitable"
! style="text-align:center;"|Timeframe
! Task
! Completion
|-
| align="center" |Week 3
|
* Conduct literature research on in-context learning
* Exploring the datasets
* Exploring in-context learning approaches for textual description generation
| align="center" | ✓
|-
| align="center" |Week 4
|
* Identify patterns and edge cases from the dataset (e.g missing fields, "odd" values)
* Craft manual descriptions for each frequent pattern to be used for in-context learning
| align="center" | ✓
|-
| align="center" |Week 5
|
* Refine the description generation structure
* Compile a dictionary based on available resources to translate functionalities
* Construct a pipeline for preparing the data, bringing the translations, generating the description and validation
| align="center" | ✓
|-
| align="center" |Week 6
|
* Refine the prompt to use less definitive language and stick to the factual information
* Exploring how to address the less frequent patterns
* Evaluate description generation results
| align="center" | ✓
|-
| align="center" |Week 7
|
* Autumn break
| align="center" | ✓
|-
| align="center" |Week 8
|
* Prepare for mid-term presentation
| align="center" | ✓
|-
| align="center" |Week 9
|
* Exploring the connection between the Catastici and Sommarioni dataset
* Making use of the change in functionality between Catastici and Sommarioni in the generated text
| align="center" | ✓
|-
| align="center" |Week 10
|
* Refine the compiled dictionary and add interpretations to provide historical context
* Standardization of monthly rent column
| align="center" | ✓
|-
| align="center" |Week 11
|
* Verify and refine standardized rent values
| align="center" | ✓
|-
| align="center" |Week 12
|
* Add district mean rent column and integrate additional information into the data
* Refine text prompts to incorporate new information
* Implement evaluation methods for the final generated text
| align="center" | ✓
|-
 
| align="center" |Week 13
|
* Refine evaluation methods
* Manual and automatic evaluation of generated texts
| align="center" | ✓
|-
 
| align="center" |Week 14
|
* Verify and combine evaluation tasks
* Finalize Wikipage and prepare for presentation
| align="center" | ✓
|-
|}


=Limitations and further work=
=Conclusion= 
In this work, we explored the 1740 Catastici and 1808 Sommarioni datasets to uncover patterns and generate textual descriptions of parcel information. We enriched the data by standardizing key values, such as rent prices and property functions, complemented by English translations. This processed information was structured into in-context learning templates, enabling our language model (GPT-4) to generate coherent and meaningful textual descriptions, including multilingual outputs in Italian. Finally, we evaluated the generated descriptions using three key metrics: completeness, accuracy, and creativity. Our results demonstrate that the descriptions are factual and reliable, with high accuracy confirmed through a manual review of a 100-entry subsample.


=Conclusion=
=Github Repository=
[https://github.com/dhlab-class/fdh-2024-student-projects-nastaran-rubi-fawzia/tree/main GitHub Link]


=Credits=
=Credits=

Latest revision as of 23:59, 18 December 2024

Introduction

Leveraging two invaluable historical datasets - the 1740 Catastici and the 1808 Sommarioni - our project aim to generate detailed textual descriptions of parcels in Venice. These datasets, rich in historical, spatial, and social information, provide a comprehensive view of land ownership, urban development, and social structure in Venice across two distinct periods. The 1740 Catastici offers early insights into parcel functions, rent payments, and tenant names, while the 1808 Sommarioni provides a more detailed and standardized survey, including owner data, and normalized ownership types and qualities. By integrating these datasets, we can create informative descriptions of each parcel, including their location, function, ownership details, and historical context, thereby enhancing our understanding of Venice's evolution and cultural significance.

A visualization of the datapoints and an example of their parcel information on the Catastici 1740 Dataset. Source: Parcel of Venice

Motivation

Our motivation for utilizing the 1740 Catastici and 1808 Sommarioni datasets stems from the rich historical insights they offer, despite the challenges in comprehending and connecting the data. These datasets, like many other historical records, face significant challenges in terms of data consistency and coherence. The data, often unstructured and manually constructed, presents inconsistencies due to its varied and sometimes incomplete nature. Fields may be missing, mistranscribed, or contain varying levels of detail, making it difficult to connect the different pieces of information into a coherent narrative. Additionally, the use of old Italian and Venetian dialects can further complicate the interpretation and context of the data.

Given these challenges, especially for non-historians, our goal is to simplify and contextualize this complex information. By cleaning, organizing, and integrating the data from these datasets, we aim to create coherent and informative textual descriptions. Leveraging in-context learning and prompting with GPT-4, we seek to depict a clear description of the story behind each datapoint, making the historical narratives more accessible and understandable for a broader audience. This approach will not only facilitate a deeper understanding of Venice's history but also make the datasets more user-friendly and interpretable.

Deliverables

  • Pre-processed, standardized dataset from the original source
  • A manually-crafted dictionary of old Venice house functions, matching terms used in the dataset with translations and detailed historical description
  • A text generation pipeline using in-context learning methods
  • A set of evaluation metrics to assess the text generation pipeline across different perspectives

Methodology

Data Exploration

Data clusterings and patterns

After a general review of the Catastici data, it was observed that each data point contains a series of empty fields. Many of these fields appeared to follow the same pattern of missing values. Since the generated text needs to handle various entries with different available data, the first step involved categorizing the data points based on their missing values and then addressing each category.

In the initial step, all possible patterns of missing values were extracted, and their frequency within the dataset was analyzed. Among the 36K data points, 28K samples were found to align with 8 major patterns. In the table below, the 8 frequent patterns are present in order of popularity with an 'X' indicating that the data is present in the given template.

Frequent Patterns in the Dataset
pattern id owner entity owner entity group owner first name owner family name owner family group owner title an rendi ten name
2 X X X X X
0 X X X X X X
12 X X X X
19 X X X X X
8 X X X X X X
1 X X X X X
23 X X X
5 X X X X X

The main text generation task was then divided into two subtasks. The first subtask addressed the most frequent patterns, where a suitable example was crafted for each pattern to be used as context for the large language model. The second subtask focused on optimizing the context and prompt to ensure high-quality descriptions could also be generated for the less frequent patterns in the dataset.

Frequent Patterns Categorization

In tackling the first subtask, the 8 major patterns were categorized based on the following features:

  • Whether the property is owned by a person or a different entity
  • Whether the owner's title is present or not
  • Whether the tenant’s name is present or not

For each combination of these features, a sample paragraph was crafted for a corresponding sample data point, providing a precise explanation of the parcel. Below are two examples crafted for pattern id 0 and 1 respectively:

Pattern #0 Sample:

The property with ID 1 is located in Campo vicino alla Chiesa and serves as a casa e bottega da barbier (house and workshop for a barber). Owned by Liberal Campi, who holds the title of Secondo Prete (Second Priest), the property is tenanted by Francesco Zeni. In 1700s-1800s Venice, such properties were often dual-purpose, combining living spaces with workplaces. Barbers at the time may have also performed minor medical tasks in addition to their grooming services.

Pattern #1 Sample:

The property with ID 3 is located in Campo vicino alla Chiesa and serves as a bottega da strazariol (a workshop for a dealer or repairer of old clothes). The owner is listed as Pievan di San Cancian (the parish priest of San Cancian), with the title of Pievano. The tenant is Bortolamio Piazza. In the 1700s and 1800s, such workshops were associated with tradespeople engaged in repairing or selling second-hand clothing.

When a given data point matched one of the frequent patterns, the pre-crafted sample data and text were added to the prompt, enabling the language model to generate descriptions for new data points. The outcomes consistently resulted in well-written and reliable descriptions. To provide the sample data point to the language model, the following structure is used in the prompt:

... As an example of descriptive text I generated: {incontext_learning_text}; for sample data: {incontext_learning_data}.

Addressing Less Frequent Patterns

Addressing the second subtask, which involved the less frequent patterns, the dataset revealed over 160 different patterns, each representing fewer than 100 samples. Many of these patterns were present in fewer than five samples. Due to the variety, it was impractical to craft a tailored text for each individual pattern. Instead, an approach was devised where examples of two frequent patterns, along with their desired manually crafted paragraphs, were provided to the language model. The prompt was enhanced to acknowledge the possibility of missing values and instructed the model to adjust the paragraph slightly if necessary. Additionally, the prompt included a request for the model to highlight missing values, as this information could be useful to the end-user.

Data Enrichment

Functionality of the parcels and historical translations

In working towards the goal of creating descriptive text for each data point, it became clear that simply listing the names of owners and the functions of parcels was just a straightforward conversion of tabular data into text. While this approach might be informative, it adds little value for the end-user. On the other hand, including some historical context about how the parcel's function fit into 1700s-1800s Venetian society could make the descriptions far more engaging and useful.

The first step in adding historical context was to ask the language model to use the functionality data to provide insights grounded in the time and place where these functions existed. While the language model (specifically GPT-4o) could sometimes deliver helpful information, it often struggled. The model seemed to lack sufficient training on the historical and linguistic nuances needed for this project.

After a few trials, it became clear that the main issue was the model's limited understanding of 18th-century Italian and how it differs from modern Italian. Once the correct meanings of the words in the functionality field were provided, the model was able to produce much more accurate and relevant descriptions.

As mentioned above, when dealing with the historical context and asking the GPT-4o model to provide explanations based on historical facts and information, it was observed that the GPT-4o model lacks an understanding of certain words in 18th-century Italian. Some words and phrases related to the functionality of parcels are not accurately translated by the GPT-4o model, resulting in generated text that is sometimes inaccurate or, in certain cases, misleading.

Take the following sample data entry as an example:

  • General Information
    • ID: 238
    • Place: Campiel del remer
  • Ownership Details
    • Owner First Name: Gerolamo
    • Owner Family Name: GRADENIGO
    • Owner Title: NOBIL HOMO
  • Functionality
    • Catastici Function: bastion inviamento (bastion with trade designation)
    • Standardized Functions: bastione, inviamento
    • Sommarioni Functions: BOTTEGA, CASA

Using the finalized prompt and without manually providing the translations, the model generated the following text regarding the context of the parcel's functionality:

In the 1700s-1800s Venice, the term "bastion inviamento" suggests a space used for fortification activities and dispatch operations.

However, based on the available sources, it is known that `bastione` refers to "a large tavern or commercial space, often used for selling wine," which is quite different from what the language model might infer from modern Italian. To address this discrepancy, a dictionary titled Dizionario del dialetto veneziano was used as the primary reference, supplemented by two additional resources: a website called History Walks Venice and an index of Venetian toponymy titled Toponomastica Veneziana.

For each standardized function used in the dataset, these three resources were combined with the team members' expertise in Italian to determine accurate translations. The translations derived from these efforts were then compiled into a concise dictionary to ensure the language model generated historically accurate and contextually appropriate descriptions.

This dictionary is provided to the GPT-4 model to ensure it generates factually and historically accurate explanations about the functionality of parcels. In the system prompt, only the necessary definitions are included to avoid redundant context and minimize resource usage.

With the updated translation provided to the language model, a description for the above example was generated as follows:

In the context of the 1700s-1800s Venice, a "bastion inviamento" may imply a versatile commercial space potentially used for selling wine by the glass and/or was designated for a specific trade, granted a particular right or possibly a license related to lucrative professions like baking or wine selling.

Using the compiled dictionary, the model accurately describes the functionality of the parcel within the historical context of 1700-1800s Venice.

Standardization of Monthly Rent

Exploring Value Formats

Inferring accurate values from rent data can offer valuable insights into a parcel's history, such as its price in relation to its neighborhood, which may reflect its relative status at that time. Among 35,946 data rows, we identified that only 28,610 (~80%) contained numeric values, with the remaining entries consisting of null values or text data.

Upon exploration, we encountered examples of text data such as:

'10 ducati, 19 grossi' '40 ducati e 14 grossi' 'casa in soler'
'libertà di traghetto' '20 lire' '26 lire' '15 lire'...

These examples illustrate the diversity of information present in the text entries, including multiple currencies, descriptive text about the parcel’s function, and potential typographical errors.

To standardize the non-numeric rent values, we developed an iterative approach. Using regular expressions (regex), we captured patterns within the data. After each iteration, we matched the patterns against the dataset, identified any emerging new patterns, and incorporated them into our existing pattern set. This process was repeated until no new patterns could be found.

In the final iteration, we identified six main patterns as follows:

Pattern Name Example Notes
Single currency 30 lire with optional "de piccoli" or "di piccoli"
Dual currency 7 ducati, 18 grossi with optional "de piccoli" or "di piccoli"
Three-part currency 10 ducati, 2 lire, 8 soldi
Fractional or "e mezzo" units 8 ducati e mezzo
Time-related mentions al mese, ogni tre mesi, per metà
Function or Ownership casa, bottega
Others 1[5], 1' Unmatched with any of the above

Format Selection and Hypothesis

From the identified text formats, we decided to exclude entries related to function, ownership, or other non-monetary details due to uncertainty about their origins. These deviations may stem from transcription errors or the inclusion of special characters.

For analysis, we focused on the remaining matched patterns, which revealed a mix of currencies within the entries. We identified five currency types used:

  • Ducati/ducato [basis currency]
  • Lire/lira/libre
  • Grossi/grosso
  • Soldi

To standardize the dataset, we made the following assumptions:

  1. Currency synonyms: Terms like ducati and ducato refer to the same currency, as do lire, lira, libre, and grossi/grosso.
  2. Default currency: If no currency is mentioned, it defaults to ducati.
  3. Irrelevant qualifiers: Phrases like de piccoli or di piccoli are ignored in numeric contexts. For example, 50 lire is treated the same as 50 lire di piccoli. Historically, "lire di piccoli" referred to a monetary unit based on the piccolo, a base coin in Venice, but not a distinct currency itself.
  4. Exchange rate: Based on Wikipedia, we use the following conversion:
    1. 1 ducati = 6.2 lire = 24 grossi = 124 solid
    2. While exchange rates were historically dynamic and context-dependent, we adopted this commonly cited, publicly available rate.
  5. Unmatched formats: Entries that do not align with the identified currency patterns are excluded from analysis and marked as -1 due to uncertainty (e.g., transcription errors or unclear historical context).

Standardization Strategy

We standardized all values by converting them to the smallest unit (soldi):

  • Numeric values: Convert from ducati to soldi
  • Non-numeric values:
    • Matched format: Standardize as below
    • Unmatched format: Ignore (no analysis)

Value Standardization Method

  1. Single currency: For entries like 30 lire, convert directly using the chosen exchange rate.
  2. Dual currency: For entries like 7 ducati, 18 grossi, separate the components, convert each, and combine the results.
  3. Three-part currency: For entries like 10 ducati, 2 lire, 8 soldi, follow the same process as dual currency but account for the third component.
  4. Fractional units: For entries like 8 ducati e mezzo or 8 e mezzo, treat e mezzo (and similar terms) as +0.5, then convert.
  5. Time-related mentions:
    1. al mese = "monthly" (standard conversion).
    2. ogni tre mesi = "every three months" (divide by 3).
    3. per metà = "for half" (divide by 2; likely indicates partial or shared payment responsibility). For example, 36 lire per metà translates to 18 lire each for two parties or simply half of 36 lire.
    4. per 3 mesi = "for 3 months" (remain as is; indicates a fixed three-month payment).

This process results in a new dataset column, std_rent, which contains standardized monthly rent values in soldi. A few samples of the result is shown as below:

an_rendi std_rent
30 3720.0
30.5 lire di picolli 189.1
7 ducati, 18 grossi di piccolo 1300.0
10.0 ducati, 2 lire, 8 soldi 1260.4
30 lire per mezzo 189.1
40 lire al mese 248.0

Adding Mean District Rent Value for Parcels

With the standardized rent values for each parcel, we extended our analysis to compare individual parcels against the mean rent value of their respective districts (sestiere).

During this process, we identified additional anomalies, such as values with more than four digits. Upon verification, these entries were recognized as transcription errors, likely due to confusion with the ID field. Such values were excluded from further analysis.

To compute the mean rent price for each district, we excluded both the invalid entries and the standardized -1 values, as defined earlier. The results are visualized in the following graph:

The mean monthly rent price of each district (sestiere)

We further visualized the mean district rent values on a map, utilizing geographic coordinates to explore spatial patterns and gain insights into neighborhood characteristics and rent distributions.

Distribution of mean monthly rent price by district (sestiere), visualized on exact coordinates

From this visualization, we can derive interesting insights about neighborhoods. For example, San Marco (SM) has the highest mean rent, likely due to its central location within the city. This information enables further enrichment for text generation, not only within individual districts but also for comparisons across districts, offering a richer understanding of the socio-economic dynamics of the city.

Prompt Engineering

The prompt used to generate the description for each given input data is as follows. The steps involved in crafting this prompt and the rationale behind including each part are detailed below:

Generate a concise and factual descriptive text for the given input data about historical Venetian properties. The text should include:

1. Property Summary: Start with the property's ID, location, district name, and function. Use the Italian terms (e.g., casa, bottega) with an English explanation in parentheses.

2. Ownership Details: Mention the owner. It can be either an entity or a person defined by first name, family name, and possibly title. If it is not an entity and a first name, family name, or title is missing, clearly state "no [specific detail] available" without overexplaining. Include the tenant's name (based on 1740 data) and additional details, if provided.

3. Historical Context: Offer a brief explanation of the property's function, referencing its role in 1700s-1800s Venice, phrased with appropriate caution (e.g., "may suggest," "indicates," "was often associated with") rather than definitive language. Only use the information provided by this dictionary for historical context: {translations_dict} 

Avoid speculative storytelling or unnecessary elaboration or phrases like "reflecting a common arrangement".
Avoid extrapolating beyond the data provided unless well-established historical knowledge applies directly.
Avoid describing the Venetian society when providing Historical Context about the functionality. Focus on the function rather than society.

The field "som_function" (always refer to it as function in Sommarioni Data), if available, indicates the function in 1808 based on Sommarioni Data rather than field "function" which indicates the function in 1740. If the "som_function" is available, compare it to "function" and explain the difference.

Provide insight on rent amount (std_rent - unit is 'Soldi') and rent relative to the district (sestiere_rent_mean) if data is available.

Follow this structure to ensure clarity, professionalism, and historical accuracy, maintaining a formal tone suitable for a reference text. Use varied phrasing for common words like "typically" or "often."

As an example of descriptive text I generated: <{incontext_learning_text}> for sample data: <{incontext_learning_data}>.

Generate for input data: <{input_sample_data}>.

Ask for Basic Information

To establish a foundation for generating the descriptions, the prompt begins by clearly stating the task: to create a descriptive text. Early in the prompt, it emphasizes that the text must be concise and factual. While further details on what is meant by concise and factual are provided later, these requirements are highlighted from the outset.

Additionally, the language model is given the context of the data, which focuses on historical Venetian properties, ensuring that the descriptions align with the subject matter.

Generate a concise and factual descriptive text for the given input data about historical Venetian properties.

The prompt then outlines the basic information to be included in the generated description. It starts with the parcel ID, which serves as an index, followed by the parcel's location, district name, and its function. For the function, an English translation is provided to give the user a clearer understanding of the parcel. The detailed process of creating these translations is described in the Data Enrichment section.

The text should include:
1. Property Summary: Start with the property's ID, location, district name, and function. Use the Italian terms (e.g., casa, bottega) with an English explanation in parentheses.

The second set of basic information to include in the descriptive text is ownership details, encompassing both the owner and tenant information. As explained in the Clustering section, ownership can be categorized into two types: an entity or a person. This distinction is provided to the language model, which is also instructed to note missing data where applicable.

It was clarified that if the owner is an entity, the absence of a first name, last name, or title should not be considered missing data. In contrast, for private owners, any missing details in these fields should be explicitly pointed out in the description.

In rare cases, the language model has referred to the tenant's name as the current tenant. To address this, it is explicitly stated that the tenancy information is based on the 1740 data.

2. Ownership Details: Mention the owner. It can be either an entity or a person defined by first name, family name, and possibly title. If it is not an entity and a first name, family name, or title is missing, clearly state "no [specific detail] available" without overexplaining.
Include the tenant's name (based on 1740 data) and additional details, if provided.

Use of Translations

As outlined in the Data Enrichment section, one key enhancement added to the descriptions is the historical context of the parcel's functionality. In many cases, the GPT-4o model struggled to accurately translate the parcel's functionality into English, especially when considering the historical time and place. To address this, a dictionary was compiled using various resources alongside the team’s knowledge of the Italian language. This dictionary ensures accurate context is provided to the language model, enabling it to generate detailed and relevant descriptions of the parcels' functionality.

For each parcel, the necessary translations from the crafted dictionary are included in the prompt to ensure the descriptions are historically accurate and contextually relevant.

3. Historical Context: Offer a brief explanation of the property's function, referencing its role in 1700s-1800s Venice, phrased with appropriate caution (e.g., "may suggest," "indicates," "was often associated with") rather than definitive language.
Only use the information provided by this dictionary for historical context: {translations_dict}

Ensuring Historical Accuracy Over Narrative Storytelling

Evaluating the descriptions generated by the GPT-4o model, it was discovered that the language model occasionally shifts toward storytelling rather than adhering strictly to historical facts. While the story provided by the model might be correct, redundant information without concrete factual evidence is avoided.

For example:

  • General Information
    • ID: 3
    • Place: Campo vicino alla Chiesa
  • Ownership Details
    • Owner Title: PIEVANO
  • Functionality
    • Catastici Function: bottega da strazariol

Generated description:

... Workshops like this were typically associated with tradespeople specializing in the repair or resale of second-hand clothing, which was a common trade in Venice during the 1700s-1800s.

Although this may be plausible, it cannot always be verified and is considered out of scope. The model is instructed to avoid such speculative storytelling and focus on providing only data-supported insights.

To enforce this, the prompt includes guidelines:

Avoid speculative storytelling or unnecessary elaboration or phrases like "reflecting a common arrangement."
Avoid extrapolating beyond the data provided unless well-established historical knowledge applies directly.

Use of Sommarioni Data

To further enrich the generated text, Sommarioni data has also been incorporated. This allows comparisons of functionality between 1740 and 1808, providing insight into changes or consistencies over time.

The field "som_function" (always refer to it as function in Sommarioni Data), if available, indicates the function in 1808 based on Sommarioni Data rather than field "function," which indicates the function in 1740. If the "som_function" is available, compare it to "function" and explain the difference.

Use of Standard Rent and District Rent

As the final piece of information, the prompt highlights the calculated standard rent in units of "Soldi" and compares it with the average rent cost in the district. This ensures clarity on the economic context.

Provide insight on rent amount (std_rent - unit is 'Soldi') and rent relative to the district (sestiere_rent_mean) if data is available.

Example:

The rent for this property is 744 Soldi, which is below the district (sestiere) rent mean of 2836 Soldi, suggesting its rental cost was relatively low in comparison.

Prompt Ending and In-Context Learning

The system prompt concludes with an emphasis on adhering to the provided guidelines, focusing on historical accuracy and avoiding an overly confident tone. The final section provides a sample data point and its manually crafted description for in-context learning, helping the model align with the required format.

Follow this structure to ensure clarity, professionalism, and historical accuracy, maintaining a formal tone suitable for a reference text. Use varied phrasing for common words like "typically" or "often."
As an example of descriptive text I generated: <{incontext_learning_text}> for sample data: <{incontext_learning_data}>.

User Prompt

All the above-mentioned prompts are components of the system prompt. Once the model is aligned, the input can be provided, and the model is instructed to generate a description. Having generated the description, the model is then asked to provide an Italian version of the text. It was necessary to clarify that parts already translated from Italian to English should not be retranslated into Italian in the final output.

Generate for input data: <{input_sample_data}>
Then, provide an Italian translation of the text. Do not include in the translation the parts that are already translated from Italian to English.

Handling Less Frequent Patterns

As previously mentioned, not all data entries follow the same pattern of data availability. As described in the Clustering section, it was observed that while 8 major patterns could represent most data points, there are numerous other patterns that represent smaller sets of data points, sometimes consisting of only a single entry.

For these data points, a slightly different approach was taken in prompt engineering. For the frequent patterns, the pre-crafted sample corresponding to the specific pattern of the input data was provided to the language model. In contrast, for less frequent patterns, no pre-crafted sample was available for each pattern. Instead, two samples from the frequent patterns were used (one representing a parcel owned by an entity and one by a person). The model was explicitly instructed once more to be cautious about missing data and to highlight it when applicable.

The prompt ending, including the in-context learning section, for data points from less frequent patterns is as follows:

If there is missing or misformatted data, identify and address it. For instance, if a first name or place is missing, clearly indicate that the data is absent.
Follow this structure to ensure clarity, professionalism, and historical accuracy, maintaining a formal tone suitable for a reference text. Use varied phrasing for common words like "typically" or "often."

As an example of descriptive text, I generated: {incontext_learning_text_1} for sample data: {incontext_learning_data_1}. And: {incontext_learning_text_2} for sample data: {incontext_learning_data_2}"

Evaluation Metrics

Generated Text in English

To evaluate the generated textual descriptions, we define three main metrics:

  • Completeness: Assessing whether all necessary fields (properties) are included, ensuring they are non-empty and relevant.
  • Accuracy: Verifying the correctness of the information in the included fields.
  • Creativity: Evaluating the variation in structure across generated texts for different samples

The metrics were evaluated using both manual and automatic methods:

  • Completeness:
    • Manual Evaluation: A scoring system of 0, 1, or 2 was applied to 100 samples by the research team. A score of 0 indicated missing properties, 1 indicated all required properties were included, and 2 indicated that additional irrelevant properties were added.
    • Automatic Evaluation: A "coverage score" was computed for 481 samples. This score, ranging from 0 to 1, reflects the ratio of properties included and the accuracy of the information presented.
  • Accuracy:
    • Manual Evaluation: Descriptions were scored as 0 or 1 based on the correctness of the included information. Any instance of incorrect data resulted in a score of 0. This evaluation was conducted on the same 100 samples used for manual completeness scoring.
    • Automatic Evaluation: The "coverage score" used for completeness also accounted for accuracy by verifying both the inclusion and correctness of the properties.
  • Creativity: Creativity was assessed using large language models, specifically ChatGPT 4o, to analyze the structural and stylistic variation in 200 generated texts. Clustering techniques were employed, and the results of these clusters were manually reviewed to verify the validity of the variation.

Generated Text in Italian

To test the multilingual capabilities of the generated textual descriptions, we translated the already-checked factual samples into Italian and had an expert fluent in Italian review them. The expert evaluation focused on:

  • Soundness of Translation: Ensuring the translations were linguistically accurate and preserved the meaning of the original descriptions.
  • Clarity and Coherence: Verifying that the descriptions were natural and contextually appropriate in Italian.

Evaluation Results

This is the final version of a generated text after refinement of prompts:

The property with ID 220 is located in Fondamenta del Forner within the district of San Polo. Its function is listed as a stazio da erbarol (station of a herbalist). The owner is noted as Provveditori di Commun, with no available personal name or family name. The tenant in 1740 is identified as Antonio Scarpa. Historically, such stations were associated with places for stopping and conducting specific activities, including the trade of goods like herbs. The property's rent, set at 1736 Soldi, is below the average rent of 4124 Soldi for the San Polo district.

La proprietà con ID 220 si trova in Fondamenta del Forner nel distretto di San Polo. La sua funzione è indicata come stazio da erbarol. Il proprietario è registrato come Provveditori di Commun, senza nome o cognome personale disponibili. L'inquilino nel 1740 è identificato come Antonio Scarpa. Storicamente, tali stazioni erano associate a luoghi di sosta e conduzione di attività specifiche, incluso il commercio di beni. L'affitto della proprietà, fissato a 1736 Soldi, è inferiore alla media dell'affitto di 4124 Soldi per il distretto di San Polo.

Next, we present the results of our generated textual descriptions, evaluated using the three proposed metrics outlined earlier.

Completeness and Accuracy

Using the automatic evaluation method, approximately 73% of the generated texts achieved a coverage score above 80%. This result indicates that the majority of descriptions included most of the required properties and accurate information. However, this left 27% of the samples below the threshold, prompting further manual investigation.

One significant limitation of the automatic approach lies in its reliance on keyword matching. This approach, while efficient for large amounts of samples, fails to account for some cases. Consequently, it can only be used as a preliminary metric to identify patterns in the data and flag samples that require closer manual inspection. An example illustrating this limitation is the sample below; the translation of fratelli to brothers was linguistically correct, yet the automatic system did not recognize it as a match, leading to an inaccurately reduced score for the sample.


input_data:   {...'owner_first_name': 'Paolo | _fratelli', 'owner_family_name': 'BEMBO | BEMBO'...}
generated_text:   ...It is owned by Paolo and his brothers, identified collectively with the family name Bembo...

To address the gaps in the automatic evaluation, we conducted a manual review. This process involved all team members independently rating 100 randomly selected samples for completeness and accuracy. The manual evaluation revealed that 100% of the samples were rated 1 (fully complete) by all team members. This confirms that all necessary properties were included without omissions. The manual ratings also showed that 100% of the samples were accurate, meaning no errors or incorrect information.

Creativity

Creativity was assessed by evaluating structural variation in the generated texts. To do this, we leveraged ChatGPT 4o's capabilities to cluster samples based on their structural and stylistic differences, with a focus on paraphrasing and the order in which the properties of the entities are mentioned. Initial attempts involved clustering the full dataset of 481 samples. ChatGPT 4o, however, produced an overwhelming number of clusters, indicating challenges in distinguishing subtle variations in sentence structure. Recognizing this limitation, we reduced the dataset to 200 samples to refine the clustering process. We explicitly instructed ChatGPT 4o to focus on structural variations and avoid over-grouping by the function field. Despite these refinements, ChatGPT 4o continued to group samples according to it.

The clustering difficulties are explainable and align with our expectations. Samples that have the same function field often share similar fields and templates, leading to similar sentence structures that ChatGPT 4o could not effectively differentiate. With manual inspection, our analysis confirmed the presence of subtle variations in sentence order and phrasing, as illustrated in the example below. These variations, while not extensive, demonstrate that our generated texts do not rigidly follow a single template.

Owned by Michelangelo and his brothers, surnamed Lini, who hold the title of Nobil Homo, there are more than three owners.
The owner is identified as Nobil Homo Capello, with no first name available.


Below is a summary of the key results from both automatic and manual evaluations across the three metrics:

Metric Evaluation Method Result (%) Notes
Completeness Automatic Evaluation 73 Coverage score > 80% for most samples.
Completeness Manual Evaluation 100 Rated 1 by all team members.
Accuracy Automatic Evaluation 73 Same as completeness due to shared metric.
Accuracy Manual Evaluation 100 Rated accurate by all team members.
Creativity ChatGPT 4o Clustering N/A Subtle variations detected manually, clustering struggled to differentiate nuances.


Our decision to prioritize factuality and accuracy over creativity was rooted in the objectives of our project, and demonstrated in the achieved results. Our target audience, primarily concerned with the accuracy and completeness of information, places minimal emphasis on stylistic variations in the generated texts. By ensuring that all generated descriptions were factually correct and complete, we achieved our core goal, even at the expense of higher stylistic diversity. The results demonstrate the robustness of our approach in generating accurate and complete descriptions.

Limitations and Future Work

While this project has established a comprehensive pipeline for enriching data and generating textual descriptions, there are several areas where future work can enhance the quality and depth of the results:

  • Further data standardization: Currently, we have only standardized the rent column due to its identified frequent patterns. However, there are additional unstandardized text fields that could provide valuable information, such as the quantity or quality of income. Further standardization of these fields could infer more detailed insights into owners' or tenants' social and economic status.
  • Handling uncertain data: At present, we ignore data attributes that do not satisfy certain predetermined formats. Future work could involve taking extra steps to handle these uncertain cases, such as manual checks to verify accuracy, or developing additional strategies for interpretation through hypotheses or further processing (for instance, we ignore very high rent values (5 or 6-digit) for them being the outliers of our data - but was that true or there were these exceptional parcels that are highly expensive due to certain economical situation?)
  • Deepening the connection between Catastici and Sommarioni: We have demonstrated the changes in parcel functions between the Catastici and Sommarioni datasets, which offer historical and economic insights. However, the Sommarioni dataset contains more potential information that requires deeper exploration. For instance, the detailed owner information, including family relationships, titles, and parcel ownership, can be used to detect interesting inheritance patterns (e.g., whether a parcel was inherited from father to son or grandson). This could significantly enhance the summarization text by providing a more nuanced understanding of historical and economic shifts.
  • Extensive model testing and comparative analysis: Due to monetary constraints, we only tested GPT-4 on a subset of samples. Future work should involve refining the pipeline and running the final version over the entire dataset. Additionally, testing with different models and conducting a comparative analysis using existing or new evaluation metrics would provide a more critical and comprehensive assessment of the results. This approach would help in identifying the most effective model and fine-tuning the pipeline for optimal performance.

By addressing these limitations, future work can further enrich the data, improve the accuracy of the generated texts, and provide a more comprehensive understanding of the historical and economic context of Venice during these periods.

Project Timeline & Milestones

Timeframe Task Completion
Week 3
  • Conduct literature research on in-context learning
  • Exploring the datasets
  • Exploring in-context learning approaches for textual description generation
Week 4
  • Identify patterns and edge cases from the dataset (e.g missing fields, "odd" values)
  • Craft manual descriptions for each frequent pattern to be used for in-context learning
Week 5
  • Refine the description generation structure
  • Compile a dictionary based on available resources to translate functionalities
  • Construct a pipeline for preparing the data, bringing the translations, generating the description and validation
Week 6
  • Refine the prompt to use less definitive language and stick to the factual information
  • Exploring how to address the less frequent patterns
  • Evaluate description generation results
Week 7
  • Autumn break
Week 8
  • Prepare for mid-term presentation
Week 9
  • Exploring the connection between the Catastici and Sommarioni dataset
  • Making use of the change in functionality between Catastici and Sommarioni in the generated text
Week 10
  • Refine the compiled dictionary and add interpretations to provide historical context
  • Standardization of monthly rent column
Week 11
  • Verify and refine standardized rent values
Week 12
  • Add district mean rent column and integrate additional information into the data
  • Refine text prompts to incorporate new information
  • Implement evaluation methods for the final generated text
Week 13
  • Refine evaluation methods
  • Manual and automatic evaluation of generated texts
Week 14
  • Verify and combine evaluation tasks
  • Finalize Wikipage and prepare for presentation

Conclusion

In this work, we explored the 1740 Catastici and 1808 Sommarioni datasets to uncover patterns and generate textual descriptions of parcel information. We enriched the data by standardizing key values, such as rent prices and property functions, complemented by English translations. This processed information was structured into in-context learning templates, enabling our language model (GPT-4) to generate coherent and meaningful textual descriptions, including multilingual outputs in Italian. Finally, we evaluated the generated descriptions using three key metrics: completeness, accuracy, and creativity. Our results demonstrate that the descriptions are factual and reliable, with high accuracy confirmed through a manual review of a 100-entry subsample.

Github Repository

GitHub Link

Credits

Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisors: Alexander Rusnak, Tristan Karch, Tommy Bruzzese
Authors: Nastaran Hashemisanjani, Fawzia Zeitoun, Bich Ngoc (Rubi) Doan