Generation of Textual Description: Difference between revisions
(18 intermediate revisions by the same user not shown) | |||
Line 38: | Line 38: | ||
| align="center" |Week 8 | | align="center" |Week 8 | ||
| | | | ||
* | * Prepare for mid-term presentation | ||
| align="center" | | | align="center" | | ||
|- | |- | ||
| align="center" |Week 9 | | align="center" |Week 9 | ||
| | | | ||
* | * Explore father-son relationship among Catastici and Sommarioni dataset | ||
| align="center" | | | align="center" | | ||
|- | |- | ||
| align="center" |Week 10 | | align="center" |Week 10 | ||
| | | | ||
* | * Standardization of monthly rent column | ||
| align="center" | | | align="center" | | ||
|- | |- | ||
| align="center" |Week 11 | | align="center" |Week 11 | ||
| | | | ||
* | * Verified and refined standardized rent values | ||
|- | |- | ||
| align="center" |Week 12 | | align="center" |Week 12 | ||
| | | | ||
* | * Added district mean rent column and integrated additional information into the data | ||
| align="center" | | | align="center" | | ||
|- | |- | ||
Line 76: | Line 76: | ||
=Methodology= | =Methodology= | ||
==Generating Summarization Formats for In-context Learning== | |||
==Functionality of Parcels== | |||
blablabla | |||
==Standardization of Monthly Rent== | |||
===Exploring Value Formats=== | |||
Inferring accurate values from rent data can offer valuable insights into a parcel's history, such as its price in relation to its neighborhood, which may reflect its relative status at that time. Among 35,946 data rows, we identified that only 28,610 (~80%) contained numeric values, with the remaining entries consisting of null values or text data. | |||
Upon exploration, we encountered examples of text data such as: | |||
'10 ducati, 19 grossi' '40 ducati e 14 grossi' 'casa in soler' | |||
'libertà di traghetto' '20 lire' '26 lire' '15 lire'... | |||
These examples illustrate the diversity of information present in the text entries, including multiple currencies, descriptive text about the parcel’s function, and potential typographical errors. | |||
To standardize the non-numeric rent values, we developed an iterative approach. Using regular expressions (regex), we captured patterns within the data. After each iteration, we matched the patterns against the dataset, identified any emerging new patterns, and incorporated them into our existing pattern set. This process was repeated until no new patterns could be found. | |||
In the final iteration, we identified six main patterns as follows: | |||
{|class="wikitable" | |||
!scope="col" width="150"|Pattern Name | |||
!|Example | |||
!|Notes | |||
|- | |||
|Single currency | |||
|30 lire | |||
|with optional "de piccoli" or "di piccoli" | |||
|- | |||
|Dual currency | |||
|7 ducati, 18 grossi | |||
|with optional "de piccoli" or "di piccoli" | |||
|- | |||
|Three-part currency | |||
|10 ducati, 2 lire, 8 soldi | |||
| | |||
|- | |||
|Fractional or "e mezzo" units | |||
|8 ducati e mezzo | |||
| | |||
|- | |||
|Time-related mentions | |||
|al mese, ogni tre mesi, per metà | |||
| | |||
|- | |||
|Function or Ownership | |||
|casa, bottega | |||
| | |||
|- | |||
|Others | |||
|1[5], 1' | |||
|Unmatched with any of the above | |||
|} | |||
===Format Selection and Hypothesis=== | |||
From the identified text formats, we decided to exclude entries related to function, ownership, or other non-monetary details due to uncertainty about their origins. These deviations may stem from transcription errors or the inclusion of special characters. | |||
For analysis, we focused on the remaining matched patterns, which revealed a mix of currencies within the entries. We identified five currency types used: | |||
* Ducati/ducato [basis currency] | |||
* Lire/lira/libre | |||
* Grossi/grosso | |||
* Soldi | |||
To standardize the dataset, we made the following assumptions: | |||
# '''Currency synonyms''': Terms like ducati and ducato refer to the same currency, as do lire, lira, libre, and grossi/grosso. | |||
# '''Default currency''': If no currency is mentioned, it defaults to ducati. | |||
# '''Irrelevant qualifiers''': Phrases like de piccoli or di piccoli are ignored in numeric contexts. For example, 50 lire is treated the same as 50 lire di piccoli. Historically, "lire di piccoli" referred to a monetary unit based on the piccolo, a base coin in Venice, but not a distinct currency itself. | |||
# '''Exchange rate''': Based on [https://en.wikipedia.org/wiki/Venetian_lira Wikipedia], we use the following conversion: | |||
## 1 ducati = 6.2 lire = 24 grossi = 124 solid | |||
## While exchange rates were historically dynamic and context-dependent, we adopted this commonly cited, publicly available rate. | |||
# '''Unmatched formats''': Entries that do not align with the identified currency patterns are excluded from analysis and marked as <code>-1</code> due to uncertainty (e.g., transcription errors or unclear historical context). | |||
===Standardization Strategy=== | |||
We standardized all values by converting them to the smallest unit (''soldi''): | |||
* '''Numeric values''': Convert from ducati to soldi | |||
* '''Non-numeric values''': | |||
** Matched format: Standardize as below | |||
** Unmatched format: Ignore (no analysis) | |||
===Value Standardization Method=== | |||
# Single currency: For entries like 30 lire, convert directly using the chosen exchange rate. | |||
# Dual currency: For entries like 7 ducati, 18 grossi, separate the components, convert each, and combine the results. | |||
# Three-part currency: For entries like 10 ducati, 2 lire, 8 soldi, follow the same process as dual currency but account for the third component. | |||
# Fractional units: For entries like 8 ducati e mezzo or 8 e mezzo, treat e mezzo (and similar terms) as +0.5, then convert. | |||
# Time-related mentions: | |||
## al mese = "monthly" (standard conversion). | |||
## ogni tre mesi = "every three months" (divide by 3). | |||
## per metà = "for half" (divide by 2; likely indicates partial or shared payment responsibility). For example, 36 lire per metà translates to 18 lire each for two parties or simply half of 36 lire. | |||
## per 3 mesi = "for 3 months" (remain as is; indicates a fixed three-month payment). | |||
This process results in a new dataset column, <code>std_rent</code>, which contains standardized monthly rent values in ''soldi''. | |||
===Adding Mean District Rent Value for Parcels=== | |||
With the standardized rent values for each parcel, we extended our analysis to compare individual parcels against the mean rent value of their respective districts (sestiere). | |||
During this process, we identified additional anomalies, such as values with more than four digits. Upon verification, these entries were recognized as transcription errors, likely due to confusion with the ID field. Such values were excluded from further analysis. | |||
To compute the mean rent price for each district, we excluded both the invalid entries and the standardized <code>-1</code> values, as defined earlier. The results are visualized in the following graph: | |||
[[File:mean-sestiere-rent.png|thumb|The mean monthly rent price of each district (sestiere)]] | |||
We further visualized the mean district rent values on a map, utilizing geographic coordinates to explore spatial patterns and gain insights into neighborhood characteristics and rent distributions. | |||
[[File:Mean-district-rent.png|thumb|Distribution of mean monthly rent price by district (sestiere), visualized on exact coordinates]] | |||
From this visualization, we can derive interesting insights about neighborhoods. For example, San Marco (SM) has the highest mean rent, likely due to its central location within the city. This information enables further enrichment for text generation, not only within individual districts but also for comparisons across districts, offering a richer understanding of the socio-economic dynamics of the city. | |||
=Results= | =Results= | ||
Line 81: | Line 185: | ||
=Limitations and further work= | =Limitations and further work= | ||
=Conclusion= | |||
=Credits= |
Latest revision as of 23:00, 11 December 2024
Introduction
Motivation
Deliverables
Project Timeline & Milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
|
Week 8 |
|
|
Week 9 |
|
|
Week 10 |
|
|
Week 11 |
| |
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|
Methodology
Generating Summarization Formats for In-context Learning
Functionality of Parcels
blablabla
Standardization of Monthly Rent
Exploring Value Formats
Inferring accurate values from rent data can offer valuable insights into a parcel's history, such as its price in relation to its neighborhood, which may reflect its relative status at that time. Among 35,946 data rows, we identified that only 28,610 (~80%) contained numeric values, with the remaining entries consisting of null values or text data.
Upon exploration, we encountered examples of text data such as:
'10 ducati, 19 grossi' '40 ducati e 14 grossi' 'casa in soler' 'libertà di traghetto' '20 lire' '26 lire' '15 lire'...
These examples illustrate the diversity of information present in the text entries, including multiple currencies, descriptive text about the parcel’s function, and potential typographical errors.
To standardize the non-numeric rent values, we developed an iterative approach. Using regular expressions (regex), we captured patterns within the data. After each iteration, we matched the patterns against the dataset, identified any emerging new patterns, and incorporated them into our existing pattern set. This process was repeated until no new patterns could be found.
In the final iteration, we identified six main patterns as follows:
Pattern Name | Example | Notes |
---|---|---|
Single currency | 30 lire | with optional "de piccoli" or "di piccoli" |
Dual currency | 7 ducati, 18 grossi | with optional "de piccoli" or "di piccoli" |
Three-part currency | 10 ducati, 2 lire, 8 soldi | |
Fractional or "e mezzo" units | 8 ducati e mezzo | |
Time-related mentions | al mese, ogni tre mesi, per metà | |
Function or Ownership | casa, bottega | |
Others | 1[5], 1' | Unmatched with any of the above |
Format Selection and Hypothesis
From the identified text formats, we decided to exclude entries related to function, ownership, or other non-monetary details due to uncertainty about their origins. These deviations may stem from transcription errors or the inclusion of special characters.
For analysis, we focused on the remaining matched patterns, which revealed a mix of currencies within the entries. We identified five currency types used:
- Ducati/ducato [basis currency]
- Lire/lira/libre
- Grossi/grosso
- Soldi
To standardize the dataset, we made the following assumptions:
- Currency synonyms: Terms like ducati and ducato refer to the same currency, as do lire, lira, libre, and grossi/grosso.
- Default currency: If no currency is mentioned, it defaults to ducati.
- Irrelevant qualifiers: Phrases like de piccoli or di piccoli are ignored in numeric contexts. For example, 50 lire is treated the same as 50 lire di piccoli. Historically, "lire di piccoli" referred to a monetary unit based on the piccolo, a base coin in Venice, but not a distinct currency itself.
- Exchange rate: Based on Wikipedia, we use the following conversion:
- 1 ducati = 6.2 lire = 24 grossi = 124 solid
- While exchange rates were historically dynamic and context-dependent, we adopted this commonly cited, publicly available rate.
- Unmatched formats: Entries that do not align with the identified currency patterns are excluded from analysis and marked as
-1
due to uncertainty (e.g., transcription errors or unclear historical context).
Standardization Strategy
We standardized all values by converting them to the smallest unit (soldi):
- Numeric values: Convert from ducati to soldi
- Non-numeric values:
- Matched format: Standardize as below
- Unmatched format: Ignore (no analysis)
Value Standardization Method
- Single currency: For entries like 30 lire, convert directly using the chosen exchange rate.
- Dual currency: For entries like 7 ducati, 18 grossi, separate the components, convert each, and combine the results.
- Three-part currency: For entries like 10 ducati, 2 lire, 8 soldi, follow the same process as dual currency but account for the third component.
- Fractional units: For entries like 8 ducati e mezzo or 8 e mezzo, treat e mezzo (and similar terms) as +0.5, then convert.
- Time-related mentions:
- al mese = "monthly" (standard conversion).
- ogni tre mesi = "every three months" (divide by 3).
- per metà = "for half" (divide by 2; likely indicates partial or shared payment responsibility). For example, 36 lire per metà translates to 18 lire each for two parties or simply half of 36 lire.
- per 3 mesi = "for 3 months" (remain as is; indicates a fixed three-month payment).
This process results in a new dataset column, std_rent
, which contains standardized monthly rent values in soldi.
Adding Mean District Rent Value for Parcels
With the standardized rent values for each parcel, we extended our analysis to compare individual parcels against the mean rent value of their respective districts (sestiere).
During this process, we identified additional anomalies, such as values with more than four digits. Upon verification, these entries were recognized as transcription errors, likely due to confusion with the ID field. Such values were excluded from further analysis.
To compute the mean rent price for each district, we excluded both the invalid entries and the standardized -1
values, as defined earlier. The results are visualized in the following graph:
We further visualized the mean district rent values on a map, utilizing geographic coordinates to explore spatial patterns and gain insights into neighborhood characteristics and rent distributions.
From this visualization, we can derive interesting insights about neighborhoods. For example, San Marco (SM) has the highest mean rent, likely due to its central location within the city. This information enables further enrichment for text generation, not only within individual districts but also for comparisons across districts, offering a richer understanding of the socio-economic dynamics of the city.