Generation of Textual Description

From FDHwiki
Jump to navigation Jump to search

Introduction

Motivation

Deliverables

Project Timeline & Milestones

Timeframe Task Completion
Week 4
  • Exploring the dataset
  • Exploring in-context learning models for text summarization
Week 5
  • Identify patterns and edge cases from the dataset (e.g missing fields, "odd" values)
  • Define different summarization formats accordingly to be used for in-context learning
  • Explore the connection between the Catastici and Sommarioni dataset
Week 6
  • Refine summarization formats
  • Construct a pipeline connecting translation generation, summarization and validation
Week 7
  • Evaluate summarization results
Week 8
  • Prepare for mid-term presentation
Week 9
  • Explore father-son relationship among Catastici and Sommarioni dataset
Week 10
  • Standardization of monthly rent column
Week 11
  • Verified and refined standardized rent values
Week 12
  • Added district mean rent column and integrated additional information into the data
Week 13
  • TBD
Week 14
  • TBD

Methodology

Generating Summarization Formats for In-context Learning

Functionality of Parcels

blablabla

Standardization of Monthly Rent

Inferring accurate values from rent data can offer valuable insights into a parcel's history, such as its price in relation to its neighborhood, which may reflect its relative status at that time. Among 35,946 data rows, we identified that only 28,610 (~80%) contained numeric values, with the remaining entries consisting of null values or text data.

Upon exploration, we encountered examples of text data such as: ```'1022 lire de piccoli' '3776 lire' '22 ducati, 22 grossi'

'10 ducati, 19 grossi' '40 ducati e 14 grossi' 'casa in soler'
'libertà di traghetto' '20 lire' '26 lire' '15 lire'...```

These examples illustrate the diversity of information present in the text entries, including multiple currencies, descriptive text about the parcel’s function, and potential typographical errors.

To standardize the non-numeric rent values, we developed an iterative approach. Using regular expressions (regex), we captured patterns within the data. After each iteration, we matched the patterns against the dataset, identified any emerging new patterns, and incorporated them into our existing pattern set. This process was repeated until no new patterns could be found.

In the final iteration, we identified six main patterns as follows:

Pattern Name Example Notes
Single currency 30 lire with optional "de piccoli" or "di piccoli"
Dual currency 7 ducati, 18 grossi with optional "de piccoli" or "di piccoli"
Three-part currency 10 ducati, 2 lire, 8 soldi
Fractional or "e mezzo" units 8 ducati e mezzo or 8 e mezzo
Time-related mentions al mese, ogni tre mesi, per metà Function or Ownership casa, bottega

}

Results

Limitations and further work

Conclusion

Credits