Love2Late

From FDHwiki
Jump to navigation Jump to search

Project Description

The goal of the project is to collect data from matrimonial ads from the beginning of the 20th century and to find matches between people. The interface of the website that will be created will consist of two parts:

  • a search engine where one can enter their own “ad” (give some information and chose between certain criteria) in order to find their match from the archives
  • some observations and analysis of the data itself. Can everyone publishing an ad find a person similar enough in the database? Who do people match with? Can a difference between time periods be observed?

Considering that all the publications are anonymous, the results of a search will be presented as extracts from the actual pages of the journals the corresponding ads came from along with some publishing information (e.g. date, place).

Planning

Project Progress

Extraction and OCR

The first phase of the project was finding the periodicals from the beginning of the 20th century publishing matrimonial ads. Once the corpus was complete, the images was extracted using Raphael Barman’s Gallica wrapper. and the metadata from Gallica was stored. An optical character recognition was performed using Transkribus, giving as a result xml files containing the text equivalent of each page as well as the coordinates of the the place where each segment appears.

Database creation

The text was segmented based on its content and its location in the periodical (an ad has always the same format and is never in the margins). As a result, the ads are extracted relatively easily and the irrelevant information is discarded. Subsequently, some information was directly retrieved from each ad, such as the age, the gender of the participants and the keywords with which they presented themselves. At the end of the process the result is a database giving information about every ad, basic information about its author and the journal it appeared in (information about it and the page image).

Cleaning the data

Once the database was created, some missing information, as well as errors from the OCR needed to be addressed. Peculiar entries were examined manually and removed if needed.

Text processing

The text from the ads was parsed using the Python library spaCy in order to tokenize words and associate part of speech tags to each of them. This step allowed for finding patterns that help distinguish different parts of announcements and making further use of natural language processing tools possible. The immediate effect of this step was to realize that the part of the ad describing what the person is looking for is almost always preceded by one of several key words (cherche, épouserait, désirerait, etc.). As a result, every ad could be broken into two parts: the description of the person and that of the person they would wish to encounter.

Next steps

  • Separate different attributes found in the bulk of ads into categories such as profession, physical description, moral qualities, fortune... Categories will be made based on the observation of recurrent information contain in the ad.
  • Implement a classification algorithm that forms clusters of similar descriptions and attributes for each of the categories define during the previous step.
  • Choose an appropriate metric to calculate similarity between ads and find matches that minimises the sum of the distance of each attribute. Distances of an attribute will be based on the cluster it belong and the cluster to wich the compared attribute belong. Distances will be specific to each category.
  • Find patterns and extract some statistics about the matches
  • Visualize the obtained results and build the web interface
Step Things to do Week
1 Classification of information by category 10-11
2 Clustering in category 11
3 Implement metric and compute matches 11-12
4 Data analysis 13
5 Data visualization 13