Love2Late: Difference between revisions
Todor.manev (talk | contribs) No edit summary |
Todor.manev (talk | contribs) |
||
Line 55: | Line 55: | ||
==Next steps== | ==Next steps== | ||
*Separate different attributes found in the bulk of ads into categories such as profession, physical description, moral qualities, fortune... Categories will be | *Separate different attributes found in the bulk of ads into categories such as profession, physical description, moral qualities, fortune... Categories will be created from observations on recurrent information in the ads. | ||
*Implement a classification algorithm that forms clusters of similar descriptions and attributes for each of the categories | *Implement a classification algorithm that forms clusters of similar descriptions and attributes for each of the categories defined during the previous step. | ||
*Choose an appropriate metric to calculate similarity between ads and find matches that | *Choose an appropriate metric to calculate similarity between ads and find matches that minimise the the distances between attributes. These distances will be based on the clusters the attributes belong to and will be specific to each category. | ||
*Find patterns and extract some statistics about the matches | *Find patterns and extract some statistics about the matches | ||
*Visualize the obtained results and build the web interface | *Visualize the obtained results and build the web interface |
Revision as of 16:07, 22 November 2019
Project Description
The goal of the project is to collect data from matrimonial ads from the beginning of the 20th century and to find matches between people. The interface of the website that will be created will consist of two parts:
- a search engine where one can enter their own “ad” (give some information and chose between certain criteria) in order to find their match from the archives
- some observations and analysis of the data itself. Can everyone publishing an ad find a person similar enough in the database? Who do people match with? Can a difference between time periods be observed?
Considering that all the publications are anonymous, the results of a search will be presented as extracts from the actual pages of the journals the corresponding ads came from along with some publishing information (e.g. date, place).
Planning
Step | Things to do | Week |
---|---|---|
1 | Classification of information by category | 10-11 |
2 | Clustering in category | 11 |
3 | Implement metric and compute matches | 11-12 |
4 | Data analysis | 13 |
5 | Data visualization | 13 |
Project Progress
Extraction and OCR
The first phase of the project was finding the periodicals from the beginning of the 20th century publishing matrimonial ads. Once the corpus was complete, the images was extracted using Raphael Barman’s Gallica wrapper. and the metadata from Gallica was stored. An optical character recognition was performed using Transkribus, giving as a result xml files containing the text equivalent of each page as well as the coordinates of the the place where each segment appears.
Database creation
The text was segmented based on its content and its location in the periodical (an ad has always the same format and is never in the margins). As a result, the ads are extracted relatively easily and the irrelevant information is discarded. Subsequently, some information was directly retrieved from each ad, such as the age, the gender of the participants and the keywords with which they presented themselves. At the end of the process the result is a database giving information about every ad, basic information about its author and the journal it appeared in (information about it and the page image).
Cleaning the data
Once the database was created, some missing information, as well as errors from the OCR needed to be addressed. Peculiar entries were examined manually and removed if needed.
Text processing
The text from the ads was parsed using the Python library spaCy in order to tokenize words and associate part of speech tags to each of them. This step allowed for finding patterns that help distinguish different parts of announcements and making further use of natural language processing tools possible. The immediate effect of this step was to realize that the part of the ad describing what the person is looking for is almost always preceded by one of several key words (cherche, épouserait, désirerait, etc.). As a result, every ad could be broken into two parts: the description of the person and that of the person they would wish to encounter.
Next steps
- Separate different attributes found in the bulk of ads into categories such as profession, physical description, moral qualities, fortune... Categories will be created from observations on recurrent information in the ads.
- Implement a classification algorithm that forms clusters of similar descriptions and attributes for each of the categories defined during the previous step.
- Choose an appropriate metric to calculate similarity between ads and find matches that minimise the the distances between attributes. These distances will be based on the clusters the attributes belong to and will be specific to each category.
- Find patterns and extract some statistics about the matches
- Visualize the obtained results and build the web interface