Humans of Paris 1900: Difference between revisions
Line 40: | Line 40: | ||
Sometimes query result contained different entities that have the same name. For this case, we exploited the fact that we are handling 19th century data, taken by one author, Nadar. Among the namesakes, we chose the one whose living period overlaps the most with that of Nadar. | Sometimes query result contained different entities that have the same name. For this case, we exploited the fact that we are handling 19th century data, taken by one author, Nadar. Among the namesakes, we chose the one whose living period overlaps the most with that of Nadar. | ||
==== Creating Tags out of metadata ==== | |||
Table from data.bnf html page contains a variety of useful information: name, nationality, language, gender, short description, etc. We used description and nationality to create tags. | |||
Our idea of extracting tag from descriptions is using most frequent nouns that appears across different people, except stopwords The problem here is that, some important keywords and entities are broken down under wordcount. For example, “Legion d’honneur”, is split into “legion” and “d’honneur” and the original meaning is lost. | |||
This is handled by manual jobs. First, we checked the wordcount without stopwords, and got the frequent words that are not profession-related nouns. We checked the notes that contain those words and examined if they are part of important (and frequent) phrases. For the phrases that are found, we concatenated them into one word, so that they can be used as tag. | |||
After this step, we fetched 300 major keywords, some being feminin or plural conjugation of another. Cleaning those semantic duplicates were done manually. For future improvement, this job can be done by using proper french NLP library. | |||
Once creating list of all possible tags is done, we can assign tags for each picture by having intersection between list of all tags and list of words in each description. Finally, nationality is added in the list of tags. | |||
== Project execution plan == | == Project execution plan == |
Revision as of 12:26, 12 December 2019
Motivation
We take inspiration from the famous Instagram page, Humans of New York, which features pictures and stories of people living in current day New York. In similar fashion, our project, Humans of Paris, has the aim to be a platform to connect us to the people of 19th century Paris. Photography was still in its early stages when Nadar took up the craft in his atelier in Paris. Through the thousands of pictures taken by him and his son we can get a glimpse of who lived at the time. We explore the use of deep learning models to cluster similar faces to get an alternative, innovative view of the collection and allowing for serendipitous discovery of patterns and people. There is a story behind every person, and our interface highlights this by association people’s story with their picture.
Historical Background and Nadar's Collection
Implementation
Website Description
In more concrete terms, our project involves four core interfaces motivated by the above.
- A home page highlighting the most known individuals
- A page (FaceMap) that highlight similarities in differences in the faces of the people in the dataset.
- A page to find your 19th century doppelganger, for fun and to gather interest in people the user my otherwise would never have known existed.
- A way to search using tags, to allow users to find individuals of interest.
To each person in the pictures we associate background information crawled from wikipedia.
Methods & Evaluation
Getting & Processing Metadata
As a first approach, we use the library provided by Raphael to get a list of all the photos in the collection of the foundation Nadar on Gallica. Nadar's collection contains a variety of genres: portraits, comics, caricatures, paysage, sculptures, etc. In order to stick to our emphasis on ‘people of 19th century Paris’, we filtered out photographs that are not directly relevant to people of that time.
- Getting individual portraits
We used metadata of Gallica collection to filter irrelevant photographs. Among a number of attribute objects in the metadata, we concentrated on ‘dc:subject’ attribute. This field contains a list that has detailed information about the photograph and the entity in that photo: [Names of individuals],(year of birth - year of death) -- [genre of the photograph]. For each row, We ignored subjects that do not have the substring “-- Portraits” and returned the new list of subjects. This way, we can discard the landscape, comics, caricatures, and sculptures. After filtering only ‘Portraits’, we had column of list that varied in length: that is, the number of people featured in photographs differed. Since our intention is to connect ourselves to people of 19th century Paris by presenting story of each Parisien-ne in the photograph, we filtered out the photographs that features more than one person. In order to filter and get insights of people who work in the same field or had the same role we created the concept of tags that helps us to access and query groups of people.
- First attempt
Gallica ‘dc:subject’ metadata had very brief information on each person - only name, year of birth, and year of death, which are not enough information to sort categorize each person. On the other hand, there is ‘dc:title’ metadata that gives title of each photograph, but they were variance in texts. Moreover, for some photographs, especially ‘Portrait du theatre’, the descriptions were on fictional characters which some performers had represented, not on the performers. As a result, we had to find another dataset to finish this task.
- Second attempt
Data.bnf.fr is the project driven by BnF in order to make the data produced by the BnF more visible on the Web, and federate them within and outside the catalogues. Since Gallica is one of the BnF projects, it is reasonable to assume that a person who has their name in Gallica metadata will have some document or page in data.bnf.fr semantic web. In order to get corresponding pages, we queried metadata though python SPARQL API Data.bnf XML schema has three name attributes: foaf:name, foaf:givenName, foaf:famliyName. We checked how names in <dc:subject> is arranged, compare the arrangement with some names in Data.bnf, rearranged names accordingly, then queried them.
Sometimes query result contained different entities that have the same name. For this case, we exploited the fact that we are handling 19th century data, taken by one author, Nadar. Among the namesakes, we chose the one whose living period overlaps the most with that of Nadar.
Creating Tags out of metadata
Table from data.bnf html page contains a variety of useful information: name, nationality, language, gender, short description, etc. We used description and nationality to create tags.
Our idea of extracting tag from descriptions is using most frequent nouns that appears across different people, except stopwords The problem here is that, some important keywords and entities are broken down under wordcount. For example, “Legion d’honneur”, is split into “legion” and “d’honneur” and the original meaning is lost. This is handled by manual jobs. First, we checked the wordcount without stopwords, and got the frequent words that are not profession-related nouns. We checked the notes that contain those words and examined if they are part of important (and frequent) phrases. For the phrases that are found, we concatenated them into one word, so that they can be used as tag. After this step, we fetched 300 major keywords, some being feminin or plural conjugation of another. Cleaning those semantic duplicates were done manually. For future improvement, this job can be done by using proper french NLP library. Once creating list of all possible tags is done, we can assign tags for each picture by having intersection between list of all tags and list of words in each description. Finally, nationality is added in the list of tags.
Project execution plan
Milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 | ||
07.11 | Understanding Gallica Query Gallica API | ✓ |
Query Gallica API | ||
Week 5 | ||
14.10 | Start preprocessing images | ✓ |
Choose suitable Wikipedia API | ||
Week 6 | ||
21.10 | Choose face recognition library | ✓ |
Get facial vectors | ||
Try database design with Docker & Flask | ||
Week 7 | ||
28.10 | Remove irrelevant backgrounds of images | ✓ |
Extract age and gender from images | ||
Design data model | ||
Extract tags, names, birth and death years out of metadata | ||
Week 8 | ||
04.11 | Set up database environment | ✓ |
Set up mockup user-interface | ||
Prepare midterm presentation | ||
Week 9 | ||
11.11 | Get tags, names, birth and death years in ready-to-use format | ✓ |
Handle Wikipedia false positives | ||
Integrate face recognition functionalities into database | ||
Week 10 | ||
18.11 | Create draft of the website (frontend) | |
Create FaceMap using D3 | ||
Week 11 | ||
25.11 | Integrate all functionalities | |
Finalize project website | ||
Week 12 | ||
02.12 | Write Project report |