Paris: address book of the past: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 77: Line 77:


As the extraction process from the scanned books was done automatically by OCR, the data is not without errors. Di Lenardo et al.<ref>di Lenardo, Isabella; Barman, Raphaël; Descombes, Albane; Kaplan, Frédéric, 2019, "Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922.", https://doi.org/10.34894/MNF5VQ, DataverseNL, V2 </ref>, who were responsible for the extraction process, estimated that the character error amounts to 2.6% with a standard deviation of 0.1%, while the error per line is 21% with a standard deviation of 2.9%.
As the extraction process from the scanned books was done automatically by OCR, the data is not without errors. Di Lenardo et al.<ref>di Lenardo, Isabella; Barman, Raphaël; Descombes, Albane; Kaplan, Frédéric, 2019, "Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922.", https://doi.org/10.34894/MNF5VQ, DataverseNL, V2 </ref>, who were responsible for the extraction process, estimated that the character error amounts to 2.6% with a standard deviation of 0.1%, while the error per line is 21% with a standard deviation of 2.9%.


==== Tagged Professions ====
==== Tagged Professions ====

Revision as of 12:33, 20 December 2022

Introduction

This project works with around 4.4 million datapoints which have been extracted from address books in Paris (Bottin Data). The address books date from the period 1839-1922 and contain the name, profession and place of residence of Parisian citizens. In a first step, we align this data with geodata of Paris’s city network, in order to be able to conduct a geospatial analysis on the resulting data in a second step.

Motivation

In the 19th century, Paris was a place of great transformations. Like in many other European cities, the industrialization led to radical changes in people’s way of life, completely reordering the workings of both economy and society. At the same time, the city grew rapidly. While only half a million people lived there in 1800, the number of inhabitants increased by the factor of nearly 7 within one century. To get control over the expansion of the city and to improve people’s living conditions, the city underwent constructional changes during the Haussmann Period (1853-1870), leading to the grand boulevards and general cityscape that we know today.

While all those circumstances have been well documented and studied extensively, they could mainly be described in a qualitative manner, e. g. by looking at the development of certain streets. The Bottin dataset, providing information on persons, their professions and locations during exactly the time of change described above, will be able to open new perspectives on the research, as it permits to analyze the economic and social transformation on a grander scale. With this data, it will be possible to follow the development of a chosen profession over the whole city, or to look at the economic transformation of an Arrondissement throughout the century.

This project will give an idea of the potential which lies in the Bottin Data to contribute to the research on Paris’s development during the 19th century.

Deliverables

  • Github
  • Google Drive


Organisation

Planning:

Part Tasks
Week 4 Alignment tba
Week 5 tba
Week 6 tba
Week 7
Week 8
Week 9
Week 10
Week 11 Analysis
Week 12
Week 13
Week 14

Alignment

The Pipeline of the Alignment

The alignment happened in two steps: First, two street network datasets of the years 1836 and 2022 were combined to create one dataset which incorporates all available geodata. Second, this dataset was used to align the Bottin datapoints with a matching street and thus a georeference.

Aligning Street Datasets

The Street Datasets

Vasserot Data (1836)

Open Street Data (2022)


Aligning Bottin Data with Street Data

After having constructed one dataset of past and present streets in Paris, it will be used to align each datapoint of the Bottin Data with a geolocation. The alignment process is decribed in detail in the following chapter. For this, we first introduce the Bottin dataset, then elaborate on the methods of the alignment process and finally estimate the quality of the alignment.

Bottin Dataset

The dataset referred to as “Bottin Data” consists of slightly over 4.4 million datapoints which have been extracted via Optical Character Recognition (OCR) out of address books named “Didot-Bottin” from Paris. The address books are from 55 different years within the time span of 1839 and 1922, with at least 37’177 entries (1839) up to 130’958 entries (1922) per year. The address books can be examined at the Gallica portal[1]. Each entry consists of a person’s name, their profession or activity, their address and the year of the address book it was published in.

As the extraction process from the scanned books was done automatically by OCR, the data is not without errors. Di Lenardo et al.[2], who were responsible for the extraction process, estimated that the character error amounts to 2.6% with a standard deviation of 0.1%, while the error per line is 21% with a standard deviation of 2.9%.

Tagged Professions

Alignment Process

Preprocessing

Perfect Matching

Fuzzy Matching

Quality Assessment

Analysis

Limitations of the Project

faced some challenges when working on project

OCR Mistakes

  • OCR mistakes together with abbreviations leading to wrong fuzzy matching -> idea: use customized distance function which punishes character substitutions/insertions at the end of the string more than at the beginning of the string (EXAMPLE)

Missing streets

  • streets not in either one of the two datasets of 1836 or 2022: rue d'Allemagne -> get other street data from 19th century
  • given data in Bottin dataset not necessarily a street, e.g. "cloître"

Ambiguous streetnames

  • in 1836 street network dataset: datapoints with same streetname, but not located near each other
  • many bottin datapoints aligned on "short streetname", while this streetname might refer to many different streets (at point in time when address book published clear?) -> try to incorporate when which street was built


Matching on street level

  • worked with centroids of streets in order to represent each data point, which especially for long streets is imprecise -> additional step to clean and align address numbers

Small part of potential analysis

  • time constraint: semester project, already resources for alignment
  • knowledge constraint: get more (historical) knowledge about Paris to be able to put analysis into context

Outlook

Possible Directions of Research

  • Perfection of Alignment: include address numbers, include more street network datasets, two-step alignment of short streetname and then type of street, other distance function with heavier penalty for substitution at the end of string, align on whole datapoint (name, profession, street) to account for businesses existing more than one year
  • Work on Professions: Cluster in thematic fields, maybe classify by social reputation (-> gentrification of a quartier)
  • Analysis: gentrification, development of arrondissements, influence of political decisions of social/economic landscape,...
  • Visualization: interactive map

-> good alignment needed to derive facts/knowledge from data

References

  1. Gallica portal, Bibliothèque Nationale de France, https://gallica.bnf.fr/accueil/en/content/accueil-en?mode=desktop; example for an address book on https://gallica.bnf.fr/ark:/12148/bpt6k6314697t.r=paris%20paris?rk=21459;2
  2. di Lenardo, Isabella; Barman, Raphaël; Descombes, Albane; Kaplan, Frédéric, 2019, "Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922.", https://doi.org/10.34894/MNF5VQ, DataverseNL, V2