Influencers of the past: Difference between revisions
Line 55: | Line 55: | ||
== Finding the geolocation of the adresses == | == Finding the geolocation of the adresses == | ||
To be able to show the adresses on the map, we need to find their geolocation (latitude/longitude coordinates). For this step, we have proceded in two steps. First we have used the [http://fdh.epfl.ch/index.php/Lists_of_addresses_of_Paris| list of addresses of Paris] created by the DHLab. This database provides a list of old Paris addresses with the start and ending date (if known) and the geocoordinates (latitude and longitude, directly in the format [https://en.wikipedia.org/wiki/Web_Mercator_projection EPSG:3857] handled by [https://leafletjs.com/ Leaflet]). To complete our database, we then used the GeoPy API <ref>GeoPy Contributors, [https://buildmedia.readthedocs.org/media/pdf/geopy/stable/geopy.pdf "GeoPy Documentation"], 26/05/2019</ref>. This API simply takes our remaining addresses and gives back the geocoordinates. | To be able to show the adresses on the map, we need to find their geolocation (latitude/longitude coordinates). For this step, we have proceded in two steps. First we have used the [http://fdh.epfl.ch/index.php/Lists_of_addresses_of_Paris| list of addresses of Paris] created by the DHLab. This database provides a list of old Paris addresses with the start and ending date (if known) and the geocoordinates (latitude and longitude, directly in the format [https://en.wikipedia.org/wiki/Web_Mercator_projection EPSG:3857] handled by Leaflet<ref>[https://leafletjs.com/ Leaflet]</ref>). To complete our database, we then used the GeoPy API <ref>GeoPy Contributors, [https://buildmedia.readthedocs.org/media/pdf/geopy/stable/geopy.pdf "GeoPy Documentation"], 26/05/2019</ref>. This API simply takes our remaining addresses and gives back the geocoordinates. | ||
== Georeference old maps of Paris == | == Georeference old maps of Paris == | ||
Once we have the geocoordinates of our addresses we need to georeference old maps of Paris. To do so we [https://www.georeferencer.com/ Georeferencer]. Through the localisation of homologuous points between the old map and the present map, this tool allows to project geocoordinates on the old map. This can then be used with the library [https://leafletjs.com/ Leaflet] to visualise our results. | Once we have the geocoordinates of our addresses we need to georeference old maps of Paris. To do so we Georeferencer<ref>[https://www.georeferencer.com/ Georeferencer]</ref>. Through the localisation of homologuous points between the old map and the present map, this tool allows to project geocoordinates on the old map. This can then be used with the library Leaflet<ref>[https://leafletjs.com/ Leaflet]</ref> to visualise our results. | ||
== Visualise results == | == Visualise results == | ||
Once we have all our elements we can start visualise our results. At first we tried to continue using Python with the Python module Folium <ref> [https://python-visualization.github.io/folium/ "Folium documentation"] </ref> (implementing [https://leafletjs.com/ Leaflet]). However the results were not great: it would take a long time to load and we would not have much control on how to visualise the people. This is why we have decided to switch to Javascript, making it also much simpler to embed the maps in our website. Then we had to decide how display the famous people on the map. | Once we have all our elements we can start visualise our results. At first we tried to continue using Python with the Python module Folium <ref>[https://python-visualization.github.io/folium/ "Folium documentation"]</ref> (implementing [https://leafletjs.com/ Leaflet]). However the results were not great: it would take a long time to load and we would not have much control on how to visualise the people. This is why we have decided to switch to Javascript, making it also much simpler to embed the maps in our website. Then we had to decide how display the famous people on the map. | ||
The naive way would be to simply put all our addresses on the map but due to the large number of addresses we have (a few thousands) this would result in a overcrowded map. Our first idea is therefore to cluster our addresses when they are near each other. This will allow, at low level zoom, to visualise 'influential' neighbourhoods for instance. Then, when one starts to zoom more on the map, he will eventually reach a level where each person is shown as a dot. In this last case, when one clicks on the dot, a pop-up with additional information on the person (such as the name) will show up. To do so we use the plugin [https://github.com/Leaflet/Leaflet.markercluster Leafler.markercluster]. This is a first step to show how "clustered" the famous people are but we want to implement other visualisation to better show it. The first one use the plugin [https://github.com/Leaflet/Leaflet.heat Leaflet.heat], a simple heatmap plugin, to represent the density of famous people. The second one adds to the map the arrondissements of Paris<ref>[https://opendata.paris.fr/explore/dataset/arrondissements/information/ "Geocoordinates of Paris arrondissements]</ref>, coloring them given the number of famous people within. Finally, the same thing is done with the quarters of Paris<ref>[https://opendata.paris.fr/explore/dataset/quartier_paris/information/?location=12,48.88063,2.34695&basemap=jawg.streets "Geocoordinates of Paris quarters]</ref>. Notice that both the arrondissements and the quarters date from 1860<ref>[https://en.wikipedia.org/wiki/Historical_quarters_of_Paris "Historical quarters of Paris]</ref> and have not changed much up to the present day, meaning that finding the fanciest quarters is meaningful (even without knowing the precise history of Paris). | The naive way would be to simply put all our addresses on the map but due to the large number of addresses we have (a few thousands) this would result in a overcrowded map. Our first idea is therefore to cluster our addresses when they are near each other. This will allow, at low level zoom, to visualise 'influential' neighbourhoods for instance. Then, when one starts to zoom more on the map, he will eventually reach a level where each person is shown as a dot. In this last case, when one clicks on the dot, a pop-up with additional information on the person (such as the name) will show up. To do so we use the plugin [https://github.com/Leaflet/Leaflet.markercluster Leafler.markercluster]. This is a first step to show how "clustered" the famous people are but we want to implement other visualisation to better show it. The first one use the plugin [https://github.com/Leaflet/Leaflet.heat Leaflet.heat], a simple heatmap plugin, to represent the density of famous people. The second one adds to the map the arrondissements of Paris<ref>[https://opendata.paris.fr/explore/dataset/arrondissements/information/ "Geocoordinates of Paris arrondissements]</ref>, coloring them given the number of famous people within. Finally, the same thing is done with the quarters of Paris<ref>[https://opendata.paris.fr/explore/dataset/quartier_paris/information/?location=12,48.88063,2.34695&basemap=jawg.streets "Geocoordinates of Paris quarters]</ref>. Notice that both the arrondissements and the quarters date from 1860<ref>[https://en.wikipedia.org/wiki/Historical_quarters_of_Paris "Historical quarters of Paris]</ref> and have not changed much up to the present day, meaning that finding the fanciest quarters is meaningful (even without knowing the precise history of Paris). | ||
Revision as of 10:35, 11 December 2019
Abstract
The goal of this project is to show who were the notable people in Paris in 1884 and 1908 and where they lived. Our expected output is a webpage showing both maps from 1884 and 1908, with clusters indicating the number of inhabitants. The more you zoom, the more details you can see. You can click on a point to have more information about someone (i.e. his/her name). We will provide an analysis of the results.
Here is the initial sketch of our project: Sketch of Influencers of the past and you can find the final website at the following link: Project website.
Planing
Task | Status | Deadline |
---|---|---|
Extract the data | Done | (: |
Clean the data | Done | (: |
Get coordinates of the addresses | Done | 22.11.19 |
Georeference old maps | Done | 22.11.19 |
Display people on maps | Done | 29.11.19 |
Web interface and analysis | Done | 06.12.19 |
Historical sources
In this project we are dealing with two main sources: the Annuaire du grand monde parisien of 1884 and the Paris-mondain : annuaire du grand monde parisien et de la colonie étrangère... of 1908. These annuaires comprehend a list of people considered famous and influential at the time, listing for each of them their names and addresses. As stated in the preface by the author of the 1884 annuaire, Pol Hanin, the goal of such a book was to honor the high society of Paris and create a truly useful list of famous people[1].
For the visualisation, we have used two old maps of Paris. For the year 1884, we have the Nouveau plan complet illustré de la ville de Paris en 1884 by Alexandre Aimé Vuillemin and Charles Dyonnet. For the year 1908, we have the Plan de Paris, Mars 1908 et du chemin de fer métropolitain, distinguant les lignes déclarées d'utilité publique; les lignes concédées à titre éventuel et la concession de la Cie Nord-Sud by L. Wuhrer. Both maps are stored at the Bibliothèque nationale de France, in the departement Cartes et plans. Note that the second map also presents the subway network of Paris at the time, of which the first line was opened on July 19th of 1900 for the Olympics Games of that year[2].
Main steps
Extracting the data from the directories
Our first step is to extract all the names and adresses from the two directories. To do so, we use Transkribus[3] to get the OCR and then start to parse the informations.
Cleaning the data
This is the principal step in our project. The data the OCR gives us is quite messy, there are a lot of errors and we definetely need to correct them to hope obtaining the geocoordinates of our addresses. We also need to harmonise our results. For instance, we want to consider in the same way 'r.' and 'rue' (the French name for 'street') or 'bd' and 'boulevard'. Having all our addresses in a stardardized form is also helpful to easily retrieve the corresponding geocoordinates. The principal challenge of this step, is that we have two different OCRs for the two years (1884 and 1908). We thus had to implement two specific parsers.
Finding the geolocation of the adresses
To be able to show the adresses on the map, we need to find their geolocation (latitude/longitude coordinates). For this step, we have proceded in two steps. First we have used the list of addresses of Paris created by the DHLab. This database provides a list of old Paris addresses with the start and ending date (if known) and the geocoordinates (latitude and longitude, directly in the format EPSG:3857 handled by Leaflet[4]). To complete our database, we then used the GeoPy API [5]. This API simply takes our remaining addresses and gives back the geocoordinates.
Georeference old maps of Paris
Once we have the geocoordinates of our addresses we need to georeference old maps of Paris. To do so we Georeferencer[6]. Through the localisation of homologuous points between the old map and the present map, this tool allows to project geocoordinates on the old map. This can then be used with the library Leaflet[7] to visualise our results.
Visualise results
Once we have all our elements we can start visualise our results. At first we tried to continue using Python with the Python module Folium [8] (implementing Leaflet). However the results were not great: it would take a long time to load and we would not have much control on how to visualise the people. This is why we have decided to switch to Javascript, making it also much simpler to embed the maps in our website. Then we had to decide how display the famous people on the map. The naive way would be to simply put all our addresses on the map but due to the large number of addresses we have (a few thousands) this would result in a overcrowded map. Our first idea is therefore to cluster our addresses when they are near each other. This will allow, at low level zoom, to visualise 'influential' neighbourhoods for instance. Then, when one starts to zoom more on the map, he will eventually reach a level where each person is shown as a dot. In this last case, when one clicks on the dot, a pop-up with additional information on the person (such as the name) will show up. To do so we use the plugin Leafler.markercluster. This is a first step to show how "clustered" the famous people are but we want to implement other visualisation to better show it. The first one use the plugin Leaflet.heat, a simple heatmap plugin, to represent the density of famous people. The second one adds to the map the arrondissements of Paris[9], coloring them given the number of famous people within. Finally, the same thing is done with the quarters of Paris[10]. Notice that both the arrondissements and the quarters date from 1860[11] and have not changed much up to the present day, meaning that finding the fanciest quarters is meaningful (even without knowing the precise history of Paris).
Implementation details
Quality assessment
In this section, we assess the quality of our processes. First we evaluate the quality of our parsing of the OCR output, comparing the number of entries in the annuaire and the number of addresses we have to clean. Note that for the 1908 list, the OCR has given us a text file and therefore this evaluation will combine both the quality of the OCR and the quality of our parsing methods to extract each pair name/address. For the 1884 list, the OCR directly gives us a table of names and addresses, thus this evaluation assesses the quality of the OCR, over which we have no control. To estimate the number of entries in the actual annuaire, we have counted them manually on a few pages to get the mean number of people per page (a very constant number due to the clear structure of the annuaires) and multiplied it by the number of pages. We get the following results:
Year | Entries per page | Number of pages | Total number of entries | Output of the OCR | After removing missing values | Quality assessment [%] |
---|---|---|---|---|---|---|
1884 | 35 | 182 | ~6400 | 5709 | 5590 | 87 |
1908 | 40 | 310 | ~12400 | 11045 | 8394 | 68 |
Then we need to evaluate the quality of our cleaning on the addresses and how many coordinates we managed to get. Those two steps are evaluated together as the quality of our cleaning directly reflect on the number of coordinates we get. Here are the results:
Year | Entries pre-cleaning | Entries with coordinates | Quality assessment [%] |
---|---|---|---|
1884 | 5590 | 3572 | 64 |
1908 | 8394 | 5295 | 63 |
Overall we have managed to get the coordinates of 56% of the people in the 1884 annuaire and 43% of the people in the 1908 annuaire. This numbers seem to be quite low but it is important to stress out that, especially for the 1908 entries, the OCR output was of poor quality. This was to be expected as the 1908 annuaire itself has less structure than the 1884 one and the quality of the images on Gallica is poorer. In some cases, even for a human it would be hard to read the exact information, as in the following example.
Links
References
- ↑ Preface of the Annuaire du grand monde parisien, 1884
- ↑ Gustave Fulgence, « Le plan de métro de Paris revisité », Confins (Online), 30; posted on February 24th 2017, seen on December 8th 2019; DOI : 10.4000/confins.11869
- ↑ Transkribus
- ↑ Leaflet
- ↑ GeoPy Contributors, "GeoPy Documentation", 26/05/2019
- ↑ Georeferencer
- ↑ Leaflet
- ↑ "Folium documentation"
- ↑ "Geocoordinates of Paris arrondissements
- ↑ "Geocoordinates of Paris quarters
- ↑ "Historical quarters of Paris