Revision as of 05:56, 9 October 2019

Projects 2019

Guidelines

(1) Select an image series on Paris
(2) Make a sketch (drawing + textual description) of a final interface
(3) Extract the maximum information out of it (Train a segmenter, Train a handwritten recognition system)
(4) Export this information in other places in the Web or Build a website with specific services

As the focus of this year is Paris and interaction between projects is wanted, projects should ideally make use of a common system. Example of a common system is street names, addresses or in some cases name of people.

Image collections

Gallica offers a large collection of images of a variety of types. The most useful entry point is the Paris collection.
The Bibliothèque numérique of INHA offers a lot of resources in link with art history.
The Archives de Paris have many administrative documents as well as a large photography collection.

Specific examples

Photographies of Eugène Atget documents the forgotten Paris of 19th century. They can be found on Gallica and on INHA library.
Architectural photograhy collection of Archives de Paris.
Grand monde de Paris , address book of famous people of Paris, found on Gallica.
Name and reservations for some theaters of Paris found on Gallica and plan of the theatre room also found on Gallica.

Resources

Data extraction tools

Gallica wrapper to download documents, images and OCR from Gallica directly from Python.
Transkribus to make your own OCR.
VGG Image annotator (VIA) to annotate your documents.
OCR and HTR available for documents who do not have it. (Caveat for HTR, the data series should be uniform and needs to be annotated by hand).
dhSegment for segmenting visually distinct part of documents.

Databases/APIs

Gallica document API, Gallica image extraction API, python Gallica API wrapper and python periodical corpus extractor
INHA offers an OAI-PMH API, see here for an entry point.
Data BNF, WikiData for general-purpose knowledge base.
In lab-resources (ask if you need it)
- List of all the streets of Paris at several points in time.
- List of all addresses of Paris at several points in time.
- List of all the people and their addresses for several years extracted from directories (1839-1907).

Sketches

First student name	Second student name	Third student name	Sketch 1	Sketch 2	Sketch 3
Guilhem Sicard	Todor Manev	-	Sketch of Love2Late	Sketch of The 1897 directory of businesses
Giacomo Alliata	Andrea Scalisi	-	Sketch of Influencers of the past	Sketch of Artwork origins in Paris
Leonore Guillain	Haeeun Kim	Liamarcia Bifano	Sketch of Humans of Paris 1900	Sketch of ImmoSearch Paris 1900	Sketch of Job Search Paris 1900
Arthur Parmentier	Bertil Wicht	-	Sketch of Book editing in Paris 1847	Sketch of Trip advisor Paris	Sketch of Bike from Paris
Robin Szymczak	Cédric Tomasini		Sketch of Virtual Louvre	Sketch of Diving into the Opera	Sketch of Article in context

Projects 2018

(1) Choose a Map on Gallica https://gallica.bnf.fr/accueil/?mode=desktop
(2) Extract the maximum information out of it (Train a segmenter, Train a handwritten recognition system)
(3) Export this information in other places in the Web or Build a website with specific services

Projects websites

Shortest-Path Route Extraction From City Map
Train schedules
Paris 1909 TripAdvisor
A century in Beijing
Coal supply in the German Empire
Paris Metropolitan, an evolution

Shortest-Path Route Extraction From City Map

The goal of this project is to give people an idea of what it might have been like to find one's way around cities of the past, by building a navigation tool not unlike Google Maps or similar services - except it is for the past. The application will create and display the shortest path between user-selected start and end points, using information extracted from an old city map. The finished application will be made available via a web-interface.

On a technical level, the approach is based on extracting the road network from the map, representing it as an undirected planar graph, then applying Dijkstra's algorithm to solve the routing problem.

Members: Jonathan and Florian

Train schedules

This project aims to observe how one could travel between France, Switzerland and the North of Italy back in the middle of the 19th century. From the journey paths, schedules and prices, one will realize how easy it is now to travel through Europe and how big the impact of the evolution of both the railway network and the technology has been. These data will be extracted from a document from 1858and computed in a CFF-like website so that one can put oneself in the shoes of a railway user of the 1850's.

Members: Anna Fredrikson and Olivier Dietrich

Paris 1909 TripAdvisor

The project intends to recreate the cultural geography of Paris in the Belle Époque, immersing a user into the world of cabarets, balls, theatres, the universe described by Zola and Proust, and painted by Renoir and Toulouse-Lautrec. In order to give a new perspective on how this legendary world was actually structured and perceived, we are going to digitize the authentic Plan des plaisirs et attractions de Paris created in 1909 and augment it with the pieces of evidence and descriptions from the Guide dés plaisirs à Paris: timetables, advice and guidance on what to wear, what to say, and where to go.

- Alina, Maryam, and Paola

A century in Beijing

Using this map by France's Service Géographique de l'Armée from 1900 we will follow the evolution of the urban landscape of the central part of China's capital over the last hundred years.

The planned goals

To align maps from different time periods and see how the landscape changed. The town's straightforward rectangular planning will allow us to make matches more easily.
The map has a rich legend with toponymic information in French and the dated French system of transliteration of Mandarin. The plan is to extract and match these place names with their modern counterparts.
Add the old pictures of significant buildings that are no longer there if it's possible to find them.

Members: Jimin and Anton

Coal supply in the German Empire

Main ideas

To study the coal supply and demand of the German Empire for the year 1881.
Interactive visualization of Germany's main coal production and consumption centers.
Dynamic visualization of coal transport flows according to the different mining basins and transport routes.
Differentiating the production and consumption centers from the transport hubs.
Creation of a website to present the results.

Members : Axel Matthey and Rémi Petitpierre

Paris Metropolitan, an evolution

Definition of the project

This group will analyze the evolution of the Paris Metropolitan system from its inception. The group will look at two maps of the planning of the metro, from the definition of the routes to the addition of stations, a first map from 1908 of the actual metro after its construction in 1990, a second map from 1915, with already visible impacts of the first war, and a third map from 1950, a more contemporain look at the metro as we know it today. The goal is to analyze how different areas of major cultural attractions evolved around or hand in hand with the metro stations and the overall Paris metro system - basically answering the chicken and the egg question, and how the metro was impacted by catastrophic events such as wars.

Selected maps

The different maps selected for the project are the following:

Plan de Paris, avec le tracé du chemin de fer métropolitain (projet de l'administration) et les différentes lignes d'omnibus et de tramways, 1882
Plan de Paris [indiquant les lignes projetées du chemin de fer métropolitain en souterrain, tranchée, Viaduc, 1895
Paris, chemin de fer métropolitain ; lignes en exploitation, 1908
Paris Nouveau plan de Paris avec toutes les lignes du Métropolitain et du Nord-Sud, 1915
Paris. Plan d'ensemble par arrondissements. Métropolitain : [vers 1950]

Members: Evgeniy Chervonenko and Valentine Bernasconi

Projects 2017

All the projects are pieces of a larger puzzle. The goal is to experiment a new approach to knowledge production and negociation based on a platform intermediary between Wikipedia and Twitter.

The platform is called ClioWire

ClioWire: Platform management and development

This group will manage the experimental platform of the course. They will have to run platform and develop additional features for processing and presenting the pulses. The initial code base is Mastodon.

The group will write bots for rewritting pulses and progressively converging towards articulation/datafication of the pulses.

Knowledge required : Python, Javascript, basic linux administration.

Resp. Vincent and Orlin

- Albane - Cédric
Platform management and development : State of art and Bibliography

Platform management and development : methodology

Platform management and development : Quantitative analysis of performance

GitHub page of the project : [1]

Secondary sources

The goal is to extract from a collection of 3000 scanned books about Venice all the sentences containing at least two named entities and transforming them into pulses. This should consiste a de facto set of relevant information taking a large base of Venetian documents.

Resp. Giovanni / Matteo

- Hakim - Marion

Named Entity Recognition

GitHub page of the project : [2]

Primary sources

This group will look for named entities in digiitized manuscript and post pulses about these mentions.

The group will use Wordspotting methods based on commercial algorithm. During the project, the group will have to set up a dedicated pipeline for indexing and searching the document digitized in the Venice Time Machine project and other primary sources using the software component provided.
The group will have to search for list of names or regular expressions. A method based on predefined list will be compared with a recursive method based on the results provided by the Wordspotting components.
Two types of Pulses will be produced : (a) "Mention of Francesco Raspi in document X" (b) "Franseco Raspi and Battista Nanni linked (document Y)"
The creation of simple web Front end to test the Wordspotting algorithm would help assessing the quality of the method

Supervisor : Sofia

Skills : Java, simple Linux administration

- Raphael - Mathieu

Primary sources

Image banks

The goal is to transform the metadata of CINI which have been OCRed into pulses. One challenge is to deal with OCR errors and possible disambiguation.

Supervision: Lia

Newspaper, Wikipedia, Semantic Web

The goal is to find all the sentences in a large newspaper archive that contains at least 2 names entities. These sentences should be posted as pulses.

The named entity detection have already been done. The only challenge to retrieve the corresponding sentences in the digitized transcriptions.

In addition, this group should look for ways for importing massively element of knowledge from other sources (DBPedia, RDF databases)

Resp. Maud

Skills: Python or Java

- Laurene and Santiago

Newspaper, Wikipedia, Semantic Web : State of art and Bibliography

@@ Line 25: / Line 25: @@
 == Resources ==
 === Data extraction tools ===
+* [[Gallica wrapper]] to download documents, images and OCR from Gallica directly from Python.
+* [https://transkribus.eu Transkribus] to make your own OCR.
+* [http://www.robots.ox.ac.uk/~vgg/software/via/ VGG Image annotator] (VIA) to annotate your documents.
 * OCR and HTR available for documents who do not have it. (Caveat for HTR, the data series should be uniform and needs to be annotated by hand).
 * [https://dhlab-epfl.github.io/dhSegment/ dhSegment] for segmenting visually distinct part of documents.

Projects: Difference between revisions

Revision as of 05:56, 9 October 2019

Contents

Projects 2019

Guidelines

Image collections

Specific examples

Resources

Data extraction tools

Databases/APIs

Sketches

Projects 2018

Projects websites

Shortest-Path Route Extraction From City Map

Train schedules

Paris 1909 TripAdvisor

A century in Beijing

The planned goals

Coal supply in the German Empire

Main ideas

Paris Metropolitan, an evolution

Definition of the project

Selected maps

Projects 2017

ClioWire: Platform management and development

Secondary sources

Primary sources

Image banks

Newspaper, Wikipedia, Semantic Web

Navigation menu

Projects: Difference between revisions

Revision as of 05:56, 9 October 2019

Projects 2019

Guidelines

Image collections

Specific examples

Resources

Data extraction tools

Databases/APIs

Sketches

Projects 2018

Projects websites

Shortest-Path Route Extraction From City Map

Train schedules

Paris 1909 TripAdvisor

A century in Beijing

The planned goals

Coal supply in the German Empire

Main ideas

Paris Metropolitan, an evolution

Definition of the project

Selected maps

Projects 2017

ClioWire: Platform management and development

Secondary sources

Primary sources

Image banks

Newspaper, Wikipedia, Semantic Web

Navigation menu

Search