Projects
All the projects are pieces of a larger puzzle. The goal is to experiment a new approach to knowledge production and negociation based on a platform intermediary between Wikipedia and Twitter.
The platform is called ClioWire
ClioWire: Platform management and development
This group will manage the experimental platform of the course. They will have to run platform and develop additional features for processing and presenting the pulses. The initial code base is Mastodon.
The group will write bots for rewritting pulses and progressively converging towards articulation/datafication of the pulses.
Knowledge required : Python, Javascript, basic linux administration.
Resp. Vincent and Orlin
- Albane
- Cédric
Platform management and development : State of art and Bibliography
Secondary sources
The goal is to extract from a collection of 3000 scanned books about Venice all the sentences containing at least two named entities and transforming them into pulses. This should consiste a de facto set of relevant information taking a large base of Venetian documents.
Resp. Giovanni / Matteo
- Hakim - Marion
Primary sources
This group will look for named entities in digiitized manuscript and post pulses about these mentions.
- The group will use Wordspotting methods based on commercial algorithm. During the project, the group will have to set up a dedicated pipeline for indexing and searching the document digitized in the Venice Time Machine project and other primary sources using the software component provided.
- The group will have to search for list of names or regular expressions. A method based on predefined list will be compared with a recursive method based on the results provided by the Wordspotting components.
- Two types of Pulses will be produced : (a) "Mention of Francesco Raspi in document X" (b) "Franseco Raspi and Battista Nanni linked (document Y)"
- The creation of simple web Front end to test the Wordspotting algorithm would help assessing the quality of the method
Supervisor : Sofia
Skills : Java, simple Linux administration
- Raphael - Mathieu
Image banks
The goal is to transform the metadata of CINI which have been OCRed into pulses. One challenge is to deal with OCR errors and possible disambiguation.
Supervision: Lia
Newspaper, Wikipedia, Semantic Web
The goal is to find all the sentences in a large newspaper archive that contains at least 2 names entities. These sentences should be posted as pulses.
The named entity detection have already been done. The only challenge to retrieve the corresponding sentences in the digitized transcriptions.
In addition, this group should look for ways for importing massively element of knowledge from other sources (DBPedia, RDF databases)
Resp. Maud
Skills: Python or Java
- Laurene and Santiago
Bibliography
The database from “Le Temps” is containing more than two centuries of historical, geographical and political data. All this information was analysed and aggregated into Ressource Desription Frameworks (RDF), which is triplet of information containing a subject, a predicate and an object. By having such organization, one is able to synthetize many ideas and links, such as the ownership of an object, the filiation or the relationship between two people… The general idea is to like concepts in a specific way, which results in a graph. Indeed the RDF are meant to be shown in graphs, this is their most powerful representation. It may seem easy to represent how the data can express itself when being shown in a graph of links, however the graph of RDFs can be highly problematic because of its size and its complexity. The network that it creates is too dense to allow people to have an overview of it. The use of a SPARQL, a protocol to make queries on RDFs is necessary. Using such tool will allow is to retrieve many information from the graph. When information is aggregated from the original text, it is being serialized. This means that an amount of data is encapsulated into a smaller format, which is RDF in this case. In order to be able to retrieve meaningful data, that is, natural language sentences coming from the RDFs, our work is to deserialize the graph to obtain short sentences from it. This will allow us to have a greater vision of what the RDF contains. To guide us in our work, we will use the following bibliography:
- "RDF serialization from JSON data: The case of JSON data in Diavgeia.gov.gr" by Stamatios Theocharis & George Tsihnrintzis (2016).
This case study will show us how the serialization process occurs. Knowinng how a much larger structure is encapsulated will allow us to reconstruct the intial data. When serializing the original data from "Le Temps", the process included many steps that require a level comprehension we will try to reach when analysing the steps that were done by Theocharis & Tsihnrintzis.
- "Jumping NLP Curves: A Review of Natural Language Processing Research", by Erik Cambria & Bebo White (2014).
This article, along with the knowledge provided by our superviser, will help us to understand how the Natural Language Process works. Because RDFs are a direct product of it, gather information on how they are structured and conceptualized will expand our understanding of the construction of the graph we are working with.
- "PANTO: A Portable Natural Language Interface to Ontologies", by Chong Wang, Miao Xiong, Qi Zhou and Young Fu (2007).
This article presents a system that takes natural language queries to transfer them into SPARQL queries. Using this tool and learning more about it would improve our skill in using both of those types of implementation. During this project it will be unavoidable to master the use of SPARQL, and knowing what is linking this language to the natural one will foster our understanding of the structure of queries.