Newspaper, Wikipedia, Semantic Web : State of art and Bibliography

From FDHwiki
Jump to navigation Jump to search

The database from “Le Temps” is containing more than two centuries of historical, geographical and political data. All this information was analysed and aggregated into Ressource Desription Frameworks (RDF), which is triplet of information containing a subject, a predicate and an object. In Semantic Web applications and in relatively popular applications of RDF ressources tend to be represented by URIs that can be used to access actual data on the World Wide Web or even out of the internet-based ressources for RDF (not dereferenceable).

By having such organization, one is able to synthetize many ideas and links, such as the ownership of an object, the filiation or the relationship between two people… The general idea is to link concepts in a specific way, which results in a graph. Indeed the RDF are meant to be shown in graphs, this is their most powerful representation. It may seem easy to represent how the data can express itself when being shown in a graph of links (node as subject and object vs predicate node-arc-node link), however the graph of RDFs can be highly problematic because of its size and its complexity. The network that it creates is too dense to allow people to have an overview of it. The use of a SPARQL, a protocol to make queries on RDFs is necessary. Using such tool will allow is to retrieve many information from the graph. From such needs, the SPARQL query pattern has to specify what (SELECT) and in which organisation it want to extract thoses items while giving the dataset definition stating what RDF graphs are being required (WHERE : query pattern).

When information is aggregated from the original text, it is being serialized. This means that an amount of data is encapsulated into a smaller format, which is RDF in this case. In order to be able to retrieve meaningful data, that is, natural language sentences coming from the RDFs, our work is to deserialize the graph to obtain short sentences from it and thus extracting a data structure from a format that can be stored easily (RDF). This will allow us to have a greater vision of what the RDF contains. This new format seems like a sentence from Tweeter called a « pulse » and it has the limit of 140 characters. It is created in order to feed the practical class platform ClioWire.

To guide us in our work, we will use the following bibliography:

- RDF serialization from JSON data: The case of JSON data in Diavgeia.gov.gr [1] by Stamatios Theocharis & George Tsihnrintzis (2016) This case study will show us how the serialization process occurs. Knowinng how a much larger structure is encapsulated will allow us to reconstruct the intial data. When serializing the original data from "Le Temps", the process included many steps that require a level comprehension we will try to reach when analysing the steps that were done by Theocharis & Tsihnrintzis.

- Jumping NLP Curves: A Review of Natural Language Processing Research [2], by Erik Cambria & Bebo White (2014). This article, along with the knowledge provided by our superviser, will help us to understand how the Natural Language Process works. Because RDFs are a direct product of it, gather information on how they are structured and conceptualized will expand our understanding of the construction of the graph we are working with.

- PANTO: A Portable Natural Language Interface to Ontologies [3], by Chong Wang, Miao Xiong, Qi Zhou and Young Fu (2007). This article presents a system that takes natural language queries to transfer them into SPARQL queries. Using this tool and learning more about it would improve our skill in using both of those types of implementation. During this project it will be unavoidable to master the use of SPARQL, and knowing what is linking this language to the natural one will foster our understanding of the structure of queries.

- Real-time #SemanticWeb in <= 140 chars, Joshua Shinavier [4] With the certainty that the « real time » social Web could mash up with the semantic web. From the massive using of GAFA during those previous years, those platforms are beginning a lot to supply this data as we could observe with Twitter places, twitter annotations.. Only this concept of annotation and the semantic web combined bring us to ask some questions : how do we interlink annotations vocabularies? how do we query over the data? how do we interlink annotations vocabularies? By doing the inverse of converting RDF to Tweet, this article could be interesting in order to highlight the problematic from the other side, which can be quite similar and offer us a different point of view.

For more information follow this link to find our GitHub repository:

https://github.com/CSantiStSup/FDH_Le_temps


References

  1. "RDF serialization from JSON data: The case of JSON data in Diavgeia.gov.gr"
  2. "Jumping NLP Curves: A Review of Natural Language Processing Research"
  3. "PANTO: A Portable Natural Language Interface to Ontologies"
  4. "Real-time #SemanticWeb in <= 140 chars, Joshua Shinavier "