Platform management and development : methodology
For the platform management we decided to focus on developing a tool which permits to visualise the pulses’ geographical density. The process of this task is distributed among 3 python programs : GeoParsingBot : detects geographical references in all the pulses of ClioWire. The bot will fetch the coordinates of the geographical coordinates of the pulse and post as reply a geocoded version (geopulse). MapBot: fetches all pulses that were previously geocoded, convert them to the JSON format, and saves the JSONs on a local file. MapApp : Using the JSON file made by MapBot, produces a html page representing the density map of the pulses.
A geopulse is defined as a pulse indicating a geographical reference from its content and holding the coordinates of such reference. To be considered a “geopulse”, the pulse must be bearing at least those three tags :
- geocoding #LocationName #plongitude_decimals_latitude_decimals.
No other type of pulse should be added to the geographic density map.
Below is explained the inner working of the three bots performing the task of finding, formatting and displaying the geographical names found on ClioWire.
GeoParsingBot : The task of this bot is to produce geopulses out of normal pulses from ClioWire. The bot is aimed to fetch only pulses produced by specific accounts of data scraping. Since geoparsing is most efficient when given the language and a focus area of the text given, it is important to make the geoparsing specifically to each account. To avoid computing the same pulse two times, it keeps into an external file a reference to the most recent pulse it last geocoded. To do the geoparsing (which is the act of detecting and giving coordinates to a natural language piece of text holding a geographical reference), it uses a python library called “geoparsepy” [1]. This library performs geoparsing locally, using a postgresql [2] database loaded with OpenStreetMap [3] geographical references. In the context of this project, all sources were coming from either scraping from swiss newspaper “Le Temps” or Venetian books scan, so only data of european places were put in the database, to increase performance, and reduce false-positive/wrong georeferences.
Geoparsepy was chosen as a geoparsing tool because it can perform geoparsing on very large datasets pretty fast once some mandatory phase of preprocessing has been achieved. Considering the data held by ClioWire can be enormous, performance and offline computing was a crucial point to consider when choosing a technology of geoparsing.
This library is able to detect locations and find its OpenStreetMap ID (OSMID). The OSMID is a unique internal number OpenStreetMap(OSM) used to identify all geographic entities it holds. At the actual state of the project, geoparsepy cannot retrieve directly the coordinates from this ID, so another service is used to translate it into actual geographic coordinates : the Overpass API [4]. This service is a “readonly” version of OSM destined to people wanting to only scrap data from it. This is in opposition to the classic OSM API which is more oriented toward writing modification, and blocks people reading too much through the API. Once the coordinates are retrieved, the original pulse is posted with appended the #geocoding hashtag, a hashtag containing the name of the location georeferenced, and the hashtag containing the coordinates. As described before, those three elements defines this newly formated pulse as a geopulse.
Basically, it scans the newly posted pulses (since the last time it did so, and this date is stocked in a file along with the identifiers of the GeoParsingBot account), and tries to match the words it finds with the names of locations listed in a separated database. To ease the disambiguation of the geoparsing, and as the pulses scanned are coming from the swiss newspaper Le Temps and archives situated in Italy, we chose a database focusing mostly on european cities and more precise location around Switzerland. The pulses which contain such geographical informations are called GeoPulses, and are treated by GeoParsingBot to display the geographical names they contain as GeoEntities.
MapBot : This bot fetches geopulses on ClioWire. It is able to retrieve only geopulse thanks to the “#geocoding” tag. The pulses read are then put in a JSON file (geopulses.json). The data format used in the JSON file is called GeoJSON. It consists of a geographical points defined by a couple of coordinates, owning a single identifier, a content and entities. The entities are the tags contained from the pulse, it will be the indexes that allow a search engine to exist on the final map. This is one of the reason why in the geopulse format, the georeferenced entity is added as a tag, so it can be searched on the map.
Each bot can access a file containing the ID of the lastly scanned pulse, so as to not scan the same pulses twice.
MapApp : This bot is strictly speaking the ending application of all the geoparsing of the pulses. It creates a map displaying all the geopulses, and indicates each of them by a marker. An automatic zoom is done according to the maximum and minimum coordinates found in the geopulses. A search bar is available and enables the users to navigates through the displayed pulses. The markers matching the results of some keyword research made by the user are circled in red. The map is produced using the folium python library [5], which takes GeoJSONS data and build an html file from it which displays this data on a geographic representation of the earth. The html file is located on a server, and MapBot displays the map url on its description so anybody can on ClioWire have access to the final result of the geoparsing of pulses.