Named Entity Recognition: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(69 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Discussion of the State of the Art ==
== Discussion of the State of the Art ==
According to <ref>https://en.wikipedia.org/wiki/Named-entity_recognition</ref>, state-of-the-art implementations of named entity recognition (NER) heavily rely on algorithms based Hidden Markov Model (HMM) <ref>https://en.wikipedia.org/wiki/Hidden_Markov_model</ref> and on Conditional Random Field (CRF) <ref>https://en.wikipedia.org/wiki/Conditional_random_field</ref>.
Named Entity Recognizers relied for a long time on algorithms based on statistical models such as the Hidden Markov Model (HMM)
The Hidden Markov Model is a statistical model which describes the system as being in one of a number of possible state. To each state are associated possible outputs with their respective probabilities; furthermore, the system will change state with a certain probability.  
<ref>Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning. 2003. [https://dl.acm.org/citation.cfm?id=1119204 Named entity recognition with character-level models]. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03), Vol. 4. Association for Computational Linguistics, Stroudsburg, PA, USA, 180-183.</ref>
The Conditional Random Field is another statistical model whose distinctive feature is the context-aware nature: where a discrete classifier works on mapping a single sample to a single class, CSF outputs sequence of labels for sequence of samples. For instance, the widely used Stanford Named Entity Recognizer <ref>https://nlp.stanford.edu/software/CRF-NER.html</ref> uses CRF.
<ref>Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-Lim Tan. 2003. [https://dl.acm.org/citation.cfm?id=1118965 Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain]. In Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13 (BioMed '03), Vol. 13. Association for Computational Linguistics, Stroudsburg, PA, USA, 49-56.</ref>  
In recent years however the advancement in both GPU technology and deep learning techniques triggered the advent of Long short-term memory <ref>https://en.wikipedia.org/wiki/Long_short-term_memory</ref> neural network (LSTMNN) architectures, which are often used in conjunction with CRF to obtain state-of-the-art-performance <ref>https://arxiv.org/abs/1603.01360</ref><ref>https://arxiv.org/abs/1508.01991</ref> and provide a model which has become a fundamental feature for major companies according to <ref>https://en.wikipedia.org/wiki/Long_short-term_memory#History</ref>. LSTMNN are therefore currently preferred to HMM.
and the Conditional Random Fields (CRFs)  
As a last note, a recent paper <ref>https://arxiv.org/abs/1702.02098</ref> introduces the possibility to use Iterated Dilated Convolutional Neural Networks (ID-CNNs) in place of LSTMNN to drastically improve computation time through parallelization while keeping the same level of accuracy, which suggests ID-CNNs could be the next step in improving NER.  
<ref>Andrew McCallum and Wei Li. 2003. [https://dl.acm.org/citation.cfm?id=1119206 Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons]. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03), Vol. 4. Association for Computational Linguistics, Stroudsburg, PA, USA, 188-191.</ref>
Matters of great concern in NER as of now include training data scarcity and inter-domain generalization <ref name="id">https://arxiv.org/abs/1612.00148c</ref>. In order to be efficient on a language domain, Current NER systems need large labeled datasets related to that domain <ref>https://arxiv.org/abs/1701.02877</ref>. This training data isn’t available for all language domains, which leads to the impossibility of applying NER efficiently to them. Furthermore, if a language domain doesn’t follow strict language conventions and allows for a wide use of the language, then the model will fail to generalize due to excessive heterogeneity. Examples of such domains are Sport and Finance.
<ref>Burr Settles. 2004. [https://dl.acm.org/citation.cfm?id=1567618 Biomedical named entity recognition using conditional random fields and rich feature sets]. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA '04), Nigel Collier, Patrick Ruch, and Adeline Nazarenko (Eds.). Association for Computational Linguistics, Stroudsburg, PA, USA, 104-107.</ref>.
This is the reason for which one of the big challenges is, as stated in <ref name="id"/>, “adapt[ing] models learned on domains where large amounts of annotated training data are available to domains with scarce annotated data”.
For instance, the widely used Stanford Named Entity Recognizer  
<ref>https://nlp.stanford.edu/software/CRF-NER.html</ref>  
uses CRFs.
 
In recent years however the advancement in both GPU technology and deep learning techniques triggered the advent of architectures making use of Long Short-Term Memory Neural Networks (LSTMNNs)
<ref>James Hammerton. 2003. [https://dl.acm.org/citation.cfm?id=1119202 Named entity recognition with long short-term memory]. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03), Vol. 4. Association for Computational Linguistics, Stroudsburg, PA, USA, 172-175.</ref>. LSTMNNs are often used in conjunction with CRFs and allow to obtain a performance equivalent to the aforementioned methods <ref name="a">Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016. [https://arxiv.org/abs/1603.01360 Neural Architectures for Named Entity Recognition].</ref>
<ref>Zhiheng Huang, Wei Xu, Kai Yu. 2015. [https://arxiv.org/abs/1508.01991 Bidirectional {LSTM-CRF} Models for Sequence Tagging].</ref> without the need of performing complex feature engineering <ref name="a" /><ref>Chen Lyu, Bo Chen, Yafeng Ren, Donghong Ji. 2017.[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1868-5 Long short-term memory RNN for biomedical named entity recognition].</ref>
<ref>Jason P. C. Chiu, Eric Nichols. 2015. [https://arxiv.org/abs/1511.08308 Named Entity Recognition with Bidirectional LSTM-CNNs].</ref>.
LSTMNNs provide a model which has become a fundamental component for major companies like Google which prefer them over precedent approaches
<ref>Hasim Sak, Andrew Senior, Francoise Beaufays. 2014. [https://static.googleusercontent.com/media/research.google.com/it//pubs/archive/43895.pdf Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling]</ref>.
 
It is worth nothing that a recent paper  
<ref>Emma Strubell, Patrick Verga, David Belanger, Andrew McCallum. 2017. [https://arxiv.org/abs/1702.02098 Fast and Accurate Sequence Labeling with Iterated Dilated Convolutions].</ref>  
introduces the possibility to use Iterated Dilated Convolutional Neural Networks (ID-CNNs) in place of LSTMNNs to drastically improve computation time through full exploitation of the GPU's parallelizable architecture while keeping the same level of accuracy, which suggests ID-CNNs could be the next step in improving NER.  
 
As far as the current state of the NER reasearch, matters of great concern include training data scarcity and inter-domain generalization  
<ref name="id">Vivek Kulkarni, Yashar Mehdad, Troy Chevalier.2016. [https://arxiv.org/abs/1612.00148 Domain Adaptation for Named Entity Recognition in Online Media with Word Embeddings].</ref>.  
In order to be efficient on a language domain, current NER systems need large labeled datasets related to that domain  
<ref>Isabelle Augenstein, Leon Derczynski, Kalina Bontcheva. 2017. [https://arxiv.org/abs/1701.02877 Generalisation in Named Entity Recognition: {A} Quantitative Analysis].</ref>.  
This training data isn’t available for all language domains, which leads to the impossibility of applying NER efficiently to them. Furthermore, if a language domain doesn’t follow strict language conventions and allows for a wide use of the language, then the model will fail to generalize due to excessive heterogeneity in the data. Examples of such domains are Sport and Finance.
This is the reason for which one of the big challenges is “adapt[ing] models learned on domains where large amounts of annotated training data are available to domains with scarce annotated data”
<ref name="id"/>.
== Bibliography ==
== Bibliography ==
<references />
<references />
== Quantitive analysis of the performances ==
Through the Python script designed as part of this project it is possible to generate at least six different kinds of pulses for each detected named entity. Considering that the database contains approximately 2300 books and 600 journal issues, it would be interesting to compute a lower bound for the number of produced pulses. In order to do this, some assumptions concerning the average number of named entity detected in a single document must be formulated. By examining the log of the script, which registers the number of named entity associated to each processed document, we were led to consider that 1000 named entity per document is a good estimate of the average performance. A second assumption must be done with regard to the number of articles contained in a journal issue. By browsing the database, it was concluded that 10 is an acceptable value.
Under these assumption, the number of pulses produced by parsing the whole database is
(2300 books * 1000 named_entities/book + 600 journal_issues * 10 articles/journal_issue * 1000 named_entities/articles) * 6 pulses/named_entity = 50 millions pulses.
The question now is how large a database must be in order to contain all these pulses. Please note that pulses are not the only elements stored in the output database. There’s also metadata information on documents, among which the most notable is the number of pulses produced from each processed document. The size of this metadata however is negligible compared to the one of pulses and so it won’t be considered in the estimate. In order to compute the final size of the output database, a speculation based on the current dimension of the database will be done. As of now the database contains around 400’000 pulses and its size is in the order of 220MB. Therefore, a database of 50 millions of pulses should be 125 times larger, which means a size of 27.5 GB. Considering that the input database is 80 GB in size, it can be argued that pulses represent an interesting way of condensing critical parts of information.
Lastly, an estimation of the needed time to process the whole database will be provided. The bottlenecks in the processing of the input database are two: the limited number of requests which can be sent daily to the NER service, Dandelion, and the writing of pulses into the output database. With these limitations, it was still possible to process 100 books in 16 hours. Under the assumption that processing a book takes as long as processing a journal issue, the total time to process the 2900 documents in the input database would be 464 hours, or 20 days.
== Detailed description of the methods ==
For this project, the database is pretty big, with approximately 2300 books and 600 journal issues. This is a MongoDatabase, so all texts and metadata are stored in json. As there is two types of sources, books and articles, each method should be adapted for each type.
The database where the books and articles are stored has a lot of collections but for this project, only four of them will be used. First, the metadata collection, in which there are :
* the _id of the book
* the creator (the author)
* the language in which the book is written
* the title, the bid (id of the book)
* the date
* the type of document (in this project, only the monograph type will be used for this collection).
The second collection used is the documents collection. In it, there are :
* the _id
* the bid
* the type (which can be here monograph or journal issue)
* a list of pages id.
For the journal issues, there is also a list of articles, and for each article there are :
* the author
* the title
* the start and end page of the article
The third collection is the pages one, which has two interesting values, the full text of a page and its printed page number (there is also the _id, as for every collection).
The last collection used is the bibliodb_articles one. In this one, which is a collection of articles, there are :
* the _id
* the journal bid
* the journal title
* the title of the article
* the year
* the volume (of the journal it comes from)
The first step is to get the full text of a book/article, as the books/articles are not stored fully together, but page by page.
For the books, the first step is to get all the BID (book id) in the metadata collection. After that, the second step is to go in the documents collection and get for each book the list of id of the pages.
The last step is to go into the pages collection to get the full text that is contained in each page.
For the articles, more steps are needed to have the full text. The first step is to go into the bibliodb_articles (which contains every article), in order to have the journal id in which the article is contained. In the same time, a metadata is created for each article as there were none (and it is needed for after). Authors, journal bid, journal title, article title, year and volume are stored in the metadata table. After that, the start page number and the end page number are retrieved from the documents collection with the help of journal id. With the start and the end page number of the article, a table with the id of all pages of the article is created (as in documents, all pages of the document are stored in an array). The last step is to go into the pages collection to get the full text contained in each page of the article.
The second step is to send all our data to Dandelion in order to extract all the entities that are contained in the books and articles.
Dandelion is an API that extracts entities (that can be person, work, organisations, places, events and concept) from text. The answer of Dandelion is a json file containing all entities found on the text. In this json file, for each entity, Dandelion returns the spot of the entity (the name found in text detected as entity), the label of the entity (name of the entity), the start and the end of the word detected (related to the position in the text), wikpedia/dbpedia page linked to the entity, and a list of categories the entity is part of (for example « Paintings of Leonardo da Vinci », « Italian paintings », etc. for the entity « Mona Lisa »). Two constraints appear with the use of Dandelion : Dandelion allows 30’000 requests per day  (after upgrade, basic account allows only 1000 requests/day) and each request shouldn’t exceed 1MiB. To handle theses constraints, the idea is to split the full text in small chunks that fit the requirement if the full text exceeds 1MiB. To do this, the text is split in sentences, and sentences are add one by one until the new text reaches the maximum size. And the same is done for the removing sentences. But for a majority of books and articles, the full text doesn’t exceed the maximum size.
The third step is to create pulses for each entity found by Dandelion.
For making theses pulses, only the label, the spot and the wikipedia URL are kept from the Dandelion answer. There will be 7 types of pulses, with sometimes different versions for books and articles.
There are two types of pulses. The first type is pulse that contains one entity, and they are : 
* The equality pulse, of the form : '' #eq #entity_label wikipedia_URL ''
* The mention pulse, of the form : ''#mention #entity_label #in #title '', where title can be the book_title_p42 (if entity is mention on page 42 of the book) or the article_title
* The in pulse, of the form : '' #book_title_p42 #in #book_title'' for books, and '' #article_title #in #journal_title '' for articles
* The creator pulse, of the form : '' #creator #book_title #author '' for books, and '' #creator #article_title #author '' (can have more than one author, if this is the case there is more hashtags)
* A pulse where all theses informations are contained in a more readable way, of the form : '' entity (link_wikipedia) is present in book 'book_title' by author_name at page page_number. #entity #book_title #author '' for books, and '' entity_label (link_wikipedia) is present in article 'article_title' by author(s)_name at page page_number in the volume volume_number of journal 'journal_title'. #entity_label #article_title #author #journal_title'' for articles
The creator pulse is not created for each entity but once for each book and article.
The in pulse is created once for each article.
The second type is pulse that contains two entities, and they are :
* The copresence pulse, of the form : '' #copresence #entity1_label #entity2_label #book_title_p42 '' for books and '' #copresence #entity1_label #entity2_label #article_title '' for articles
* A pulse where this information is contained in a more readable way, of the form : '' entity1_label (link_wikipedia1) and entity2_label (link_wikipedia2) are distance_in_page pages distant in the book 'book_title' by author_name. #entity1 #entity2 #book_title #author '' for books and '' entity1 (link_wikipedia1) and entity2 (link_wikipedia2) are distance_in_page pages distant in the article 'article_title' by author(s)_name present in volume volume_number of journal 'journal_title'. #entity1 #entity2 #article_title #author(s) #journal_title '' for articles
The copresence pulse is only created when two entities are on the same page of a book. An entity can have multiple copresence pulses, especially for entities that are in an article.
For pulses from books, the page number is needed for some pulses (#in, #mention #copresence). To retrieve it, the book containing the entity is scanned until the spot word is found.
For pulses that contain two entities, there are two methods used depending if this is a book or and article. For books, as the interest is to find entities on the same page, entities are taken two by two (Dandelion returns the entities in the order of appearance). Then the page number of each entity can be found, and a copresence pulse is created if the difference between the two pages is zero. For the articles, as it’s small, it is relevant to mention the copresence of all entities. In order to do that, each entity is combined with all other entities, so the number of #copresence pulse will be higher for articles than for books.
All pulses created are pushed in a new MongoDatabase, with the original books and articles, to know which books/articles have already been processed (as the access to the original database in in read-only mode).
In this new database, there are three collections. The first one is the collection of pulses, and in it there is the type of the pulse, that can be :
* 1 for sentence pulse about one entity
* 2 for sentence pulse about two entities
* book_copresence_pulse
* article_copresence_pulse
* book_mention_pulse
* article_mention_pulse
* book_in_pulse
* article_in_pulse
* entity_eq_pulse
* creator_pulse
For each type of pulses, there are a timestamp, the id and the pulse itself. All the rest are the data used to create the pulse. For example, for book_copresence_pulse, there are in addition the name of the two entities, their page number and the title of the book.
The second collection is the collection of books processed. For each book, there are :
* the creator
* the language
* the type
* the title
* the bid
* the date
* the timestamp
* the list of pulses created from this book.
The third collection is the collection of articles processed, in which there are :
* the id
* the list of authors
* the journal bid
* the journal title
* the volume
* the year
* the article title
* the list of pulses associated to the article
In order to publish the pulses on ClioWire, the pulses are retrieved from the database in a json format with the help of a script, and then publish with a small python program.

Latest revision as of 21:01, 15 December 2017

Discussion of the State of the Art

Named Entity Recognizers relied for a long time on algorithms based on statistical models such as the Hidden Markov Model (HMM) [1] [2] and the Conditional Random Fields (CRFs) [3] [4]. For instance, the widely used Stanford Named Entity Recognizer [5] uses CRFs.

In recent years however the advancement in both GPU technology and deep learning techniques triggered the advent of architectures making use of Long Short-Term Memory Neural Networks (LSTMNNs) [6]. LSTMNNs are often used in conjunction with CRFs and allow to obtain a performance equivalent to the aforementioned methods [7] [8] without the need of performing complex feature engineering [7][9] [10]. LSTMNNs provide a model which has become a fundamental component for major companies like Google which prefer them over precedent approaches [11].

It is worth nothing that a recent paper [12] introduces the possibility to use Iterated Dilated Convolutional Neural Networks (ID-CNNs) in place of LSTMNNs to drastically improve computation time through full exploitation of the GPU's parallelizable architecture while keeping the same level of accuracy, which suggests ID-CNNs could be the next step in improving NER.

As far as the current state of the NER reasearch, matters of great concern include training data scarcity and inter-domain generalization [13]. In order to be efficient on a language domain, current NER systems need large labeled datasets related to that domain [14]. This training data isn’t available for all language domains, which leads to the impossibility of applying NER efficiently to them. Furthermore, if a language domain doesn’t follow strict language conventions and allows for a wide use of the language, then the model will fail to generalize due to excessive heterogeneity in the data. Examples of such domains are Sport and Finance. This is the reason for which one of the big challenges is “adapt[ing] models learned on domains where large amounts of annotated training data are available to domains with scarce annotated data” [13].

Bibliography

  1. Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning. 2003. Named entity recognition with character-level models. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03), Vol. 4. Association for Computational Linguistics, Stroudsburg, PA, USA, 180-183.
  2. Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-Lim Tan. 2003. Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain. In Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13 (BioMed '03), Vol. 13. Association for Computational Linguistics, Stroudsburg, PA, USA, 49-56.
  3. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03), Vol. 4. Association for Computational Linguistics, Stroudsburg, PA, USA, 188-191.
  4. Burr Settles. 2004. Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA '04), Nigel Collier, Patrick Ruch, and Adeline Nazarenko (Eds.). Association for Computational Linguistics, Stroudsburg, PA, USA, 104-107.
  5. https://nlp.stanford.edu/software/CRF-NER.html
  6. James Hammerton. 2003. Named entity recognition with long short-term memory. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03), Vol. 4. Association for Computational Linguistics, Stroudsburg, PA, USA, 172-175.
  7. 7.0 7.1 Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016. Neural Architectures for Named Entity Recognition.
  8. Zhiheng Huang, Wei Xu, Kai Yu. 2015. Bidirectional {LSTM-CRF} Models for Sequence Tagging.
  9. Chen Lyu, Bo Chen, Yafeng Ren, Donghong Ji. 2017.Long short-term memory RNN for biomedical named entity recognition.
  10. Jason P. C. Chiu, Eric Nichols. 2015. Named Entity Recognition with Bidirectional LSTM-CNNs.
  11. Hasim Sak, Andrew Senior, Francoise Beaufays. 2014. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling
  12. Emma Strubell, Patrick Verga, David Belanger, Andrew McCallum. 2017. Fast and Accurate Sequence Labeling with Iterated Dilated Convolutions.
  13. 13.0 13.1 Vivek Kulkarni, Yashar Mehdad, Troy Chevalier.2016. Domain Adaptation for Named Entity Recognition in Online Media with Word Embeddings.
  14. Isabelle Augenstein, Leon Derczynski, Kalina Bontcheva. 2017. Generalisation in Named Entity Recognition: {A} Quantitative Analysis.

Quantitive analysis of the performances

Through the Python script designed as part of this project it is possible to generate at least six different kinds of pulses for each detected named entity. Considering that the database contains approximately 2300 books and 600 journal issues, it would be interesting to compute a lower bound for the number of produced pulses. In order to do this, some assumptions concerning the average number of named entity detected in a single document must be formulated. By examining the log of the script, which registers the number of named entity associated to each processed document, we were led to consider that 1000 named entity per document is a good estimate of the average performance. A second assumption must be done with regard to the number of articles contained in a journal issue. By browsing the database, it was concluded that 10 is an acceptable value. Under these assumption, the number of pulses produced by parsing the whole database is

(2300 books * 1000 named_entities/book + 600 journal_issues * 10 articles/journal_issue * 1000 named_entities/articles) * 6 pulses/named_entity = 50 millions pulses.

The question now is how large a database must be in order to contain all these pulses. Please note that pulses are not the only elements stored in the output database. There’s also metadata information on documents, among which the most notable is the number of pulses produced from each processed document. The size of this metadata however is negligible compared to the one of pulses and so it won’t be considered in the estimate. In order to compute the final size of the output database, a speculation based on the current dimension of the database will be done. As of now the database contains around 400’000 pulses and its size is in the order of 220MB. Therefore, a database of 50 millions of pulses should be 125 times larger, which means a size of 27.5 GB. Considering that the input database is 80 GB in size, it can be argued that pulses represent an interesting way of condensing critical parts of information.

Lastly, an estimation of the needed time to process the whole database will be provided. The bottlenecks in the processing of the input database are two: the limited number of requests which can be sent daily to the NER service, Dandelion, and the writing of pulses into the output database. With these limitations, it was still possible to process 100 books in 16 hours. Under the assumption that processing a book takes as long as processing a journal issue, the total time to process the 2900 documents in the input database would be 464 hours, or 20 days.

Detailed description of the methods

For this project, the database is pretty big, with approximately 2300 books and 600 journal issues. This is a MongoDatabase, so all texts and metadata are stored in json. As there is two types of sources, books and articles, each method should be adapted for each type.

The database where the books and articles are stored has a lot of collections but for this project, only four of them will be used. First, the metadata collection, in which there are :

  • the _id of the book
  • the creator (the author)
  • the language in which the book is written
  • the title, the bid (id of the book)
  • the date
  • the type of document (in this project, only the monograph type will be used for this collection).

The second collection used is the documents collection. In it, there are :

  • the _id
  • the bid
  • the type (which can be here monograph or journal issue)
  • a list of pages id.

For the journal issues, there is also a list of articles, and for each article there are :

  • the author
  • the title
  • the start and end page of the article

The third collection is the pages one, which has two interesting values, the full text of a page and its printed page number (there is also the _id, as for every collection).

The last collection used is the bibliodb_articles one. In this one, which is a collection of articles, there are :

  • the _id
  • the journal bid
  • the journal title
  • the title of the article
  • the year
  • the volume (of the journal it comes from)

The first step is to get the full text of a book/article, as the books/articles are not stored fully together, but page by page.

For the books, the first step is to get all the BID (book id) in the metadata collection. After that, the second step is to go in the documents collection and get for each book the list of id of the pages. The last step is to go into the pages collection to get the full text that is contained in each page.

For the articles, more steps are needed to have the full text. The first step is to go into the bibliodb_articles (which contains every article), in order to have the journal id in which the article is contained. In the same time, a metadata is created for each article as there were none (and it is needed for after). Authors, journal bid, journal title, article title, year and volume are stored in the metadata table. After that, the start page number and the end page number are retrieved from the documents collection with the help of journal id. With the start and the end page number of the article, a table with the id of all pages of the article is created (as in documents, all pages of the document are stored in an array). The last step is to go into the pages collection to get the full text contained in each page of the article.


The second step is to send all our data to Dandelion in order to extract all the entities that are contained in the books and articles.

Dandelion is an API that extracts entities (that can be person, work, organisations, places, events and concept) from text. The answer of Dandelion is a json file containing all entities found on the text. In this json file, for each entity, Dandelion returns the spot of the entity (the name found in text detected as entity), the label of the entity (name of the entity), the start and the end of the word detected (related to the position in the text), wikpedia/dbpedia page linked to the entity, and a list of categories the entity is part of (for example « Paintings of Leonardo da Vinci », « Italian paintings », etc. for the entity « Mona Lisa »). Two constraints appear with the use of Dandelion : Dandelion allows 30’000 requests per day (after upgrade, basic account allows only 1000 requests/day) and each request shouldn’t exceed 1MiB. To handle theses constraints, the idea is to split the full text in small chunks that fit the requirement if the full text exceeds 1MiB. To do this, the text is split in sentences, and sentences are add one by one until the new text reaches the maximum size. And the same is done for the removing sentences. But for a majority of books and articles, the full text doesn’t exceed the maximum size.


The third step is to create pulses for each entity found by Dandelion. For making theses pulses, only the label, the spot and the wikipedia URL are kept from the Dandelion answer. There will be 7 types of pulses, with sometimes different versions for books and articles.

There are two types of pulses. The first type is pulse that contains one entity, and they are :

  • The equality pulse, of the form :  #eq #entity_label wikipedia_URL 
  • The mention pulse, of the form : #mention #entity_label #in #title , where title can be the book_title_p42 (if entity is mention on page 42 of the book) or the article_title
  • The in pulse, of the form :  #book_title_p42 #in #book_title for books, and  #article_title #in #journal_title  for articles
  • The creator pulse, of the form :  #creator #book_title #author  for books, and  #creator #article_title #author  (can have more than one author, if this is the case there is more hashtags)
  • A pulse where all theses informations are contained in a more readable way, of the form :  entity (link_wikipedia) is present in book 'book_title' by author_name at page page_number. #entity #book_title #author for books, and  entity_label (link_wikipedia) is present in article 'article_title' by author(s)_name at page page_number in the volume volume_number of journal 'journal_title'. #entity_label #article_title #author #journal_title for articles

The creator pulse is not created for each entity but once for each book and article. The in pulse is created once for each article.


The second type is pulse that contains two entities, and they are :

  • The copresence pulse, of the form :  #copresence #entity1_label #entity2_label #book_title_p42 for books and  #copresence #entity1_label #entity2_label #article_title  for articles
  • A pulse where this information is contained in a more readable way, of the form :  entity1_label (link_wikipedia1) and entity2_label (link_wikipedia2) are distance_in_page pages distant in the book 'book_title' by author_name. #entity1 #entity2 #book_title #author  for books and  entity1 (link_wikipedia1) and entity2 (link_wikipedia2) are distance_in_page pages distant in the article 'article_title' by author(s)_name present in volume volume_number of journal 'journal_title'. #entity1 #entity2 #article_title #author(s) #journal_title  for articles

The copresence pulse is only created when two entities are on the same page of a book. An entity can have multiple copresence pulses, especially for entities that are in an article. For pulses from books, the page number is needed for some pulses (#in, #mention #copresence). To retrieve it, the book containing the entity is scanned until the spot word is found.


For pulses that contain two entities, there are two methods used depending if this is a book or and article. For books, as the interest is to find entities on the same page, entities are taken two by two (Dandelion returns the entities in the order of appearance). Then the page number of each entity can be found, and a copresence pulse is created if the difference between the two pages is zero. For the articles, as it’s small, it is relevant to mention the copresence of all entities. In order to do that, each entity is combined with all other entities, so the number of #copresence pulse will be higher for articles than for books.


All pulses created are pushed in a new MongoDatabase, with the original books and articles, to know which books/articles have already been processed (as the access to the original database in in read-only mode).

In this new database, there are three collections. The first one is the collection of pulses, and in it there is the type of the pulse, that can be :

  • 1 for sentence pulse about one entity
  • 2 for sentence pulse about two entities
  • book_copresence_pulse
  • article_copresence_pulse
  • book_mention_pulse
  • article_mention_pulse
  • book_in_pulse
  • article_in_pulse
  • entity_eq_pulse
  • creator_pulse

For each type of pulses, there are a timestamp, the id and the pulse itself. All the rest are the data used to create the pulse. For example, for book_copresence_pulse, there are in addition the name of the two entities, their page number and the title of the book. The second collection is the collection of books processed. For each book, there are :

  • the creator
  • the language
  • the type
  • the title
  • the bid
  • the date
  • the timestamp
  • the list of pulses created from this book.

The third collection is the collection of articles processed, in which there are :

  • the id
  • the list of authors
  • the journal bid
  • the journal title
  • the volume
  • the year
  • the article title
  • the list of pulses associated to the article

In order to publish the pulses on ClioWire, the pulses are retrieved from the database in a json format with the help of a script, and then publish with a small python program.