Gallica wrapper: Difference between revisions
(8 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Basics of Gallica API == | == Basics of Gallica API == | ||
Line 23: | Line 22: | ||
The API can also be used to find the pagination of a document, which in turn can give access to its OCR under the [https://en.wikipedia.org/wiki/ALTO_(XML) XML ALTO] format and its [https://iiif.io/ IIIF] images. | The API can also be used to find the pagination of a document, which in turn can give access to its OCR under the [https://en.wikipedia.org/wiki/ALTO_(XML) XML ALTO] format and its [https://iiif.io/ IIIF] images. | ||
=== Search API === | === Search API === | ||
Line 30: | Line 27: | ||
It is directly used in the main Gallica website, thus by observing the URL when doing a search, it is possible to infer the corresponding API call. The results are paginated up to 50 records per page (15 by default), it is thus necessary to handle the pagination. | It is directly used in the main Gallica website, thus by observing the URL when doing a search, it is possible to infer the corresponding API call. The results are paginated up to 50 records per page (15 by default), it is thus necessary to handle the pagination. | ||
== Using the wrapper == | |||
=== Installation === | |||
The wrapper requires python 3. | |||
* Clone the repository with <code>git clone https://github.com/raphaelBarman/fdh-gallica.git</code> | |||
* First its requirements should be installed (using <code>pip install -r requirements.txt</code> or with conda <code>conda install --file requirements.txt</code>). | |||
* Then the wrapper package can be installed using <code>python setup.py install</code>. | |||
=== Command line tools === | |||
The wrapper contains two command line tools: <code>gallica_exporter.py</code> and <code>gallica_download.py</code>. | |||
==== gallica_exporter.py ==== | |||
This tool allows to generate a [https://en.wikipedia.org/wiki/Comma-separated_values CSV] of urls and paths to the metadata, IIIF images and OCR of a given document/periodical/search. | |||
Type <code>python gallica_exporter.py -h</code> to have more information about the different options. | |||
The main idea is that it either takes as input either the ark of a document or a periodical either a search query and generates a list of all the relevant urls. The paths will form a hierarchy with the year at the top level for the periodical and then for each document a folder with its ark name and inside an xml with its OAI metadata, an "images" folder with its jpg images and an "alto" folder with its xml ALTO OCR (if it exists). By default, the script will generate paths inside the same directory it is executed in and will create a file named "download_urls_paths.csv" | |||
This can generate a lot of results, so be sure to check the file before launching the downloader script. | |||
Example usage: | |||
<code> | |||
python gallica_exporter.py -p "12148/cb42659777h"</code> to generate urls for all the issues of the [https://gallica.bnf.fr/ark:/12148/cb42659777h/date.item Bazar parisien] | |||
<code> | |||
python gallica_exporter.py -d "12148/bpt6k855459k"</code> to generate urls for the [https://gallica.bnf.fr/ark:/12148/bpt6k855459k Liste des marchands bonnetiers un des six corps des marchands à Paris] | |||
<code> | |||
python gallica_exporter.py -s "Eugène Atget" --doc-type "image" --max-records 60</code>to generate urls for the first [https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&version=1.2&startRecord=0&maximumRecords=15&page=1&query=%28bibliotheque%20adj%20%22Biblioth%C3%A8que%20nationale%20de%20France%22%29%20and%20%28gallica%20all%20%22eug%C3%A8ne%20atget%22%29%20and%20%28dc.type%20all%20%22image%22%29%20sortby%20dc.date%2Fsort.ascending 60 documents of type images that contain "Eugène Atget"] | |||
==== gallica_download.py ==== | |||
This tool will take the list built by the previous tool and download all the files. This is done in parallel, hopefully BNF's server are strong enough to support you download, but if you see a lot of errors, maybe reduce the number of processes ("-p" flag). It is possible to store the errors using the "-f" flag. | |||
Once again, type <code>python gallica_download.py -h</code> to have more information about the different options. | |||
Example usage: | |||
<code>python gallica_download.py -f failures.txt download_urls_paths.csv</code> will download all urls to the paths specified in "download_urls_paths.csv" and store the eventual failures in a file named "failures.txt". | |||
=== API wrapper === | |||
For more advanced and custom usage, the wrapper can be directly used in python. | |||
It provides the <code>fdh_gallica</code> package that contains wrapper for periodical, document and search. | |||
The periodical and document objects are built given an ARK identifier. | |||
==== Periodical ==== | |||
For the periodical, it is possible to query: | |||
* The years where there were issues | |||
* Its issues per year (gives back Document objects) | |||
* All its issues (for all years) | |||
==== Document ==== | |||
For the document, it is possible to query: | |||
* Its OAI metadata under the form of a parsed dict from XML (using [https://github.com/martinblech/xmltodict xmltodict]) | |||
* Its pagination metadata under the form of a parsed dict from XML | |||
* Its page numbers | |||
* The IIIF url of the image of a particular page number | |||
* The IIIF urls of the images of all the document | |||
* Check if is has ALTO OCR | |||
* The url of its ALTO XML of a particular page number | |||
* The urls of its ALTO XML of all the document | |||
==== Search ==== | |||
For the Search is built from a query with the following fields: | |||
* The "All" field | |||
* The "dc.type" field (to filter by type of document, e.g. "image" or "book") | |||
* The "dc.creator" field | |||
* The "dc.title" field | |||
* Any other pair of field/search term | |||
The Search object then needs to be executed using its <code>execute</code> method. Once executed, three information are stored in the object: | |||
* In <code>.records</code> the xmltodict of the XML of the records (the raw results) | |||
* In <code>.documents</code> the Document objects corresponding to the records | |||
* In <code>.failures</code> the urls of the pages that were not correctly fetched (possible to retry fetching them using <code>.retry()</code>) |
Latest revision as of 14:45, 9 October 2019
Basics of Gallica API
ARK identifier
ARK (Archival Resource Key) is a unique identifier for any resource. It has for goal of creating a strong link between its id and the object. It works on the following principle:
- Responsibility of the organization for permanent access to its resources.
- Possibility to add a suffix to interrogate on the metadata of the resource.
- Opacity of its ID so that it never needs to be changed even if new properties of the resource arise.
- Complete control on all published IDs to ensure no duplicates.
An ARK has the following form: ark:/12148/bpt6k205233j
, where
ark:
is the scheme12148
is the Name Assigning Authority Number (NAAN), a unique number given to each institution by the California Digital Library (CDL). This particular number is the one of the BNF.bpt6k205233j
the ARK name which is the identifier to the resource of the institution of the NAAN.
ARK is the ID scheme used at the BNF and is thus used in the wrapper to identify periodical and documents. For more information on the usage of ARK a the BNF c.f. this page
Document API
The Document API allows finding information on periodicals and on documents.
Periodical
The API gives all the years and dates of their publication as well as the ARKs of the documents.
Documents
The API gives metadata of the document under the form of an OAI XML that gives dublin core properties. These properties contain many valuable information, such as authors, dates, subjects, etc.
The API can also be used to find the pagination of a document, which in turn can give access to its OCR under the XML ALTO format and its IIIF images.
Search API
Gallica offers a research API that uses the SRU norm. The details will not be explicit here but can be found on this Gallica page (in french).
It is directly used in the main Gallica website, thus by observing the URL when doing a search, it is possible to infer the corresponding API call. The results are paginated up to 50 records per page (15 by default), it is thus necessary to handle the pagination.
Using the wrapper
Installation
The wrapper requires python 3.
- Clone the repository with
git clone https://github.com/raphaelBarman/fdh-gallica.git
- First its requirements should be installed (using
pip install -r requirements.txt
or with condaconda install --file requirements.txt
). - Then the wrapper package can be installed using
python setup.py install
.
Command line tools
The wrapper contains two command line tools: gallica_exporter.py
and gallica_download.py
.
gallica_exporter.py
This tool allows to generate a CSV of urls and paths to the metadata, IIIF images and OCR of a given document/periodical/search.
Type python gallica_exporter.py -h
to have more information about the different options.
The main idea is that it either takes as input either the ark of a document or a periodical either a search query and generates a list of all the relevant urls. The paths will form a hierarchy with the year at the top level for the periodical and then for each document a folder with its ark name and inside an xml with its OAI metadata, an "images" folder with its jpg images and an "alto" folder with its xml ALTO OCR (if it exists). By default, the script will generate paths inside the same directory it is executed in and will create a file named "download_urls_paths.csv"
This can generate a lot of results, so be sure to check the file before launching the downloader script.
Example usage:
python gallica_exporter.py -p "12148/cb42659777h"
to generate urls for all the issues of the Bazar parisien
python gallica_exporter.py -d "12148/bpt6k855459k"
to generate urls for the Liste des marchands bonnetiers un des six corps des marchands à Paris
python gallica_exporter.py -s "Eugène Atget" --doc-type "image" --max-records 60
to generate urls for the first 60 documents of type images that contain "Eugène Atget"
gallica_download.py
This tool will take the list built by the previous tool and download all the files. This is done in parallel, hopefully BNF's server are strong enough to support you download, but if you see a lot of errors, maybe reduce the number of processes ("-p" flag). It is possible to store the errors using the "-f" flag.
Once again, type python gallica_download.py -h
to have more information about the different options.
Example usage:
python gallica_download.py -f failures.txt download_urls_paths.csv
will download all urls to the paths specified in "download_urls_paths.csv" and store the eventual failures in a file named "failures.txt".
API wrapper
For more advanced and custom usage, the wrapper can be directly used in python.
It provides the fdh_gallica
package that contains wrapper for periodical, document and search.
The periodical and document objects are built given an ARK identifier.
Periodical
For the periodical, it is possible to query:
- The years where there were issues
- Its issues per year (gives back Document objects)
- All its issues (for all years)
Document
For the document, it is possible to query:
- Its OAI metadata under the form of a parsed dict from XML (using xmltodict)
- Its pagination metadata under the form of a parsed dict from XML
- Its page numbers
- The IIIF url of the image of a particular page number
- The IIIF urls of the images of all the document
- Check if is has ALTO OCR
- The url of its ALTO XML of a particular page number
- The urls of its ALTO XML of all the document
Search
For the Search is built from a query with the following fields:
- The "All" field
- The "dc.type" field (to filter by type of document, e.g. "image" or "book")
- The "dc.creator" field
- The "dc.title" field
- Any other pair of field/search term
The Search object then needs to be executed using its execute
method. Once executed, three information are stored in the object:
- In
.records
the xmltodict of the XML of the records (the raw results) - In
.documents
the Document objects corresponding to the records - In
.failures
the urls of the pages that were not correctly fetched (possible to retry fetching them using.retry()
)