As part of the SemBib project, I was led to choose a unique identifier for each author. Following my usual strategy, I started by using identifiers defined in our namespace, with our prefix. Thus, it was possible to produce results quickly and to encourage others to participate in the project.
I apply this method to the elements which I need immediately, here the authors; I try to minimize the identifiers and the vocabularies that I define ad-hoc and to use the most common and known vocabularies, but in an agile approach I do not want to block first application results by a long search of all the pre-existing vocabularies with which to bind.
In a second step, I look for vocabularies or identifiers to create links with other datasets. The basic principle is to create new versions of my data – with backward compatibility, if the data has been published – especially with the use of owl: sameAs. I intend to consolidate this strategy in the coming months and I am open to advices.
By chance, I found that I am identified by IdRef with the permanent link http://www.idref.fr/157248550. So I went to see more of what it is. The page at http://www.idref.fr gives little information. The subtitle of IdRef is ‘The repository of the Sudoc authorities’, which is not very clear – except if we know that SuDoc is ‘The catalog of the University System of Documentation’. The bottom of the page contains a header ‘ABES – Bibliographical Agency of Higher Education’, which already suggests a more direct link with the SemBib project and has prompted me to deepen.
In fact, the online documentation has taught me that my idref id is http://www.idref.fr/157248550/id. I can obtain an XML representation of the contents of my bibliographic record recorded by the ABES at the address http://www.idref.fr/157248550.xml (note that it is very incomplete). And for the record in JSON: http://www.idref.fr/services/biblio/157248550.json.
Find a researcher’s identifier
I wondered if I would be able to associate an idref identifier with all Telecom ParisTech researchers. As there are about 200 researchers and as many PhD students at Telecom ParisTech, I would like to automate this.
By misfortune, the day I started my tests the examples in section 2.3 did not work (16/1/2017). So I posted a message to the email address listed in the documentation. Very quickly I had an answer, with several proposals (thank you F.M. whose answer is largely reproduced below). I will limit myself here to describing publicly accessible solutions.
The first is to query IdRef’s Solr search engine with a Person Name / First Name (service documented here http://documentation.abes.fr/aideidrefdeveloppeur/ch02s01.html). Example of query: http://www.idref.fr/Sru/Solr?q=persname_t:(Moissinac AND Jean-Claude) & fl = ppn_z, affcourt_z & wt = xml
The q parameter contains the search that will be performed by Solr. Here, we make a type search ‘contains the words’ – indicated by the suffix _t- on the persname (person name) field, followed by: to indicate its parameters, here a list of strings that will be searched. If the searched words are found – as is the case in the above example – a response is obtained such that:
<?xml version = "1.0" encoding = "UTF-8"?> <Response> <Lst name = "responseHeader"> <Int name = "status"> 0 </ int> <Int name = "QTime"> 1 </ int> <Lst name = "params"> <Str name = "fl"> ppn_z </ str> <Str name = "q"> persname_t: (Moissinac AND Jean-Claude) </ </ Lst> </ Lst> <Result name = "response" numFound = "1" start = "0"> <Doc> <Str name = "ppn_z"> 157248550 </ str> </ Doc> </ Result> </ Response>
By replacing the end of the query wt = xml with wt = json, we get a response formatted in JSON (as of 27/1/2017: not with the right MIME type).
The method runs the risk of getting nothing if the IDREF database does not contain exactly the strings you are looking for or the possibility of recovering too many things if you open the search too much. For example, a search limited to the word Moissinac gives 9 answers that will have to be discriminated. For example, by searching for bibliographic records – cf. above – and eliminating inappropriate records. In the 9 responses for ‘Moissinac’, the first one is id 056874022 associated with the record http://www.idref.fr/services/biblio/056874022.json where, for example, you can see that the field “name” has the value” Moissinac, Bernard “. One can, for example, test the different fields “name” with reference to the reference string “Moissinac, Jean-Claude” with the Levenshtein distance. This should suffice to properly discriminate most cases. One can also have a specific control on all the names for which one obtains 0 or several answers (assuming that when one has only one answer, it is the right one). We will later consider automated testing of other fields in the record.
The second method consists in querying the Solr engine of theses.fr with a surname / first name and an additional link constraint of this person with Telecom Paristech (= Paris, ENST) or idref id 026375273:
You can get an output in xml or json format.
The people sought must have been involved in a thesis and not necessarily associated with “Paris, ENST”. The constraint is strong. Moreover, theses.fr does not seem to know our different names: Telecom Paris, Telecom Paris, Telecom ParisTech … or, in any case, do not identify that it is various denominations of the same organization. The ideal would be to find an identifier idref of our institution and to use it in the search criteria. We will not deal with that today.
Rely on VIAF
The http://corist-shs.cnrs.fr/IDCharers_2016 article contains ideas: Instead of using IDREF, rely on ORCID or VIAF with which IDREF has swap agreements.
A tour of the ORCID APIs and especially the VIAF API (https://platform.worldcat.org/api-explorer/apis/VIAF) gives me the following link:
Allows me to find my VIAF identifier and many others (SUDOC / IDREF in particular).
I have only to decline this query on all the names of researcher Telecom ParisTech hoping that there will not be too many ambiguities – which translates with a field numberOfRecords greater than 1- or missing -which translates with a numberOfRecords field greater than 0.
Suite: Access Sparql
In a future post, we will explore the SPARQL access of the ABES: https://lod.abes.fr/sparql.