Using TfIdf concepts on graphs

Tf-Idf is a weighting method, often used to evaluate the importance of a word in a document.

The idea here is to evaluate the importance of a predicate or a predicate-value pair for an entity of a semantic graph.

Tf-Idf on predicates

We will, for this first evaluation, consider that an entity of a graph is represented by the predicates that describe it. The equivalent of a document in the classic use of Tf-Idf is therefore an entity and its list of predicates (we could extend it to several neighborhood levels of the entity).

Then, the “raw” frequency of a predicate is simply the number of occurrences of this predicate used by the considered entity. This raw frequency can be chosen to express the frequency of a predicate. For documents, it is also possible to use the number of occurrences of a word relative to the number of words of the document. When dealing with documents of approximately homogeneous length, it does not matter much. If not, it can quickly become important. Similarly, we propose to use the number of occurrences of a predicate associated with an entity normalized by the total number of predicates associated with this entity.

to(p,e)=number of occurrences of p for the entity e

can be obtained for example on our graph <> with the query sparql

select (count(?o) as ?c) where 
  graph <> {
    <target entity> <target predicate> ?o }

tp(e)=number of predicates associated with e

can be obtained for example on our graph <> with a sparql query following the model:

select (count(?p) as ?c) where 
  graph <> {
    <target entity> ?p []}

tf(p,e)= to(p,e)/tp(e)

The inverse document frequency (IDF) is a measure of the importance of the term in the entire body of documents. For us, it will be a question of evaluating the importance of a predicate on the whole graph.

We will evaluate it using the logarithm of the inverse of the proportion of entities that use this predicate:

D = number of entities in the graph

can be obtained for example on our graph <> with the sparql query:

select (count(distinct ?s) as ?c) where 
  graph <> {
    ?s ?p []}

d(p )= number of entities that use the predicate p

on the previous graph, this is obtained for the most used predicates with

select ?p (count(distinct ?s) as ?c) where 
  graph <> {
    ?s ?p []}
group by ?p
order by desc(?c)


idf(p) = log(D/d(p))


tfidf(p,e) = tf(p,e)*idf(p)

For example, for the previous graph


* for the predicate <>
d(<>) = 118659

* for the predicate <>
d(<>) = 2079

* for the predicate <>
d(<>) = 1793

* for the predicate <> 
d(<>) = 1581

for the entity <ées>, with the query

select ?p (count(?p) as ?c)where 
  graph <> {
    <ées> ?p ?o }
group by ?p
order by desc(?c)

the following predicates are found (and a total of 84 predicates)

<>	48
<>	8
<>	5
<>	5
<>	3
<>	3
<>	3
<>	2
<>	1
<>	1
<>	1
<>	1
<>	1
<>	1
<>	1

And, for example, the following tfidf

tfidf(<>) = 0,63

tfidf(<>) = 0,41

tfidf(<>) = 0

tfidf(<>) = 0,43


We can also do this type of calculation on the predicates used by the classes or on the pairs (predicates, values) used. We can also consider incoming predicates to an entity.

Posted in Non classé | Leave a comment

Artworks in Wikidata

As part of the Data&Musée project, we are interested in the related data available about artworks, artists and museums and monuments related to these works. We have already addressed this issue in the Artworks in DBpedia post. We now look at their coverage by Wikidata from the access point sparql

Technical note: the following RDF prefixes are used
prefix wd: <>
prefix wdt: <>


Number of Artworks and their Types

We looked in Wikidata for entities of the types CreativeWork (wd: Q17537576), Painting (wd: Q3305213) and Artwork (wd: Q838948).Nombre d’œuvres et leurs types

SELECT ?type (count(DISTINCT ?oeuvre) as ?c)  WHERE {
  VALUES ?type {wd:Q838948 wd:Q17537576 wd:Q3305213}
  ?oeuvre wdt:P31 ?type.
group by ?type
order by desc(?c)

Which give

type c
wd:Q3305213 365806
wd:Q17537576 8105
wd:Q838948 3578

Links between artworks and museums

Now, let’s look for those that have a direct link to a museum. To avoid timeouts on wikidata, we will make a query with each type of previous work on the following model

SELECT ?type (COUNT(DISTINCT ?oeuvre) AS ?c)  WHERE {
  VALUES ?type { wd:Q838948  }
  ?oeuvre wdt:P31 ?type.
  { ?oeuvre ?link ?museum. } union { ?museum ?link ?oeuvre. }
  ?museum wdt:P31 wd:Q33506.
GROUP BY ?type

which give

type c
wd:Q838948 299
wd:Q17537576 2
wd:Q3305213 52095

One can ask for the number of entities of type wd: Q33506 (museum):

select (count(distinct ?museum) as ?c)
where {
?museum wdt:P31 wd: Q33506 

There are 34606. This suggests that many museums are known, and that there are many links with artworks, especially paintings.

LLinks between artworks and people

Now let’s see how the artworks are related to people (wd: Q5), type by type on the following model:

SELECT ?type (COUNT(DISTINCT ?oeuvre) AS ?c) WHERE {
  VALUES ?type {    wd:Q838948  }
  ?oeuvre wdt:P31 ?type.
  { ?oeuvre ?link ?artist. }   UNION  { ?artist ?link ?oeuvre. }
  ?artist wdt:P31 wd:Q5.
GROUP BY ?type

which give

type c
wd:Q838948 2512
wd:Q17537576 269
wd:Q3305213 >9422 (timeout sur ?oeuvre ?link ?artist)

We see that in Wikidata, there are links between artworks and people, but on a relatively small proportion of artworks. There is still work to be done to improve Wikidata’s knowledge of artworks! It’s your turn……


Posted in Data&Musée, Public data | Leave a comment

Paris Musées and Wikidata: establishing links

As of 6/1/2019, my list of establishments attached to Paris Museums includes 14 museums with 16 denominations (see at the end of this post). It was built by hand from the web site de Paris Musées.

I have established several methods for finding links between these museums and Wikidata entities representing them.

The simplest gives 15 links. It searches for a link from the museum name using the search feature provided by Wikidata’s WDQS service. My method does not retain any results if the search yields more than one result and there is uncertainty about the correct answer. Thus ‘The Catacombs’ is not found by this method because the WDQS search yields two results.

By adding as criterion the city of the museum, Paris, and the fact that it is in France I get 8 entities.

Adding as a criterion to the simple method the fact that the entity must be a museum,  I get 14 entities. There are two entities missing. The ‘Palais Galliera’ is an instance of ‘palace’ which is not a type derived from ‘museum’. The ‘Petit Palais’ is of the ‘museum building’ type which is a derivative of ‘building’, but no ‘museum’.

I get 12 entities adding the fact that the museum is in France.

If I combine the monuments and city criteria with the simple method, I get 8 entities.

This makes us 5 methods.

Entities without wikidata link

No denominations of the museums of Paris Musées do not obtain wikidata link by any of the proposed methods.

Entity with non-homogeneous wikidata links

Only one name – Musée Zadkine – obtained different links according to the methods used. There is indeed a Zadkine museum in Arques and one in Paris. The methods get one or the other of the museums. The geographical criterion makes it possible to obtain the right link.

We must now verify that the answers selected are accurate for all cases.

Check of the 16 entities obtained for the 16 denominations


The evaluation is as follows:

  • results found: 16
  • desired results: 16
  • exact results: 16
  • inaccurate results: 0

Which in terms of precision and recall, gives us:

precision = number of exact results / number of results found = 16/16 = 100%

recall = number of exact links found / number of findable links = 16/16 = 100%

And so an F-Measure of:

f-measure = 2 * 1 * 1 / (1 + 1) = 1 = 100%

On this small series of denominations, the proposed methods prove completely satisfactory. We will soon publish the results of these methods on other datasets and technical details on the methods used.

Posted in Public data, Semantic taging, SPARQL | Leave a comment

Find the Wikidata element corresponding to an entity we know in DBPedia

Suppose we are interested in an entity in DBPedia, for example: (associated web page

which describes the Caranavalet Museum.

We want to automatically find a possible entity in Wikidata describing the same entity.

In DBPedia, an entity is always associated with the Wikipedia page that was used to generate the DBPedia entity. This page is designated by the value associated with the property which makes it possible to give provenance information of the entity.

Here, the value is:

By following this link, we can retrieve the corresponding Wikipedia page (and, by the way, all the original text). What will interest us is a tag with the id “t-wikibase”, if present, it’s a link to a Wikidata entity corresponding to the page. This link is in the href property of the <a> tag contained in the element with id “t-wikibase”.

<li id ​​= "t-wikibase">
    href = ""
    title = "Link to the repository item of the connected data [g]"
    accesskey = "g"> Wikidata element </a>
</ Li>


Its good. We found the Wikidata entity searched for:

which redirect to the entity:


Posted in Data&Musée | Leave a comment

Statistics about DBPedia-fr

I needed the number of separate entities described in dbpedia-fr. We will see a problem to consider when using linked data that uses public access points.

My first attempt was to get this information with the query

select count(distinct ?r) where { ?r ?p ?l }

but it always failed on a timeout.

While if I just counted the triples with

select count(?r) where { ?r ?p ?l }

I got


Nandana Mihindukulasooriya told me the  Loupe  for statistics on datasets, but I wanted to be able to get the information directly programmatically by querying the source I was using: here, DBPedia-Fr, but potentially with a method applicable to other datasets.

Hugh Williams told me that my request was heavily resource intensive (in the implementation of Virtuoso which hosts DBPedia and which seems to go through the construction of a very large hash table). He informed me that DBPedia (English) provides an up-to-date description of its dataset according to the principles proposed by the technical note on VoID: VoID pour DBPedia; this prevents heavy requests to be processed repetitively to obtain this information. But the French version does not offer this at the time of writing this post.

John Walker offered to rewrite my request as follows:

select (count(?s) as ?c) where { select distinct ?s where { ?s ?p []} }

it seems that this writing consumes fewer resources, and I get the expected result:


Why is this writing more effective than the other? It seems to me semantically equivalent; it would be necessary to understand the detail of how it is implemented to imagine which writing is preferable. On this subject, we find an entire chapter on query optimization in the very good book of Bob du Charme sur SPARQL.

Alasdair J G Gray reports to me the report Dataset Descriptions: HCLS Community Profile and in particular section 6.6 where examples of statistics to be obtained on data sets are presented with the SPARQL queries which make it possible to obtain them.en particulier la section 6.6 où des exemples de statistiques à obtenir sur des jeux de données sont présentés avec les requêtes SPARQL qui permettent de les obtenir.

The results below correspond to the application of these queries to DBPedia-Fr.

Description Requete Resultat
Number of triples SELECT (COUNT(*) AS ?triples) { ?s ?p ?o } 185404575
Number of distinct typed entities SELECT (COUNT(DISTINCT ?s) AS ?entities) { ?s a [] } 6015375
Number of distinct subjects SELECT (COUNT(DISTINCT ?s) AS ?distinctSubjects) { ?s ?p ?o } time out
Number of distinct properties SELECT (COUNT(DISTINCT ?p) AS ?distinctProperties) { ?s ?p ?o } 20321
Number of distinct non-literal objects SELECT (COUNT(DISTINCT ?o ) AS ?distinctObjects){ ?s ?p ?o FILTER(!isLiteral(?o)) } time out
Number of distinct classes SELECT (COUNT(DISTINCT ?o) AS ?distinctClasses) { ?s a ?o } 442
Number of distinct literal objects SELECT (COUNT(DISTINCT ?o) AS ?distinctLiterals) { ?s ?p ?o filter(isLiteral(?o)) } time out
Number of graphs in the data set SELECT (COUNT(DISTINCT ?g ) AS ?graphs) { GRAPH ?g { ?s ?p ?o }} 14

The same queries applied to DBPedia give:

Description Requete Resultat Donnes VoID
Number of triples SELECT (COUNT(*) AS ?triples) { ?s ?p ?o } 438038746 438038866
Number of distinct typed entities SELECT (COUNT(DISTINCT ?s) AS ?entities) { ?s a [] } rien
Number of distinct subjects SELECT (COUNT(DISTINCT ?s) AS ?distinctSubjects) { ?s ?p ?o } rien 33996245
Number of distinct properties SELECT (COUNT(DISTINCT ?p) AS ?distinctProperties) { [] ?p ?o } 64119 64119
Number of distinct non-literal objects SELECT (COUNT(DISTINCT ?o ) AS ?distinctObjects){ ?s ?p ?o FILTER(!isLiteral(?o)) } rien 164615518 (?)
Number of distinct classes SELECT (COUNT(DISTINCT ?o) AS ?distinctClasses) { [] a ?o } 370680 370680
Number of distinct literal objects SELECT (COUNT(DISTINCT ?o) AS ?distinctLiterals) { ?s ?p ?o filter(isLiteral(?o)) } 29883287
Number of graphs in the data set SELECT (COUNT(DISTINCT ?g ) AS ?graphs) { GRAPH ?g { ?s ?p ?o }} 19
Number of distinct objects 164615518

To obtain some results, it was necessary to increase the time granted to the request to be treated; for others, the request proposed in the above document has been slightly modified. DBPedia does not cause ‘time out’, but returns an empty response when it was unable to process the request (see discussion here). The managers of DBPedia propose not to consider their server as a production server and to install a copy if we want more guaranteed results; indeed, they have to deal with a large mass of users and can not offer a total guarantee of service. As a result, some queries, even by increasing the processing time, do not get a result.

Nevertheless, DBPedia makes much more intensive use of classes than DBPedia-Fr. I will return to the use of classes in a future post.

We have seen here some workarounds to some problems in the processing time of some requests; but these workarounds do not always work, even on queries that seem simple and of basic use (counting certain types of elements in a dataset).

Update 20/4/2018

Kingsley Idehen point me on

and more precisely to the section  “Virtuoso Anytime Query” which shows how DBPedia add some header in the reponse to give an alert about the incomplete results when the timeout is reached


Posted in Non classé | Leave a comment

Where Telecom ParisTech publishes regularly: technical viewpoint

In the article “Where Telecom ParisTech publishes regularly“, I showed an example of use of the semantic representation of our bibliography: a graph that allows to see the lecture series mainly used by Telecom ParisTech researchers to publish scientific results .

I will give here some technical elements which made it possible to obtain this result. To lighten the notations, prefixes are used; they are explained at the end of this article.

For starters, a URI has been assigned to each publication. They are of the form:

where NNNNN is a unique number assigned to a publication in the bibliographic database of Télécom ParisTech. For example:

Each conference article is associated with a conference by the predicate <> that we have defined for our own purposes.

For example, the previous article is associated with a conference by the triple:

tpt:13187  sb:inConf conf:ISM2012

Each lecture is associated with a series of lectures using the predicate


defined by the Springer publisher for its access point to its data graph (see possible the predicates are chosen from useful predicates already used by other data sets).

conf:ISM2012 ns0:hasSeries conf:ism

Thus the following query allows us to find the number of publications per serie of conferences, sorted by increasing value:

SELECT ?source (count(?paper) as ?count) 
graph ?g {
  ?paper sb:inConf ?conf .
  ?conf ns0:hasSeries ?urisource .
  ?urisource dc:label ?source
group by ?source
order by desc(?count)

By sending the request with a response request in TSV format, we can directly integrate the response as input of a graphic defined with the javascript graphic library D3.js. The code used by our graphical representation is a very simple variant of the example presented here.

Additional note: prefixes used above

prefix conf: <>
prefix tpt: <>
prefix sb: <>
prefix ns0: <>


Posted in Non classé | Leave a comment

Where Telecom ParisTech publishes regularly

(click to magnify)

Our bibliographic database does not make it easy to highlight conferences and journals where we publish often. The semantic approach that we started with the SemBib project provides answers.

As part of the SemBib project, a semi-automatic work was done to identify entities that are significant from this point of view. In particular, it was to list the scientific conferences where we publish. In a future post we will present the method of deduplication and consistency of references.

Let’s illustrate the problem: the ICASSP conference is present in the database with 30 different titles for more than 80 publications.

We published 2190 articles in 1340 conferences from November 2010 to November 2015. Above, a chart of the series of conferences where we have published the most.



Posted in Non classé | Leave a comment

Using NLTK on Heroku with Python

On the principle of the “Extract PDF text with Python” ticket, I will create a service that uses the NLTK package. NLTK is a set of tools for building language processing programs in Python. It therefore requires using Python.

Basic phases

I summarize the steps detailed in the ticket mentioned above:

  • create a folder for this service
  • create a virtual environment for local development
  • create a requirements.txt file with the list of dependencies (including nltk, see below)
  • create a folder nltk_service
  • in this folder, create two files: and (empty for now)
  • start a local server (replacing pdf_service with nltk_service in
  • create a git repository
  • add project files to the local git repository
  • update local repository
  • connect to Heroku
  • create a new service on Heroku
  • push local development towards heroku
  • launch an instance of the service
  • test online

There, an empty service was created. We will complement it with an implementation of one of the possibilities offered by NLTK.

A simple service with NLTK

I want to create a service that finds the words and other elements that make up a text, and then eliminates the “stopwords” – those words that do not contribute significantly to certain analyzes of the content of a text, such as ‘the’, ‘a’, ‘to’ in English- and finally, counts the number of words in the text and the number of occurrences of each word retained.

To eliminate stopwords – as well as for many other treatments – NLTK uses data files defined under the name ‘NLTK Data‘. Files listing the stopwords of a set of languages ​​are available. But Heroku is not very suitable for the use of large volumes of static files; Heroku is made to host and run programs.

Hosting static files for Heroku

In the article “Using AWS S3 to Store Static Assets and File Uploads“, it is suggested that good scalability can be achieved by deploying the static files used by an application on Amazon S3. Indeed, files hosted on Heroku can be downloaded every time an application is put to sleep and then reloaded for a new use. If it is a large volume of static files, this can penalize the response times of an application. The file system proposed by Heroku is said to be ‘ephemeral’.

Normally the NLTK-related data are downloaded with an interactive program that allows you to select what you’re going to download and make sure that the loaded data will be found by the NLTK tools. In the case of Heroku, we can not do this, but rather try to use a command line load.

This is the first method I have explored. The first idea is to load the data locally and then push them on Heroku, but this would load the GIT repository that we use in our exchanges with Heroku with all static data from nltk-data. A solution is available here: This is the solution that I adopted in the first approach. A test with all nltk_data data fails (all). With just the stopwords (python -m nltk.downloader stopwords) corpus and wordnet (python -m nltk.downloader wordnet) corpus and punkt tokenizer (python -m nltk.downloader punkt), the deployment runs smoothly.

Another idea is to use AWS Simple Storage Service, Amazon’s cloud storage service. I will explore this possibility in a future post.

The service

from flask import request, abort
from flask.ext import restful
from flask.ext.restful import reqparse
from nk_service import app
import urllib2
import io
import os
import collections
import json
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import stopwords

english_stops = set(stopwords.words('english'))
french_stops = set(stopwords.words('french'))
stops = english_stops.union(french_stops)

def words_counting(wordslist):
    wordscounter = {}
    wordscounter["cnt"] = collections.Counter()
    wordscounter["wordscount"] = 0
    for w in wordslist:
        word = w.lower()
        if word not in stops:
            syn = wordnet.synsets(word)
            if syn:
                lemmas = syn[0].lemmas()
                res = lemmas[0].name()
                wordscounter["cnt"][res] += 1
                wordscounter["wordscount"] += 1
    wordscounter["cnt"] = wordscounter["cnt"].most_common()
    return wordscounter

def filewords(path):
    text = urllib2.urlopen(path).read().decode('utf-8')
    wordslist = word_tokenize(text)
    jsonwords = words_counting(wordslist)
    return json.dumps(jsonwords)

@app.route('/words/<path:url>', methods=['GET'])
def get_words(url):
    return filewords(url)

The call of <mondomaine> / words / <url of a text file> returns a json structure with an array of words associated with their number of occurrences (in the cnt array) and the total number of words taken into account (in wordscount), this will allow to easily aggregate the results of several files and to calculate frequencies of occurrences of words.


The use of NLTK with Heroku is validated. It is now necessary to define the structure(s) (a priori json) which will allow us to progressively transport and enrich the data associated with a source or a set of sources.

Posted in NLP, Tutorial | Leave a comment

Services for bibliographic analysis

I present here the needs related to our approach of analysis of the production and publication of scientific documents – essentially articles – by Telecom ParisTech. It is the goal of the SemBib project.

The articles

Télécom ParisTech has a bibliographical database that contains the bulk of our publications. For each publication, we have the title, the authors’ names, the date of publication and some other metadata. Sometimes there is a link to the publication. This link is not always informed and, when it is, it can present different forms:

  • it is sometimes a link to a direct online access to a digital version of the article, usually in PDF format
  • it is sometimes a link to a web page that contains a link to access, often costly, a digital version of the article
  • sometimes it is a dead link

Sometimes links to digital documents can not be exploited by a robot to constitute a stock of all documents. We will see in a future post the solutions implemented to recover the highest possible proportion of documents.

A five-year evaluation gives about 4000 publications, of which one quarter has a link, which can be automatically and easily exploited in a little more than half the cases. On some documents we have a limitation on publication rights which prevents us from putting them online. This has a significant impact on the solutions we will be able to exploit.


As of the date of publication of this post, we have 420 documents requiring 180 MB for source documents. If we can recover the 4000 documents, we will need about 2 GB for the source documents.

We plan to store intermediate treatment results; for example, we will store

  • the plain text extracts from PDF documents
  • dictionary of words associated with a document
  • metadata associated with a document, resulting from different stages of processing

Several of us are working on this project and we plan to involve students for whom this type of work can lead to very formative projects. We must therefore make our sources and results accessible via the network.

I have access to unlimited online storage via a hosting solution which, on the other hand, only supports developments in PHP. It is this accommodation that I will use.


A very good tool for applying treatments to texts for analysis is NLTK. NLTK is written in Python. So treatments are not going to be able to be done in terms of hosting our sources.

For the processing of semantic representations, we use on the one hand tools written in Java based on Apache Jena / Fuseki, and on the other hand a Virtuoso server accessible only on the internal network of Telecom ParisTech.

The need to host services written in different languages, which none of our direct hosting solutions provides, lead us to adopt a distributed solution, based on web services. We will see in a next post the steps taken to create a service in Python, operating NLTK and hosted on Heroku.

Algorithmic starting point

We used the article “Using Linked Data Traversal to Label Academic Communities” by Tiddi, Aquin and Motta (SAVE-SD 2015). However, we have completed or modified the proposed procedure. For example, we need to consider publications in several languages ​​(at least French and English); we also decided to rely on Wordnet, when it made sense. We have undertaken the implementation of the following steps

  • development of a list of researchers in the school with links to departments and research teams
  • retrieval of the list of publications for the last 5 years; the bibliographic database gives us a result in the BIBTEX format that we translated into a JSON structure; a reflection should be carried out to make the solution more modular, for example, by recovering the base year by year, taking into account the need for updates, with some publications being declared late in the database; moreover, the data obtained present a series of defects handled in our first working phase,
  • for each reference, attempt to retrieve a digital version of the document
  • for each retrieved reference, extract the raw text from the document (see
  • for each reference recovered, to gather metadata (authors, cited references …) either from the bibtex above, or by analysis of the document,
  • for each retrieved reference, identify the language of the document
  • pass each text in lowercase
  • eliminate the empty words in each text as well as the numbers and the punctuations
  • stemmatize and / or lemmatize the words of each text
  • replace each root resulting from the stemmatization by a word that shares this root (eg the shortest word)
  • filter the list of retained words (blacklist, minimum length …)
  • search for a mapping of each word to a concept of the semantic web / LOD for example by referring to DBPedia,, Wordnet, reference vocabulary of IEEE and ACM …; we will need to evaluate the number of unmapped terms and look for a solution for these terms
  • evaluate the number of words per article, by corpus (ex: Telecom ParisTech corpus for a given interval of years, ditto for a department, for an author)
  • construct the TfIdf  matrix of the corpus; methods are implemented which facilitate the updating of this matrix, for example when adding a new article
  • reduce this matrix by eliminating words present in a high proportion of documents (application of a threshold, for example 25%); these words being considered as weakly discriminating; they can be kept as potentially representative of the overall production of Telecom ParisTech
  • reduce this matrix by eliminating the words present too few times on the corpus; these words being considered as structuring for the corpus
  • application of a latent semantic search (LSA)
  • clustering of the remaining words either by clustering methods applied to the matrix or by grouping the words on structural criteria of Telecom ParisTech (departments, teams, projects, etc.)

These steps should provide us with the baseline data for our analyzes. In future blog posts, we will study some of these steps in more detail. We will also propose approaches based on semantic web technologies.

Posted in NLP, SemBib | Leave a comment

Extract PDF text with Python

As part of our SemBib project to analyze the scientific production of Telecom ParisTech, I recover a lot of PDF files. To analyze the content, I need to get the raw text. In addition, as indicated in the blog Services for bibliographic analysis, I made the choice to implement our developments by web services.

I will show here how I develop a REST service of raw text extraction from PDF with Python and how I deploy it on Heroku (note: I develop under Windows 10).

Note: I am freshly converted to Python; my code is not exemplary and deserves to be cleaned / improved, but it is enough for me as proof of feasibility

Create a folder for my service

Create a virtual environment for local development with the command

virtualenv venv

Create a requirements.txt file with the following content


The file indicates the pdfminer package for processing pdf files.

(see for generic explanations on the dependencies of the project)

In the folder that contains the requirements.txt file, create a pdf_service folder. In this folder, create two files: and initializes the Flask application that will be created

import os
from flask import Flask, request, jsonify
from flask_restful import Resource, Api
from flask import make_response

app = Flask(__name__)

import pdf_service.resources

And file

import json
from flask import request, abort
from flask.ext import restful
from flask.ext.restful import reqparse
from pdf_service import app
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import urllib2
from urllib2 import Request
from StringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    open = urllib2.urlopen(Request(path)).read()
    memoryFile = StringIO(open)

    parser = PDFParser(memoryFile)

    doc = PDFDocument(parser)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True

    for page in PDFPage.get_pages(memoryFile, pagenos, maxpages=maxpages,        password=password,caching=caching, check_extractable=True):

    text = retstr.getvalue()

    return text

@app.route('/', methods=['GET'])
def get_root():
    return "index"

@app.route('/pdftotxt/<path:url>', methods=['GET'])
def get_url(url):
    return convert_pdf_to_txt(url)

To start a local server

For tests, I created the file

from pdf_service import app

launched with the command


To create an empty git repository

git init

To add local files to the git

git add .

Be careful to create a .gitignore by specifying the files and directories to ignore so as not to overload the communication

To feed the local deposit

ho will serve as a reference for exchanges with Heroku

git commit -m "mon message texte"

To connect to Heroku

heroku login

then your account information

To destroy a service on Heroku

For example, to make room, remove tests that have become useless …

heroku apps:destroy --app <nom du service à détruire>

To create the service on heroku

heroku create

To push local development towards heroku

git push heroku master

(by default, causes a Python 2.7 deployment and dependencies described in requirements.txt)

To start an instance of the service

heroku ps:scale web=1

To see what runs

heroku ps

Rename the service

Using the command line to rename the service (see

Command to change the name of the service (from the one generated automatically by the heroku tools) and the associated git repository (to be done in the folder where the service was created)

heroku apps: rename <new name>


Posted in NLP, SemBib, Tutorial | Leave a comment

First contact with the tools of the Bibliographic Agency of Higher Education

As part of the SemBib project, I was led to choose a unique identifier for each author. Following my usual strategy, I started by using identifiers defined in our namespace, with our prefix. Thus, it was possible to produce results quickly and to encourage others to participate in the project.

I apply this method to the elements which I need immediately, here the authors; I try to minimize the identifiers and the vocabularies that I define ad-hoc and to use the most common and known vocabularies, but in an agile approach I do not want to block first application results by a long search  of all the pre-existing vocabularies with which to bind.

In a second step, I look for vocabularies or identifiers to create links with other datasets. The basic principle is to create new versions of my data – with backward compatibility, if the data has been published – especially with the use of owl: sameAs. I intend to consolidate this strategy in the coming months and I am open to advices.

By chance, I found that I am identified by IdRef with the permanent link So I went to see more of what it is. The page at gives little information. The subtitle of IdRef is ‘The repository of the Sudoc authorities’, which is not very clear – except if we know that SuDoc is ‘The catalog of the University System of Documentation’. The bottom of the page contains a header ‘ABES – Bibliographical Agency of Higher Education’, which already suggests a more direct link with the SemBib project and has prompted me to deepen.

In fact, the online documentation has taught me that my idref id is I can obtain an XML representation of the contents of my bibliographic record recorded by the ABES at the address (note that it is very incomplete). And for the record in JSON:

In principle, the query should find my VIAF identifier (, but can not find it.

Find a researcher’s identifier

I wondered if I would be able to associate an idref identifier with all Telecom ParisTech researchers. As there are about 200 researchers and as many PhD students at Telecom ParisTech, I would like to automate this.

By misfortune, the day I started my tests the examples in section 2.3 did not work (16/1/2017). So I posted a message to the email address listed in the documentation. Very quickly I had an answer, with several proposals (thank you F.M. whose answer is largely reproduced below). I will limit myself here to describing publicly accessible solutions.

First method

The first is to query IdRef’s Solr search engine with a Person Name / First Name (service documented here Example of query: AND Jean-Claude) & fl = ppn_z, affcourt_z & wt = xml

The q parameter contains the search that will be performed by Solr. Here, we make a type search ‘contains the words’ – indicated by the suffix _t- on the persname (person name) field, followed by: to indicate its parameters, here a list of strings that will be searched. If the searched words are found – as is the case in the above example – a response is obtained such that:

<?xml version = "1.0" encoding = "UTF-8"?>
<Lst name = "responseHeader">
  <Int name = "status"> 0 </ int>
  <Int name = "QTime"> 1 </ int>
  <Lst name = "params">
    <Str name = "fl"> ppn_z </ str>
    <Str name = "q"> persname_t: (Moissinac AND Jean-Claude) </
  </ Lst>
</ Lst>
<Result name = "response" numFound = "1" start = "0">
    <Str name = "ppn_z"> 157248550 </ str>
  </ Doc>
</ Result>
</ Response>

By replacing the end of the query wt = xml with wt = json, we get a response formatted in JSON (as of 27/1/2017: not with the right MIME type).

The method runs the risk of getting nothing if the IDREF database does not contain exactly the strings you are looking for or the possibility of recovering too many things if you open the search too much. For example, a search limited to the word Moissinac gives 9 answers that will have to be discriminated. For example, by searching for bibliographic records – cf. above – and eliminating inappropriate records. In the 9 responses for ‘Moissinac’, the first one is id 056874022 associated with the record where, for example, you can see that the field “name” has the value” Moissinac, Bernard “. One can, for example, test the different fields “name” with reference to the reference string “Moissinac, Jean-Claude” with the Levenshtein distance. This should suffice to properly discriminate most cases. One can also have a specific control on all the names for which one obtains 0 or several answers (assuming that when one has only one answer, it is the right one). We will later consider automated testing of other fields in the record.

Second method

The second method consists in querying the Solr engine of with a surname / first name and an additional link constraint of this person with Telecom Paristech (= Paris, ENST) or idref id 026375273:,directeurThesePpn

You can get an output in xml or json format.

The people sought must have been involved in a thesis and not necessarily associated with “Paris, ENST”. The constraint is strong. Moreover, does not seem to know our different names: Telecom Paris, Telecom Paris, Telecom ParisTech … or, in any case, do not identify that it is various denominations of the same organization. The ideal would be to find an identifier idref of our institution and to use it in the search criteria. We will not deal with that today.

Rely on VIAF

The article contains ideas: Instead of using IDREF, rely on ORCID or VIAF with which IDREF has swap agreements.

A tour of the ORCID APIs and especially the VIAF API ( gives me the following link:
Allows me to find my VIAF identifier and many others (SUDOC / IDREF in particular).

I have only to decline this query on all the names of researcher Telecom ParisTech hoping that there will not be too many ambiguities – which translates with a field numberOfRecords greater than 1- or missing -which translates with a numberOfRecords field greater than 0.

Suite: Access Sparql

In a future post, we will explore the SPARQL access of the ABES:

Posted in Public data, Semantic taging, SemBib | 1 Comment

Unique Identifiers of Researchers versus Unicity of Identifiers of Researchers

As mentioned in the article “First contact with the tools of the ABES“, for the SemBib project, I started by using my own identifiers for the researchers. Then, I wanted to use identifiers coming from reference sources, starting with the identifiers IDREF of the ABES.

I put my finger in a gear.

Given the difficulties encountered in recovering the IDREF identifiers from all Telecom ParisTech researchers, and having seen that ABES has agreements with VIAF, I looked for what I could do with VIAF. VIAF is a ‘virtual international authority file’ created by a set of national libraries. It manages a set of unique identifiers of people. We have seen in the post above how I retrieved from VIAF information about Telecom ParisTech researchers.

By analyzing the data, I was able to see that the VIAF data came from a variety of sources; step by step, from these sources, I found identifiers of people from:

BNF: the Bibliothèque Nationale de France assigns identifiers to authors and especially authors of scientific publications,
ARK: an identification system also used by the BNF,
SUDOC: it is a catalog produced by the ABES, which manages IDREF identifiers,
ISNI: for “International Standard Name Identifier”, defined by an ISO standard, also used, inter alia, by the BNF (see ISNI and BNF)
ORCID: these identifiers concern authors and contributors in the fields of higher education and research; there are imperfect links between ISNI and ORCID (see Relationship between ORCID and ISNI);
RERO: appears to be defined by the Network of the Libraries of Western Switzerland;
LC: identifiers used by the Library of Congress;
KRNLK: The Linked Open Data Access Point of the National Library of Korea, which includes SPARQL access (
ICCU: used by the Central Institute for a Unified Catalog of Italian Libraries (ICCU)
LNB: identifiers in the National Library of Latvia
NKC: identifiers of the Czech National Library
NLI: identifiers of the National Library of Israel
NLP: identifiers of the National Library of Poland
NSK: identifiers of the university library of Zagreb;
NUKAT: comes from the NUKAT Center of the University of Warsaw
SELIBR: used by LIBRIS, a research service that provides information on titles held by Swedish universities and research libraries (example:;
WKD: concerns the identifiers used by WikiData;
BLSA: probably originated from the British Library
NTA: identifiers of the Royal Library of the Netherlands
I probably forgot some others…

These identifiers all designate a researcher uniquely in an identification system. But, as we have just seen, there can be many identifiers for the same researcher; not all researchers have all the identifiers, but they often have several.

For example, Antonio Casilli is identified at least by:

BNF, ARK, ISNI, VIAF, LC, SUDOC, ORCID, DNB | 1012066622, NUKAT | n 2016165182

The researchers thus have several more or less equivalent identifiers that can be useful to know in a Linked Open Data approach: if one wants to be able to link the data on a researcher, one must already be able to link their unique identifiers! I will come back in a future post about the decentralized solution that I propose.

Note: it makes me think of the joke on video standards “there are N different standards , it is too much; to finish, I will make a unique format that will bring together the best of each standard”; after such work, we do not have 1 standard, but N + 1 standards …

Posted in Semantic taging, SemBib | Leave a comment

Getting Started with SPARQL Access Point from the Springer Editor

As part of the SemBib project, I will discover with you the SPARQL public access point of the Springer scientific editor at For a first contact, we must get acquainted and some classic requests will help us.

First, discover the properties used:

PREFIX rdf: <> 
PREFIX spr: <> 
PREFIX rdfs: <>
PREFIX dc: <>
select distinct ?p ?label where {
?s ?p ?o .
?p rdfs:label ?label .
limit 100

which takes about 23 seconds (I have somewhat cheated: I executed it a first time to see the URIs used and deduce prefixes to define and have the result more compact below, thanks to the prefixes).

She gives :

| p                           | label                        |
| rdf:type                    |                              |
| dc:creator                  |                              |
| dc:date                     |                              |
| dc:description              |                              |
| dc:publisher                |                              |
| dc:rights                   |                              |
| rdfs:label                  |                              |
| rdfs:domain                 |                              |
| rdfs:range                  |                              |
| rdfs:subPropertyOf          |                              |
| spr:hasDBLPID               | "has DBLP ID"@en             |
| spr:confSeriesName          | "conference series name"@en  |
| spr:confYear                | "conference year"@en         |
| spr:confAcronym             | "conference acronym"@en      |
| spr:confName                | "conference name"@en         |
| spr:confCity                | "conference city"@en         |
| spr:confCountry             | "conference country"@en      |
| spr:confStartDate           | "conference start date"@en   |
| spr:confEndDate             | "conference end date"@en     |
| spr:hasSeries               | "ConferenceSeries"@en        |
| spr:volumeNumber            | "Volume number"@en           |
| spr:title                   | "Title"@en                   |
| spr:subtitle                | "Subtitle"@en                |
| spr:ISBN                    | "ISBN"@en                    |
| spr:EISBN                   | "eISBN"@en                   |
| spr:bookSeriesAcronym       | "book series acronym"@en     |
| spr:hasConference           | "Conference"@en              |
| spr:bookDOI                 | "DOI"@en                     |
| spr:isIndexedByScopus       | "Is indexed by Scopus"@en    |
| spr:scopusSearchDate        | "Scopus search date"@en      |
| spr:isAvailableAt           | "Available at"@en            |
| spr:isIndexedByCompendex    | "Is indexed by Compendex"@en |
| spr:compendexSearchDate     | "Compendex search date"@en   |
| spr:confNumber              | "conference number"@en       |
| spr:copyrightYear           | "Copyright year"@en          |
| spr:firstPage               | "First page"@en              |
| spr:lastPage                | "Last page"@en               |
| spr:chapterRegistrationDate | "Registration date"@en       |
| spr:chapterOf               | "Book"@en                    |
| spr:chapterOnlineDate       | "Online date"@en             |
| spr:copyrightHolder         | "Copyright Holder"@en        |
| spr:metadataRights          | "Metadata Rights"@en         |
| spr:abstractRights          | "Abstract Rights"@en         |
| spr:bibliographyRights      | "Bibliography Rights"@en     |
| spr:bodyHtmlRights          | "Body HTML Rights"@en        |
| spr:bodyPdfRights           | "Body PDF Rights"@en         |
| spr:esmRights               | "ESM Rights"@en              |

On the one hand, it shows us that the performances are not exceptional. On the other hand, we see that for the most part Springer defined his own ontology: data are accessible, but not really connected to the rest of the world by shared concepts. Data is defined with 47 predicates (properties).

The following query – without repeating the prefixes defined above – gives us the number of distinct ‘subjects’ filled in the database: 451277.


select (count(distinct ?s) as ?size) where {
?s ?p ?o .

and that which follows gives the number of triples which inform these subjects: 3490865, that is to say about 8 predicates per different subject, which is little to give great details about bibliographic references. It can be assumed that there will be little data on each reference.

select (count(?s) as ?size) where {
?s ?p ?o .

The relatively small number of predicates per subject suggests me to search for the most used:

select ?p (count(?p) as ?freq) where {
?s ?p ?o .
group by ?p
order by desc(?freq)

giving (by removing the least used, essentially relating to rights issues):

| p                           | freq   |
| rdf:type                    | 451316 |
| spr:bookDOI                 | 441266 |
| spr:title                   | 439864 |
| spr:chapterOf               | 381656 |
| spr:firstPage               | 381646 |
| spr:lastPage                | 381646 |
| spr:chapterRegistrationDate | 245143 |
| spr:chapterOnlineDate       | 188209 |
| spr:EISBN                   | 59611  |
| spr:isIndexedByScopus       | 59370  |
| spr:scopusSearchDate        | 59370  |
| spr:ISBN                    | 59101  |
| spr:isAvailableAt           | 59101  |
| spr:copyrightYear           | 55964  |
| spr:subtitle                | 40321  |
| spr:volumeNumber            | 34988  |
| spr:bookSeriesAcronym       | 17665  |
| spr:compendexSearchDate     | 11400  |
| spr:isIndexedByCompendex    | 11400  |
| spr:hasConference           | 9509   |
| spr:confCity                | 8487   |
| spr:confCountry             | 8487   |
| spr:confName                | 8487   |
| spr:hasSeries               | 8487   |
| spr:confEndDate             | 8479   |
| spr:confStartDate           | 8479   |
| spr:confYear                | 8479   |
| spr:confAcronym             | 8233   |
| spr:confNumber              | 8021   |

We see that most of the information available on an element of the database consists of: its type, its DOI number, its title, of what the element is a chapter, first page and last page. Other information relates in particular to conferences from which the documents may originate.

Predicates with domain and range

We see that the properties domain – which gives us the category of objects to which the predicate applies- and range – which gives us the category of possible values for this predicate – seem to be filled for some predicates.

With the same small cheating as above for prefixes, the following query gives us in 15 seconds the domain and range used:


PREFIX rdf: <> 
PREFIX spr: <>               
PREFIX rdfs: <>
PREFIX xs: <>
PREFIX sxs: <>
PREFIX dc: <>
PREFIX spc:<>                       
select distinct ?p ?domain ?range  where {
?s ?p ?o .
?p rdfs:domain ?domain .
?p rdfs:range ?range .
limit 100

The result is:

| p                           | domain                | range                |
| spr:confSeriesName          | spc:ConferenceSeries  | rdf:langString       |
| spr:confYear                | spc:Conference        | xs:date              |
| spr:confAcronym             | spc:Conference        | rdf:langString       |
| spr:confName                | spc:Conference        | rdf:langString       |
| spr:confCity                | spc:Conference        | rdf:langString       |
| spr:confCountry             | spc:Conference        | rdf:langString       |
| spr:confStartDate           | spc:Conference        | xs:date              |
| spr:confEndDate             | spc:Conference        | xs:date              |
| spr:hasSeries               | spc:Conference        | spc:ConferenceSeries |
| spr:confNumber              | spc:Conference        | xs:int               |
| spr:volumeNumber            | spc:ProceedingsVolume | rdfs:literal         |
| spr:title                   | spc:ProceedingsVolume | rdf:langString       |
| spr:subtitle                | spc:ProceedingsVolume | rdf:langString       |
| spr:ISBN                    | spc:ProceedingsVolume | rdfs:literal         |
| spr:EISBN                   | spc:ProceedingsVolume | rdfs:literal         |
| spr:bookSeriesAcronym       | spc:ProceedingsVolume | rdfs:literal         |
| spr:hasConference           | spc:ProceedingsVolume | spc:Conference       |
| spr:bookDOI                 | spc:ProceedingsVolume | rdf:langString       |
| spr:isIndexedByScopus       | spc:ProceedingsVolume | sxs:boolean          |
| spr:scopusSearchDate        | spc:ProceedingsVolume | sxs:dateTime         |
| spr:isIndexedByCompendex    | spc:ProceedingsVolume | sxs:boolean          |
| spr:compendexSearchDate     | spc:ProceedingsVolume | sxs:dateTime         |
| spr:volumeNumber            | spc:Book              | rdfs:literal         |
| spr:title                   | spc:Book              | rdf:langString       |
| spr:subtitle                | spc:Book              | rdf:langString       |
| spr:ISBN                    | spc:Book              | rdfs:literal         |
| spr:EISBN                   | spc:Book              | rdfs:literal         |
| spr:bookSeriesAcronym       | spc:Book              | rdfs:literal         |
| spr:bookDOI                 | spc:Book              | rdf:langString       |
| spr:isIndexedByScopus       | spc:Book              | sxs:boolean          |
| spr:scopusSearchDate        | spc:Book              | sxs:dateTime         |
| spr:copyrightYear           | spc:Book              | sxs:date             |
| spr:isIndexedByCompendex    | spc:Book              | sxs:boolean          |
| spr:compendexSearchDate     | spc:Book              | sxs:dateTime         |
| spr:title                   | spc:BookChapter       | rdf:langString       |
| spr:subtitle                | spc:BookChapter       | rdf:langString       |
| spr:bookDOI                 | spc:BookChapter       | rdf:langString       |
| spr:firstPage               | spc:BookChapter       | sxs:int              |
| spr:lastPage                | spc:BookChapter       | sxs:int              |
| spr:chapterRegistrationDate | spc:BookChapter       | sxs:date             |
| spr:chapterOnlineDate       | spc:BookChapter       | sxs:date             |
| spr:chapterOf               | spc:BookChapter       | spc:Book             |
| spr:copyrightYear           | spc:BookChapter       | sxs:date             |
| spr:copyrightHolder         | spc:BookChapter       | rdf:string           |
| spr:metadataRights          | spc:BookChapter       | rdf:string           |
| spr:abstractRights          | spc:BookChapter       | rdf:string           |
| spr:bibliographyRights      | spc:BookChapter       | rdf:string           |
| spr:bodyHtmlRights          | spc:BookChapter       | rdf:string           |
| spr:bodyPdfRights           | spc:BookChapter       | rdf:string           |
| spr:esmRights               | spc:BookChapter       | rdf:string           |

Exploring some predicates


I expected a use of dc: creator for author names. But dc: creator takes only one value: “Springer“@en. No doubt: to designate the creator of the base. No other predicate seems to bear the name of the authors.


The following query will allow us to see the distribution of the types used:

select distinct ?o (count(?o) as ?typecount) where {
?s a ?o .
group by ?o
order by desc(?typecount)


| o                                                | typecount |
| spc:BookChapter                                  | 381657    |
| spc:Book                                         | 50102     |
| spc:ProceedingsVolume                            | 9509      |
| spc:Conference                                   | 8487      |
| spc:ConferenceSeries                             | 1477      |
| rdf:Property                                     | 39        |
| <> | 34        |
| <>            | 5         |
| <>   | 5         |
| <>                | 2         |


This predicate probably associates a DOI number with each document. By its nature, a document is uniquely identifies. I will be interested in the form used to record the DOI number by Springer (I noted, for example, that in the Telecom ParisTech database, various forms are used).

select ?doi  where {
?s spr:bookDOI ?doi .
limit 5


| doi                            |
| "10.1007/978-3-319-09147-1"@en |
| "10.1007/978-3-319-09147-1"@en |
| "10.1007/978-3-319-10762-2"@en |
| "10.1007/978-3-319-10762-2"@en |
| "10.1007/978-3-319-07785-7"@en |

We see a homogeneous representation of the DOI numbers in the Springer database. I have checked this on a larger number of examples.

Predicates about conferences

Several predicates seem to concern series of conferences. I will look at how many are concerned and what series conferences have had the most occurrences.

There are 1477 topics of type spc: ConferenceSeries (cf above the most frequent types).

The following query will give us the series of conferences which have given rise to more publication by Springer:


select ?name (count(distinct ?conf) as ?c) where {
?conf a spc:Conference  .
?conf spr:hasSeries ?serie  .
?serie a spc:ConferenceSeries  .
?serie spr:confSeriesName ?name
group by ?name
order by desc(?c)
limit 20


| name                                                                                            | c  |
| "International Colloquium on Automata, Languages, and Programming"@en                           | 40 |
| "International Symposium on Mathematical Foundations of Computer Science"@en                    | 38 |
| "Annual Cryptology Conference"@en                                                               | 33 |
| "Annual International Conference on the Theory and Applications of Cryptographic Techniques"@en | 33 |
| "International Conference on Applications and Theory of Petri Nets and Concurrency"@en          | 32 |
| "International Workshop on Graph-Theoretic Concepts in Computer Science"@en                     | 32 |
| "International Symposium on Distributed Computing"@en                                           | 29 |
| "International Conference on Computer Aided Verification"@en                                    | 28 |
| "International Conference on Concurrency"@en                                                    | 28 |
| "International Conference on Information Security and Cryptology"@en                            | 28 |
| "International Conference on Advanced Information Systems Engineering"@en                       | 27 |
| "International Semantic Web Conference"@en                                                      | 27 |
| "Ada-Europe International Conference on Reliable Software Technologies"@en                      | 26 |
| "European Conference on Object-Oriented Programming"@en                                         | 26 |
| "European Symposium on Programming Languages and Systems"@en                                    | 25 |
| "International Conference on Conceptual Modeling"@en                                            | 25 |
| "International Workshop on Languages and Compilers for Parallel Computing"@en                   | 25 |
| "Annual Symposium on Combinatorial Pattern Matching"@en                                         | 24 |
| "Annual Symposium on Theoretical Aspects of Computer Science"@en                                | 24 |
| "International Conference on Algorithmic Learning Theory"@en                                    | 24 |

This probably gives an overview of the themes most pubished by Springer.

Update frequency of the database

Scientific papers are published each month.

To get an idea of the freshness of the data available here, I make a first test on a book to which I contributed – “Multimodal Interaction with W3C Standards” – whose DOI is: 10.1007 / 978-3-319-42816 -1. It is not in the database on 3/12/2016.

Some predicates suggest date information. I will look for the most recent date in the database. I will use the predicate of this most frequent type: spr: chapterRegistrationDate, which gives dates of the form


The request

select distinct ?date where {
?s spr:chapterRegistrationDate ?date  .
order by desc(?date)
limit 5

gives the following surprising result

| date                  |
| "2017-09-09"^^xs:date |
| "2017-07-25"^^xs:date |
| "2017-06-14"^^xs:date |
| "2017-05-20"^^xs:date |
| "2016-12-19"^^xs:date |

The last recorded document was in the future!?!

In any case, this suggests that this database is regularly updated – even if the posted dates are to be interpreted in a way I do not know at the moment.


This exploration confirms what I have been intuitive about since the beginning of the SemBib project: there are more and more sources of bibliographic data, but each has its own objectives and is incomplete for other purposes, such as Sembib.

This also confirms the axis chosen for Sembib: to constitute a graph of data specific to SemBib, but interconnected with other graphs. SemBib advocates a federation of interconnected bibliographic graphs.

Posted in Public data, Semantic taging, SPARQL | Leave a comment

A country without war

I saw the question “Is the a country that has never been in a war?” qui renvoie à

World peace? These are the only 11 countries in the world that are actually free from conflict

and I thought that’s a good exercise for the semantic web. I think it will help me to illustrate the power and limitations of large masses of data made freely available on the web and machine-readable (LOD, Linked Open Data).

The references above suggest a particular dimension of the issue: the time dimension! What period are we talking about? More later. But first interrogate some concepts: country, war, region.


First, consider what we have as a country in DBPedia. Let’s start with the concept Country

select count(distinct ?pays) where {
?pays a <> .

giving 3294. Surprising! I had in mind that there are just under 200 countries in the World. A google search “countries count in the world” gives further guidance. We find for example that the IEP considers 162 countries in a recent study. And the UN seems to recognize 197 (see In this case, it is only countries constituting sovereign states (see

Wikipedia gives us an indication of a time dimension. The french article Liste des pays du Monde evokes the evolution of the number of countries over time.

While DBpedia identifies missing today states, the disparity is huge. We’ll have to understand what differentiates the Country concept in DBPedia and, for example, countries recognized by the UN and how to reconcile the two approaches.

For example we see on many types associated with this entity, including: Country by Yago (5528 entities),  (5506 entities), umbel (2385 entities) wikidata: Q6256 (5506 units). These variations probably will not help.


Let military conflicts listed by DBPedia. I retained the MilitaryConflict concept.

select count(distinct ?conflict) where {
?conflict a <> .

gives 13354 (at 29.05.2016).

Now, let’s explicitly linked to a conflict countries

select count(distinct ?conflict) where {
?conflict a <> .
?conflict ?p ?pays .
?pays a <>

gives 7750. And the number of countries involved in these conflicts is given by

select count(distinct ?pays) where {
?conflict a <> .
?conflict ?p ?pays .
?pays a <>

is limited to 1115. By viewing just 100 countries, we find, for example, “Province of Quebec”. To arrive at a form of aggregation by attaching a conflict to countries existing today, we have to see if a ‘country’ associated with a conflict can be identified as a part of an country existing today.

32 different properties are used for these links and I have a little trouble with the interpretation of them. Some are understandable well enough. For example: Place property is probably used to indicate the place of the conflict; the combatant property probably indicates the origin of the fighters involved. In both cases, we may well consider that the named countries were involved in the conflict concerned.

Another look can be worn on DBPedia of data noting that some conflicts are associated with the property wordnet_type synset -noun-1.rdf.


The idea is to specify the time on which we will focus our interest.

Are there any dates or times associated with conflicts recorded by DBPedia?

Some are associated with one or more dates (start? end?); others with endDate and startDate, others without direct evidence, but with links to battles with date indication.

When this information is available we can say that a country was at war within the periods concerned. Otherwise, we can not say anything: we do not know at what time the country has been affected by the conflict observed.

We’ll have to try to identify countries that have been involved in a conflict identified by DBPedia (which probably lack of contemporary conflicts).


For simplicity, I have sought to identify the countries recognized by the UN. There is a concept in which DBPedia must help us: Member State of the United Nations. But on a few examples, I see that this concept is associated with country by the dc: subject property, which is a little vague. This gives me the opportunity to see that countries are associated with the rdf: type yago: MemberStatesOfTheUnitedNations, which seems more accurate.

select count(distinct ?pays) where {
?pays a <> .

gives us 186 countries.

I will focus on these countries to assess whether they were involved in a military conflict at one time or another.

select count(distinct ?pays) where {
?pays a <> .
?conflict a <> .
{ ?conflict ?p1 ?pays } UNION { ?pays ?p2 ?conflict }

gives 139, suggesting that 47 UN countries are or was not related to military conflict to the attention of DBPedia as far as we can judge by knowledge directly represented in DBPedia (not induced knowledge).

We will have to express a negation: countries that are not in the list of countries that had military conflict.

I think the request

select distinct ?country where {
?country a yago:MemberStatesOfTheUnitedNations .
   ?conflict a dbo:MilitaryConflict .
   { ?country ?p1 ?conflict } UNION {?conflict ?p2 ?country }

will give us the countries not bound by any DBPedia property with a military conflict. This gives 15 results:


2 results are not country obviously (but it would make it detectable in the request). Rest 13 results that we need to look a little closer. I do not go review every result, but watch a few significant examples.

Monaco is a tiny state. The city-state was annexed by Julius Caesar, for example; without a fight? without use of force? Monaco has managed to maintain some neutrality during World War II. Did Monaco really avoided any armed conflict?

Micronesia consists of a group of Pacific islands. In particular, it was invaded by Japan during World War II. I think she has experienced at least one armed conflict. But it does not appear as such in DBPedia.

An interesting case is the Ivory Coast. Wikipedia knows that this country is experiencing political and military problems (see but DBPedia seems to ignore it. Perhaps the question is too recent and current?


We see through this example that data are available that can help answer questions -and this is already a big progress-but there is still much to do, a lot of intelligence to put to the data path to answer to a specific question. Data should be used with caution, specifying the borders of responses. For example, in the case described in this post, a correct formulation of a response that can be obtained is: list of countries recognized by UN of which we are certain, from the knowledge represented in DBPedia, that they have been linked to military conflict.

Posted in DBpedia, Public data, Semantic taging, SPARQL, Tutorial, web | Leave a comment

CORS, semantic web and linked data

In this post, I talk about CORS and solutions to use data from a server different than the web page which use it.

The development of the semantic web and linked data certainly use through the development of websites that operate data made available in the context of semantic web technologies. The best known example is the use of DBPedia to complement the content of a web page.

But use DBPedia from a Web page implies to send a request to a SPARQL access point (en, fr) from the page, then use the data obtained to enrich the page. Obviously this requires to bypass the security rule which ban the use in a web page a content from a server other than the server where the page is served (unless it’s a particular content as a javascript code or jpeg image): the CORS principle. I will call later the server which serves the web page a ‘source server’, and the one where we try to recover data, the ‘alien server. ‘

I read a lot about CORS before I understood an essential thing: to bypass CORS, the server that provides the data must implement specific solutions; Therefore, you can not use any source but a source willing to cooperate with you. We will see later that one of the solutions allows to exploit any source, but through your server which becomes the data source for the web page.

Three solutions are known to query an alien server:

  • the declaration of the source server on the alien server as authorized requester,
  • sending data in the form of JavaScript code (JSONP method)
  • routing data from the alien server to the web page via the source server (proxy)

In the first solution, the source server has obtained from alien server that it registers  the source server in a list of authorized requester.

After this registering, a web page that comes from the source server will be able to send requests to the foreign server and obtains data.

So there is a strong prerequisite: the heads of the two servers must be connected and the data server (alien) should have referenced the other server (source). This is probably the best solution in terms of safety, but it is very restrictive. In particular, there is a big uncertainty and delay between when you identify a data source that you want to use and when you can use that source (if it allows you).

The way to do this recording depends completely on the nature of the alien server. For example, I administer shadok, a Virtuoso server ( and I was able to declare a server ( that will host pages that will make SPARQL queries on the shadok server. To this, I added in the server list accepted by shadok, following the instructions found here.

A basic example of usage is visible by following this link:

JSONP method

The idea here is simple: because the pages can load javascript, send them javascript. The data server (alien) will send a function call that receives the data as a parameter. The source page must set the execution function code. This solution is known as JSONP.

The page contains for example the function definition

function myjsonp(data) { // défini la fonction callback dont le nom est passé en paramètre de la requête
queryAnswer = data;

The following URL, which can be recovered by making a test directly with the user interface at

gives as result

{ "head": { "link": [], "vars": ["Concept"] },
  "results": { "distinct": false, "ordered": true, "bindings": [
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }} ] } }

But if you make this request in a web page, it will fail because it will look for data on a server different from the origin of the page.

Adding to the above query the following parameter


that defines the callback parameter and assigns the value myjsonp (or any name you have given to your callback function). The Virtuoso server that hosts DBPedia, will use this setting to encapsulate the answer in a call to myjsonp function.

This query can be executed if it is interpreted by the browser as JavaScript code loading:

<script  type=”text/javascript” src=”…here the above request…”></script>

Thus, in interpreting this line, the browser receives from the alien server a javascript function call and it execute the call; such as our function is defined, the result of the sparql request will be stored in the global variable queryAnswer (with our very simple myjsonp sample function).

The alien server Virtuoso– -here is the implementation of a specific request treatment, instead of returning the data, sends the data encapsulated in a call fonction.demande data.

Thus, for DBPedia, if you ask to have a JSON response type and add the callback parameter, you get the JSONP. For the above query, the received result is:


{ "head": { "link": [], "vars": ["Concept"] },
  "results": { "distinct": false, "ordered": true, "bindings": [
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }},
    { "Concept": { "type": "uri", "value": "" }} ] } })

We see that it’s the same as before surrounded by myjsonp (…).

An example is shown here:

For dynamic behavior, it is necessary produce some javascript code to create the script tag and inject it into the page.


The last solution that I will present is the use of a proxy in the source server.

Because the server that sends data must trust the page that asks them, one solution is to request these data to the server that provided the page.

In this case, the page sends the request-for example, those defined in the previous section – implemented as a parameter of a service on the source server. The service executed in the source server retrieves this parameter, executes the request to the alien server, receives the result and returns it to the page.

Although this solution has drawbacks:

  • it adds a processing load on the source server,
  • it requires a transfer of data from the alien server to the source server, then from the source server to the page that has sent the request,
  • it adds latency.

But it also has advantages:

  • it can query all data sources, at the cost of implementation of something like a proxy-cache,
  • it reduces the dependence of operation against the alien server hazards, and even in this case, the latency can be reduced (in the example of the above application, the sparql code would run once on the foreign server, and the result is stored on the source server; subsequent executions of the same query would get the result directly from the cache on the source server).

I will give elements in a future post for the implementation of this solution.

Pour aller plus loin

A more detailed presentation of CORS calls here

JSONP support by the Ajax calls offered by jQuery

where it may be noted in particular that jQuery converts queries of json data in jsonp queries when the query is for a different origin than the current page.

Some supplements on the details of CORS:


You will probably need the solution via the proxy for certain data sources that offer none of the previous two solutions, so it could be a good idea to support the proxy solution in your server now. We will see how to do it in a future post.

You will see that more and more sources offer you the JSONP solution; it is quite easy to implement and probably give more dynamicity to your pages.

Finally, declarative solution is limited to data sources on which you can intervene to setup (or someone can do it for you): this will surely be the least common case. If it is possible, remember that it is still the safest.




Posted in SPARQL, web | Tagged , | Leave a comment

Some french SPARQL endpoints

A short post, which will evolve from time to time, to list some SPARQL endpoints which have a significant relation with France, for example:

  • data produced by a french institution;
  • data about french resources.


The Bibliothèque Nationale de France, the main and the official library in France, published a lot of data about artworks, authors and artists.


This project helps a lot of french actors which work on cultural data (Digital Humanities) to publish data on an unique access point. A lot of data sets have been published.

DBPEDIA français

A workgroup has taken the job to expand the DBPedia intiative to the french Wikipedia.   The SPARQL endpoint is here: (virtuoso endpoint)

(to check: what is


The european project Europeana collect a lot of european cultural data including french data.

GIVINGSENSE (experimental)

… and really small in regard of previous projects. Our team is publishing data and enrichment about the french curiculum and the school system. It’s experimental, a work in progress. All is not so clean as it should be, but things are useful as is. The project is commented here:

Datasets are progressively integrated in a triple store with a sparql endpoint: (Virtuoso endpoint)

Available datasets are for instance:

Posted in SPARQL | Tagged , , , | Leave a comment

Web interaction based on a simple ontology


As part of our work, we identified several ontologies on which a useful visualization mode is a tree of concepts display. In fact, in many ontologies or even set of RDF triples, a ‘view’ through the data can be a tree. We will see an example that preserves semantic links.

We will see that this is doable with some automated transformations and some simple code.

To move forward on this idea, we chose our Bloom ontology that we created in the project ILOT (see this article). In its current state, it is primarily a thesaurus, which could be represented in SKOS, but which  is an OWL ontology -for reasons related to the conduct of the ILOT project was conducted-, using part of the SKOS vocabulary. Clearly, there is a vocabulary tree with  concepts/general words and words in a hierarchy, limiting the semantic of broader concepts. Our ontology is available here.

For Web visualization of a tree, we quickly identified the javascript library d3.js as reliable and highly customizable, offering various opportunities to display tree data.

d3.js offers features to load data described in json and display them. For this, we will start from a model with default operations on the tree structure, which requires a well-defined structure of the json object, but that is based on methods that can be redefined. At first glance, we must simply redefine two methods:

  • children, starting from a node, returns a table of its direct sons,
  • a function that returns the label to display for a given node.

But, before, we must generate the JSON structure that we will use.

Transforming OWL ontology to JSON-LD

Our ontology can be easily converted into JSON-LD with RDF-Translator. It s an online tool, whose sources are available, which allows you to convert files from one representation of RDF triples in another. We must give the input format -here RDF/XML- and the output format -here JSON-LD.

JSON-LD is a recent W3C specification for representing RDF triples with JSON, in an organized manner to facilitate the exchange of data between applications and between web sites. A future post will tell you more about JSON-LD.

Once copied the JSON-LD file, we will make it a little easier to handle it with javascript by adding a context (a context is something like a list of templates which, applied to the rest of the file simplify his presentation, but preserving the content). For this, I used  ‘json-ld playground‘ -a tool with samples and to tools to play with json-ld- which allowed me to add the little context below to the JSON-LD file previously generated:

    "subClassOf": "",
    "label": {
      "@id": "",
      "@container": "@language"
    "prefLabel": {
      "@id": "",
      "@container": "@language"
    "close": "",
    "bloom": {
      "@id": ""
    "scopeNote": {
      "@id": "",
      "@type": ""

allowing for example to transform the following portion of JSON-LD:

"": [
        "@value": "detect",
        "@language": "en"
        "@value": "détecter ",
        "@language": "fr"

into the next, thank’s to the ‘filter’ defined by the context:

"prefLabel": {
        "en": [
          "compter ",

which is more convenient to handle. The resulting JSON-LD file is here.

Displaying with d3

En partant de l’exemple ‘Indented Tree’ qui figure sur la page d’exemples d3, nous avons cherché comment exploiter une telle représentation avec notre fichier JSON-LD. La première étape est, bien sûr de remplacer les données utilisées dans l’exemple par les notres:

From the example ‘Indented Tree’ that appears on that page, we looked how to exploit such a representation with our JSON-LD file. The first step is of course to replace the data used in the example with ours:

d3.json("bloom/ontobloom.jsld", ...

The ellipsis replaces a function passed as a parameter which uses the data read from the json file with the given path.

Directly, it does not work. As mentioned above, you need to tell the function how to find the root of the tree that interests us:

flare["@id"] = "";

where we give the id of the node that should be found in the function that finds the son of a node:

childrenFinder = function(node) {
            var j=0;
            var children = new Array();
            var id = node["@id"];
            if (id!==undefined) {
              root["@graph"].forEach(function(n, i) {
                if (n.subClassOf!==undefined)
                if ("@id" in n.subClassOf)
                if (n.subClassOf["@id"]===id){
                    children[j] = n; j++;
            return children;

This function is quite simple and obviously dependent on input data. It is called recursively on the son of the node and the son of the son and so on.

Once defined, it is used as a parameter of the display method of the tree structure:

// surcharge de la méthode standard de recherche des fils
tree.children( childrenFinder)
// construction de la structure visuelle de l'arbre

We find then the piece of code that inserts the text in the tree; the function that gives the label associated with each node is changed:

      .attr("dy", 3.5)
      .attr("dx", 5.5)
      .text(function(d) { 
          var label = (d.label!==undefined?"?");
          return label; 

The result is shown here.

If you look in a debugger -for example, development tools integrated into Chrome- the object that you get on a mouse click in the tree, you see that this object is a direct result of the original structure of the ontology; it contains an “@ id” field  whose value is the URI of the element of the ontology associated with the clicked item in the tree, for example:

"@id": ""

So, at low cost, you have a user interface with semantic: it is ready to establish links with the rest of the wonderful world of linked open data and the semantic web.

Posted in OWL Cookbook, Visualization, web | Tagged , , | 1 Comment

Using public data: educational resources

Our team has undertaken reuse of public data , especially in the field of culture. For this, we treat public data available on to improve their use in semantic web and LOD ( Linked Open Data) .

The first set of re-published data is that entitled

Educational resources for teaching art history

published by the French Ministry of Culture and Communication , licensed ‘Open License‘ .

The original file is a CSV file, which contains the description of 5000 worksheets . This file has multiple columns with multiple values ​​separated by ‘,’ or ‘;’ . We have improved the structure of the data by unlinking these values ​​so they can be easily recognized for use in the Semantic Web , and in particular with SPARQL queries. In this regard, see, and the ongoing integration within the ILOT project.

The file was converted into RDF/XML . It is available at the following address . It is published under Creative Commons Share Alike license .

Our data were announced in, here:

Future developments of our work are expected to enrich the data with :

  • links to external data (eg links with DBPedia for people and places, geonames links for places )
  • links to a description enriched for all tags and keywords used in this file.

Feel free to use this resource. Thank you for keeping us informed if you use or improve it.

Posted in DBpedia, OWL Cookbook, Public data | Tagged , , | Leave a comment

Ontologies, properties and inheritance of features

I know that I will hurt some specialists of ontologies in speaking about inheritance of features.

But let me tell a story.

I’m quite enthusiast about using ontologies but also I was  recently a newbie in the domain. I do quick progress :).

I had a very common need, but I was stuck on that for a while.

My need:

  • I have a tree
  • Nodes have direct ancestor (parent) and indirect ancestors
  • hasParent is a property which links a node to another, with a direct in the tree
  • hasAncestor is a property which links a node to another which somewhere between the node and the tree’s root
  • hasAncestor must be transitive
  • hasParent isn’t transitive

I would like to get the rule:

a :hasParent  b
a :hasAncestor b

I’ve checked a lot of documents and I don’t figure how to do it (directly in XML/RDF or interactively with Protégé).

The solution was simple and, so, not clearly visible; here in Turtle syntax:

:hasParent rdfs:subPropertyOf :hasAncestor.

My blindness was due to my inability to think that a subProperty doesn’t share his ‘features’ with his super-property.


:hasParent rdfs:subClassOf :hasAncestor
:hasAncestor rdf:type owl:TransitiveProperty

does not mean that :hasParent is also transitive.

Transitivity isn’t “inherited” down the property hierarchy, so it’s possible to have a non-transitive sub property of a transitive super property.

As soon as I have understood that fact, I searched where I had missed something.

First, I spontaneously think that a sub-property inherit behavior from his super-property.

Some good specialists -like Dave Reynolds- thinks that for OWL ontologies, and the underlying logic, using terms like “inheritance” can trip you up. Especially when it comes to property axioms. In the RDF/OWL way of thinking then a property corresponds to set of pairs of things that are related by the property. So saying

:hasParent rdfs:subPropertyOf :hasAncestor

means, and only means, that the set of pairs of things related by :hasParent is a subset of the set of pairs of things related by :hasAncestor.

We can’t assume that property characteristics are inherited – some are (e.g. functionality), some aren’t (e.g. transitivity and symmetry). And, surely, think of that in terms of inheritance is very confusing.

Some “features” of OWL are “inherited”. Others are not. For example:

(a subPropertyOf b) and (b inverseOf c)  doesn't imply (a inverseOf c)
(a subPropertyOf b) and (b equivalentPropertyOf c) doesn't imply (a equivalentPropertyOf c)
(a subPropertyOf b) and (b type SymetricProperty) doesn't imply (a type SymProperty)
(a subPropertyOf b) and (b type TransitiveProperty) doesn't imply (a type TransitiveProperty)

Where can we find that?

Aidan Hogan suggest that “if you want a non-technical means of introducing the features of OWL, examples using IF — THEN — (i.e., rules) will give a sound but incomplete picture. Studying the rules in OWL 2 RL/RDF is a great starting point for anyone wanting to learn a bit about what the *key* entailments of the OWL (2) features are (and without having to get into the formal semantics):

The OWL features mean more than what’s represented in these rules, but IF you can understand these rules, THEN you’ll have a working knowledge of OWL.”


Posted in OWL Cookbook | Leave a comment