Statistics about DBPedia-fr

I needed the number of separate entities described in dbpedia-fr. We will see a problem to consider when using linked data that uses public access points.

My first attempt was to get this information with the query

select count(distinct ?r) where { ?r ?p ?l }

but it always failed on a timeout.

While if I just counted the triples with

select count(?r) where { ?r ?p ?l }

I got

185404575

Nandana Mihindukulasooriya told me the  Loupe  for statistics on datasets, but I wanted to be able to get the information directly programmatically by querying the source I was using: here, DBPedia-Fr, but potentially with a method applicable to other datasets.

Hugh Williams told me that my request was heavily resource intensive (in the implementation of Virtuoso which hosts DBPedia and which seems to go through the construction of a very large hash table). He informed me that DBPedia (English) provides an up-to-date description of its dataset according to the principles proposed by the technical note on VoID: VoID pour DBPedia; this prevents heavy requests to be processed repetitively to obtain this information. But the French version does not offer this at the time of writing this post.

John Walker offered to rewrite my request as follows:

select (count(?s) as ?c) where { select distinct ?s where { ?s ?p []} }

it seems that this writing consumes fewer resources, and I get the expected result:

10515620

Why is this writing more effective than the other? It seems to me semantically equivalent; it would be necessary to understand the detail of how it is implemented to imagine which writing is preferable. On this subject, we find an entire chapter on query optimization in the very good book of Bob du Charme sur SPARQL.

Alasdair J G Gray reports to me the report Dataset Descriptions: HCLS Community Profile and in particular section 6.6 where examples of statistics to be obtained on data sets are presented with the SPARQL queries which make it possible to obtain them.en particulier la section 6.6 où des exemples de statistiques à obtenir sur des jeux de données sont présentés avec les requêtes SPARQL qui permettent de les obtenir.

The results below correspond to the application of these queries to DBPedia-Fr.

Description Requete Resultat
Number of triples SELECT (COUNT(*) AS ?triples) { ?s ?p ?o } 185404575
Number of distinct typed entities SELECT (COUNT(DISTINCT ?s) AS ?entities) { ?s a [] } 6015375
Number of distinct subjects SELECT (COUNT(DISTINCT ?s) AS ?distinctSubjects) { ?s ?p ?o } time out
Number of distinct properties SELECT (COUNT(DISTINCT ?p) AS ?distinctProperties) { ?s ?p ?o } 20321
Number of distinct non-literal objects SELECT (COUNT(DISTINCT ?o ) AS ?distinctObjects){ ?s ?p ?o FILTER(!isLiteral(?o)) } time out
Number of distinct classes SELECT (COUNT(DISTINCT ?o) AS ?distinctClasses) { ?s a ?o } 442
Number of distinct literal objects SELECT (COUNT(DISTINCT ?o) AS ?distinctLiterals) { ?s ?p ?o filter(isLiteral(?o)) } time out
Number of graphs in the data set SELECT (COUNT(DISTINCT ?g ) AS ?graphs) { GRAPH ?g { ?s ?p ?o }} 14

The same queries applied to DBPedia give:

Description Requete Resultat Donnes VoID
Number of triples SELECT (COUNT(*) AS ?triples) { ?s ?p ?o } 438038746 438038866
Number of distinct typed entities SELECT (COUNT(DISTINCT ?s) AS ?entities) { ?s a [] } rien
Number of distinct subjects SELECT (COUNT(DISTINCT ?s) AS ?distinctSubjects) { ?s ?p ?o } rien 33996245
Number of distinct properties SELECT (COUNT(DISTINCT ?p) AS ?distinctProperties) { [] ?p ?o } 64119 64119
Number of distinct non-literal objects SELECT (COUNT(DISTINCT ?o ) AS ?distinctObjects){ ?s ?p ?o FILTER(!isLiteral(?o)) } rien 164615518 (?)
Number of distinct classes SELECT (COUNT(DISTINCT ?o) AS ?distinctClasses) { [] a ?o } 370680 370680
Number of distinct literal objects SELECT (COUNT(DISTINCT ?o) AS ?distinctLiterals) { ?s ?p ?o filter(isLiteral(?o)) } 29883287
Number of graphs in the data set SELECT (COUNT(DISTINCT ?g ) AS ?graphs) { GRAPH ?g { ?s ?p ?o }} 19
Number of distinct objects 164615518

To obtain some results, it was necessary to increase the time granted to the request to be treated; for others, the request proposed in the above document has been slightly modified. DBPedia does not cause ‘time out’, but returns an empty response when it was unable to process the request (see discussion here). The managers of DBPedia propose not to consider their server as a production server and to install a copy if we want more guaranteed results; indeed, they have to deal with a large mass of users and can not offer a total guarantee of service. As a result, some queries, even by increasing the processing time, do not get a result.

Nevertheless, DBPedia makes much more intensive use of classes than DBPedia-Fr. I will return to the use of classes in a future post.

We have seen here some workarounds to some problems in the processing time of some requests; but these workarounds do not always work, even on queries that seem simple and of basic use (counting certain types of elements in a dataset).

Update 20/4/2018

Kingsley Idehen point me on

https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71

and more precisely to the section  “Virtuoso Anytime Query” which shows how DBPedia add some header in the reponse to give an alert about the incomplete results when the timeout is reached

 

This entry was posted in Non classé. Bookmark the permalink.