I needed the number of separate entities described in dbpedia-fr. We will see a problem to consider when using linked data that uses public access points.
My first attempt was to get this information with the query
select count(distinct ?r) where { ?r ?p ?l }
but it always failed on a timeout.
While if I just counted the triples with
select count(?r) where { ?r ?p ?l }
I got
185404575
Nandana Mihindukulasooriya told me the Loupe for statistics on datasets, but I wanted to be able to get the information directly programmatically by querying the source I was using: here, DBPedia-Fr, but potentially with a method applicable to other datasets.
Hugh Williams told me that my request was heavily resource intensive (in the implementation of Virtuoso which hosts DBPedia and which seems to go through the construction of a very large hash table). He informed me that DBPedia (English) provides an up-to-date description of its dataset according to the principles proposed by the technical note on VoID: VoID pour DBPedia; this prevents heavy requests to be processed repetitively to obtain this information. But the French version does not offer this at the time of writing this post.
John Walker offered to rewrite my request as follows:
select (count(?s) as ?c) where { select distinct ?s where { ?s ?p []} }
it seems that this writing consumes fewer resources, and I get the expected result:
10515620
Why is this writing more effective than the other? It seems to me semantically equivalent; it would be necessary to understand the detail of how it is implemented to imagine which writing is preferable. On this subject, we find an entire chapter on query optimization in the very good book of Bob du Charme sur SPARQL.
Alasdair J G Gray reports to me the report Dataset Descriptions: HCLS Community Profile and in particular section 6.6 where examples of statistics to be obtained on data sets are presented with the SPARQL queries which make it possible to obtain them.en particulier la section 6.6 où des exemples de statistiques à obtenir sur des jeux de données sont présentés avec les requêtes SPARQL qui permettent de les obtenir.
The results below correspond to the application of these queries to DBPedia-Fr.
Description | Requete | Resultat |
---|---|---|
Number of triples | SELECT (COUNT(*) AS ?triples) { ?s ?p ?o } | 185404575 |
Number of distinct typed entities | SELECT (COUNT(DISTINCT ?s) AS ?entities) { ?s a [] } | 6015375 |
Number of distinct subjects | SELECT (COUNT(DISTINCT ?s) AS ?distinctSubjects) { ?s ?p ?o } | time out |
Number of distinct properties | SELECT (COUNT(DISTINCT ?p) AS ?distinctProperties) { ?s ?p ?o } | 20321 |
Number of distinct non-literal objects | SELECT (COUNT(DISTINCT ?o ) AS ?distinctObjects){ ?s ?p ?o FILTER(!isLiteral(?o)) } | time out |
Number of distinct classes | SELECT (COUNT(DISTINCT ?o) AS ?distinctClasses) { ?s a ?o } | 442 |
Number of distinct literal objects | SELECT (COUNT(DISTINCT ?o) AS ?distinctLiterals) { ?s ?p ?o filter(isLiteral(?o)) } | time out |
Number of graphs in the data set | SELECT (COUNT(DISTINCT ?g ) AS ?graphs) { GRAPH ?g { ?s ?p ?o }} | 14 |
The same queries applied to DBPedia give:
Description | Requete | Resultat | Donnes VoID |
---|---|---|---|
Number of triples | SELECT (COUNT(*) AS ?triples) { ?s ?p ?o } | 438038746 | 438038866 |
Number of distinct typed entities | SELECT (COUNT(DISTINCT ?s) AS ?entities) { ?s a [] } | rien | |
Number of distinct subjects | SELECT (COUNT(DISTINCT ?s) AS ?distinctSubjects) { ?s ?p ?o } | rien | 33996245 |
Number of distinct properties | SELECT (COUNT(DISTINCT ?p) AS ?distinctProperties) { [] ?p ?o } | 64119 | 64119 |
Number of distinct non-literal objects | SELECT (COUNT(DISTINCT ?o ) AS ?distinctObjects){ ?s ?p ?o FILTER(!isLiteral(?o)) } | rien | 164615518 (?) |
Number of distinct classes | SELECT (COUNT(DISTINCT ?o) AS ?distinctClasses) { [] a ?o } | 370680 | 370680 |
Number of distinct literal objects | SELECT (COUNT(DISTINCT ?o) AS ?distinctLiterals) { ?s ?p ?o filter(isLiteral(?o)) } | 29883287 | |
Number of graphs in the data set | SELECT (COUNT(DISTINCT ?g ) AS ?graphs) { GRAPH ?g { ?s ?p ?o }} | 19 | |
Number of distinct objects | 164615518 |
To obtain some results, it was necessary to increase the time granted to the request to be treated; for others, the request proposed in the above document has been slightly modified. DBPedia does not cause ‘time out’, but returns an empty response when it was unable to process the request (see discussion here). The managers of DBPedia propose not to consider their server as a production server and to install a copy if we want more guaranteed results; indeed, they have to deal with a large mass of users and can not offer a total guarantee of service. As a result, some queries, even by increasing the processing time, do not get a result.
Nevertheless, DBPedia makes much more intensive use of classes than DBPedia-Fr. I will return to the use of classes in a future post.
We have seen here some workarounds to some problems in the processing time of some requests; but these workarounds do not always work, even on queries that seem simple and of basic use (counting certain types of elements in a dataset).
Update 20/4/2018
Kingsley Idehen point me on
https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71
and more precisely to the section “Virtuoso Anytime Query” which shows how DBPedia add some header in the reponse to give an alert about the incomplete results when the timeout is reached