Tf-Idf is a weighting method, often used to evaluate the importance of a word in a document.
The idea here is to evaluate the importance of a predicate or a predicate-value pair for an entity of a semantic graph.
Tf-Idf on predicates
We will, for this first evaluation, consider that an entity of a graph is represented by the predicates that describe it. The equivalent of a document in the classic use of Tf-Idf is therefore an entity and its list of predicates (we could extend it to several neighborhood levels of the entity).
Then, the “raw” frequency of a predicate is simply the number of occurrences of this predicate used by the considered entity. This raw frequency can be chosen to express the frequency of a predicate. For documents, it is also possible to use the number of occurrences of a word relative to the number of words of the document. When dealing with documents of approximately homogeneous length, it does not matter much. If not, it can quickly become important. Similarly, we propose to use the number of occurrences of a predicate associated with an entity normalized by the total number of predicates associated with this entity.
to(p,e)=number of occurrences of p for the entity e
can be obtained for example on our graph <http://givingsense.eu/datamusee/onto/parismusees> with the query sparql
select (count(?o) as ?c) where { graph <http://givingsense.eu/datamusee/onto/parismusees> { <target entity> <target predicate> ?o } }
tp(e)=number of predicates associated with e
can be obtained for example on our graph <http://givingsense.eu/datamusee/onto/parismusees> with a sparql query following the model:
select (count(?p) as ?c) where { graph <http://givingsense.eu/datamusee/onto/parismusees> { <target entity> ?p []} }
tf(p,e)= to(p,e)/tp(e)
The inverse document frequency (IDF) is a measure of the importance of the term in the entire body of documents. For us, it will be a question of evaluating the importance of a predicate on the whole graph.
We will evaluate it using the logarithm of the inverse of the proportion of entities that use this predicate:
D = number of entities in the graph
can be obtained for example on our graph <http://givingsense.eu/datamusee/onto/parismusees> with the sparql query:
select (count(distinct ?s) as ?c) where { graph <http://givingsense.eu/datamusee/onto/parismusees> { ?s ?p []} }
d(p )= number of entities that use the predicate p
on the previous graph, this is obtained for the most used predicates with
select ?p (count(distinct ?s) as ?c) where { graph <http://givingsense.eu/datamusee/onto/parismusees> { ?s ?p []} } group by ?p order by desc(?c)
then
idf(p) = log(D/d(p))
and
tfidf(p,e) = tf(p,e)*idf(p)
For example, for the previous graph
D=255075
* for the predicate <http://dbpedia.org/ontology/wikiPageWikiLink> d(<http://dbpedia.org/ontology/wikiPageWikiLink>) = 118659 * for the predicate <http://purl.org/dc/terms/subject> d(<http://purl.org/dc/terms/subject>) = 2079 * for the predicate <http://xmlns.com/foaf/0.1/primaryTopic> d(<http://xmlns.com/foaf/0.1/primaryTopic>) = 1793 * for the predicate <http://www.w3.org/2002/07/owl#sameAs> d(<http://www.w3.org/2002/07/owl#sameAs>) = 1581
for the entity <http://fr.dbpedia.org/resource/Paris_Musées>, with the query
select ?p (count(?p) as ?c)where { graph <http://givingsense.eu/datamusee/onto/parismusees> { <http://fr.dbpedia.org/resource/Paris_Musées> ?p ?o } } group by ?p order by desc(?c)
the following predicates are found (and a total of 84 predicates)
<http://dbpedia.org/ontology/wikiPageWikiLink> 48 <http://fr.dbpedia.org/property/wikiPageUsesTemplate> 8 <http://purl.org/dc/terms/subject> 5 <http://www.w3.org/2002/07/owl#sameAs> 5 <http://dbpedia.org/ontology/abstract> 3 <http://www.w3.org/2000/01/rdf-schema#comment> 3 <http://www.w3.org/2000/01/rdf-schema#label> 3 <http://dbpedia.org/ontology/wikiPageExternalLink> 2 <http://dbpedia.org/ontology/wikiPageID> 1 <http://dbpedia.org/ontology/wikiPageLength> 1 <http://dbpedia.org/ontology/wikiPageOutDegree> 1 <http://dbpedia.org/ontology/wikiPageRevisionID> 1 <http://www.w3.org/ns/prov#wasDerivedFrom> 1 <http://xmlns.com/foaf/0.1/homepage> 1 <http://xmlns.com/foaf/0.1/isPrimaryTopicOf> 1
And, for example, the following tfidf
tfidf(<http://dbpedia.org/ontology/wikiPageWikiLink>) = 0,63
tfidf(<http://purl.org/dc/terms/subject>) = 0,41
tfidf(<http://xmlns.com/foaf/0.1/primaryTopic>) = 0
tfidf(<http://www.w3.org/2002/07/owl#sameAs>) = 0,43
Extensions
We can also do this type of calculation on the predicates used by the classes or on the pairs (predicates, values) used. We can also consider incoming predicates to an entity.