Sparql- Number Of Entities and Instances? - sparql

I'm doing small task on Sparql Query. I want to get the number of entities and number of instances. I have basic knowledge of Sparql and rdf. So I wrote sparql query to get the number of entities but i'm not 100% sure it's right. The endpoint i'm using is Dbpedia. Here's the query.
#Number of Entities
SELECT (count(?entity) AS ?Entities)
WHERE{ ?entity rdf:type ?type.
}
-----------
Output:
113715893
The output above me gives me big number. I'm just wondering is that the right query to get the number of entities?
Also I have to get the number of Instances. I'm not sure what 'instances' means. I assume that is the subclass or something.
Can anyone help me out with the task?

Hey the problems with the terms entity and instance is they are used often in different meanings. I assume Entity means every uri that can be an subject. While instance means every entity which is an instance of an owl:Class.
For the entities the query would be:
SELECT (count(distinct ?entity) AS ?Entities)
WHERE{ ?entity ?p ?o}
For instances i would write the following query:
select distinct count(distinct ?instance) where {?instance a ?class . ?class a owl:Class}
Maybe you mention the distinct before the variable i want to count? This is very important for you. Because to stick with your try an entity can have multiple types. For each of this types you will get an binding for the combination of entity & type variable. This at least leads to the fact that you will count the entity for each type you found in your query. So an entity with two types is counted twice. But I assume you want to count the entity only once - so you need to use the distinct keyword for the variable you want to count. This ensures that you only count different entities that are bound to this variable.

Related

How to determine whether two SPARQL queries are identical using Python?

When using SPARQL to query RDF dataset, the same query can be written in many different ways. For example, sparql queries are always permutation-invariant with respect to some clauses inside it. Also, we can rename the variables inside a sparql query. But how can we identify those identical SPARQL queries? Ideally, there should be a python package that can parse a sparql query (i.e., a string object) into a query object, and different strings share the same underlying query are parsed into the same object, then we can simply compare the parsed query objects to determine whether two sparql queries are identical. Is there any tool like this (seems prepareQuery() in rdflib doesn't work in this way)? If not, then what should I do?
Semantically identical queries example:
SELECT ?x WHERE { ?x foaf:haha ?k .\n ?person foaf:knows ?x .}
SELECT ?s WHERE { ?person foaf:knows ?s .\n ?s foaf:haha ?k .}
The paper "Generating SPARQL Query Containment Benchmarks
using the SQCFramework" by Muhammad Seleem et al., mentions "SPARQL query containment solvers" where
Query containment is the problem of deciding if the result set of a query Q1 is included
in the result set of another query Q2
If you use such a solver to test whether the result set of Q1 is a subset of Q2 and vice versa, you have established that they are semantically identical.
As for your "off-the-shelf tool": the former paper mentions that those are tested in another paper "Evaluating and benchmarking sparql query containment solvers." by M.W. Chekol et al..
As for the complexity and computability, the latter paper mentions:
The query containment problem for full SPARQL is undecidable [15, 1].
Hence, it is necessary to reduce SPARQL in order to consider it. A
double exponential upper bound has been proven for the containment and
equivalence problems of SPARQL queries without OPTIONAL , FILTER and
under set semantics [7].
However, query containment in both directions is only one way to determine identity of queries. I am unaware whether there is a proof of a better complexity/computability for query identity than for query containment (or a proof on the contrary).

Mapping DBpedia types to Wikipedia Categories

I am trying to map DBPedia types to Wikipedia Categories, a simple example would be the following SPARQL query
select distinct ?cat where {
?s a dbpedia-owl:LacrossePlayer; dcterms:subject ?cat . filter(regex(?cat,'players','i') )
} limit 100
SPARQL Result
But this is highly inefficient as it has to first map the DBpedia types to DBpedia Named Entities(resources) and then extract their corresponding Wikipedia categories. I am trying to do this mapping for a lot of other DBpedia types.
Is there a direct or more efficient way to do this?
Improving the filter may help…
As an initial note, you may get some speedup if you remove or improve your filter. You can, of course, just remove it, but you could also make it more efficienct, since you're not really using any special regular expressions. Just do
filter contains(lcase(str(?cat)),'players')
to check whether the URI for ?cat contains the string players. It might even be better (I'm not sure) to grab the English rdfs:label of ?cat and check that, since you wouldn't have to do the case or string conversions.
… but there are lots of results.
But this is highly inefficient as it has to first map the DBpedia
types to DBpedia Named Entities(resources) and then extract their
corresponding Wikipedia categories. I am trying to do this mapping for
a lot of other DBpedia types. Is there a direct or more efficient way
to do this?
I'm not sure exactly what's inefficient in this. The only way that DBpedia types and categories are associated is that resources have types (via rdf:type) and have categories (via dcterms:subject). If you want to find the connections, then you'll need to find the instances of the type and the categories to which they belong. There may be some possibility that you can look into whether any particular infoboxes provide categories to articles and are used in the infobox mapping to provide DBpedia types. That's the only way to get category/DBpedia-types directly, without going through instances that I can think of, and I don't know whether the current dataset has that kind of information.
In general, since Wikipedia categories are not a type hierarchy, there will be lots of categories with which instances of any particular type are associated. For instance, we can count the number of categories associated with the types Fish and LacrossePlayer with a query like this:
select ?type (count(distinct ?category) as ?nCategories) where {
values ?type { dbpedia-owl:Fish dbpedia-owl:LacrossePlayer }
?type ^a/dcterms:subject ?category
}
group by ?type
SPARQL results
type nCategories
http://dbpedia.org/ontology/LacrossePlayer 346
http://dbpedia.org/ontology/Fish 2375
That query responds pretty quickly, and you can even get those categories pretty easily, too:
select distinct ?type ?category where {
values ?type { dbpedia-owl:Fish dbpedia-owl:LacrossePlayer }
?type ^a/dcterms:subject ?category
}
order by ?type
limit 4000
SPARQL results
When you start using types that have many more instances, though, these counts get big, and the queries take a while to return. E.g., a very common type like Place:
select ?type (count(distinct ?category) as ?nCategories) where {
values ?type { dbpedia-owl:Place }
?type ^a/dcterms:subject ?category
}
group by ?type
type nCategories
http://dbpedia.org/ontology/Place 191172
I wouldn't suggest trying to pull all that data down from the remote server. If you want to extract it, you should load the data locally.

Query DBpedia for countries with human development index

I'm trying to query for all Countries in DBpedia and get their human development index.
The query I am trying is:
SELECT *
WHERE {
?Country a <http://dbpedia.org/ontology/Country> .
?Country <http://dbpedia.org/ontology/humanDevelopmentIndex> ?humanDevelopmentIndex .
}
LIMIT 1000
Would anyone be able to explain why this query isn't returning any results? It seems straightforward to me.
You're not getting anything back because apparently, none of the countries in DBpedia actually have a humanDevelopmentIndex property associated with them.
You can verify this for yourself. If you simplify your query to just get back countries:
SELECT *
WHERE {
?Country a <http://dbpedia.org/ontology/Country> .
}
LIMIT 1000
You will get back a list of countries, so clearly it is the addition of the other property pattern that causes the query to not match any results. Also, if you take a look at the data for, for example, Austrialia in DBPedia, you will not find the property you want there.
The reason it doesn't appear is that the data you want is probably located in the ontology_infobox_properties or the ontology_infobox_properties_specific dataset. These are not exposed in the public endpoint, but you can download them.

Sparql Keys vs distinct values

I have a sparql query that returns duplicates, and I want it to clean them up on one of the values only (subjectID). Unlike DISTINCT that seems to find a unique value for the combination of values selected, rather than for only one of the parameters.
I saw someone here propose group by, but that only seems applicable if I list all the parameters after group by (my sparql endpoint complains, e.g. Non-group key variable in SELECT: ?occupation).
I tried running an internal select, but it doesn't seem to work for this specific query. So might be an issue with the query itself ( the values of the livedIn optional seem to be causing the duplicate) ?
While happy enough with relational DBs early in the learning curve with SPARQL, so feel free to explain the obvious for the otherwise uninitiated! :)
select distinct
?subjectID ?englishName ?sex ?locatedIn15Name
?dob ?dod ?dom ?bornLocationName ?occupation
where {
?person a hc:Person ;
hc:englishName ?englishName ;
hc:sex ?sex;
hc:subjectID ?subjectID;
optional { ?person hc:livedIn11 ?livedIn11 .
?livedIn11 hc:englishName ?lived11LocationName .
?livedIn11 hc:locatedIn11 ?locatedIn11 .
?locatedIn11 hc:englishName ?locatedIn11Name .
?locatedIn11 hc:locatedIn15 ?locatedIn15 .
?locatedIn15 hc:englishName ?locatedIn15Name .
} .
optional {?person hc:born ?dob } .
optional {?person hc:dateOfDeath ?dod } .
optional {?person hc:dateOfMarriage ?dom } .
optional { ?person hc:bornIn ?bornIn .
?bornIn hc:englishName ?bornLocationName .
?bornIn hc:easting ?easting .
?bornIn hc:northing ?northing } .
optional { ?person hc:occupation ?occupation }
FILTER regex(?englishName, "^FirstName LastName")
}
GROUP BY
?subjectID ?englishName ?sex
?locatedIn15Name ?dob ?dod ?dom
?bornLocationName ?occupation
Re the error message:
Non-group key variable in SELECT: ?occupation
You can avoid this by using the SAMPLE() aggregate - this will allow you to just group on ?subjectID but still select values for the rest of the variables provided you only care about getting one value for those other variables.
Here's a simple example of this:
SELECT ?subjectID (SAMPLE(?dob) AS ?dateOfBirth)
WHERE
{
?person a hc:Person ;
hc:subjectID ?subjectID .
OPTIONAL { ?person hc:born ?dob }
}
GROUP BY ?subjectID
First thing to note is that there is no such thing as a key, really, in RDF/SPARQL. You're querying a graph, and ?subjectID may simply have several possible combinations of values for the other variables you are selecting. This is caused by the shape of the graph you're querying: perhaps your person has more than one english name, or indeed the other way around: the same english name can be shared by more than one person.
A SPARQL SELECT query is a strange beast: it queries a graph structure but presents the result as a flat table (technically, it's a sequence of sets of variable bindings, but it amounts to the same thing). Duplicates occur because different combinations of values for your variables can be found by basically following different paths in the graph.
The fact that you get duplicate values for ?subjectID in your result is therefore unavoidable, simply because these are, from the point of view of the RDF graph, unique solutions to your query. You can not filter out results without actually losing information, so in general it's hard to give you a solution without knowing more about exactly which 'duplicates' you want to discard: do you only want one possible english name for each subject, or one possible date of birth (even though there may be more than one in your data)?
However, here are some tips for handling/procesing such results more easily:
First of all, you could choose to use an ORDER BY clause on your ?subjectID variable. This will still give you several rows with the same value for ?subjectID, but they'll all be in order, so you can process your result more efficiently.
Another solution is to split your query in two: do a first query that only selects all unique subjects (and possibly all other values for which you know, in advance, that they will be unique given the subject), then iterate over the result and do a separate query to get the other values you're interested in, for each individual subjectID value. This solution may sound like heresy (especially if you're from an SQL background), but it might actually be quicker and easier than trying to do everything in one huge query.
Yet another solution is the one suggested by RobV: using a SAMPLE aggregate on a particular variable to just select one (random) unique value. A variation on that is to use the GROUP_CONCAT aggregate, which creates a single value by concatenating all possible values into a single string.

SPARQL: Extracting Unique Entities from DBpedia

Consider the following script:
PREFIX category: <http://dbpedia.org/resource/Category:>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpedia: <http://dbpedia.org/ontology/>
SELECT DISTINCT *
WHERE {
?s dcterms:subject category:Living_people .
?s foaf:name ?name
}
LIMIT 10000
When running it, I get something like this in result:
Sir Alexander Chapman Ferguson
Sir Alex Ferguson
Though they are different entries, they are definitely the same entities. So I would like to reduce the output when addressing the SPARQL endpoint, i.e. I would like to avoid editing output data because it may be challenging in this case. Could you help me with that? What should be fixed in my query?
As you see when you run your query, both the rows that you mention refer to the same resource: <http://dbpedia.org/resource/Alex_Ferguson>. The fact that you get multiple rows in your query result is simply because there are multiple names for this person.
So if you just need to ensure that you don't get duplicates in your application, simply make sure that your application treats each unique value for "s" in your query result as a separate person.
On the other hand, if your problem is the fact that you get multiple names for a person, you could perhaps use some other properties. For example, dbpedia:fullname only has a single entry, likewise the properties dbpedia:surname and dbpedia:givenName.