Sparql - Traverse to top (broadest) - sparql

I have an internal triples dataset.
I am trying to implement a typeahead feature for an application using the dataset.
I am trying to figure out how I can conditional traverse to the Top Concept
In this picture, if I searched for dreamworks, It would be 3 layers down (Business Organizations -> Private Company -> Dreamworks).
If I do something else, obviously it will be at different layers of the graph.
I am trying to get the value "Organization" or "Person" as the top level.
This very straightforward query works.
SELECT DISTINCT ?subjectPrefLabel ?a ?b ?o
WHERE
{
?subject skosxl:prefLabel/skosxl:literalForm ?subjectPrefLabel .
?subject skos:broader/skos:broader/skos:broader/skosxl:prefLabel/skosxl:literalForm ?a
FILTER regex(?subjectPrefLabel, "Dreamworks", 'i')
}
ORDER BY ?subjectPrefLabel
However, obviously this is unsustainable, since the developer would need to know how many levels down the hierarchy in order to issue the correct number of skos:broader.
Is there anyway I can conditionally discover the highest level of the hierarchy?
As this is an internal Ontology, I cannot share any data, so I know it will be hard to work with, any advice is appreciated.

Using the kleene star operator * incombination with a FILTER that there is no more broad concept should return the top concepts only:
SELECT DISTINCT ?subjectPrefLabel ?a ?b ?o WHERE {
?subject skosxl:prefLabel/skosxl:literalForm ?subjectPrefLabel .
?subject skos:broader* ?concept.
?concept skosxl:prefLabel/skosxl:literalForm ?a
FILTER NOT EXISTS { ?concept skos:broader ?supConcept }
FILTER regex(?subjectPrefLabel, "Dreamworks", 'i')
}
ORDER BY ?subjectPrefLabel

Related

Can this SPARQL search query be made more efficient?

I have a compound 'search' query made in SPARQL that
(1) Searches for unique subject URIs that are of a certain rdf:type:
Example:
SELECT ?s FROM NAMED <http://www.example.org/graph1> FROM NAMED <http://www.example.org/graph2>
{
GRAPH ?g
{
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.example.org/widget>.
}
} OFFSET 10000 LIMIT 100
This query is quite simple and just returns all subjects of type 'widget'.
(2) For the returned page of satisfying subject URIs, search for all subject URIs that have a reference to those subject URIs (i.e. referencing entities), specifying the reference predicate URIs that indicate a reference.
Let's say the previous query (1) returned 2 subject URIs http://www.example.org/widget100 and http://www.example.org/widget101 and the referencing predicate I wanted to query for was http://www.example.org/widget:
Example:
SELECT ?s FROM NAMED <http://www.example.org/graph1> FROM NAMED <http://www.example.org/graph2>
WHERE {
UNION
{
?s <http://www.example.org/widget> <http://www.example.org/widget100>
}
UNION
{
?s <http://www.example.org/widget> <http://www.example.org/widget101>
}
}
If the previous page returned 100 subject URIs, there would be 100 'UNION' statements here for each subject.
This query works - it selects the subject URIs of the given type, and returns the additional subject URIs that reference those subjects with the given reference predicate.
The problem is in practice, when I have 100,000s of triples across my query graphs, even on a fast machine on an in-memory graph this query is taking typically 1 minute+ to execute. This is unacceptably slow for users for this fairly typical search scenario.
Under profiling, both queries take roughly 50% of the query time.
I have enough experience with SPARQL to construct such a query above, but I am certainly not an expert. I am wondering if this could be made more efficient. For example, could it be combined into a single query that might at least reduce query times by 50%+? Is my use of UNIONs across potentially many subjects replacable by a more efficient method?
Thank you
SPARQL Guy
UPDATE: I have managed to reduce the query down to a single query of the following form:
SELECT *
FROM NAMED <http://www.example.org/widgets>
FROM NAMED <http://www.example.org/widgetstats>
FROM NAMED <http://www.example.org/widgetmetadata>
FROM NAMED <http://www.example.org/widgetfactory>
WHERE
{ { SELECT ?s ?p ?o
WHERE
{ GRAPH ?g
{ ?s ?p ?o }
{ SELECT ?s
WHERE
{ GRAPH ?i
{ ?s a <http://www.example.org/widget> }
}
OFFSET 0
LIMIT 100
}
}
}
UNION
{ SELECT ?s ?p ?o
WHERE
{ GRAPH ?g
{ ?s ?p ?o }
{ SELECT DISTINCT ?s
WHERE
{ GRAPH ?h
{ OPTIONAL
{ ?s <http://www.example.org/widgetstats/widget> ?x }
OPTIONAL
{ ?s <http://www.example.org/widgetmetadata/widget> ?x }
OPTIONAL
{ ?s <http://www.example.org/widgetfactory/widget> ?x }
}
{ SELECT ?x
WHERE
{ GRAPH ?i
{ ?x a <http://www.example.org/widget> }
}
OFFSET 0
LIMIT 100
}
}
}
}
}
}
This improves query speed by approx. 50%. The query can, though, I think be made faster. This form of query - fetching first all triples associated with the primary entities of the given type followed by all the triples associated with the referencing entities - requires two identical innermost subqueries, fetching the unique subjects of the given type.
Is there any way of reducing this query down - perhaps performing with a single query instead of a UNION of two subqueries? I am assuming this will probably improve performance further.
UPDATE 2: I couldn't improve on the query above (first update) and so I will make this as the answer for now.
If you still want the paging of the first query then probably the best approach would be to combine the queries using a SPARQL subquery.
Note that with subqueries you work from the inside out, so the subquery selects the widgets and the outer query expands to find the references. If you are using FROM NAMED then you need to match on the graph (assuming your results are in a named graph and you aren't working with a union default graph). The OFFSET and LIMIT on the inner query means that the example below returns references to the 3rd widget (in whatever default sort order the engine is applying).
I'm not sure if this will speed up the overall query time, but worth experimenting with and saves you a bunch of string concantenation!
PREFIX ex: <http://www.example.org/>
SELECT ?s FROM NAMED ex:g1 FROM NAMED ex:g2 WHERE {
GRAPH ?h {
?s ex:widget ?x
}
{
SELECT ?x WHERE {
GRAPH ?g {
?x a ex:widget
}
} OFFSET 2 LIMIT 1
}
}

How to limit COUNT(*) for SPARQL

I have a query like below. My goal is to count, for each item that itemX is related to through a specific relationship, count how many has the specific relationship with the item.
SELECT ?b ?c (COUNT(*) as ?count)
WHERE{
itemX ?b ?c.
?a ?b ?c.
}
GROUP BY ?b ?c
This is a hypothetical result
?b|?c|?count
rel1|item1|100
instanceof|human|10000000
rel2|item2|1
I run this query through the Wikidata (with appropriate modification.) The issue is ?count can be at millions and easily hit the 60 seconds time limit.
Therefore, I wish a compromise. I still wish a count, but for each ?b-?c pair, I will only look for maximum for 10000 ?a. For instance, if ?b is instance of and ?c is human, there are millions of humans and listing them all before aggregating may result a timeout. Hence, I wish the query will just list only 10000 ?a for instance of and human and then aggregate it. The result is below is the desired outcome.
?b|?c|?count
rel1|item1|100
instanceof|human|10000
rel2|item2|1
I tried something like this.
SELECT ?b ?c (COUNT(*) as ?count)
WHERE{
itemX ?b ?c.
{
SELECT ?a
{
?a ?b ?c.
}
LIMIT 10000
}
}
GROUP BY ?b ?c
However, I get this result instead.
?b|?c|?count
rel1|item1|10000
instanceof|human|10000
rel2|item2|10000
I wonder where I went wrong.

GraphDB - very slow sparql query with two connections

My database has information about documents, where each document has a category, e.g.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX: <http://example.com>
:doc1 :hasCategory :category1 .
:category1 rdfs:label "Law" .
There are about 100k statements like this.
Running a simple query to get counts of documents per category:
SELECT ?category (count(distinct ?doc) as ?count) WHERE {
?doc :hasCategory ?category .
} GROUP BY ?category
takes about 0.1s to run.
But to return the category labels as well:
SELECT ?category ?label (count(distinct ?doc) as ?count) WHERE {
?doc :hasCategory ?category .
?category rdfs:label ?label .
} GROUP BY ?category ?label
this query takes more than 7s to run.
Why would the difference be so large, and is there a more optimised query I can use to get the labels?
I found I can get the desired result in 0.2s with the following query:
SELECT ?category (sample(?lbl) as ?label) ?count WHERE {
?category rdfs:label ?lbl .
{
SELECT ?category (count(distinct ?doc) as ?count) WHERE {
?doc :hasCategory ?category .
} GROUP BY ?category
}
} GROUP BY ?category ?count
But I don't really understand why it's more efficient.
GraphDB versions before 8.6 release implement the GROUP BY operation with a naive LinkedHashMap, where the hash key is composed of all elements part of the projection. To calculate the hashcode the engine will translate the internal identifier to a RDF value. If the strings are longer, they will be read from an external collection resulting an extra disk operation and additional CPU to calculate the hashcode.
The only way to optimise the query is to switch to GraphDB 8.6 (currently it's a late release candidate), which implements a more optimised aggregate algorithm or reduce the GROUP BY projection like you did in your answer.

select all people in dbpedia associated with particular concept/occupation/profession

I am trying to select all people in wikipedia with a particular profession, occupation or are known for a particular field (i.e. contains the string of the occupation or profession).
SELECT * WHERE {
?p a <http://dbpedia.org/ontology/Person> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
?p <http://dbpedia.org/ontology/knownFor> ?knownFor.
}
LIMIT 100
Produces sort of what I want but I also want to include their Profession/Occupation (both of which are ?Profession or ?occupation) but the following doesn't return anything:
SELECT * WHERE {
?p a <http://dbpedia.org/ontology/Person> .
?p <http://dbpedia.org/ontology/influenced> ?influenced.
?p <http://dbpedia.org/ontology/knownFor> ?knownFor.
?p <http://dbpedia.org/ontology/profession> ?profession.
}
LIMIT 100
Or including ?Occupation doesn't work either as some don't have a profession entry but have an occupation. I just want all people (and their influenced list) for a particular profession/occupation i.e. "psychology" or "architect". I want basically a table:
Person | Profession/Occupation | Known For | Influenced Person | Profession/Occupation | Known For
The below is obviously wrong but something like this....
SELECT * WHERE {
?p a <http://dbpedia.org/ontology/Person> .
?s <http://dbpedia.org/ontology/profession> ?profession. (or ?fields/?occupation)
?p <http://dbpedia.org/ontology/influenced> ?influenced.
?p <http://dbpedia.org/ontology/knownFor> ?knownFor.
?s <http://dbpedia.org/ontology/profession> ?profession. (or ?fields/?occupation)
}
LIMIT 100
I don't know the exact reason*, but it seems that people in DBpedia who have dbo:influenced or dbo:knownFor don't have dbo:profession or dbo:occupation.
So, your query won't work, and you will have to find some other way to find the information you need, possibly by looking at dbp:fields or rdf:type.
* My guess is that dbo:influenced and dbo:knownFor is used for scientists and scientists don't have occupation or profession in DBpedia.
DBpedia is extraction of wikipedia infoboxes. If you look at the result of the first query and take one example out of it, check in wikipedia, you will see that there is no profession or occupation is used in the infoboxes. For example, let us look at http://dbpedia.org/page/Jerry_Coyne, wikipedia page is https://en.wikipedia.org/wiki/Jerry_Coyne. In wikipedia page, we can not find any information about profession or occupation in infobox. This is because https://en.wikipedia.org/wiki/Template:Infobox_scientist is used as template and there is no occupation/profession in that template. On the other hand there is https://en.wikipedia.org/wiki/Template:Infobox_person template, we can find occupation and known for. Therefore, below query will work:
SELECT * WHERE {
?p a <http://dbpedia.org/ontology/Person> .
?p <http://dbpedia.org/ontology/knownFor> ?knownFor.
?p <http://dbpedia.org/ontology/occupation> ?occupation.
}
LIMIT 100
But this one will query only extraction of page which used https://en.wikipedia.org/wiki/Template:Infobox_person and will not return any page which used https://en.wikipedia.org/wiki/Template:Infobox_scientist.

How to get all semantic data that is common between three classes using SPARQL?

I have a scenario where the user can select multiple classes and i need to build dynamic SPARQL query to fetch all the data that satisfies all the relationships thats exist (if any )between all selected classes. How do i achieve this?
Suppose the user selects three classes- Population, Drugname,SideEffects. Now he tries to find all the common data that exists between these three classes. How do i build a SPARQL query to achieve this? I need a sample SPARQL query that joins these three classes.
If you want to retrieve properties for which some resources have the same value (and so you probably want to retrieve that value, too), you can use a query like this:
select ?p ?o where {
dbpedia:Bob_Dylan ?p ?o .
dbpedia:Tom_Waits ?p ?o .
dbpedia:The_Byrds ?p ?o .
}
SPARQL results from DBpedia
p o
--------------------------------------------------------------------------------------------------------------------------
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2002/07/owl#Thing
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Agent
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/MusicGroup
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/YagoLegalActor
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/YagoLegalActorGeo
http://purl.org/dc/terms/subject http://dbpedia.org/resource/Category:Rock_and_Roll_Hall_of_Fame_inductees
http://dbpedia.org/ontology/recordLabel http://dbpedia.org/resource/Asylum_Records
http://dbpedia.org/ontology/genre http://dbpedia.org/resource/Rock_music
http://dbpedia.org/property/genre http://dbpedia.org/resource/Rock_music
http://dbpedia.org/property/label http://dbpedia.org/resource/Asylum_Records
http://dbpedia.org/property/wordnet_type http://www.w3.org/2006/03/wn/wn20/instances/synset-musician-noun-1
You could also use a query like this to find the subjects that are related to all three of these individuals by some particular property (but on DBpedia it doesn't have any answers):
select ?s ?p where {
?s ?p dbpedia:Bob_Dylan, dbpedia:Tom_Waits, dbpedia:The_Byrds .
}