How can i query a remote endpoint (like endpoints of DBPedia or Wikidata) and insert resulting triples in a local graph? So far, i know that there are commands like INSERT, ADD, COPY etc. which can be used for such tasks. What i don't understand is how to address a remote endpoint while updating my local graph. Could someone provide a minimum example or the main steps?
I'm using Apache Jena Fuseki v2 on Windows and this is my query so far:
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wd: <http://www.wikidata.org/entity/>
INSERT
{ GRAPH <???> { ?s ?p ?o } } #don't know what to insert here for "GRAPH"
WHERE
{ GRAPH <???> #don't know what to insert here for "GRAPH" either
{ #a working example query for wikidata:
?s wdt:P31 wd:Q5. #humans
?s wdt:P54 wd:Q43310. #germans
?s wdt:P1344 wd:Q79859. #part of world cup 2014
?s ?p ?o.
}
}
My local endpoint i'm querying is http://localhost:3030/mylocaldb/update. I've read that /update is necessary to edit the database (i'm not sure if i understood that correctly though).
Is my approach correct so far? Or is more stuff like additional scripting outside SPARQL needed?
Taken from the SPARQL 1.1 Update W3C specs:
The syntax is
( WITH IRIref )?
INSERT QuadPattern
( USING ( NAMED )? IRIref )*
WHERE GroupGraphPattern
If the INSERT template specifies GRAPH blocks then these will be the graphs affected. Otherwise, the operation will be applied to the default graph, or, respectively, to the graph specified in the WITH clause, if one was specified. If no USING (NAMED) clause is present, then the pattern in the WHERE clause will be matched against the Graph Store, otherwise against the dataset specified by the USING (NAMED) clauses. The matches against the WHERE clause create bindings to be applied to the template for determining triples to be inserted (following the same rules as for DELETE/INSERT).
So this basically means, you can omit the GRAPH definition from the INSERT part if you want to store it in the default graph, otherwise it will be the graph in which you want to store the data.
Regarding the WHERE clause, usually you would have to use the SERVICE keyword here to apply federated querying on the Wikidata endpoint (https://query.wikidata.org/bigdata/namespace/wdq/sparql):
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wd: <http://www.wikidata.org/entity/>
INSERT
{ ?s ?p ?o }
WHERE
{ SERVICE <https://query.wikidata.org/bigdata/namespace/wdq/sparql>
{ #a working example query for wikidata:
?s wdt:P31 wd:Q5. #humans
?s wdt:P54 wd:Q43310. #germans
?s wdt:P1344 wd:Q79859. #part of world cup 2014
?s ?p ?o.
}
}
I tested it with Apache Jena and it inserts 4462 triples into my local dataset.
I have the following federated SPARQL query that works as I expect in TopBraid Composer Free Edition (version 5.1.4) but does not work in Apache Fuseki (version 2.3.1):
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?s WHERE {
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
}
I monitor the sub SPARQL queries that are being executed under the hood and notice that TopBraid correctly executes the following query to the http://dbpedia.org/sparql endpoint:
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
}
while Apache Fuseki executes the following sub query:
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace(?actorName, " ", "_"))
}
Notice the difference; TopBraid replace the variable ?actorName with a particular value 'Paul Reubens', while Apache Fuseki does not. This results in an error from the http://dbpedia.org/sparql endpoint because the ?actorName is used in the result set but not assigned.
Is this a bug in Apache Fuseki or a feature in TopBraid? How can I make Apache Fuseki correctly execute this Federated query.
update 1: to clarify the behaviour difference between TopBraid and Apache Fuseki a bit more. TopBraid executes the linkedmdb.org subquery first and then executes the dbpedia.org subquery for each result of the linkedmdb.org query )(and substitutes the ?actorName with the results from the linkedmdb.org query). I assumed Apache Fuseki behaves similar, but the first subquery to dbpedia.org fails (because ?actorName is used in the result set but not assigned) and so it does not continue. But now I am not sure if it actually want to execute the subquery to dbpedia.org multiple times, because it never gets there.
update 2: I think both TopBraid and Apache Fuseki use Jena/ARQ, but I noticed that in stack traces from TopBraid the package name is something like com.topbraid.jena.* which might indicate they use a modified version of Jena/ARQ?
update 3: Joshua Taylor says below: "Surely you wouldn't expect the second service block to be executed for each one of them?". Both TopBraid and Apache Fuseki use exactly this method for the following query:
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?film ?label ?subject WHERE {
SERVICE <http://data.linkedmdb.org/sparql> {
?film a movie:film .
?film rdfs:label ?label .
?film owl:sameAs ?dbpediaLink
FILTER(regex(str(?dbpediaLink), "dbpedia", "i"))
}
SERVICE <http://dbpedia.org/sparql> {
?dbpediaLink dcterms:subject ?subject
}
}
LIMIT 50
but I agree that in principle they should execute both parts once and join them, but maybe for performance reasons they chose a different strategy?
Additionally, notice how the above query works on Apache Fuseki, while the first query of this post does not. So, Apache Fuseki is actually behaving similarly to TopBraid in this particular case. It seems to be related to using an URI variable (?dbpediaLink) in two triple patterns (which works in Fuseki) compared to using a String variable (?actorName) from a triple pattern in a FILTER regex function (which does not work in Fuseki).
Updated (Simpler) Response
In the original answer I wrote (below), I said that the issue was that SPARQL queries are executed innermost first. I think that that still applies here, but I think the problem can be isolated even more easily. If you have
service <ex1> { ... }
service <ex2> { ... }
then the results have to be what you'd get from executing each query separately on the endpoints and then joining the results. The join will merge any results where the common variables have the same values. E.g.,
service <ex1> { values ?a { 1 2 3 } }
service <ex2> { values ?a { 2 3 4 } }
would execute, and you'd have two possible values for ?a in the outer query (2 and 3). In your query, the second service can't produce any results. If you take:
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
and execute it at DBpedia, you shouldn't get any results, because ?actorName isn't bound, so the filter will never succeed. It appears that TopBraid is performing the first service first and then injecting the resulting values into your second service. That's convenient, but I don't think it's correct, because it returns different results than what you'd get if the DBpedia query had been executed first and the other query executed second.
Original Answer
Subqueries in SPARQL are executed inner-most first. That means that a query like
select * {
{ select ?x { ?x a :Cat } }
?x foaf:name ?name
}
Would first find all the cats, and would then find their names. "Candidate" values for ?x are determined first by the subquery, and then those values for ?x are made available to the outer query. Now, when there are two subqueries, e.g.,
select * {
{ select ?x { ?x a :Cat } }
{ select ?x ?name { ?x foaf:name ?name } }
}
the first subquery is going to find all the cats. The second subquery finds all the names of everything that has a name, and then in the outer query, the results are joined to get just the names of the cats. The values of ?x from the first subquery aren't available during the execution of the second subquery. (At least in principle, a query optimizer might be able to figure out that some things should be restricted.)
My understanding is that service blocks have the same kind of semantics. In your query, you have:
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
You say that tracing shows that TopBraid is executing
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
}
If TopBraid already executed the first service block and got a unique solution, then that might be an acceptable optimization, but what if, for instance, the first query had returned multiple bindings for ?actorName? Surely you wouldn't expect the second service block to be executed for each one of them? Instead, the second service block is executed as written, and will return a result set that will be joined with the result set from the first.
The reason that it probably "doesn't work" in Jena is because the second query doesn't actually bind any variables, so it's pretty much got to look at every triple in the data, which is obviously going to take a long time.
I think that you can get around this by nesting the service calls. If nested service are all launched by the "local" endpoint (i.e., nesting a service call doesn't ask a remote endpoint to make another remote query), then you might be able to do:
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
That might get you the kind of optimization that you want, but that still seems like it might not work unless DBpedia has some efficient ways of figuring out which triples to retrieve based on computing the replace. You're asking DBpedia to look at all its triples, and then to keep the ones where the string form of the subject matches a particular regular expression. It'd probably be better to construct that IRI manually in a subquery and then search for it. I.e.,
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
{ select ?actor {
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
bind(iri(concat("http://dbpedia.org/resource",
replace(?actorName," ","_")))
as ?actor)
} }
?actor ?p ?o
}
(long comment)
Consider:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?s WHERE {
{
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
{
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
}
that is the same query but with no SERVICE calls.
?actorName is not in a pattern of the inner second {}.
As join is a commutative operation, this has the same answers as the first query.
SELECT ?s WHERE {
{
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
{
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
}
The SERVICE version highlights this because the parts are executes separately on different machines.
The join of the two parts happens on the results of each part.
Hello everybody i'm using a SPARQL query to retrieve properties and values of a specified resource. For example if i ask for Barry White, i obtain: birth place, associatedBand, recordLabels and so on.
Instead for any instance such as "Hammerfall", i obtain only this results:
Query results
But i want properties and values as shown in this page: Correct results.
My query is:
PREFIX db: <http://dbpedia.org/resource/>
PREFIX prop: <http://dbpedia.org/property/>
PREFIX onto: <http://dbpedia.org/ontology/>
SELECT ?property ?value
WHERE { db:Hammerfall?property ?value }
Anyone can tell me how to access to the correct resource and obtain corrects properties and values in every case?
select ?p ?o { dbpedia:HammerFall ?p ?o }
SPARQL results
The particular prefix doesn't matter; I just used dbpedia: because it's predefined on the endpoint as http://dbpedia.org/resource/, just like your db:. The issue is that HammerFall has a majuscule F in the middle, but your query uses a miniscule f.
As an alternative, since the results for Hammerfall (with a miniscule f) do include
http://dbpedia.org/ontology/wikiPageRedirects http://dbpedia.org/resource/HammerFall
you could use a property path to follow any wikiPageRedirects paths:
select ?p ?v {
dbpedia:Hammerfall dbpedia-owl:wikiPageRedirects* ?hammerfall .
?hammerfall ?p ?v
}
SPARQL results
See Retrieving dbpedia-owl:type value of resource with dbpedia-owl:wikiPageRedirect value? for more about that approach.
Is it possible to programmatically add an OPTIONAL clause to a SPARQL query using the Jena ARQ API?
I would like to programmatically take this query:
select ?concept ?p ?o where {?s ?p ?o . } limit 10
To this:
SELECT ?concept ?p ?o ?test WHERE
{
?s ?p ?o
OPTIONAL { ?concept <http://www.test.com/test> ?test }
}
LIMIT 10
Through ARQ it's simple to add the additional result variable ?test:
Query q = QueryFactory.create(query)
query.addResultVar(var);
But from what I've found in the API docs and trawling across the net it's not possible to add an OPTIONAL clause. Do I need to use a different library?
Yes you can. See this introduction to the topic on the apache jena site.
Your starting point is getting the query pattern:
Element pattern = q.getQueryPattern();
That will be an ElementGroup if I remember correctly. Add the optional in there:
((ElementGroup) pattern).addElement(new ElementOptional(...));
The ... bit will be an ElementTriplesBlock, which is pretty straightforward.
Inelegant, however. In general I'd recommend using visitors and the algebra representation, but this direct route should work.
I am trying to retrieve the value of the dbpedia-owl:influenced in this page e.g: Andy_Warhol
The query I write is:
PREFIX rsc : http://dbpedia.org/resource
PREFIX dbpedia-owl :http://dbpedia.org/ontology
SELECT ?o WHERE {
rsc:Andy_Warhol dbpedia-owl:infuenced ?o .
}
but it is EMPTY.
Strange is that when I have the same query for another property from the ontology type like "birthPlace", the sparql engine gives the result back:
SELECT ?o WHERE {
rsc:Andy_Warhol dbpedia-owl:birthplace ?o .
}
which is a link to another resource:
dbpedia.org/resource/Pittsburgh
I am just confused how to write this query?
besides several formal errors addressed in the answer of #Joshua, there is also the semantic problem that the properties you are looking for - in this case - seem to be found on the entities that were influenced.
this query might give you the desired results
PREFIX rsc: <http://dbpedia.org/resource/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?s WHERE {
?s dbpedia-owl:influencedBy rsc:Andy_Warhol .
}
run query
There are a few issues here. One is that the SPARQL, as presented, isn't correct. I edited to make the prefix syntax legal, but the prefixes were still wrong (they didn't end with a final slash). You don't want to be querying for http://dbpedia.org/resourceAndy_Warhol after all; you want to query for http://dbpedia.org/resource/Andy_Warhol. Some standard namespaces for DBpedia are listed on their SPARQL endpoint. Using those namespaces and the SPARQL endpoint, we can ask for all the triples that have http://dbpedia.org/resource/Andy_Warhol as the subject with this query:
SELECT * WHERE {
dbpedia:Andy_Warhol ?p ?o .
}
In the results produced there, you'll see the one using http://dbpedia.org/ontology/birthPlace (note the captial P in birthPlace), but you won't see any triples with the predicate http://dbpedia.org/ontology/infuenced, so it makes sense that your first query has no results. Do you have some reason to suppose that there should be some results?